Hey fellas, sorry for being absent for a long time, mainly it was lots of work on other projects.

In this post I am going to teach you how to screen scrape using NodeJS and JQuery (cheerio). Its relatively easy, here is the code:

var request = require('request'); // we need request library
var cheerio = require('cheerio'); // and cheerio library/ JQuery
// set some defaults
req = request.defaults({
  jar: true,                 // save cookies to jar
  rejectUnauthorized: false, 
  followAllRedirects: true   // allow redirections
});
// scrape the page
req.get({
    url: "http://www.whatsmyip.org/",
    headers: {
        'User-Agent': 'Google' // You can put the user-agent that you want
     }
  }, function(err, resp, body) {
  
  // load the html into cheerio
  var $ = cheerio.load(body);
  
  // get the data and output to console
  console.log( 'IP: ' + $('#ip').text() );  //scrape using CSS selector
  console.log( 'Host: ' + $('#hostname').text() );
  console.log( 'User-Agent: ' + $('#useragent').text() );
});

 

Simple HTML DOM is a PHP library that helps you parse the DOM and get to find things inside the DOM very fast, instead of using plain PHP that will take you hours to make your own libraries. There is another similar that is called PHP Selector.

What we want to do is to grab the results of Bing in this case.

 

<?php
$keyword = $argv[1]; // send the argument when you run the script like = php this.php  your keyword
require_once('simple_html_dom.php');
$bing = 'http://www.bing.com/search?q=' . $keyword . '&count=50';
// We do it with bing but it is almost the same with the other searches.
echo '#####################################';
echo '###        SEARCHING IN BING     ####';
echo '#####################################';
$html = file_get_html($bing);
$linkObjs = $html->find('li h2 a');
foreach ($linkObjs as $linkObj) {
    $title = trim($linkObj->plaintext);
    $link  = trim($linkObj->href);
    
    // if it is not a direct link but url reference found inside it, then extract
    if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&amp;sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
        $link = $matches[1];
    } else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
        continue;    
    }
    
    print '<p>Title: ' . $title . '<br />\n';
    print 'Link: ' . $link . '</p>\n';    
}

So this is practically it, you are using the find() function from Simple HTML DOM library in the DOM to find the links.

If you need more help in this remember I can help with custom bots 😉