scraping Archives - Wizard Of Bots

Tag: scraping

Scrape with NodeJS and JQuery: Get your IP, Host and User-Agent

September 2, 2016 by wizardofbots·0 Comments

Hey fellas, sorry for being absent for a long time, mainly it was lots of work on other projects.

In this post I am going to teach you how to screen scrape using NodeJS and JQuery (cheerio). Its relatively easy, here is the code:

var request = require('request'); // we need request library
var cheerio = require('cheerio'); // and cheerio library/ JQuery
// set some defaults
req = request.defaults({
  jar: true,                 // save cookies to jar
  rejectUnauthorized: false, 
  followAllRedirects: true   // allow redirections
});
// scrape the page
req.get({
    url: "http://www.whatsmyip.org/",
    headers: {
        'User-Agent': 'Google' // You can put the user-agent that you want
     }
  }, function(err, resp, body) {
  
  // load the html into cheerio
  var $ = cheerio.load(body);
  
  // get the data and output to console
  console.log( 'IP: ' + $('#ip').text() );  //scrape using CSS selector
  console.log( 'Host: ' + $('#hostname').text() );
  console.log( 'User-Agent: ' + $('#useragent').text() );
});

Crawling for Bing results with Simple HTML DOM

July 15, 2016 by wizardofbots·0 Comments

Simple HTML DOM is a PHP library that helps you parse the DOM and get to find things inside the DOM very fast, instead of using plain PHP that will take you hours to make your own libraries. There is another similar that is called PHP Selector.

What we want to do is to grab the results of Bing in this case.

<?php
$keyword = $argv[1]; // send the argument when you run the script like = php this.php  your keyword
require_once('simple_html_dom.php');
$bing = 'http://www.bing.com/search?q=' . $keyword . '&count=50';
// We do it with bing but it is almost the same with the other searches.
echo '#####################################';
echo '###        SEARCHING IN BING     ####';
echo '#####################################';
$html = file_get_html($bing);
$linkObjs = $html->find('li h2 a');
foreach ($linkObjs as $linkObj) {
    $title = trim($linkObj->plaintext);
    $link  = trim($linkObj->href);
    
    // if it is not a direct link but url reference found inside it, then extract
    if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&amp;sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
        $link = $matches[1];
    } else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
        continue;    
    }
    
    print '<p>Title: ' . $title . '<br />\n';
    print 'Link: ' . $link . '</p>\n';    
}

So this is practically it, you are using the find() function from Simple HTML DOM library in the DOM to find the links.

If you need more help in this remember I can help with custom bots 😉