Integrate Scrapoxy to Scrapy

Goal

Is it easy to find a good Python developer on Paris ? No!

So, it’s time to build a scraper with Scrapy to find our perfect profile.

The site Scraping Challenge indexes a lot of profiles (fake, for demo purposes). We want to grab them and create a CSV file.

However, the site is protected against scraping ! We must use Scrapoxy to bypass the protection.

Step 1: Install Scrapy

Install Python 2.7

Scrapy works with Python 2.7

Install dependencies

On Ubuntu:

apt-get install python-dev libxml2-dev libxslt1-dev libffi-dev

On Windows (with Babun):

wget https://bootstrap.pypa.io/ez_setup.py -O - | python
easy_install pip
pact install libffi-devel libxml2-devel libxslt-devel

Install Scrapy and Scrapoxy connector

pip install scrapy scrapoxy

Step 2: Create the scraper myscraper

Create a new project

Bootstrap the skeleton of the project:

scrapy startproject myscraper
cd myscraper

Add a new spider

Add this content to myscraper/spiders/scraper.py:

# -*- coding: utf-8 -*-

from scrapy import Request, Spider

class Scraper(Spider):
    name = u'scraper'

    def start_requests(self):
        """This is our first request to grab all the urls of the profiles.
        """
        yield Request(
            url=u'http://scraping-challenge-2.herokuapp.com',
            callback=self.parse,
        )

    def parse(self, response):
        """We have all the urls of the profiles. Let's make a request for each profile.
        """
        urls = response.xpath(u'//a/@href').extract()
        for url in urls:
            yield Request(
                url=response.urljoin(url),
                callback=self.parse_profile,
            )

    def parse_profile(self, response):
        """We have a profile. Let's extract the name
        """
        name_el = response.css(u'.profile-info-name::text').extract()
        if len(name_el) > 0:
            yield {
                'name': name_el[0]
            }

If you want to learn more about Scrapy, check on this Tutorial.

Run the spider

Let’s try our new scraper!

Run this command:

scrapy crawl scraper -o profiles.csv

Scrapy scraps the site and extract profiles to profiles.csv.

However, Scraping Challenge is protected! profiles.csv is empty…

We will integrate Scrapoxy to bypass the protection.

Step 3: Integrate Scrapoxy to the Scrapy

Install Scrapoxy

See Quick Start to install Scrapoxy.

Start Scrapoxy

Set the maximum of instances to 6, and start Scrapoxy (see Change scaling with GUI).

Warning

Don’t forget to set the maximum of instances!

Edit settings of the Scraper

Add this content to myscraper/settings.py:

CONCURRENT_REQUESTS_PER_DOMAIN = 1
RETRY_TIMES = 0

# PROXY
PROXY = 'http://127.0.0.1:8888/?noconnect'

# SCRAPOXY
API_SCRAPOXY = 'http://127.0.0.1:8889/api'
API_SCRAPOXY_PASSWORD = 'CHANGE_THIS_PASSWORD'

DOWNLOADER_MIDDLEWARES = {
    'scrapoxy.downloadmiddlewares.proxy.ProxyMiddleware': 100,
    'scrapoxy.downloadmiddlewares.wait.WaitMiddleware': 101,
    'scrapoxy.downloadmiddlewares.scale.ScaleMiddleware': 102,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
}

What are these middlewares ?

ProxyMiddleware relays requests to Scrapoxy. It is an helper to set the PROXY parameter.
WaitMiddleware stops the scraper and waits for Scrapoxy to be ready.
ScaleMiddleware asks Scrapoxy to maximize the number of instances at the beginning, and to stop them at the end.

Note

ScaleMiddleware stops the scraper like WaitMiddleware. After 2 minutes, all instances are ready and the scraper continues to scrap.

Warning

Don’t forget to change the password!

Run the spider

Run this command:

scrapy crawl scraper -o profiles.csv

Now, all profiles are saved to profiles.csv!