crawley-project/crawley

View on GitHub
doc/source/example.rst

Summary

Maintainability
Test Coverage
Pypi Example
------------

Lets make a scraper that extract data from the pypi's front page

http://pypi.python.org/pypi

Suppose that we want extract data from the packages table and
store it in a database.

First at all we need to start a new crawley's project. In this case
it will be called pypi_packages.

.. code-block:: bash
    
    crawley startproject pypi_packages
    cd pypi_packages
    
Then, inside the pypi_packages there will a settings.py file and another
directory containing a crawlers.py file and a models.py file.

Lets start defining the models.py which is very simple in this case.

Models
======

.. code-block:: python

    from crawley.persistance import Entity, UrlEntity, Field, Unicode

    class Package(Entity):
    
        #add your table fields here
        updated = Field(Unicode(255))    
        package = Field(Unicode(255))
        description = Field(Unicode(255))

As you can see, we just need a simple Package table that matches the data
we want to extract.

Ok, now is time to code. Lets make the scraper. It must be located inside
the crawlers.py file

Crawlers
========

.. code-block:: python

    from crawley.crawlers import BaseCrawler
    from crawley.scrapers import BaseScraper
    from crawley.extractors import XPathExtractor
    from models import *

    class pypiScraper(BaseScraper):
        
        #specify the urls that can be scraped by this class
        matching_urls = ["%"]
        
        def scrape(self, response):
                            
            #getting the html table
            table = response.html.xpath("/html/body/div[5]/div/div/div[3]/table")[0]
            
            #for rows 1 to n-1
            for tr in table[1:-1]:
                            
                #obtaining the searched html inside the rows
                td_updated = tr[0]
                td_package = tr[1]
                package_link = td_package[0]
                td_description = tr[2]
                
                #storing data in Packages table
                Package(updated=td_updated.text, package=package_link.text, description=td_description.text)


    class pypiCrawler(BaseCrawler):
        
        #add your starting urls here
        start_urls = ["http://pypi.python.org/pypi"]
        
        #add your scraper classes here    
        scrapers = [pypiScraper]
        
        #specify you maximum crawling depth level    
        max_depth = 0
        
        #select your favourite HTML parsing tool
        extractor = XPathExtractor


The interesting part of this is the scrape method defined inside the
pypiScraper class. It uses XPath in order to obtain the parsed html
and then stores the extracted data in the Packages table.

At this time we have finished our work. Very simple. Isn't it?

Run crawley
===========

Finally, just run the crawler (Ensure you are in the same directory where the
settings.py file is, in other case you can specify your settings directory
with --settings=path/to/your/settings.py)

.. code-block:: bash
    
    ~$ crawley run

And we are done. Check the results in your database!

Downloading the Code
====================

The entire code is located in the crawley's official repository at:

https://github.com/jmg/crawley/tree/master/examples/pypi_packages