AlexMathew/scrapple

View on GitHub
docs/intro/tutorials/link_crawler.rst

Summary

Maintainability
Test Coverage
.. _intro-tutorials-link-crawler:

=============
Link crawlers
=============

(Check out another example on the `Github repo readme`_.)

For this example, we will extract content from all talks on `pyvideo`_. We will use the `event listing`_ as the base page.

.. _Github repo readme: https://github.com/AlexMathew/scrapple
.. _pyvideo: http://pyvideo.org/
.. _event listing: http://pyvideo.org/category

To generate a skeleton configuration file, use the ``genconfig`` command. The primary arguments for the command are the project name and the URL of the base page. To generate a skeleton configuration file for a crawler, use the ``--type=crawler`` argument.

::

    $ scrapple genconfig pyvideo http://pyvideo.org/category \
    > --type=crawler

This will create pyvideo.json which will initially look like this -

::

    {

        "scraping": {
            "url": "http://pyvideo.org/category",
            "data": [
                {
                    "default": "",
                    "field": "",
                    "attr": "",
                    "selector": ""
                }
            ],
            "next": [
                {
                    "follow_link": "",
                    "scraping": {
                        "data": [
                            {
                                "default": "",
                                "field": "",
                                "attr": "",
                                "selector": ""
                            }
                        ]
                    }
                }
            ]
        },
        "project_name": "pyvideo",
        "selector_type": "xpath"

    }

You can edit this json file to specify selectors for the various data that you would want to extract from the given page.

For example, 

::

    {

        "scraping": {
            "url": "http://pyvideo.org/category/",
            "data": [
                {
                    "field": "",
                    "attr": "",
                    "selector": "",
                    "default": ""
                }
            ],
            "next": [
                {
                    "follow_link": "//table//td[1]//a",
                    "scraping": {
                        "data": [
                            {
                                "field": "event",
                                "attr": "text",
                                "selector": "//h1",
                                "default": ""
                            },
                            {
                                "field": "event_url",
                                "attr": "",
                                "selector": "url",
                                "default": ""
                            }
                        ],
                        "next": [
                            {
                                "follow_link": " \
                                //div[@id='video-summary-content']/div//strong/a \
                                ",
                                "scraping": {
                                    "data": [
                                        {
                                            "field": "talk_title",
                                            "attr": "text",
                                            "selector": "//h3",
                                            "default": "<unknown>"
                                        },
                                        {
                                            "field": "speaker",
                                            "attr": "text",
                                            "selector": " \
                                            //div[@id='sidebar']//dd[2] \
                                            ",
                                            "default": "<unknown>"
                                        },
                                        {
                                            "field": "talk_url",
                                            "attr": "",
                                            "selector": "url",
                                            "default": ""
                                        }
                                    ]
                                }
                            }
                        ]
                    }
                }
            ]
        },
        "project_name": "pyvideo",
        "selector_type": "xpath"

    }

Using this configuration file, you could generate a Python script using ``scrapple generate`` or directly run the scraper using ``scrapple run``.

The ``generate`` and ``run`` commands take two positional arguments - the project name and the output file name.

To generate the Python script -

::

    $ scrapple generate pyvideo talk_list

This will create talk_list.py, which is the script that can be run to replicate the action of ``scrapple run``.

.. code-block:: python

    from __future__ import print_function
    import json
    import os

    from scrapple.selectors.xpath import XpathSelector


    def task_pyvideo():
        """
        Script generated using 
        `Scrapple <http://scrappleapp.github.io/scrapple>`_
        """
        results = dict()
        results['project'] = "pyvideo"
        results['data'] = list()
        try:
            r0 = dict()
            page0 = XpathSelector("http://pyvideo.org/category/")
            
            for page1 in page0.extract_links(
            "//table//td[1]//a"):
                r1 = r0.copy()
                r1["event"] = page1.extract_content(
                "//h1", "text", ""
                )
                r1["event_url"] = page1.extract_content(
                "url", "", ""
                )
                    
                
                for page2 in page1.extract_links(
                "//div[@class='video-summary-data']/div[1]//a"):
                    r2 = r1.copy()
                    r2["talk_title"] = page2.extract_content(
                    "//h3", "text", "<unknown>"
                    )
                    r2["speaker"] = page2.extract_content(
                    "//div[@id='sidebar']//dd[2]", "text", "<unknown>"
                    )
                    r2["talk_url"] = page2.extract_content(
                    "url", "", ""
                    )
                    results['data'].append(r2)
        except KeyboardInterrupt:
            pass
        except Exception as e:
            print(e)
        finally:
            with open(os.path.join(os.getcwd(), 'talks.json'), 'w') as f:
                json.dump(results, f)
        

    if __name__ == '__main__':
        task_pyvideo()



To run the scraper -

::

    $ scrapple run pyvideo talk_list

This will create talk_list.json, which contains the extracted information.

A portion of the talk_list.json will look like this.

::

    {

        "project": "pyvideo",
        "data": [
            {
                "talk_title": "Boston Python Meetup: ...",
                "talk_url": "http://pyvideo.org/video/591/...",
                "event_url": "http://pyvideo.org/category/15/...",
                "speaker": "Stephan Richter",
                "event": "Boston Python Meetup"
            },
            {
                "talk_title": "Boston Python Meetup: ...",
                "talk_url": "http://pyvideo.org/video/592/...",
                "event_url": "http://pyvideo.org/category/15/...",
                "speaker": "Marshall Weir",
                "event": "Boston Python Meetup"
            },
            {
                "talk_title": "November 2014 ...",
                "talk_url": "http://pyvideo.org/video/3359/...",
                "event_url": "http://pyvideo.org/category/14/...",
                "speaker": "Asma Mehjabeen Isaac Adorno",
                "event": "ChiPy"
            },


            ### talk_list.json continues


            {
                "talk_title": "Python 2.7 & Python 3: ...",
                "talk_url": "http://pyvideo.org/video/3373/...",
                "event_url": "http://pyvideo.org/category/64/...",
                "speaker": "Kenneth Reitz",
                "event": "Twitter University 2014"
            }
        ]

    }