derwentx/feature-hunter

View on GitHub
README.md

Summary

Maintainability
Test Coverage
Feature Hunter
====

```
                    ____           __
                   / __/__  ____ _/ /___  __________
                  / /_/ _ \/ __ `/ __/ / / / ___/ _ \
                 / __/  __/ /_/ / /_/ /_/ / /  /  __/
                /_/  \___/\__,_/\__/\__,_/_/   \___/
                    __                __
                   / /_  __  ______  / /____  _____
                  / __ \/ / / / __ \/ __/ _ \/ ___/
                 / / / / /_/ / / / / /_/  __/ /
                /_/ /_/\__,_/_/ /_/\__/\___/_/

```


[![Maintainability](https://api.codeclimate.com/v1/badges/6d9a7038a57ee928ca27/maintainability)](https://codeclimate.com/github/derwentx/feature-hunter/maintainability)

A python module for trawling music websites that detects changes in lists of feature albums and sends notifications by email

Pre-install
====
In order for Scrapy to work, you're going to have to install a couple of packages, this guide explains it all https://doc.scrapy.org/en/latest/intro/install.html

Install
====

Clone this repository and cd into it

```bash
git clone https://github.com/derwentx/feature-hunter
cd feature-hunter/

```

install install/test the python package

```bash
sudo python setup.py install
python test/test_basic.py
```
play with some databases (an example databse is provided)

```bash
cp example_db.json ~
cd ~
python -m feature_hunter --db example_db.json
```

schedule that bad boi in your crontab for alerts!

```
0 * * * * python -m feature_hunter --db ~/example_db.json --enable-alerts --smtp-host smtp.gmail.com --smtp-port 465 --smtp-pass <email_password> --smtp-sender <your_email> --smtp-domain <your_domain>
```

Configuration
====
Targets are configured by modifying the target table of the database file. Here's the example DB which reads feature albums of the Triple J website:

```json
{
    "targets": {
        "1": {
            "url": "http://www.abc.net.au/triplej/music/featurealbums/",
            "record_spec": "{\"css\": \"div.podlist_item\"}",
            "field_specs": "{\"album\": {\"regex\": \" - \\\\s*(\\\\S[\\\\s\\\\S]+\\\\S)\\\\s*$\", \"css\": \"div.text div.title::text\"}, \"artist\": {\"regex\": \"^\\\\s*(\\\\S[\\\\s\\\\S]+\\\\S)\\\\s* - \", \"css\": \"div.text div.title::text\"}}",
            "name": "triplej"
        }
    }
}

```

it's JSON within JSON (JSON all the way down) so quote chars have to be backslash-escaped, which means it's easier to create your own database using feature_hunter.db.DBWrapper.insert_target(), but if I get enough interest in this repo, I'll add something to make the targets easier to enter into the database.

In this example, our target webpage looks something like this
```html
<!-- ... -->
<div id="two_col">
    <h2 id="latest">latest feature albums</h2>

    <!--item start-->
    <div class="podlist_item">
        <a href="http://www.abc.net.au/triplej/review/album/s4547791.htm"><img width="300" height="300" alt="Banks - The Altar" src="http://www.abc.net.au/triplej/review/album/img/banks_thealtar.jpg"></a>
        <div class="text" style="height: 66px;">
            <div class="title">Banks - The Altar</div>
            Following up her 2013 debut <i>Goddess</i>, the L.A. singer pushes personal boundaries with her alt-pop R&amp;B sound.
        </div>
        <a href="http://www.abc.net.au/triplej/review/album/s4547791.htm" class="more">More</a>
        <div class="clear"></div>
    </div>
    <!--item end-->
    <!-- ... -->

</div>
<!-- ... -->
```

we want to target every `<div class="podlist_item">`  using the css target spec: `div.podlist_item` as our records (it also supports xpath targeting), then to obtain the fields `album` and `artist` from each record we're going to do another css target spec on `div.text div.title::text`. Now since the format of the title is `<artist> - <album>` we're going to further target the fields within this text element by selecting them with a regular expression which is ` - \s*(\S[\s\S]+\S)\s*$` for the album and `^\s*(\S[\s\S]+\S)\\s* - ` for the artist.

That's all you need to specify a target. a css or xpath target spec for each record and a css or xpath target spec for each field. The regex is optional, and not needed if your fields are separated in the html.

Mail
----
You may need to dick around with mail settings to get mail to work. At the moment it connects to localhost as a plaintext SMTP server, so if you're using macOS you'll have to floow this guide: http://www.developerfiles.com/how-to-send-emails-from-localhost-mac-os-x-el-capitan/

If I get enough interest I'll write an SSL SMTP client, because plaintext creds r bad


Roadmap
====
 - [x] Correctly identify changes in targets specified in database
 - [ ] Interface to easily add targets to database
 - [ ] Send alerts when changes are detected
 - [ ] get rid of ScrapyDeprecationWarning