README.md
Feature Hunter
====
```
____ __
/ __/__ ____ _/ /___ __________
/ /_/ _ \/ __ `/ __/ / / / ___/ _ \
/ __/ __/ /_/ / /_/ /_/ / / / __/
/_/ \___/\__,_/\__/\__,_/_/ \___/
__ __
/ /_ __ ______ / /____ _____
/ __ \/ / / / __ \/ __/ _ \/ ___/
/ / / / /_/ / / / / /_/ __/ /
/_/ /_/\__,_/_/ /_/\__/\___/_/
```
[![Maintainability](https://api.codeclimate.com/v1/badges/6d9a7038a57ee928ca27/maintainability)](https://codeclimate.com/github/derwentx/feature-hunter/maintainability)
A python module for trawling music websites that detects changes in lists of feature albums and sends notifications by email
Pre-install
====
In order for Scrapy to work, you're going to have to install a couple of packages, this guide explains it all https://doc.scrapy.org/en/latest/intro/install.html
Install
====
Clone this repository and cd into it
```bash
git clone https://github.com/derwentx/feature-hunter
cd feature-hunter/
```
install install/test the python package
```bash
sudo python setup.py install
python test/test_basic.py
```
play with some databases (an example databse is provided)
```bash
cp example_db.json ~
cd ~
python -m feature_hunter --db example_db.json
```
schedule that bad boi in your crontab for alerts!
```
0 * * * * python -m feature_hunter --db ~/example_db.json --enable-alerts --smtp-host smtp.gmail.com --smtp-port 465 --smtp-pass <email_password> --smtp-sender <your_email> --smtp-domain <your_domain>
```
Configuration
====
Targets are configured by modifying the target table of the database file. Here's the example DB which reads feature albums of the Triple J website:
```json
{
"targets": {
"1": {
"url": "http://www.abc.net.au/triplej/music/featurealbums/",
"record_spec": "{\"css\": \"div.podlist_item\"}",
"field_specs": "{\"album\": {\"regex\": \" - \\\\s*(\\\\S[\\\\s\\\\S]+\\\\S)\\\\s*$\", \"css\": \"div.text div.title::text\"}, \"artist\": {\"regex\": \"^\\\\s*(\\\\S[\\\\s\\\\S]+\\\\S)\\\\s* - \", \"css\": \"div.text div.title::text\"}}",
"name": "triplej"
}
}
}
```
it's JSON within JSON (JSON all the way down) so quote chars have to be backslash-escaped, which means it's easier to create your own database using feature_hunter.db.DBWrapper.insert_target(), but if I get enough interest in this repo, I'll add something to make the targets easier to enter into the database.
In this example, our target webpage looks something like this
```html
<!-- ... -->
<div id="two_col">
<h2 id="latest">latest feature albums</h2>
<!--item start-->
<div class="podlist_item">
<a href="http://www.abc.net.au/triplej/review/album/s4547791.htm"><img width="300" height="300" alt="Banks - The Altar" src="http://www.abc.net.au/triplej/review/album/img/banks_thealtar.jpg"></a>
<div class="text" style="height: 66px;">
<div class="title">Banks - The Altar</div>
Following up her 2013 debut <i>Goddess</i>, the L.A. singer pushes personal boundaries with her alt-pop R&B sound.
</div>
<a href="http://www.abc.net.au/triplej/review/album/s4547791.htm" class="more">More</a>
<div class="clear"></div>
</div>
<!--item end-->
<!-- ... -->
</div>
<!-- ... -->
```
we want to target every `<div class="podlist_item">` using the css target spec: `div.podlist_item` as our records (it also supports xpath targeting), then to obtain the fields `album` and `artist` from each record we're going to do another css target spec on `div.text div.title::text`. Now since the format of the title is `<artist> - <album>` we're going to further target the fields within this text element by selecting them with a regular expression which is ` - \s*(\S[\s\S]+\S)\s*$` for the album and `^\s*(\S[\s\S]+\S)\\s* - ` for the artist.
That's all you need to specify a target. a css or xpath target spec for each record and a css or xpath target spec for each field. The regex is optional, and not needed if your fields are separated in the html.
Mail
----
You may need to dick around with mail settings to get mail to work. At the moment it connects to localhost as a plaintext SMTP server, so if you're using macOS you'll have to floow this guide: http://www.developerfiles.com/how-to-send-emails-from-localhost-mac-os-x-el-capitan/
If I get enough interest I'll write an SSL SMTP client, because plaintext creds r bad
Roadmap
====
- [x] Correctly identify changes in targets specified in database
- [ ] Interface to easily add targets to database
- [ ] Send alerts when changes are detected
- [ ] get rid of ScrapyDeprecationWarning