albertyw/albertyw.com

View on GitHub
app/notes/20190225-0559.md

Summary

Maintainability
Test Coverage
Scraping Images From Tumblr

scraping-tumblr-images

1551074385

With my [Reaction.Pics](https://www.reaction.pics/) project, I had to scrape a
bunch of tumblr accounts for data to assemble its database.  Since I was trying
to not hotlink to thousands of images, I made local copies of images (about
22 GB raw).  However, given the uncurated nature of tumblr posts, I found that
there were tons of broken images.  Going through them, I noticed a few common
themes, including:

- empty files (I assume from 404s)
- malformed files (just binary crap)
- HTML (also mostly 404s from sites that don't obey `Accept` HTTP headers)
- Non-standard images like `.raw` and `.tiff`

After processing the database several times with multiple scripts that checked
various heuristics like file extension and guessing MIME encoding, I found
that the single most useful way of checking images is having
[Python Pillow](https://python-pillow.org/) parse the image binary:

```python
# Given an image path
path = "abcd.gif"

# Have PIL verify the image
from PIL import Image
Image.open(path).verify()
```

After filtering images and removing duplicates, I was able to bring the image
databaes down to 8 GB.

Thanks to these sites for providing data:

- [devopsreactions](https://devopsreactions.tumblr.com/)
- [lifeofasoftwareengineer](https://lifeofasoftwareengineer.tumblr.com/)
- [dbareactions](https://dbareactions.tumblr.com/)
- [securityreactions](https://securityreactions.tumblr.com/)