app/notes/20190225-0559.md
Scraping Images From Tumblr
scraping-tumblr-images
1551074385
With my [Reaction.Pics](https://www.reaction.pics/) project, I had to scrape a
bunch of tumblr accounts for data to assemble its database. Since I was trying
to not hotlink to thousands of images, I made local copies of images (about
22 GB raw). However, given the uncurated nature of tumblr posts, I found that
there were tons of broken images. Going through them, I noticed a few common
themes, including:
- empty files (I assume from 404s)
- malformed files (just binary crap)
- HTML (also mostly 404s from sites that don't obey `Accept` HTTP headers)
- Non-standard images like `.raw` and `.tiff`
After processing the database several times with multiple scripts that checked
various heuristics like file extension and guessing MIME encoding, I found
that the single most useful way of checking images is having
[Python Pillow](https://python-pillow.org/) parse the image binary:
```python
# Given an image path
path = "abcd.gif"
# Have PIL verify the image
from PIL import Image
Image.open(path).verify()
```
After filtering images and removing duplicates, I was able to bring the image
databaes down to 8 GB.
Thanks to these sites for providing data:
- [devopsreactions](https://devopsreactions.tumblr.com/)
- [lifeofasoftwareengineer](https://lifeofasoftwareengineer.tumblr.com/)
- [dbareactions](https://dbareactions.tumblr.com/)
- [securityreactions](https://securityreactions.tumblr.com/)