awarebayes/RecNN

View on GitHub
examples/0. Embeddings Generation/Pipelines/ML20M/1. Async Parsing.ipynb

Summary

Maintainability
Test Coverage
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Movie Parsing\n",
    "\n",
    "## Disclamer: this code takes a coulple of hours to run. \n",
    "## You can download parsed data [here](https://drive.google.com/open?id=1t0LNCbqLjiLkAMFwtP8OIYU-zPUCNAjK)\n",
    "\n",
    "\n",
    "## OMDB\n",
    "OMDB is Open Movie Database. Although, it is open, you will need to pay 1 doller to get the key and send up to 100k requests/day. For 5 you get access to the poster API.\n",
    "\n",
    "http://www.omdbapi.com/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import requests\n",
    "from tqdm import tqdm_notebook as tqdm\n",
    "import json\n",
    "\n",
    "myOmdbKey = 'your key here' # you need to buy omdb key for 1$ on patreon\n",
    "movies = pd.read_csv('../../../../data/ml-20m/links.csv')\n",
    "movies['imdbId'] = movies['imdbId'].apply(lambda i: '0' * (8 - len(str(i))) + str(i))\n",
    "movies['tmdbId'] = movies['tmdbId'].fillna(-1).astype(int).apply(str)\n",
    "movies = movies.set_index('movieId')\n",
    "movies = movies.to_dict(orient='index')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> If failed pops up, run this block again till it's done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "# movies = json.load(open(\"../../../../data/parsed/omdb.json\", \"r\") )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c1f10c26028c4326b7c7e6e2e43d1894",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(IntProgress(value=0, max=27278), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "for id in tqdm(movies.keys()):\n",
    "    imdbId = movies[id]['imdbId']\n",
    "    if movies[id].get('omdb', False):\n",
    "        continue\n",
    "    try:\n",
    "        movies[id]['omdb'] = requests.get(\"http://www.omdbapi.com/?i=tt{}&apikey={}&plot=full\".format(imdbId,\n",
    "                                                                                         myOmdbKey)).json()\n",
    "    except:\n",
    "        print(id, imdbId, 'failed')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"../../../../data/parsed/omdb.json\", \"w\") as write_file:\n",
    "    json.dump(movies, write_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'imdbId': '00114709',\n",
       " 'tmdbId': '862',\n",
       " 'omdb': {'Title': 'Toy Story',\n",
       "  'Year': '1995',\n",
       "  'Rated': 'G',\n",
       "  'Released': '22 Nov 1995',\n",
       "  'Runtime': '81 min',\n",
       "  'Genre': 'Animation, Adventure, Comedy, Family, Fantasy',\n",
       "  'Director': 'John Lasseter',\n",
       "  'Writer': 'John Lasseter (original story by), Pete Docter (original story by), Andrew Stanton (original story by), Joe Ranft (original story by), Joss Whedon (screenplay by), Andrew Stanton (screenplay by), Joel Cohen (screenplay by), Alec Sokolow (screenplay by)',\n",
       "  'Actors': 'Tom Hanks, Tim Allen, Don Rickles, Jim Varney',\n",
       "  'Plot': 'A little boy named Andy loves to be in his room, playing with his toys, especially his doll named \"Woody\". But, what do the toys do when Andy is not with them, they come to life. Woody believes that he has life (as a toy) good. However, he must worry about Andy\\'s family moving, and what Woody does not know is about Andy\\'s birthday party. Woody does not realize that Andy\\'s mother gave him an action figure known as Buzz Lightyear, who does not believe that he is a toy, and quickly becomes Andy\\'s new favorite toy. Woody, who is now consumed with jealousy, tries to get rid of Buzz. Then, both Woody and Buzz are now lost. They must find a way to get back to Andy before he moves without them, but they will have to pass through a ruthless toy killer, Sid Phillips.',\n",
       "  'Language': 'English',\n",
       "  'Country': 'USA',\n",
       "  'Awards': 'Nominated for 3 Oscars. Another 23 wins & 17 nominations.',\n",
       "  'Poster': 'https://m.media-amazon.com/images/M/MV5BMDU2ZWJlMjktMTRhMy00ZTA5LWEzNDgtYmNmZTEwZTViZWJkXkEyXkFqcGdeQXVyNDQ2OTk4MzI@._V1_SX300.jpg',\n",
       "  'Ratings': [{'Source': 'Internet Movie Database', 'Value': '8.3/10'},\n",
       "   {'Source': 'Rotten Tomatoes', 'Value': '100%'},\n",
       "   {'Source': 'Metacritic', 'Value': '95/100'}],\n",
       "  'Metascore': '95',\n",
       "  'imdbRating': '8.3',\n",
       "  'imdbVotes': '810,875',\n",
       "  'imdbID': 'tt0114709',\n",
       "  'Type': 'movie',\n",
       "  'DVD': '20 Mar 2001',\n",
       "  'BoxOffice': 'N/A',\n",
       "  'Production': 'Buena Vista',\n",
       "  'Website': 'http://www.disney.com/ToyStory',\n",
       "  'Response': 'True'}}"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies['1']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TMDB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import requests\n",
    "from tqdm import tqdm_notebook as tqdm\n",
    "import json\n",
    "\n",
    "myTmdbKey = 'your key here' # you can get it for free if you ask them nicely\n",
    "movies = pd.read_csv('../../../../data/ml-20m/links.csv')\n",
    "movies['imdbId'] = movies['imdbId'].apply(lambda i: '0' * (8 - len(str(i))) + str(i))\n",
    "movies['tmdbId'] = movies['tmdbId'].fillna(-1).astype(int).apply(str)\n",
    "movies = movies.set_index('movieId')\n",
    "movies = movies.to_dict(orient='index')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "import asyncio\n",
    "# ! pip install aiohttp --user\n",
    "import aiohttp\n",
    "# ! pip install asyncio-throttle --user\n",
    "from asyncio_throttle import Throttler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# movies = json.load(open(\"../../../../data/parsed/tmdb.json\", \"r\") )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> you can also run this code multiple times"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "aaacbda54580430396f4ae20b114f146",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(IntProgress(value=0, max=27278), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "throttler = Throttler(rate_limit=4, period=2)\n",
    "\n",
    "async def tmdb(session, id, tmdbId):\n",
    "    url = \"https://api.themoviedb.org/3/movie/{}?api_key={}\".format(tmdbId, myTmdbKey)\n",
    "    async with throttler:\n",
    "        async with session.get(url) as resp:\n",
    "            if resp.status == 429:\n",
    "                print('throttling')\n",
    "                await asyncio.sleep(0.2)\n",
    "            \n",
    "            movies[id]['tmdb'] = await resp.json()\n",
    "    \n",
    "    # this also controlls the timespan between calls\n",
    "    await asyncio.sleep(0.05)\n",
    "    \n",
    "\n",
    "async def main():\n",
    "    async with aiohttp.ClientSession() as session:\n",
    "        for id in tqdm(movies.keys()):\n",
    "            tmdbId = movies[id]['tmdbId']\n",
    "            if movies[id].get('tmdb', False) and 'status_code' not in movies[id]['tmdb']:\n",
    "                continue\n",
    "            await tmdb(session, id, tmdbId)\n",
    "        \n",
    "        \n",
    "if __name__ == '__main__':\n",
    "    loop = asyncio.get_event_loop()\n",
    "    loop.create_task(main())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"../../../../data/parsed/tmdb.json\", \"w\") as write_file:\n",
    "    json.dump(movies, write_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'imdbId': '00114709',\n",
       " 'tmdbId': '862',\n",
       " 'tmdb': {'adult': False,\n",
       "  'backdrop_path': '/dji4Fm0gCDVb9DQQMRvAI8YNnTz.jpg',\n",
       "  'belongs_to_collection': {'id': 10194,\n",
       "   'name': 'Toy Story Collection',\n",
       "   'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg',\n",
       "   'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'},\n",
       "  'budget': 30000000,\n",
       "  'genres': [{'id': 16, 'name': 'Animation'},\n",
       "   {'id': 35, 'name': 'Comedy'},\n",
       "   {'id': 10751, 'name': 'Family'}],\n",
       "  'homepage': 'http://toystory.disney.com/toy-story',\n",
       "  'id': 862,\n",
       "  'imdb_id': 'tt0114709',\n",
       "  'original_language': 'en',\n",
       "  'original_title': 'Toy Story',\n",
       "  'overview': \"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.\",\n",
       "  'popularity': 29.3,\n",
       "  'poster_path': '/rhIRbceoE9lR4veEXuwCC2wARtG.jpg',\n",
       "  'production_companies': [{'id': 3,\n",
       "    'logo_path': '/1TjvGVDMYsj6JBxOAkUHpPEwLf7.png',\n",
       "    'name': 'Pixar',\n",
       "    'origin_country': 'US'}],\n",
       "  'production_countries': [{'iso_3166_1': 'US',\n",
       "    'name': 'United States of America'}],\n",
       "  'release_date': '1995-10-30',\n",
       "  'revenue': 373554033,\n",
       "  'runtime': 81,\n",
       "  'spoken_languages': [{'iso_639_1': 'en', 'name': 'English'}],\n",
       "  'status': 'Released',\n",
       "  'tagline': '',\n",
       "  'title': 'Toy Story',\n",
       "  'video': False,\n",
       "  'vote_average': 7.9,\n",
       "  'vote_count': 10896}}"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies['1']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}