exploration/data2lemma2vec.ipynb from Lambda-School-Labs/allay-ds

exploration/data2lemma2vec.ipynb
Summary

Maintainability

Test Coverage

Issues
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook contains the preprocesing of the 'tweet' column of `data/combined_deduped.csv` into lemmas and vectors using [spaCy](https://spacy.io).\n",
    "\n",
    "When run top to bottom, it creates four new files in the data directory. These files contain the compressed, pickled dataframes for lemmas and vectors of the documents and are named `lemmas_<UTC DATETIME>.pkl.xz` and `vectors_[sm|md|lg]_<UTC DATETIME>.pkl.xz`, respectively. \n",
    "\n",
    "The lemmas file contains separate columns for lemmas created with spaCy's `en-core-web-sm`, `en-core-web-md`, and `en-core-web-lg` models. These columns are named 'sm_lemmas', 'md_vectors', and 'lg_lemmas'.\n",
    "\n",
    "The vectors files are separate files for vectors created with spaCy's `en-core-web-sm`, `en-core-web-md`, and `en-core-web-lg` models. These files are named 'vectors\\_sm\\_...', 'vectors\\_md\\_...', 'vectors\\_lg\\_...'. The vectors are in separate files due to their large size.\n",
    "\n",
    "The dataframes will also contain the 'inappropriate' column from `combined_deduped.csv`.\n",
    "\n",
    "This notebook is intended to make preprocesing of the data that becomes necessary (stop words, normalizing, etc.) a standardized process that results in reproducible and trackable results.\n",
    "\n",
    "**NOTE:** This process takes ~1.5 hours on my local machine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To load these files:\n",
    "\n",
    "```python\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.read_pickle(filename, compression='xz')\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "File sizes (approx.):\n",
    "- lemmas: 12MB\n",
    "- vectors_sm: 50MB\n",
    "- vectors_md: 155MB\n",
    "- vectors_lg: 15MB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "from datetime import datetime\n",
    "\n",
    "import spacy\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize spaCy models\n",
    "\n",
    "nlp_sm = spacy.load('en_core_web_sm')\n",
    "nlp_md = spacy.load('en_core_web_md')\n",
    "nlp_lg = spacy.load('en_core_web_lg')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load source training data\n",
    "\n",
    "training_data = pd.read_csv('data/combined_deduped.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lemmatization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Preprocessing for lemmatization\n",
    "# add / remove stop words, normalize, any text processing as necessary\n",
    "\n",
    "# additional tokens to ignore\n",
    "STOP_WORDS = []\n",
    "\n",
    "is_empty_pattern = re.compile(r'^\\s*$')\n",
    "\n",
    "def make_lemmas(nlp, docs):\n",
    "    \"\"\"Creates a list of documents containing the lemmas of each document in the input docs.\n",
    "\n",
    "    :param nlp: spaCy NLP model to use\n",
    "    :param docs: list of documents to lemmatize\n",
    "\n",
    "    :returns: list of lemmatized documents\n",
    "    \"\"\"\n",
    "    lemmas = []\n",
    "    for doc in nlp.pipe(docs, batch_size=500):\n",
    "        doc_lemmas = []\n",
    "        for token in doc:\n",
    "            if (\n",
    "                not token.is_stop\n",
    "                and not token.is_punct\n",
    "                and token.pos_ != 'PRON'\n",
    "                and not is_empty_pattern.match(token.text)\n",
    "                and len(token.lemma_) > 2\n",
    "                and token.lemma_ not in STOP_WORDS\n",
    "            ):\n",
    "                doc_lemmas.append(token.lemma_)\n",
    "        lemmas.append(doc_lemmas)\n",
    "    return lemmas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": "Lemmatization with en-core-web-sm\nLemmatization with en-core-web-md\nLemmatization with en-core-web-lg\n"
    }
   ],
   "source": [
    "# Create lemmas with each of the spaCy models\n",
    "\n",
    "lemmas = pd.DataFrame()\n",
    "\n",
    "print('Lemmatization with en-core-web-sm')\n",
    "lemmas['sm_lemmas'] = make_lemmas(nlp_sm, training_data['tweet'])\n",
    "\n",
    "print('Lemmatization with en-core-web-md')\n",
    "lemmas['md_lemmas'] = make_lemmas(nlp_md, training_data['tweet'])\n",
    "\n",
    "print('Lemmatization with en-core-web-lg')\n",
    "lemmas['lg_lemmas'] = make_lemmas(nlp_lg, training_data['tweet'])\n",
    "\n",
    "lemmas['inappropriate'] = training_data['inappropriate']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": "Training data shape: (146264, 2)\nLemmas shape: (146264, 4)\n\n"
    },
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "                                           sm_lemmas  \\\n0  [beat, Dr., Dre, urbeat, Wired, Ear, Headphone...   \n1  [@Papapishu, man, fucking, rule, party, perpet...   \n2  [time, draw, close, 128591;&#127995, Father, d...   \n3  [notice, start, act, different, distant, peep,...   \n4  [forget, unfollower, believe, grow, new, follo...   \n\n                                           md_lemmas  \\\n0  [beat, Dr., Dre, urbeat, Wired, ear, Headphone...   \n1  [@Papapishu, man, fucking, rule, party, perpet...   \n2  [time, draw, close, 128591;&#127995, Father, d...   \n3  [notice, start, act, different, distant, peep,...   \n4  [forget, unfollower, believe, grow, new, follo...   \n\n                                           lg_lemmas  inappropriate  \n0  [beat, Dr., Dre, urBeats, wire, Ear, Headphone...           True  \n1  [@Papapishu, man, fucking, rule, party, perpet...           True  \n2  [time, draw, close, 128591;&#127995, Father, d...          False  \n3  [notice, start, act, different, distant, peep,...          False  \n4  [forget, unfollower, believe, grow, new, follo...          False  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>sm_lemmas</th>\n      <th>md_lemmas</th>\n      <th>lg_lemmas</th>\n      <th>inappropriate</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>[beat, Dr., Dre, urbeat, Wired, Ear, Headphone...</td>\n      <td>[beat, Dr., Dre, urbeat, Wired, ear, Headphone...</td>\n      <td>[beat, Dr., Dre, urBeats, wire, Ear, Headphone...</td>\n      <td>True</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>[@Papapishu, man, fucking, rule, party, perpet...</td>\n      <td>[@Papapishu, man, fucking, rule, party, perpet...</td>\n      <td>[@Papapishu, man, fucking, rule, party, perpet...</td>\n      <td>True</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>[time, draw, close, 128591;&amp;#127995, Father, d...</td>\n      <td>[time, draw, close, 128591;&amp;#127995, Father, d...</td>\n      <td>[time, draw, close, 128591;&amp;#127995, Father, d...</td>\n      <td>False</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>[notice, start, act, different, distant, peep,...</td>\n      <td>[notice, start, act, different, distant, peep,...</td>\n      <td>[notice, start, act, different, distant, peep,...</td>\n      <td>False</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>[forget, unfollower, believe, grow, new, follo...</td>\n      <td>[forget, unfollower, believe, grow, new, follo...</td>\n      <td>[forget, unfollower, believe, grow, new, follo...</td>\n      <td>False</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 6
    }
   ],
   "source": [
    "# Quick glance verification of proper output\n",
    "\n",
    "print('Training data shape:', training_data.shape)\n",
    "print('Lemmas shape:', lemmas.shape)\n",
    "print()\n",
    "lemmas.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vectorization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_vectors(nlp, docs):\n",
    "    \"\"\"Creates a list of documents containing the vectors of each document in the input docs.\n",
    "\n",
    "    :param nlp: spaCy NLP model to use\n",
    "    :param docs: list of documents to vectorize\n",
    "\n",
    "    :returns: list of vectorized documents\n",
    "    \"\"\"\n",
    "    vectors = []\n",
    "    for doc in nlp.pipe(docs, batch_size=500):\n",
    "        vectors.append(doc.vector)\n",
    "    return vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": "Vectorization with en-core-web-sm\n - shape: (146264, 97)\nVectorization with en-core-web-md\n - shape: (146264, 301)\nVectorization with en-core-web-lg\n - shape: (146264, 301)\n"
    }
   ],
   "source": [
    "# Create vectors of the documents with each of the spaCy models\n",
    "\n",
    "print('Vectorization with en-core-web-sm')\n",
    "vectors_sm = pd.DataFrame(make_vectors(nlp_sm, training_data['tweet']))\n",
    "vectors_sm['inappropriate'] = training_data['inappropriate']\n",
    "print(' - shape:', vectors_sm.shape)\n",
    "\n",
    "print('Vectorization with en-core-web-md')\n",
    "vectors_md = pd.DataFrame(make_vectors(nlp_md, training_data['tweet']))\n",
    "vectors_md['inappropriate'] = training_data['inappropriate']\n",
    "print(' - shape:', vectors_md.shape)\n",
    "\n",
    "print('Vectorization with en-core-web-lg')\n",
    "vectors_lg = pd.DataFrame(make_vectors(nlp_lg, training_data['tweet']))\n",
    "vectors_lg['inappropriate'] = training_data['inappropriate']\n",
    "print(' - shape:', vectors_lg.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export lemmas and vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": "Creating data/lemmas_2020-05-04-16-27-18Z.pkl.xz\nCreating data/vectors_sm_2020-05-04-16-27-18Z.pkl.xz\nCreating data/vectors_md_2020-05-04-16-27-18Z.pkl.xz\nCreating data/vectors_lg_2020-05-04-16-27-18Z.pkl.xz\n"
    }
   ],
   "source": [
    "# UTC datetime stamp to correlate lemmas and vectors to a specific run\n",
    "utc_now_formatted = datetime.utcnow().strftime(r'%Y-%m-%d-%H-%M-%SZ')\n",
    "\n",
    "lemmas_filename = f'data/lemmas_{utc_now_formatted}.pkl.xz'\n",
    "vectors_sm_filename = f'data/vectors_sm_{utc_now_formatted}.pkl.xz'\n",
    "vectors_md_filename = f'data/vectors_md_{utc_now_formatted}.pkl.xz'\n",
    "vectors_lg_filename = f'data/vectors_lg_{utc_now_formatted}.pkl.xz'\n",
    "\n",
    "print(f'Creating {lemmas_filename}')\n",
    "lemmas.to_pickle(lemmas_filename, compression='xz')\n",
    "\n",
    "print(f'Creating {vectors_sm_filename}')\n",
    "vectors_sm.to_pickle(vectors_sm_filename, compression='xz')\n",
    "\n",
    "print(f'Creating {vectors_md_filename}')\n",
    "vectors_md.to_pickle(vectors_md_filename, compression='xz')\n",
    "\n",
    "print(f'Creating {vectors_lg_filename}')\n",
    "vectors_lg.to_pickle(vectors_lg_filename, compression='xz')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python37664bitallaydspipenv9dc12dab3da348b6bc255254c19ea33e",
   "display_name": "Python 3.7.6 64-bit ('allay-ds': pipenv)"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}