Lambda-School-Labs/allay-ds

View on GitHub
exploration/explore_data.ipynb

Summary

Maintainability
Test Coverage
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "Traditional job sites can be overwhelming. Additionally, Lambda School graduates can save time and anxiety by focusing on companies who understand the unique value Lambda graduates bring as employees. Therefore we are developing a website where Lambda School students and alumni can post company and interview experiences and find helpful posts that others have made. \n",
    "\n",
    "For its first user feature, the Data Science team is developing automated content moderation. As the website scales, manual content moderation may not be a pragmatic way of enforcing content rules. Furthermore, inappropriate content undermines the site's core mission of saving Lambda students time and helping ease their anxieties. Therefore we seek to find a model that can automatically and accurately classify posts that should be flagged or removed.\n",
    "\n",
    "It is worth noting that many tweets in the data used within this notebook contain hateful or obscene content. Those who may be traumatized by such content may want to avoid reading this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "## Modeling Plan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "At the risk of tautology, data science requires data. So we first assess the data we have available and what we can get. We face a classic problem right now in building out features for a fledgling site: because we don't have users, we don't have actual user data; however, we'll never generate enough users to have sufficient data if they're encountering abusive and hateful content. For this reason, we concluded starting with a model trained on external data was a possible route.\n",
    "\n",
    "We were able to find several labeled data sets that flagged hateful, abusive, or spam content (or some combination of the three). Our initial strategy will be to use these to train, validate, and test models. It is worth noting that all three data sets were composed of Tweets. Our hope is that patterns of abusive text are similar enough across websites that the model learning on this will transfer.\n",
    "\n",
    "We will load and explore the data, then fit sequentially more complex models in order to maximize predictive performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "# Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import spacy\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import roc_curve, roc_auc_score\n",
    "import matplotlib.pyplot as plt\n",
    "import os"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C:\\\\Users\\\\ajenk\\\\GitHub\\\\allay-ds\\\\exploration'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "os.getcwd()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "## Data Set One: \"Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "The first data set is a set of 100,000 tweets labeled for use in the paper \"Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior.\" The text of these tweets has been provided for research use courtesy of the author Antigoni-Marie Founta. \n",
    "\n",
    "https://github.com/ENCASEH2020/hatespeech-twitter\n",
    "\n",
    "@inproceedings{founta2018large,\n",
    "    title={Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior},\n",
    "    author={Founta, Antigoni-Maria and Djouvas, Constantinos and Chatzakou, Despoina and Leontiadis, Ilias and Blackburn, Jeremy and Stringhini, Gianluca and Vakali, Athena and Sirivianos, Michael and Kourtellis, Nicolas},\n",
    "    booktitle={11th International Conference on Web and Social Media, ICWSM 2018},\n",
    "    year={2018},\n",
    "    organization={AAAI Press}\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": [
    "hundred_k_tweets = pd.read_csv(\"data\\\\hatespeech_text_label_vote.csv\", sep='\\t', header=None, names=[\"tweet\", \"category\", \"votes\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tweet</th>\n",
       "      <th>category</th>\n",
       "      <th>votes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Beats by Dr. Dre urBeats Wired In-Ear Headphon...</td>\n",
       "      <td>spam</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>RT @Papapishu: Man it would fucking rule if we...</td>\n",
       "      <td>abusive</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>It is time to draw close to Him &amp;#128591;&amp;#127...</td>\n",
       "      <td>normal</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>if you notice me start to act different or dis...</td>\n",
       "      <td>normal</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Forget unfollowers, I believe in growing. 7 ne...</td>\n",
       "      <td>normal</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               tweet category  votes\n",
       "0  Beats by Dr. Dre urBeats Wired In-Ear Headphon...     spam      4\n",
       "1  RT @Papapishu: Man it would fucking rule if we...  abusive      4\n",
       "2  It is time to draw close to Him &#128591;&#127...   normal      4\n",
       "3  if you notice me start to act different or dis...   normal      5\n",
       "4  Forget unfollowers, I believe in growing. 7 ne...   normal      3"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hundred_k_tweets.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(99996, 3)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hundred_k_tweets.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "normal     53851\n",
       "abusive    27150\n",
       "spam       14030\n",
       "hateful     4965\n",
       "Name: category, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hundred_k_tweets['category'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "For our purposes, content that is abusive, spam, or hateful should be identified by the model. The \"votes\" are the number of actual people who labeled the content by its majority label. This is a feature we may want to incorporate in the model down the line (examples where all five humans agreed than an item is appropriate should plausibly be weighted differently in training the model than examples where they were split). For the purposes of building a baseline model, I'll simplify data sets down to a binary target and the body of text. \n",
    "\n",
    "A note: emojis are characterised by their the string patterns \"&#NNNNNN\", where each N is a number. As advanced models are capable of learning these patterns, I leave them as is for the time being. But that's the human interpretation where those patterns are seen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": [
    "hundred_k_tweets[\"inappropriate\"] = (hundred_k_tweets[\"category\"].isin([\"spam\", \"abusive\", \"hateful\"]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tweet</th>\n",
       "      <th>inappropriate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Beats by Dr. Dre urBeats Wired In-Ear Headphon...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>RT @Papapishu: Man it would fucking rule if we...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>It is time to draw close to Him &amp;#128591;&amp;#127...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>if you notice me start to act different or dis...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Forget unfollowers, I believe in growing. 7 ne...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               tweet  inappropriate\n",
       "0  Beats by Dr. Dre urBeats Wired In-Ear Headphon...           True\n",
       "1  RT @Papapishu: Man it would fucking rule if we...           True\n",
       "2  It is time to draw close to Him &#128591;&#127...          False\n",
       "3  if you notice me start to act different or dis...          False\n",
       "4  Forget unfollowers, I believe in growing. 7 ne...          False"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hundred_k_tweets = hundred_k_tweets.drop([\"category\", \"votes\"], axis=1)\n",
    "hundred_k_tweets.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False    0.538532\n",
       "True     0.461468\n",
       "Name: inappropriate, dtype: float64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hundred_k_tweets[\"inappropriate\"].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "## Data Set Two: Automated Hate Speech Detection and the Problem of Offensive Language"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": [
    "additional_tweets = pd.read_csv(\"data\\\\labeled_data.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>count</th>\n",
       "      <th>hate_speech</th>\n",
       "      <th>offensive_language</th>\n",
       "      <th>neither</th>\n",
       "      <th>class</th>\n",
       "      <th>tweet</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>!!! RT @mayasolovely: As a woman you shouldn't...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>!!!!! RT @mleew17: boy dats cold...tyga dwn ba...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0  count  hate_speech  offensive_language  neither  class  \\\n",
       "0           0      3            0                   0        3      2   \n",
       "1           1      3            0                   3        0      1   \n",
       "2           2      3            0                   3        0      1   \n",
       "3           3      3            0                   2        1      1   \n",
       "4           4      6            0                   6        0      1   \n",
       "\n",
       "                                               tweet  \n",
       "0  !!! RT @mayasolovely: As a woman you shouldn't...  \n",
       "1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...  \n",
       "2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...  \n",
       "3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...  \n",
       "4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "additional_tweets.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": [
    "additional_tweets = additional_tweets.drop([\"Unnamed: 0\"], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(24783, 6)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "additional_tweets.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "This data set is composed of 24,783 tweets that have been manually labeled by CrowdFlower users. \"Count\" is the number of users who voted; \"hate_speech\", \"offensive_language\", and \"neither\" are the various categories that can be voted for. \"Class\" is the majority label.\n",
    "\n",
    "https://github.com/t-davidson/hate-speech-and-offensive-language/tree/master/data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    19190\n",
       "2     4163\n",
       "0     1430\n",
       "Name: class, dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "additional_tweets[\"class\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "\"1\" represents offensive language, \"2\" represents neither offensive language nor hate speech, and \"0\" represents hate speech. \n",
    "\n",
    "It's worth noting that inappropriate tweets are far more common in both data sets than they are in the real world. This is something to be cognizant of when training the model: weakly explanatory models will tend to default to baseline predictions, which in this case will result in many more flagged posts than is desirable. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>hate_speech</th>\n",
       "      <th>offensive_language</th>\n",
       "      <th>neither</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>24783.000000</td>\n",
       "      <td>24783.000000</td>\n",
       "      <td>24783.000000</td>\n",
       "      <td>24783.000000</td>\n",
       "      <td>24783.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>3.243473</td>\n",
       "      <td>0.280515</td>\n",
       "      <td>2.413711</td>\n",
       "      <td>0.549247</td>\n",
       "      <td>1.110277</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.883060</td>\n",
       "      <td>0.631851</td>\n",
       "      <td>1.399459</td>\n",
       "      <td>1.113299</td>\n",
       "      <td>0.462089</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>9.000000</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              count   hate_speech  offensive_language       neither  \\\n",
       "count  24783.000000  24783.000000        24783.000000  24783.000000   \n",
       "mean       3.243473      0.280515            2.413711      0.549247   \n",
       "std        0.883060      0.631851            1.399459      1.113299   \n",
       "min        3.000000      0.000000            0.000000      0.000000   \n",
       "25%        3.000000      0.000000            2.000000      0.000000   \n",
       "50%        3.000000      0.000000            3.000000      0.000000   \n",
       "75%        3.000000      0.000000            3.000000      0.000000   \n",
       "max        9.000000      7.000000            9.000000      9.000000   \n",
       "\n",
       "              class  \n",
       "count  24783.000000  \n",
       "mean       1.110277  \n",
       "std        0.462089  \n",
       "min        0.000000  \n",
       "25%        1.000000  \n",
       "50%        1.000000  \n",
       "75%        1.000000  \n",
       "max        2.000000  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "additional_tweets.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>18766</th>\n",
       "      <th>20246</th>\n",
       "      <th>7706</th>\n",
       "      <th>23560</th>\n",
       "      <th>1600</th>\n",
       "      <th>20292</th>\n",
       "      <th>17020</th>\n",
       "      <th>4107</th>\n",
       "      <th>13926</th>\n",
       "      <th>7909</th>\n",
       "      <th>15488</th>\n",
       "      <th>4867</th>\n",
       "      <th>22464</th>\n",
       "      <th>20474</th>\n",
       "      <th>16174</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hate_speech</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>offensive_language</th>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neither</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>class</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tweet</th>\n",
       "      <td>RT @chvmpagne: Bitches that dress like this pu...</td>\n",
       "      <td>RT @tupactopus: bad bitches get in free night ...</td>\n",
       "      <td>Amennn!!!!! \"@WomenLoveBrickz: If your girlfri...</td>\n",
       "      <td>all these beautiful bitches, sucha beautiful t...</td>\n",
       "      <td>&amp;#8220;@WEEEDITH: All I want is bitches, big b...</td>\n",
       "      <td>RT @vinnycrack: this bitch got the itunes term...</td>\n",
       "      <td>RT @RakwonOGOD: Lmaoo bitch what? http://t.co/...</td>\n",
       "      <td>@Mijo_LGI you're such a bitch</td>\n",
       "      <td>Pull up on my ex make that bitch mad</td>\n",
       "      <td>Bad bitch, chest out......no wonder why Miss R...</td>\n",
       "      <td>RT @HilariousSelfie: how ugly bitches take sel...</td>\n",
       "      <td>@Tee_Bizzle i aint shit, you aint shit...bitch...</td>\n",
       "      <td>Walk in the party boot up yo hoe</td>\n",
       "      <td>Ray Rice is a bitch &amp;amp; his wife is stupid f...</td>\n",
       "      <td>RT @LeezyTheWarrior: Can't be letting them mes...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                18766  \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  9   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               RT @chvmpagne: Bitches that dress like this pu...   \n",
       "\n",
       "                                                                20246  \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  9   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               RT @tupactopus: bad bitches get in free night ...   \n",
       "\n",
       "                                                                7706   \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  9   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               Amennn!!!!! \"@WomenLoveBrickz: If your girlfri...   \n",
       "\n",
       "                                                                23560  \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  9   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               all these beautiful bitches, sucha beautiful t...   \n",
       "\n",
       "                                                                1600   \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  9   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               &#8220;@WEEEDITH: All I want is bitches, big b...   \n",
       "\n",
       "                                                                20292  \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  9   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               RT @vinnycrack: this bitch got the itunes term...   \n",
       "\n",
       "                                                                17020  \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  9   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               RT @RakwonOGOD: Lmaoo bitch what? http://t.co/...   \n",
       "\n",
       "                                            4107   \\\n",
       "count                                           9   \n",
       "hate_speech                                     0   \n",
       "offensive_language                              9   \n",
       "neither                                         0   \n",
       "class                                           1   \n",
       "tweet               @Mijo_LGI you're such a bitch   \n",
       "\n",
       "                                                   13926  \\\n",
       "count                                                  9   \n",
       "hate_speech                                            0   \n",
       "offensive_language                                     9   \n",
       "neither                                                0   \n",
       "class                                                  1   \n",
       "tweet               Pull up on my ex make that bitch mad   \n",
       "\n",
       "                                                                7909   \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  9   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               Bad bitch, chest out......no wonder why Miss R...   \n",
       "\n",
       "                                                                15488  \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  9   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               RT @HilariousSelfie: how ugly bitches take sel...   \n",
       "\n",
       "                                                                4867   \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  9   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               @Tee_Bizzle i aint shit, you aint shit...bitch...   \n",
       "\n",
       "                                               22464  \\\n",
       "count                                              9   \n",
       "hate_speech                                        0   \n",
       "offensive_language                                 9   \n",
       "neither                                            0   \n",
       "class                                              1   \n",
       "tweet               Walk in the party boot up yo hoe   \n",
       "\n",
       "                                                                20474  \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  9   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               Ray Rice is a bitch &amp; his wife is stupid f...   \n",
       "\n",
       "                                                                16174  \n",
       "count                                                               9  \n",
       "hate_speech                                                         0  \n",
       "offensive_language                                                  9  \n",
       "neither                                                             0  \n",
       "class                                                               1  \n",
       "tweet               RT @LeezyTheWarrior: Can't be letting them mes...  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "additional_tweets.sort_values(by=\"offensive_language\", ascending=False).head(15).T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>6171</th>\n",
       "      <th>15658</th>\n",
       "      <th>10451</th>\n",
       "      <th>15809</th>\n",
       "      <th>3869</th>\n",
       "      <th>19136</th>\n",
       "      <th>6378</th>\n",
       "      <th>3404</th>\n",
       "      <th>23633</th>\n",
       "      <th>12949</th>\n",
       "      <th>14973</th>\n",
       "      <th>17488</th>\n",
       "      <th>5749</th>\n",
       "      <th>9357</th>\n",
       "      <th>9206</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>7</td>\n",
       "      <td>6</td>\n",
       "      <td>9</td>\n",
       "      <td>6</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>9</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hate_speech</th>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>offensive_language</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neither</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>class</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tweet</th>\n",
       "      <td>@infidelpamelaLC I'm going to blame the black ...</td>\n",
       "      <td>RT @Isa__Lopez: @D_Lo520 but you're still a fa...</td>\n",
       "      <td>I hate fat bitches</td>\n",
       "      <td>RT @JihadistJoe: We Muslims have no military h...</td>\n",
       "      <td>@L1LTR4P fucking losers wetbacks #SorryNotSorry</td>\n",
       "      <td>RT @iBeZo: Stupid fucking nigger LeBron. You f...</td>\n",
       "      <td>@kcSnowWhite7 @SamSaunders42 don't forget napp...</td>\n",
       "      <td>@Hovaa_ ya I know all the slang I'm racist I h...</td>\n",
       "      <td>bitch kill yoself, go on to the bathroom and e...</td>\n",
       "      <td>My advice of the day: If your a tranny.....go ...</td>\n",
       "      <td>RT @DefendWallSt: Tell me how you really feel....</td>\n",
       "      <td>RT @SwaaggyyV: Fucking chinks in Clash of Clan...</td>\n",
       "      <td>@clinchmtn316 @sixonesixband AMERICA today, th...</td>\n",
       "      <td>GEEZ..... I think #NorthKorea may be right. #B...</td>\n",
       "      <td>From now on, I will call all radical MUSLIMS n...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                6171   \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         7   \n",
       "offensive_language                                                  1   \n",
       "neither                                                             1   \n",
       "class                                                               0   \n",
       "tweet               @infidelpamelaLC I'm going to blame the black ...   \n",
       "\n",
       "                                                                15658  \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         7   \n",
       "offensive_language                                                  2   \n",
       "neither                                                             0   \n",
       "class                                                               0   \n",
       "tweet               RT @Isa__Lopez: @D_Lo520 but you're still a fa...   \n",
       "\n",
       "                                 10451  \\\n",
       "count                                9   \n",
       "hate_speech                          7   \n",
       "offensive_language                   2   \n",
       "neither                              0   \n",
       "class                                0   \n",
       "tweet               I hate fat bitches   \n",
       "\n",
       "                                                                15809  \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         6   \n",
       "offensive_language                                                  0   \n",
       "neither                                                             3   \n",
       "class                                                               0   \n",
       "tweet               RT @JihadistJoe: We Muslims have no military h...   \n",
       "\n",
       "                                                              3869   \\\n",
       "count                                                             7   \n",
       "hate_speech                                                       6   \n",
       "offensive_language                                                1   \n",
       "neither                                                           0   \n",
       "class                                                             0   \n",
       "tweet               @L1LTR4P fucking losers wetbacks #SorryNotSorry   \n",
       "\n",
       "                                                                19136  \\\n",
       "count                                                               6   \n",
       "hate_speech                                                         6   \n",
       "offensive_language                                                  0   \n",
       "neither                                                             0   \n",
       "class                                                               0   \n",
       "tweet               RT @iBeZo: Stupid fucking nigger LeBron. You f...   \n",
       "\n",
       "                                                                6378   \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         6   \n",
       "offensive_language                                                  3   \n",
       "neither                                                             0   \n",
       "class                                                               0   \n",
       "tweet               @kcSnowWhite7 @SamSaunders42 don't forget napp...   \n",
       "\n",
       "                                                                3404   \\\n",
       "count                                                               6   \n",
       "hate_speech                                                         6   \n",
       "offensive_language                                                  0   \n",
       "neither                                                             0   \n",
       "class                                                               0   \n",
       "tweet               @Hovaa_ ya I know all the slang I'm racist I h...   \n",
       "\n",
       "                                                                23633  \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         5   \n",
       "offensive_language                                                  4   \n",
       "neither                                                             0   \n",
       "class                                                               0   \n",
       "tweet               bitch kill yoself, go on to the bathroom and e...   \n",
       "\n",
       "                                                                12949  \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         5   \n",
       "offensive_language                                                  4   \n",
       "neither                                                             0   \n",
       "class                                                               0   \n",
       "tweet               My advice of the day: If your a tranny.....go ...   \n",
       "\n",
       "                                                                14973  \\\n",
       "count                                                               6   \n",
       "hate_speech                                                         5   \n",
       "offensive_language                                                  1   \n",
       "neither                                                             0   \n",
       "class                                                               0   \n",
       "tweet               RT @DefendWallSt: Tell me how you really feel....   \n",
       "\n",
       "                                                                17488  \\\n",
       "count                                                               6   \n",
       "hate_speech                                                         5   \n",
       "offensive_language                                                  1   \n",
       "neither                                                             0   \n",
       "class                                                               0   \n",
       "tweet               RT @SwaaggyyV: Fucking chinks in Clash of Clan...   \n",
       "\n",
       "                                                                5749   \\\n",
       "count                                                               9   \n",
       "hate_speech                                                         5   \n",
       "offensive_language                                                  0   \n",
       "neither                                                             4   \n",
       "class                                                               0   \n",
       "tweet               @clinchmtn316 @sixonesixband AMERICA today, th...   \n",
       "\n",
       "                                                                9357   \\\n",
       "count                                                               6   \n",
       "hate_speech                                                         5   \n",
       "offensive_language                                                  1   \n",
       "neither                                                             0   \n",
       "class                                                               0   \n",
       "tweet               GEEZ..... I think #NorthKorea may be right. #B...   \n",
       "\n",
       "                                                                9206   \n",
       "count                                                               6  \n",
       "hate_speech                                                         5  \n",
       "offensive_language                                                  1  \n",
       "neither                                                             0  \n",
       "class                                                               0  \n",
       "tweet               From now on, I will call all radical MUSLIMS n...  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "additional_tweets.sort_values(by=\"hate_speech\", ascending=False).head(15).T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>15719</th>\n",
       "      <th>15718</th>\n",
       "      <th>15717</th>\n",
       "      <th>15715</th>\n",
       "      <th>15713</th>\n",
       "      <th>15712</th>\n",
       "      <th>15710</th>\n",
       "      <th>15709</th>\n",
       "      <th>15708</th>\n",
       "      <th>15705</th>\n",
       "      <th>15704</th>\n",
       "      <th>15703</th>\n",
       "      <th>15702</th>\n",
       "      <th>15700</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hate_speech</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>offensive_language</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neither</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>class</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tweet</th>\n",
       "      <td>!!! RT @mayasolovely: As a woman you shouldn't...</td>\n",
       "      <td>RT @JZolly23: Kickin trash cans on the golf ca...</td>\n",
       "      <td>RT @JStac825: There's coon classic (R. Kelly, ...</td>\n",
       "      <td>RT @JRsBBQ: TV wrestling villains must lie, se...</td>\n",
       "      <td>RT @JOscarJr: @paullemat Happy Birthday! Take ...</td>\n",
       "      <td>RT @JOEL9ONE: Thanks Carolina fans 4 the flipp...</td>\n",
       "      <td>RT @JMuggaaa: \"@TheKaosYatti: &amp;#8220;@JMuggaaa...</td>\n",
       "      <td>RT @JMK0728: Lmao...........poor pussy!!!!! ht...</td>\n",
       "      <td>RT @JLewyville: &amp;#128563; @Treslyon: Keyair an...</td>\n",
       "      <td>RT @JLM_2014: It ain't nothin to cut that bitc...</td>\n",
       "      <td>RT @JFlocka: Need a down bitch to bring me pizza</td>\n",
       "      <td>RT @JFish13: Where has this been all year? But...</td>\n",
       "      <td>RT @JETzLyfe412: Started off wit nuttin I was ...</td>\n",
       "      <td>RT @JEN_JEN_2014: My pussy is totes adorbs whe...</td>\n",
       "      <td>RT @JDYDFF: know the bitch before you call you...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                0      \\\n",
       "count                                                               3   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  0   \n",
       "neither                                                             3   \n",
       "class                                                               2   \n",
       "tweet               !!! RT @mayasolovely: As a woman you shouldn't...   \n",
       "\n",
       "                                                                15719  \\\n",
       "count                                                               4   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  1   \n",
       "neither                                                             3   \n",
       "class                                                               2   \n",
       "tweet               RT @JZolly23: Kickin trash cans on the golf ca...   \n",
       "\n",
       "                                                                15718  \\\n",
       "count                                                               3   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  1   \n",
       "neither                                                             2   \n",
       "class                                                               2   \n",
       "tweet               RT @JStac825: There's coon classic (R. Kelly, ...   \n",
       "\n",
       "                                                                15717  \\\n",
       "count                                                               3   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  2   \n",
       "neither                                                             1   \n",
       "class                                                               1   \n",
       "tweet               RT @JRsBBQ: TV wrestling villains must lie, se...   \n",
       "\n",
       "                                                                15715  \\\n",
       "count                                                               3   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  1   \n",
       "neither                                                             2   \n",
       "class                                                               2   \n",
       "tweet               RT @JOscarJr: @paullemat Happy Birthday! Take ...   \n",
       "\n",
       "                                                                15713  \\\n",
       "count                                                               3   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  0   \n",
       "neither                                                             3   \n",
       "class                                                               2   \n",
       "tweet               RT @JOEL9ONE: Thanks Carolina fans 4 the flipp...   \n",
       "\n",
       "                                                                15712  \\\n",
       "count                                                               3   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  3   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               RT @JMuggaaa: \"@TheKaosYatti: &#8220;@JMuggaaa...   \n",
       "\n",
       "                                                                15710  \\\n",
       "count                                                               3   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  3   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               RT @JMK0728: Lmao...........poor pussy!!!!! ht...   \n",
       "\n",
       "                                                                15709  \\\n",
       "count                                                               3   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  3   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               RT @JLewyville: &#128563; @Treslyon: Keyair an...   \n",
       "\n",
       "                                                                15708  \\\n",
       "count                                                               3   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  3   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               RT @JLM_2014: It ain't nothin to cut that bitc...   \n",
       "\n",
       "                                                               15705  \\\n",
       "count                                                              3   \n",
       "hate_speech                                                        0   \n",
       "offensive_language                                                 3   \n",
       "neither                                                            0   \n",
       "class                                                              1   \n",
       "tweet               RT @JFlocka: Need a down bitch to bring me pizza   \n",
       "\n",
       "                                                                15704  \\\n",
       "count                                                               3   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  0   \n",
       "neither                                                             3   \n",
       "class                                                               2   \n",
       "tweet               RT @JFish13: Where has this been all year? But...   \n",
       "\n",
       "                                                                15703  \\\n",
       "count                                                               3   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  3   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               RT @JETzLyfe412: Started off wit nuttin I was ...   \n",
       "\n",
       "                                                                15702  \\\n",
       "count                                                               3   \n",
       "hate_speech                                                         0   \n",
       "offensive_language                                                  3   \n",
       "neither                                                             0   \n",
       "class                                                               1   \n",
       "tweet               RT @JEN_JEN_2014: My pussy is totes adorbs whe...   \n",
       "\n",
       "                                                                15700  \n",
       "count                                                               3  \n",
       "hate_speech                                                         0  \n",
       "offensive_language                                                  3  \n",
       "neither                                                             0  \n",
       "class                                                               1  \n",
       "tweet               RT @JDYDFF: know the bitch before you call you...  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "additional_tweets.sort_values(by=\"hate_speech\", ascending=True).head(15).T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": [
    "additional_tweets[\"inappropriate\"] = additional_tweets[\"class\"] != 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>hate_speech</th>\n",
       "      <th>offensive_language</th>\n",
       "      <th>neither</th>\n",
       "      <th>class</th>\n",
       "      <th>tweet</th>\n",
       "      <th>inappropriate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>!!! RT @mayasolovely: As a woman you shouldn't...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>!!!!! RT @mleew17: boy dats cold...tyga dwn ba...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   count  hate_speech  offensive_language  neither  class  \\\n",
       "0      3            0                   0        3      2   \n",
       "1      3            0                   3        0      1   \n",
       "2      3            0                   3        0      1   \n",
       "3      3            0                   2        1      1   \n",
       "4      6            0                   6        0      1   \n",
       "\n",
       "                                               tweet  inappropriate  \n",
       "0  !!! RT @mayasolovely: As a woman you shouldn't...          False  \n",
       "1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...           True  \n",
       "2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...           True  \n",
       "3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...           True  \n",
       "4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...           True  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "additional_tweets.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": [
    "additional_tweets = additional_tweets.drop([\"count\", \"hate_speech\", \"offensive_language\", \"neither\", \"class\"], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tweet</th>\n",
       "      <th>inappropriate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>!!! RT @mayasolovely: As a woman you shouldn't...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>!!!!! RT @mleew17: boy dats cold...tyga dwn ba...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               tweet  inappropriate\n",
       "0  !!! RT @mayasolovely: As a woman you shouldn't...          False\n",
       "1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...           True\n",
       "2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...           True\n",
       "3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...           True\n",
       "4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...           True"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "additional_tweets.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True     20620\n",
       "False     4163\n",
       "Name: inappropriate, dtype: int64"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "additional_tweets[\"inappropriate\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "## Data Set 3: Kaggle Tweets - https://www.kaggle.com/vkrahul/twitter-hate-speech#train_E6oV3lV.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": [
    "kaggle_tweets = pd.read_csv(\"data/train_E6oV3lV.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>label</th>\n",
       "      <th>tweet</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>@user when a father is dysfunctional and is s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>bihday your majesty</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>#model   i love u take with u all the time in ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>factsguide: society now    #motivation</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id  label                                              tweet\n",
       "0   1      0   @user when a father is dysfunctional and is s...\n",
       "1   2      0  @user @user thanks for #lyft credit i can't us...\n",
       "2   3      0                                bihday your majesty\n",
       "3   4      0  #model   i love u take with u all the time in ...\n",
       "4   5      0             factsguide: society now    #motivation"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kaggle_tweets.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": [
    "kaggle_tweets[\"inappropriate\"] = kaggle_tweets[\"label\"]\n",
    "kaggle_tweets = kaggle_tweets.drop([\"id\", \"label\"], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    29720\n",
       "1     2242\n",
       "Name: inappropriate, dtype: int64"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kaggle_tweets[\"inappropriate\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tweet</th>\n",
       "      <th>inappropriate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@user when a father is dysfunctional and is s...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>bihday your majesty</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>#model   i love u take with u all the time in ...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>factsguide: society now    #motivation</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               tweet  inappropriate\n",
       "0   @user when a father is dysfunctional and is s...              0\n",
       "1  @user @user thanks for #lyft credit i can't us...              0\n",
       "2                                bihday your majesty              0\n",
       "3  #model   i love u take with u all the time in ...              0\n",
       "4             factsguide: society now    #motivation              0"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kaggle_tweets.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "Lastly, we add in a data set of Tweets sourced from Kaggle. This data set has many more appropriate than inappropriate items, which is nice to counterbalance the last data set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "# Merging Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": [
    "dfs = [hundred_k_tweets, additional_tweets, kaggle_tweets]\n",
    "df = pd.concat(dfs, ignore_index=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(156741, 2)"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "Some of the data has duplication. Some duplicate tweets have identical labels -- in these cases, we simply drop one. Other duplicate tweets have contradictory labels. These are all inappropriate tweets, so we drop the ones labeled appropriate on these."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": [
    "df = df.drop_duplicates(subset=['tweet', 'inappropriate'])\n",
    "appropriate = ~df['inappropriate']\n",
    "dupe_tweet = df.duplicated(subset=['tweet'], keep=False)\n",
    "df = df[~((dupe_tweet) & (appropriate))].copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tweet</th>\n",
       "      <th>inappropriate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Beats by Dr. Dre urBeats Wired In-Ear Headphon...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>RT @Papapishu: Man it would fucking rule if we...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>It is time to draw close to Him &amp;#128591;&amp;#127...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>if you notice me start to act different or dis...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Forget unfollowers, I believe in growing. 7 ne...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               tweet  inappropriate\n",
       "0  Beats by Dr. Dre urBeats Wired In-Ear Headphon...              1\n",
       "1  RT @Papapishu: Man it would fucking rule if we...              1\n",
       "2  It is time to draw close to Him &#128591;&#127...              0\n",
       "3  if you notice me start to act different or dis...              0\n",
       "4  Forget unfollowers, I believe in growing. 7 ne...              0"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.duplicated(subset=['tweet']).any()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": [
    "# df.to_csv('combined_deduped.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "## Model Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "Collapsed": "false"
   },
   "source": [
    "Now that we have the data in a csv file, we can move on to model training. The notebook exploring the model training process is in exploration/train_models.ipynb."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "Collapsed": "false"
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "allay-ds-cRyEcJS9",
   "language": "python",
   "name": "allay-ds-cryecjs9"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}