Lambda-School-Labs/Labs26-StorySquad-DS-TeamB

View on GitHub
notebooks/count_spelling_errors.ipynb

Summary

Maintainability
Test Coverage
{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python_defaultSpec_1600891421656",
   "display_name": "Python 3.7.6 64-bit ('base': conda)"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "source": [
    "# Implementing a Spell Checker after transcriptions\n",
    "### Testing multiple functions to count spelling mistakes in UGC\n",
    "- For now we decided not to use this. The google transcriptions do not give an accurate enough representation of the children's writing to get an accurate count of spelling errrs. Seems to add in extra errors for bad handwriting"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Imports\n",
    "import pandas as pd\n",
    "import pkg_resources\n",
    "from symspellpy import SymSpell, Verbosity\n",
    "from spellchecker import SpellChecker\n",
    "import string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "   Unnamed: 0  Submission ID  \\\n0           0           3132   \n1           1           3104   \n2           2           3103   \n3           3           3117   \n4           4           3102   \n\n                                    Transcribed Text  \n0  Page. I 3132 Once there was a little cheatah a...  \n1  3106 D she was very, a berenang The pony that ...  \n2  3103 Rainbow the Unica unicom named some een P...  \n3  3117 O gum drop land gumdrop. land is prace We...  \n4  3102 The secret fifth grade E am Anella I am s...  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Unnamed: 0</th>\n      <th>Submission ID</th>\n      <th>Transcribed Text</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>3132</td>\n      <td>Page. I 3132 Once there was a little cheatah a...</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1</td>\n      <td>3104</td>\n      <td>3106 D she was very, a berenang The pony that ...</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2</td>\n      <td>3103</td>\n      <td>3103 Rainbow the Unica unicom named some een P...</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>3</td>\n      <td>3117</td>\n      <td>3117 O gum drop land gumdrop. land is prace We...</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>4</td>\n      <td>3102</td>\n      <td>3102 The secret fifth grade E am Anella I am s...</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 2
    }
   ],
   "source": [
    "# Load transribed stories df\n",
    "df = pd.read_csv('transcribed_stories.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check spell check for the first entry\n",
    "input_term = df['Transcribed Text'][0]"
   ]
  },
  {
   "source": [
    "### Symspell package\n",
    "- Slow"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "True"
     },
     "metadata": {},
     "execution_count": 3
    }
   ],
   "source": [
    "# Load dictionary\n",
    "sym_spell = SymSpell(max_dictionary_edit_distance=6, prefix_length=7, count_threshold= 15)\n",
    "dictionary_path = pkg_resources.resource_filename(\n",
    "    \"symspellpy\", \"frequency_dictionary_en_82_765.txt\")\n",
    "bigram_path = pkg_resources.resource_filename(\n",
    "    \"symspellpy\", \"frequency_bigramdictionary_en_243_342.txt\")\n",
    "\n",
    "sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)\n",
    "sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)"
   ]
  },
  {
   "source": [
    "# Run transcribed text through Symspell to get spelling suggestions back\n",
    "suggestions = sym_spell.lookup_compound(input_term, max_edit_distance=6, ignore_term_with_digits= True, ignore_non_words= True)\n",
    "\n",
    "suggestions"
   ],
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": "------\npage i 3132 once there was a little cheetah and the cheetah had a best friend lion the cheetahs name was paws and the lions name was dylan they always played with each other after they went hunting dylan went to play with paws after hunting with pack and paws wont to play after hunting with his mon they always met up near the same rock art the same lake they would talk and have water fights here is what the talked about your parents and my pack might have a fight said dylan we both don't want that to happen we might have to fight against each other said paws after a few minutes it was time for them bale to their family next morning they saw there family having a fight dylan family won and plus mon was here at their us all he meeting time they talked about it my mon got here its not fair said paws i think the best way to fix this is to tell are family we are friend said dylan i think your right said paws so after there talk time and water fight they went back to there family and told we have been friends since 3 years ago said paws and dylan to their family okay their parents said i will let you goy switch spot for one day and we will see to them page a 3132 they were going to get paws spend time with bylaws pack and dylan spend time with paws mon it went well and everyone was happy and now the family all went to the lake the end, 89, 0\n"
    }
   ],
   "source": [
    "for suggestion in suggestions:\n",
    "    print('------')\n",
    "    print(suggestion)"
   ]
  },
  {
   "source": [
    "### Python Spellchecker"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def spell_check(text):\n",
    "    # Initialize spellchecker\n",
    "    spell = SpellChecker()\n",
    "\n",
    "    # Strip the punctuation\n",
    "    exclude = set(string.punctuation)\n",
    "    text = ''.join(ch for ch in text if ch not in exclude)\n",
    "\n",
    "    # Split text into list of words\n",
    "    text = text.split()\n",
    "\n",
    "    # Find misspelled words\n",
    "    misspelled = spell.unknown(text)\n",
    "\n",
    "    # Get corrections\n",
    "    for word in misspelled:\n",
    "        # Get the most likely correct spelling\n",
    "        print(f'{word}: {spell.correction(word)}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_term = df['Transcribed Text'][4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "\"3102 The secret fifth grade E am Anella I am starting fifth grade I have a little ster named Emma, Emma is goining into kindergarden withiber ter plete Ara and Isabella, but I don't care about theme when I was a 3rd grader the learned how to invent this year I will try to make an invention that arobot will go to school forme Got to go to sleep. Ezzzzz. I wake up get changed, brush teeth, and do hain I have long curly brown hair. I used to wear glasses and braces, and once broke my leg I am 10, I will turn il in November. The school is called i Los Angeles elementry school, I live in Nevada and will drive there. It is probably an hour When I walked in Mrs. Begula said Hello I Mrsklapen, my last name. I said hi I have to share a desk with Beltness Ay Schedule: Math writting reading vocab spelling typing, and social stadies. It had said No Lunch in Fifth grade! When I was in second grade Kayla called me a division Decimal I did not like it. we had to make recimals. Class dismissed said division problems and they were all with Now I was not a witch wampice or godess, but I Begular made asock from Emania and made it do my homework. When it was time to go Mrs. came 3102 I didn't I made a blue robot the whole night Blue is my favorite color, but it only had a blue shirt because my teacher would hotice it is not me, I sent it out with my mom. I went on my phone, mtLAVA in my soom She said loudly Why are you not at school. Good thing The did not notice. I had to say something un I am sick of (of cof, oht see let me tell Isabella Isabella I heard she said Mo suid Ava. Should I tell Mom if she's really sick um no said Isabella play with me. Isabella already new about it. It the robot came back. and that is my year of fifth grade. When the home work is home you are doing my home world It worker yay \""
     },
     "metadata": {},
     "execution_count": 32
    }
   ],
   "source": [
    "df['Transcribed Text'][4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": "kayla: layla\nrecimals: decimals\nhotice: notice\nwithiber: wither\ngoining: joining\n3rd: ord\nezzzzz: ezzzzz\narobot: robot\nsuid: said\nshes: she\nsoom: room\ngodess: goddess\nasock: sock\nmtlava: lava\nwampice: vampire\nster: step\nkindergarden: kindergarten\nanella: nella\nwritting: writing\noht: out\ncof: of\nbeltness: boldness\ndidnt: dint\nbegula: betula\nelementry: elementary\nvocab: vocal\nstadies: studies\nmrsklapen: mrsklapen\nbegular: regular\nplete: plate\nemania: mania\n"
    }
   ],
   "source": [
    "# Return identified typo and the suggested correction\n",
    "spell_check(input_term)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ]
}