Features/Feature_Engineering1.ipynb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Compound Features from raw data\n",
"One of the most important parameters that decide irrespective of the learning model used(Perceptron,Decision Trees etc.) is the features or the input data itself.\n",
"If we can arrange the input data in a coherent way and do not take many features that convey more or less the same information we can improve the accuracy of our model to a great extent.\n",
"For example the Title of a person Mr. ,Mrs. carry the same information as that of Sex.So it will be redundant if we account for both in our model.Instead we can extract Titles from Names by RegularExpression matcher"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import re\n",
"def get_title(name):\n",
" search=re.search(' ([A-Za-z]+)\\.',name)\n",
" if search:\n",
" return search.group(1)\n",
" #.group returns the string matched by the regular expression if we find a valid one\n",
" return \"\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Mrs'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_title(\"Cumings, Mrs. John Bradley Florence Briggs Thayer\")\n",
"#Demonstration"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also make use of the fact that in the Names, the Title is always followed by a \".\"(which is why we use a escape sequence \"/\" to code it in our pattern)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Title</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Title \n",
"0 0 A/5 21171 7.2500 NaN S 0 \n",
"1 0 PC 17599 71.2833 C85 C 0 \n",
"2 0 STON/O2. 3101282 7.9250 NaN S 0 \n",
"3 0 113803 53.1000 C123 S 0 \n",
"4 0 373450 8.0500 NaN S 0 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'''Now to see how much the features Sex and Title are correlated we will use Pearson's correlation matrix which shows\n",
"correlation between two variables.One must also add a new column Title in the original sheet'''\n",
"import pandas as pd\n",
"train=pd.read_csv('../../titanic_data.csv')\n",
"train.insert(len(train.columns),'Title',value=0)\n",
"train.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train['Title']=train['Name'].apply(get_title)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Mr\n",
"1 Mrs\n",
"2 Miss\n",
"Name: Title, dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train['Title'].head(3)\n",
"#To check whether everything works fine or not"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'Dr', 'Sir', 'Mme', 'Miss', 'Mlle', 'Mrs', 'Rev', 'Mr', 'Lady', 'Master', 'Capt', 'Countess', 'Col', 'Jonkheer', 'Ms', 'Don', 'Major'}\n"
]
}
],
"source": [
"label=set(train['Title'])\n",
"#Checking no. of unique titles\n",
"print(label)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train['Title']=train['Title'].replace('Mlle','Miss')\n",
"train['Title']=train['Title'].replace('Ms','Miss')\n",
"train['Title']=train['Title'].replace('Mme','Miss')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"dict={\"Mr\": 1, \"Master\": 2, \"Mrs\": 3, \"Miss\": 4, \"Col\": 5,\n",
" \"Countess\":6,\"Don\":7,\"Sir\":8,\"Lady\":9,\"Capt\":10,\"Rev\":11\n",
" ,\"Jonkheer\":12,\"Major\":13,\"Dr\":14}\n",
"unique_title=(\"Mr\",\"Master\",\"Mrs\",\"Miss\",\"Col\",\"Countess\"\n",
" ,\"Don\",\"Sir\",\"Lady\",\"Capt\",\"Rev\",\"Jonkheer\"\n",
" ,\"Major\",\"Dr\")\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train[\"Title\"]=train['Title'].apply(lambda x:dict[x])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAADxJJREFUeJzt3X+s3Xddx/HnixbHz7gtu1xrW7yNaTAdkY3c1OmMUSqs\nOkL311IipMYl/afqMCSkxUTjHzU1GoREp2kG7iZMmoYfWcMQqWWEmCDjbgy2dsw2bKOt7XqBIKDJ\ntOPtH/eLOZTd3nPuvaff3g/PR9Kc7/mc7/d+37e599nvPffc21QVkqR2vaTvASRJ42XoJalxhl6S\nGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGre27wEAbrjhhpqamup7DElaVR555JFvVtXEYvtd\nFaGfmppidna27zEkaVVJ8uww+/nUjSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMM\nvSQ17qr4ydjlmtr7YC/nfebA7b2cV5JG4RW9JDXO0EtS4wy9JDXO0EtS4wy9JDXO0EtS4wy9JDXO\n0EtS4wy9JDXO0EtS4wy9JDXO0EtS4wy9JDXO0EtS4wy9JDXO0EtS44YKfZJnkjye5LEks93a9UmO\nJjnZ3V43sP++JKeSPJXktnENL0la3ChX9L9RVTdV1XR3fy9wrKo2A8e6+yTZAuwEbgS2A/ckWbOC\nM0uSRrCcp252ADPd9gxwx8D6oap6vqqeBk4BW5dxHknSMgwb+gL+JckjSXZ3a5NVda7bPg9Mdtvr\ngdMDx57p1iRJPRj2Pwf/1ao6m+Q1wNEkXxt8sKoqSY1y4u4fjN0Ar33ta0c5VJI0gqGu6KvqbHd7\nAfgE80/FPJdkHUB3e6Hb/SywceDwDd3apW/zYFVNV9X0xMTE0t8DSdJlLRr6JK9M8uofbgNvAZ4A\njgC7ut12AQ9020eAnUmuSbIJ2Aw8vNKDS5KGM8xTN5PAJ5L8cP9/rKpPJ/kScDjJXcCzwJ0AVXU8\nyWHgBHAR2FNVL4xleknSohYNfVV9HXjDi6x/C9i2wDH7gf3Lnk6StGz+ZKwkNc7QS1LjDL0kNc7Q\nS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1Lj\nDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1Ljhg59\nkjVJvpzkk93965McTXKyu71uYN99SU4leSrJbeMYXJI0nFGu6O8Gnhy4vxc4VlWbgWPdfZJsAXYC\nNwLbgXuSrFmZcSVJoxoq9Ek2ALcD9w4s7wBmuu0Z4I6B9UNV9XxVPQ2cArauzLiSpFENe0X/fuA9\nwA8G1iar6ly3fR6Y7LbXA6cH9jvTrUmSerBo6JO8FbhQVY8stE9VFVCjnDjJ7iSzSWbn5uZGOVSS\nNIJhruhvBd6W5BngEPCmJB8GnkuyDqC7vdDtfxbYOHD8hm7tR1TVwaqarqrpiYmJZbwLkqTLWTT0\nVbWvqjZU1RTz32T9bFW9AzgC7Op22wU80G0fAXYmuSbJJmAz8PCKTy5JGsraZRx7ADic5C7gWeBO\ngKo6nuQwcAK4COypqheWPakkaUlGCn1VfQ74XLf9LWDbAvvtB/YvczZJ0grwJ2MlqXGGXpIaZ+gl\nqXGGXpIaZ+glqXGGXpIaZ+glqXGGXpIaZ+glqXGGXpIaZ+glqXGGXpIaZ+glqXGGXpIaZ+glqXGG\nXpIaZ+glqXGGXpIaZ+glqXGGXpIaZ+glqXGGXpIaZ+glqXGGXpIaZ+glqXGGXpIaZ+glqXGGXpIa\nZ+glqXGLhj7Jy5I8nOQrSY4n+bNu/fokR5Oc7G6vGzhmX5JTSZ5Kcts43wFJ0uUNc0X/PPCmqnoD\ncBOwPcktwF7gWFVtBo5190myBdgJ3AhsB+5JsmYcw0uSFrdo6Gve97u7L+3+FLADmOnWZ4A7uu0d\nwKGqer6qngZOAVtXdGpJ0tCGeo4+yZokjwEXgKNV9UVgsqrOdbucBya77fXA6YHDz3Rrl77N3Ulm\nk8zOzc0t+R2QJF3eUKGvqheq6iZgA7A1yesvebyYv8ofWlUdrKrpqpqemJgY5VBJ0ghGetVNVX0H\neIj5596fS7IOoLu90O12Ftg4cNiGbk2S1INhXnUzkeTabvvlwJuBrwFHgF3dbruAB7rtI8DOJNck\n2QRsBh5e6cElScNZO8Q+64CZ7pUzLwEOV9Unk3wBOJzkLuBZ4E6Aqjqe5DBwArgI7KmqF8YzviRp\nMYuGvqq+Ctz8IuvfArYtcMx+YP+yp5MkLZs/GStJjTP0ktQ4Qy9JjTP0ktQ4Qy9JjTP0ktQ4Qy9J\njTP0ktQ4Qy9JjTP0ktQ4Qy9JjTP0ktQ4Qy9JjTP0ktQ4Qy9JjTP0ktQ4Qy9JjTP0ktQ4Qy9JjTP0\nktQ4Qy9JjTP0ktQ4Qy9JjTP0ktQ4Qy9JjTP0ktQ4Qy9JjTP0ktS4RUOfZGOSh5KcSHI8yd3d+vVJ\njiY52d1eN3DMviSnkjyV5LZxvgOSpMsb5or+IvDuqtoC3ALsSbIF2Ascq6rNwLHuPt1jO4Ebge3A\nPUnWjGN4SdLiFg19VZ2rqke77e8BTwLrgR3ATLfbDHBHt70DOFRVz1fV08ApYOtKDy5JGs5Iz9En\nmQJuBr4ITFbVue6h88Bkt70eOD1w2Jlu7dK3tTvJbJLZubm5EceWJA1r6NAneRXwMeBdVfXdwceq\nqoAa5cRVdbCqpqtqemJiYpRDJUkjGCr0SV7KfOTvr6qPd8vPJVnXPb4OuNCtnwU2Dhy+oVuTJPVg\nmFfdBPgg8GRVvW/goSPArm57F/DAwPrOJNck2QRsBh5euZElSaNYO8Q+twLvBB5P8li39l7gAHA4\nyV3As8CdAFV1PMlh4ATzr9jZU1UvrPjkkqShLBr6qvpXIAs8vG2BY/YD+5cxlyRphQxzRa+rzNTe\nB3s79zMHbu/t3JKWxl+BIEmNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS\n1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNW9v3AKvZ\n1N4H+x5BkhblFb0kNc7QS1LjDL0kNc7QS1LjDL0kNW7R0Cf5UJILSZ4YWLs+ydEkJ7vb6wYe25fk\nVJKnktw2rsElScMZ5or+PmD7JWt7gWNVtRk41t0nyRZgJ3Bjd8w9Sdas2LSSpJEtGvqq+jzw7UuW\ndwAz3fYMcMfA+qGqer6qngZOAVtXaFZJ0hIs9Tn6yao6122fBya77fXA6YH9znRrkqSeLPubsVVV\nQI16XJLdSWaTzM7NzS13DEnSApYa+ueSrAPobi9062eBjQP7bejWfkxVHayq6aqanpiYWOIYkqTF\nLDX0R4Bd3fYu4IGB9Z1JrkmyCdgMPLy8ESVJy7HoLzVL8hHg14EbkpwB/hQ4ABxOchfwLHAnQFUd\nT3IYOAFcBPZU1Qtjml2SNIRFQ19Vb1/goW0L7L8f2L+coSRJK8efjJWkxhl6SWqcoZekxhl6SWqc\noZekxhl6SWqcoZekxhl6SWqcoZekxhl6SWqcoZekxhl6SWqcoZekxhl6SWqcoZekxhl6SWqcoZek\nxhl6SWqcoZekxhl6SWqcoZekxhl6SWqcoZekxhl6SWqcoZekxhl6SWqcoZekxhl6SWrc2nG94STb\ngQ8Aa4B7q+rAuM4lScsxtffB3s79zIHbx36OsVzRJ1kD/C3wW8AW4O1JtozjXJKkyxvXFf1W4FRV\nfR0gySFgB3BiTOeTtIL6usK9Ele3P4nGFfr1wOmB+2eAXxrTuXQF9fklrtrnx9d4jO05+sUk2Q3s\n7u5+P8lTfc2yiBuAb/Y9xBI5+5W3WucGZ+9F/mJZs//cMDuNK/RngY0D9zd0a/+vqg4CB8d0/hWT\nZLaqpvueYymc/cpbrXODs/flSsw+rpdXfgnYnGRTkp8CdgJHxnQuSdJljOWKvqouJvl94J+Zf3nl\nh6rq+DjOJUm6vLE9R19VnwI+Na63fwVd9U8vXYazX3mrdW5w9r6MffZU1bjPIUnqkb8CQZIaZ+gX\nkGRjkoeSnEhyPMndfc80iiRrknw5ySf7nmUUSa5N8tEkX0vyZJJf7numYSX5o+5j5YkkH0nysr5n\nWkiSDyW5kOSJgbXrkxxNcrK7va7PGReywOx/2X3MfDXJJ5Jc2+eMC3mx2Qcee3eSSnLDSp/X0C/s\nIvDuqtoC3ALsWWW/xuFu4Mm+h1iCDwCfrqpfAN7AKnkfkqwH/hCYrqrXM/8ihJ39TnVZ9wHbL1nb\nCxyrqs3Ase7+1eg+fnz2o8Drq+oXgX8H9l3poYZ0Hz8+O0k2Am8BvjGOkxr6BVTVuap6tNv+HvPB\nWd/vVMNJsgG4Hbi371lGkeSngV8DPghQVf9TVd/pd6qRrAVenmQt8ArgP3qeZ0FV9Xng25cs7wBm\nuu0Z4I4rOtSQXmz2qvpMVV3s7v4b8z+7c9VZ4O8d4K+B9wBj+aapoR9CkingZuCL/U4ytPcz/0Hz\ng74HGdEmYA74h+5pp3uTvLLvoYZRVWeBv2L+iuwc8J9V9Zl+pxrZZFWd67bPA5N9DrMMvwf8U99D\nDCvJDuBsVX1lXOcw9ItI8irgY8C7quq7fc+zmCRvBS5U1SN9z7IEa4E3An9XVTcD/8XV+/TBj+ie\nz97B/D9WPwu8Msk7+p1q6Wr+5Xir7iV5Sf6Y+add7+97lmEkeQXwXuBPxnkeQ38ZSV7KfOTvr6qP\n9z3PkG4F3pbkGeAQ8KYkH+53pKGdAc5U1Q+/cvoo8+FfDX4TeLqq5qrqf4GPA7/S80yjei7JOoDu\n9kLP84wkye8CbwV+p1bP68Z/nvmLg690n7MbgEeT/MxKnsTQLyBJmH+u+Mmqel/f8wyrqvZV1Yaq\nmmL+m4GfrapVcWVZVeeB00le1y1tY/X8autvALckeUX3sbONVfKN5AFHgF3d9i7ggR5nGUn3Hx29\nB3hbVf133/MMq6oer6rXVNVU9zl7Bnhj97mwYgz9wm4F3sn8FfFj3Z/f7nuonwB/ANyf5KvATcCf\n9zzPULqvQj4KPAo8zvzn1lX705pJPgJ8AXhdkjNJ7gIOAG9OcpL5r1Cuyv8VboHZ/wZ4NXC0+1z9\n+16HXMACs4//vKvnKxxJ0lJ4RS9JjTP0ktQ4Qy9JjTP0ktQ4Qy9JjTP0ktQ4Qy9JjTP0ktS4/wNK\n0b5rTZiDIwAAAABJRU5ErkJggg==\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7fd720d04cf8>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"plt.hist(train['Title'])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see that the titles Jonkheer, Dr and others are rare we could just simply group them under a 'Rare' group"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train['Title'] = train['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')\n",
"#Replacing some similar titles and replacing the rare titles by 'Rare' tag"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'From this feature by examinig the correlation matrix we can observe a few things and improve out the accuracy ,like\\nmost of the people who had a title Mr. died'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'''From this feature by examinig the correlation matrix we can observe a few things and improve out the accuracy ,like\n",
"most of the people who had a title Mr. died'''"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}