pca.ipynb from e-mental-health/data-processing

pca.ipynb
Summary

Maintainability

Test Coverage

Issues
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualization of OVK data\n",
    "\n",
    "First, read the data from various anonymized files and store them in six lists: \n",
    "\n",
    "1. mails: text of the emails\n",
    "2. senders: sender of a mail: counselor or client\n",
    "3. nbrOfChars: length of a mail in characters\n",
    "4. counselor: id of the counselor\n",
    "5. ids: id of the client\n",
    "6. treatment type: either AS (auto-biographic writing) or ES (expressive writing)\n",
    "\n",
    "For comparison, we include two additional data sets beside the two OVK data sets (AS and ES): biblical text from 1888 and newspaper text from 1985 (NRC). Each article is treated as an email (list mails). Dummy values have been used for the other five data lists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import os\n",
    "import re\n",
    "import sys\n",
    "import xml.etree.ElementTree as ET\n",
    "\n",
    "ASFILE = \"../usb/ovk/data/eriktks/AS/text/AS-mails.csv\"\n",
    "ESFILE = \"../usb/ovk/data/eriktks/ES/text/ES-mails.csv\"\n",
    "BIBLE = \"../usb/ovk/data/eriktks/othertexts/bible.csv\"\n",
    "NEWSGAC = \"../usb/ovk/data/eriktks/othertexts/newsgac.csv\"\n",
    "TACTUS = \"../usb/output/emails-all.csv\"\n",
    "INFILENAMES = [ ASFILE,ESFILE ] # [ ASFILE,ESFILE,BIBLE,NEWSGAC ] [ TACTUS ]\n",
    "CLIENT = \"CLIENT\"\n",
    "BIBLE = \"BIBLE\"\n",
    "NEWSPAPER = \"NEWSPAPER\"\n",
    "COUNSELOR = \"counselor\"\n",
    "ID = \"client-id\"\n",
    "NBROFCHARS = \"nbrOfCharsInWords\"\n",
    "NBROFSENTS = \"nbrOfSents\"\n",
    "NBROFWORDS = \"nbrOfWords\"\n",
    "SENDER = \"sender\"\n",
    "SEPARATOR = \",\"\n",
    "TEXT = \"text\"\n",
    "MINWORDS = 0\n",
    "\n",
    "(counselors,ids,mails,nbrOfChars,senders,treatments) = ([],[],[],[],[],[])\n",
    "for inFileName in INFILENAMES:\n",
    "    try: inFile = open(inFileName,\"r\")\n",
    "    except Exception as e: sys.exit(\"cannot read file \"+inFileName+\": \"+str(e))\n",
    "    csvReader = csv.DictReader(inFile,delimiter=SEPARATOR)\n",
    "    for row in csvReader:\n",
    "        try:\n",
    "            if True:\n",
    "                if (inFileName == TACTUS or int(row[NBROFWORDS]) > MINWORDS):\n",
    "                    mails.append(row[TEXT])\n",
    "                    senders.append(row[SENDER])\n",
    "                    if inFileName != TACTUS:\n",
    "                        nbrOfChars.append(int(row[NBROFCHARS]))\n",
    "                        counselors.append(row[COUNSELOR])\n",
    "                        #ids.append(row[ID]+\"-\"+row[COUNSELOR]+\"-\"+row[SENDER]+\"-\"+str(len(ids)))\n",
    "                        ids.append(row[ID])\n",
    "                    else:\n",
    "                        #ids.append(row[ID]+\"-\"+\"-\"+row[SENDER]+\"-\"+str(len(ids)))\n",
    "                        ids.append(row[ID])\n",
    "                    if inFileName == ASFILE: treatments.append(\"AS\")\n",
    "                    else: treatments.append(\"ES\")\n",
    "        except: sys.exit(\"unexpected row in file \"+INFILENAME+\": \"+str(row))\n",
    "    inFile.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, count the tokens in the mails. We use a standard Python library for this, TfidfVectorizer, which normalizes the counts with respect to the lengths of the mails and prefers tokens that appear in a few mails over tokens that appear in every mail."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "NGRAMMIN = 1\n",
    "NGRAMMAX = 1\n",
    "\n",
    "tfidf_vectorizer = TfidfVectorizer(max_df=0.8,max_features=200000,min_df=0.2,use_idf=True,ngram_range=(NGRAMMIN,NGRAMMAX))\n",
    "tfidf_matrix = tfidf_vectorizer.fit_transform(mails)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The AS and ES contain 2,000 mails and 25,000 different tokens. Therefore the previous analysis resulted in a table with 2,000 rows and 25,000 columns. We use principal component analysis to summarize this table to 2,000 rows and 4 columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "pca = PCA(n_components=4)\n",
    "pca.fit(tfidf_matrix.toarray())\n",
    "newSpace = pca.transform(tfidf_matrix.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next the mails can be shown in a graph. Each row in the table corresponds with a mail. The values in the columns can be used as x-coordinates and y-coordinates to position the mails in the graph. We use the first two column values because they are expected to contain the most important information for creating interesting groups of mails in the graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# draw graph of pca data: red: from client; blue: from counselor\n",
    "\n",
    "%matplotlib notebook\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import random\n",
    "import re\n",
    "\n",
    "EXPERIMENT = \"Emails\"\n",
    "DOTSIZE = 5\n",
    "THRESHOLD = 5\n",
    "RANDOMFACTOR=0.00\n",
    "(idsCli,idsCou,idsBib,idsNew,xCli,xCou,xBib,xNew,yCli,yCou,yBib,yNew) = ([],[],[],[],[],[],[],[],[],[],[],[])\n",
    "\n",
    "random.seed()\n",
    "for i in range(0,len(newSpace)):\n",
    "    if senders[i] == \"CLIENT\":\n",
    "        xCli.append(newSpace[i][0]+random.random()*RANDOMFACTOR)\n",
    "        yCli.append(newSpace[i][1]+random.random()*RANDOMFACTOR)\n",
    "        idsCli.append(ids[i])\n",
    "    # top left counselors: C16 C88 C77 C65\n",
    "    elif senders[i] == \"BIBLE\":\n",
    "        xBib.append(newSpace[i][0]+random.random()*RANDOMFACTOR)\n",
    "        yBib.append(newSpace[i][1]+random.random()*RANDOMFACTOR)\n",
    "        idsBib.append(ids[i])\n",
    "    elif senders[i] == \"NEWSPAPER\":\n",
    "        xNew.append(newSpace[i][0]+random.random()*RANDOMFACTOR)\n",
    "        yNew.append(newSpace[i][1]+random.random()*RANDOMFACTOR)\n",
    "        idsNew.append(ids[i])\n",
    "    else: # senders[i] == COUNSELOR\n",
    "        xCou.append(newSpace[i][0]+random.random()*RANDOMFACTOR)\n",
    "        yCou.append(newSpace[i][1]+random.random()*RANDOMFACTOR)\n",
    "        idsCou.append(ids[i])\n",
    "\n",
    "def makePlot(colorCli,colorCou,colorBib,colorNew):\n",
    "    def pickScatter(event):\n",
    "        dataIds = []\n",
    "        for i in range(0,len(event.ind)): \n",
    "            if event.artist == cliDots: dataIds.append(idsCli[event.ind[i]])\n",
    "            elif event.artist == bibDots: dataIds.append(idsBib[event.ind[i]])\n",
    "            elif event.artist == newDots: dataIds.append(idsNew[event.ind[i]])\n",
    "            else: dataIds.append(idsCou[event.ind[i]])\n",
    "        plt.gca().set_title(EXPERIMENT+str(dataIds),fontsize=12)\n",
    "\n",
    "    fig = plt.figure(figsize=(9,5))\n",
    "    plt.gca().set_title(EXPERIMENT)\n",
    "    if len(xCou) > 0: couDots = plt.scatter(xCou,yCou,s=DOTSIZE,color=colorCou,picker=2, \\\n",
    "                      label=\"from counselor (\"+str(len(idsCou))+\")\".format(THRESHOLD))\n",
    "    else: couDots = plt.scatter(xCou,yCou,s=DOTSIZE,color=colorCou,picker=DOTSIZE)\n",
    "    if len(xCli) > 0: cliDots = plt.scatter(xCli,yCli,s=DOTSIZE,color=colorCli,picker=2, \\\n",
    "                          label=\"from client (\"+str(len(idsCli))+\")\".format(THRESHOLD))\n",
    "    else: cliDots = plt.scatter(xCli,yCli,s=DOTSIZE,color=colorCli,picker=DOTSIZE)\n",
    "    if len(xBib) > 0:  bibDots = plt.scatter(xBib,yBib,s=DOTSIZE,color=colorBib,picker=2, \\\n",
    "                       label=\"from Bible (\"+str(len(idsBib))+\")\".format(THRESHOLD))\n",
    "    else: bibDots = plt.scatter(xBib,yBib,s=DOTSIZE,color=colorBib,picker=DOTSIZE)\n",
    "    if len(xNew) > 0: newDots = plt.scatter(xNew,yNew,s=DOTSIZE,color=colorNew,picker=2, \\\n",
    "                      label=\"from newspaper (\"+str(len(idsNew))+\")\".format(THRESHOLD))\n",
    "    else: newDots = plt.scatter(xNew,yNew,s=DOTSIZE,color=colorNew,picker=DOTSIZE)\n",
    "    plt.legend(fontsize=8)\n",
    "    plt.gcf().canvas.mpl_connect(\"pick_event\",pickScatter)\n",
    "    plt.savefig(\"image.png\")\n",
    "\n",
    "makePlot(\"red\",\"blue\",\"green\",\"black\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The counselor cloud is divided in two parts: top-left and right-bottom. We examine the vocabulary of the two part to find out what the differences are between the two parts. We draw a imaginary vertical line in the graph at x = 0.1 and count the words of the counselor mails to the left and to the right of the line in two lists: textLargerThan01 and textSmallerThan01."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# plot distribution of article lengths in words\n",
    "# add ot, newspaper\n",
    "# collect counselor texts of pca1 > 0.1, pca1 < 0.1; compute t-scores\n",
    "\n",
    "THRESHOLD = 0.0\n",
    "\n",
    "def getTokens(text):\n",
    "    unigrams = text.split()\n",
    "    ngrams = []\n",
    "    for ngramSize in range(NGRAMMIN,NGRAMMAX+1):\n",
    "        for i in range(0,len(unigrams)):\n",
    "            if i+ngramSize-1 < len(unigrams):\n",
    "                ngram = unigrams[i]\n",
    "                for j in range(2,ngramSize+1):\n",
    "                    ngram += \" \"+unigrams[i+j-1]\n",
    "                ngrams.append(ngram)\n",
    "    return(ngrams)\n",
    "\n",
    "textLargerThan01 = {}\n",
    "textSmallerThan01 = {}\n",
    "totalLarger = 0\n",
    "totalSmaller = 0\n",
    "nbrOfMailsLarger = 0\n",
    "nbrOfMailsSmaller = 0\n",
    "for i in range(0,len(newSpace)):\n",
    "    tokens = getTokens(mails[i])\n",
    "    x = newSpace[i][0]\n",
    "    y = newSpace[i][1]\n",
    "    #if senders[i] == \"COUNSELOR\" and x > THRESHOLD: # counselors: top-left vs bottom-right (threshold: 0.15)\n",
    "    if (senders[i] == \"COUNSELOR\" or senders[i] == \"CLIENT\") and x > THRESHOLD: # counselors vs clients (0.0)\n",
    "    # if (senders[i] == \"COUNSELOR\" or senders[i] == \"CLIENT\") and x-y > THRESHOLD: # tactus cou vs cli (-0.2)\n",
    "        nbrOfMailsLarger += 1\n",
    "        for i in range(0,len(tokens)):\n",
    "            if i < len(tokens):\n",
    "               ngram = tokens[i]\n",
    "               if not ngram in textLargerThan01: textLargerThan01[ngram] = 0\n",
    "               textLargerThan01[ngram] += 1\n",
    "               totalLarger += 1\n",
    "    elif senders[i] == \"COUNSELOR\" or senders[i] == \"CLIENT\": # counselors vs clients\n",
    "    #elif senders[i] == \"COUNSELOR\": # counselors: top-left vs bottom-right\n",
    "        nbrOfMailsSmaller += 1\n",
    "        for i in range(0,len(tokens)):\n",
    "            if i < len(tokens):\n",
    "               ngram = tokens[i]\n",
    "               if not ngram in textSmallerThan01: textSmallerThan01[ngram] = 0\n",
    "               textSmallerThan01[ngram] += 1\n",
    "               totalSmaller += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we determine which words appear more often in the top-left group and which words appear more often in the bottom-right group. For this purpose we use the t-score, a statistical measure for comparing frequencies in two data sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "tscores = {}\n",
    "for token in textLargerThan01:\n",
    "    if token in textSmallerThan01:\n",
    "        freqLarger = textLargerThan01[token]/totalLarger\n",
    "        freqSmaller = textSmallerThan01[token]/totalSmaller\n",
    "        seLarger = freqLarger/totalLarger\n",
    "        seSmaller = freqSmaller/totalSmaller\n",
    "        tscores[token] = (freqLarger-freqSmaller)/math.sqrt(seLarger+seSmaller)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we display the top ten most specific words of each group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import operator\n",
    "import re\n",
    "\n",
    "N = 10\n",
    "sortedTscores = sorted(tscores.items(), key=operator.itemgetter(1))\n",
    "\n",
    "print(\"GROUP LEFT OF LINE (\"+str(nbrOfMailsSmaller)+\")\")\n",
    "shown = 0\n",
    "for i in range(0,len(sortedTscores)):\n",
    "    if re.match(r\"^[a-zA-Z ]+$\",sortedTscores[i][0]):\n",
    "        print(str(shown+1)+\". \"+sortedTscores[i][0])\n",
    "        shown += 1\n",
    "    if shown >= N: break\n",
    "\n",
    "print(\"\\nGROUP RIGHT OF LINE (\"+str(nbrOfMailsLarger)+\")\")\n",
    "shown = 0\n",
    "for i in range(len(sortedTscores)-1,-1,-1):\n",
    "    if re.match(r\"^[a-zA-Z ]+$\",sortedTscores[i][0]):\n",
    "        print(str(shown+1)+\". \"+sortedTscores[i][0])\n",
    "        shown += 1\n",
    "    if shown >= N: break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The word frequency lists display an important style difference between the two mail groups: mails in the top-left group use a more formal style (*u*, *uw* and *U*) while the mails in the bottom-right group are written in a less formal style (*je*, *Je*, *jou*, *jezelf*, *jij* and *jouw*)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Next:\n",
    "\n",
    "* explore linking mails of the same session\n",
    "* explore topic classification with nmf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}