pca-results-tactus.ipynb from e-mental-health/data-processing

pca-results-tactus.ipynb
Summary

Maintainability

Test Coverage

Issues
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualization of Tactus data\n",
    "\n",
    "First, read the data from various anonymized files and store them in six lists: \n",
    "\n",
    "1. mails: text of the emails\n",
    "2. senders: sender of a mail: counselor or client\n",
    "3. nbrOfChars: length of a mail in characters\n",
    "4. counselor: id of the counselor (not yet available in Tactus data)\n",
    "5. ids: id of the client\n",
    "6. treatment type: either AS (auto-biographic writing) or ES (expressive writing) (not available in Tactus data)\n",
    "\n",
    "For comparison, we include two additional data sets beside the two OVK data sets (AS and ES): biblical text from 1888 and newspaper text from 1985 (NRC). Each article is treated as an email (list mails). Dummy values have been used for the other five data lists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import os\n",
    "import re\n",
    "import sys\n",
    "import xml.etree.ElementTree as ET\n",
    "\n",
    "ASFILE = \"../usb/ovk/data/eriktks/AS/text/AS-mails.csv\"\n",
    "ESFILE = \"../usb/ovk/data/eriktks/ES/text/ES-mails.csv\"\n",
    "BIBLE = \"../usb/ovk/data/eriktks/othertexts/bible.csv\"\n",
    "NEWSGAC = \"../usb/ovk/data/eriktks/othertexts/newsgac.csv\"\n",
    "TACTUS = \"../usb/output/emails-all.csv\"\n",
    "INFILENAMES = [ TACTUS ] # [ ASFILE,ESFILE,BIBLE,NEWSGAC ] [ TACTUS ]\n",
    "CLIENT = \"CLIENT\"\n",
    "BIBLE = \"BIBLE\"\n",
    "NEWSPAPER = \"NEWSPAPER\"\n",
    "COUNSELOR = \"counselor\"\n",
    "GENDER = \"GeslachtA\"\n",
    "ID = \"client-id\"\n",
    "NBROFCHARS = \"nbrOfCharsInWords\"\n",
    "NBROFSENTS = \"nbrOfSents\"\n",
    "NBROFWORDS = \"nbrOfWords\"\n",
    "SENDER = \"sender\"\n",
    "SEPARATOR = \",\"\n",
    "TEXT = \"text\"\n",
    "MINWORDS = 0\n",
    "MAXMAILS = 0\n",
    "\n",
    "(counselors,ids,mails,nbrOfChars,senders,treatments) = ([],[],[],[],[],[])\n",
    "nbrOfMails = {}\n",
    "for inFileName in INFILENAMES:\n",
    "    try: inFile = open(inFileName,\"r\")\n",
    "    except Exception as e: sys.exit(\"cannot read file \"+inFileName+\": \"+str(e))\n",
    "    csvReader = csv.DictReader(inFile,delimiter=SEPARATOR)\n",
    "    for row in csvReader:\n",
    "        try:\n",
    "            if True:\n",
    "                if (inFileName == TACTUS or int(row[NBROFWORDS]) > MINWORDS) and row[SENDER] == \"CLIENT\" and \\\n",
    "                   (not row[ID] in nbrOfMails or MAXMAILS <= 0 or nbrOfMails[row[ID]] < MAXMAILS):\n",
    "                    mails.append(row[TEXT])\n",
    "                    senders.append(row[SENDER])\n",
    "                    if row[ID] in nbrOfMails: nbrOfMails[row[ID]] += 1\n",
    "                    else: nbrOfMails[row[ID]] = 1\n",
    "                    if inFileName != TACTUS:\n",
    "                        nbrOfChars.append(int(row[NBROFCHARS]))\n",
    "                        counselors.append(row[COUNSELOR])\n",
    "                        #ids.append(row[ID]+\"-\"+row[COUNSELOR]+\"-\"+row[SENDER]+\"-\"+str(len(ids)))\n",
    "                        ids.append(row[ID])\n",
    "                    else:\n",
    "                        #ids.append(row[ID]+\"-\"+\"-\"+row[SENDER]+\"-\"+str(len(ids)))\n",
    "                        ids.append(row[ID])\n",
    "                    if inFileName == ASFILE: treatments.append(\"AS\")\n",
    "                    else: treatments.append(\"ES\")\n",
    "        except: sys.exit(\"unexpected row in file \"+INFILENAME+\": \"+str(row))\n",
    "    inFile.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check if we want to have separate emails or have all emails for one person collapsed in one text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "COLLAPSE = False\n",
    "\n",
    "if COLLAPSE:\n",
    "    (newCounselors,newIds,newMails,newNbrOfChars,newSenders,newTreatments) = ([],[],[],[],[],[])\n",
    "    collapse = {}\n",
    "    for i in range(0,len(ids)):\n",
    "        if ids[i] in collapse:\n",
    "            thisId = collapse[ids[i]]\n",
    "            newMails[thisId] += \" \"+mails[i]\n",
    "            newNbrOfChars[thisId] += nbrOfChars[i]\n",
    "        else:\n",
    "            collapse[ids[i]] = len(newIds)\n",
    "            if i < len(counselors): newCounselors.append(counselors[i])\n",
    "            newIds.append(ids[i])\n",
    "            newMails.append(mails[i])\n",
    "            if i < len(nbrOfChars): newNbrOfChars.append(nbrOfChars[i])\n",
    "            newSenders.append(senders[i])\n",
    "            newTreatments.append(treatments[i])\n",
    "    counselors = newCounselors\n",
    "    ids = newIds\n",
    "    mails = newMails\n",
    "    nbrOfChars = newNbrOfChars\n",
    "    senders = newSenders\n",
    "    treatments = newTreatments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the meta data for the OVK data set: exit phrase and the scores CESD and MHC. If the exit phrase is non-empty the patient has abandoned the treatment. The lower the CESD (depression) score, the better. The higher the MHC (mental health) score, the better."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import re\n",
    "\n",
    "DIR = \"../usb/output\"\n",
    "FILEINTAKE = DIR+\"/Intake.csv\"\n",
    "FILEEND = DIR+\"/Lijst nameting.csv\"\n",
    "FILEMAIL = DIR+\"/emails.csv\"\n",
    "FILEOUT = DIR+\"/marcel.csv\"\n",
    "CSVSEP = \",\"\n",
    "ID = \"id\"\n",
    "WEEK = \"week\"\n",
    "WEEK0 = \"week0\"\n",
    "WEEKT0 = \"weekt0\"\n",
    "WEEK2 = \"week2\"\n",
    "WEEKN = \"weekn\"\n",
    "MAXINT = \"MAXINT\"\n",
    "\n",
    "def cleanupWeekConsumption(string):\n",
    "    return(re.sub(r\"^\\s*±\\s*\",\"\",string))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "intake = {}\n",
    "with open(FILEINTAKE,\"r\",encoding=\"utf8\") as csvFile:\n",
    "    csvreader = csv.DictReader(csvFile,delimiter=CSVSEP)\n",
    "    for row in csvreader:\n",
    "        weekConsumption = \"\"\n",
    "        if row[WEEK] != \"\": weekConsumption = row[WEEK]\n",
    "        elif row[WEEK0] != \"\": weekConsumption = row[WEEK0]\n",
    "        elif row[WEEKT0] != \"\": weekConsumption = row[WEEKT0]\n",
    "        if row[ID] != \"\" and weekConsumption !=\"\": \n",
    "            intake[row[ID]] = int(cleanupWeekConsumption(weekConsumption))\n",
    "    csvFile.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "end = {}\n",
    "with open(FILEEND,\"r\",encoding=\"utf8\") as csvFile:\n",
    "    csvreader = csv.DictReader(csvFile,delimiter=CSVSEP)\n",
    "    for row in csvreader:\n",
    "        weekConsumption = \"\"\n",
    "        if row[WEEK2] != \"\": weekConsumption = row[WEEK2]\n",
    "        elif row[WEEKN] != \"\": weekConsumption = row[WEEKN]\n",
    "        if row[ID] != \"\" and weekConsumption != \"\":\n",
    "            end[row[ID]] = int(cleanupWeekConsumption(weekConsumption))\n",
    "    csvFile.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, count the tokens in the mails. We use a standard Python library for this, TfidfVectorizer, which normalizes the counts with respect to the lengths of the mails and prefers tokens that appear in a few mails over tokens that appear in every mail."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(len(end.keys()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "NGRAMMIN = 1\n",
    "NGRAMMAX = 1\n",
    "\n",
    "tfidf_vectorizer = TfidfVectorizer(max_df=0.8,max_features=200000,min_df=0.2,use_idf=True,ngram_range=(NGRAMMIN,NGRAMMAX))\n",
    "tfidf_matrix = tfidf_vectorizer.fit_transform(mails)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The AS and ES contain 2,000 mails and 25,000 different tokens. Therefore the previous analysis resulted in a table with 2,000 rows and 25,000 columns. We use principal component analysis to summarize this table to 2,000 rows and 4 columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "pca = PCA(n_components=4)\n",
    "pca.fit(tfidf_matrix.toarray())\n",
    "newSpace = pca.transform(tfidf_matrix.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next the mails can be shown in a graph. Each row in the table corresponds with a mail. The values in the columns can be used as x-coordinates and y-coordinates to position the mails in the graph. We use the first two column values because they are expected to contain the most important information for creating interesting groups of mails in the graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "%matplotlib notebook\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import random\n",
    "import re\n",
    "\n",
    "MINMAILS = 10\n",
    "THRESHOLD = 5\n",
    "RANDOMFACTOR=0.00\n",
    "DOTSIZE = 5\n",
    "(idsG1,idsG2,xG1,xG2,yG1,yG2) = ([],[],[],[],[],[])\n",
    "#counselorDict = {}\n",
    "\n",
    "def testFunction(thisId):\n",
    "    return(stoppedTreatment(thisId))\n",
    "\n",
    "def stoppedTreatment(thisId):\n",
    "    if thisId == \"NAME\": return(\"stoppedTreatment\")\n",
    "    #else: return(re.search(r\"[a-zA-Z]\",exitData[thisId]))\n",
    "    else: return(nbrOfMails[thisId] < MINMAILS)\n",
    "\n",
    "def moreDepressed(thisId):\n",
    "    if thisId == \"NAME\": return(\"moreDepressed\")\n",
    "    else: return(ids[i] in cesdDiff and cesdDiff[ids[i]] > 0)\n",
    "\n",
    "def worseMental(thisId):\n",
    "    if thisId == \"NAME\": return(\"worseMental\")\n",
    "    else: return(ids[i] in mhcDiff and mhcDiff[ids[i]] < 0)\n",
    "\n",
    "def isMan(thisId):\n",
    "    if thisId == \"NAME\": return(\"isMan\")\n",
    "    else: return(gender[thisId] == \"man\")\n",
    "\n",
    "def false(thisId):\n",
    "    if thisId == \"NAME\": return(\"false\")\n",
    "    else: return(False)\n",
    "\n",
    "random.seed()\n",
    "for i in range(0,len(newSpace)):\n",
    "    if testFunction(ids[i]):\n",
    "        xG1.append(newSpace[i][0]+random.random()*RANDOMFACTOR)\n",
    "        yG1.append(newSpace[i][1]+random.random()*RANDOMFACTOR)\n",
    "        idsG1.append(ids[i])\n",
    "        #counselorDict[ids[i]] = counselors[i]\n",
    "    else: # testFunction()\n",
    "        xG2.append(newSpace[i][0]+random.random()*RANDOMFACTOR)\n",
    "        yG2.append(newSpace[i][1]+random.random()*RANDOMFACTOR)\n",
    "        idsG2.append(ids[i])\n",
    "        #counselorDict[ids[i]] = counselors[i]\n",
    "    \n",
    "def makePlot(colorG1,colorG2):\n",
    "    EXPERIMENT = testFunction(\"NAME\")\n",
    "    def pickScatter(event):\n",
    "        dataIds = []\n",
    "        for i in range(0,len(event.ind)):\n",
    "            if event.artist == g2Dots: dataIds.append(idsG2[event.ind[i]])\n",
    "            else: dataIds.append(idsG1[event.ind[i]])\n",
    "        title = EXPERIMENT+\" [\"\n",
    "        for i in range(0,len(dataIds)):\n",
    "            title += dataIds[i]+\":\"+str(nbrOfMails[dataIds[i]]) # +\":\"+counselorDict[dataIds[i]]\n",
    "            if i < len(dataIds)-1: title += \",\"\n",
    "        title += \"]\"\n",
    "        plt.gca().set_title(title,fontsize=12)\n",
    "\n",
    "    fig = plt.figure(figsize=(9,5))\n",
    "    plt.gca().set_title(EXPERIMENT)\n",
    "    # show blue dots\n",
    "    g1Dots = plt.scatter(xG1,yG1,s=DOTSIZE,color=colorG1,picker=DOTSIZE, \n",
    "                         label=\"group 1 (\"+str(len(idsG1))+\")\".format(THRESHOLD))\n",
    "    # show red dots\n",
    "    g2Dots = plt.scatter(xG2,yG2,s=DOTSIZE,color=colorG2,picker=DOTSIZE, \\\n",
    "                        label=\"group 2 (\"+str(len(idsG2))+\")\".format(THRESHOLD))\n",
    "    plt.legend(fontsize=8)\n",
    "    plt.gcf().canvas.mpl_connect(\"pick_event\",pickScatter)\n",
    "\n",
    "makePlot(\"blue\",\"red\")\n",
    "plt.savefig(\"image.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cloud contains different colors for clients that completed the treatment (red) and clients that stopped the treatment (blue). Some blue dots are grouped in a seperate area but some are placed among the red dots. We now can do two things:\n",
    "\n",
    "* we can build a model for all the blue dots (or all the red dots). But this model will need to separate the difficult cases as well: the blue dots among the red dots\n",
    "* we can build a model for the blue dots that are in a separate area (top right). Actually this PCP analysis already is such a model but we want to see if we can define the group with lexical features and then check if the chose features make sense"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# plot distribution of article lengths in words\n",
    "# add ot, newspaper\n",
    "# collect counselor texts of pca1 > 0.1, pca1 < 0.1; compute t-scores\n",
    "\n",
    "THRESHOLDX = 0.2\n",
    "THRESHOLDY = 0.0\n",
    "\n",
    "def getTokens(text):\n",
    "    unigrams = text.split()\n",
    "    ngrams = []\n",
    "    for ngramSize in range(NGRAMMIN,NGRAMMAX+1):\n",
    "        for i in range(0,len(unigrams)):\n",
    "            if i+ngramSize-1 < len(unigrams):\n",
    "                ngram = unigrams[i]\n",
    "                for j in range(2,ngramSize+1):\n",
    "                    ngram += \" \"+unigrams[i+j-1]\n",
    "                ngrams.append(ngram)\n",
    "    return(ngrams)\n",
    "\n",
    "textCompleted = {}\n",
    "textStopped = {}\n",
    "totalCompleted = 0\n",
    "totalStopped = 0\n",
    "nbrOfMailsCompleted = 0\n",
    "nbrOfMailsStopped = 0\n",
    "for i in range(0,len(newSpace)):\n",
    "    tokens = getTokens(mails[i])\n",
    "    x = newSpace[i][0]\n",
    "    y = newSpace[i][1]\n",
    "    #if x > THRESHOLDX and y < THRESHOLDY: \n",
    "    if testFunction(ids[i]):\n",
    "        nbrOfMailsCompleted += 1\n",
    "        for i in range(0,len(tokens)):\n",
    "            if i < len(tokens):\n",
    "               ngram = tokens[i]\n",
    "               if not ngram in textCompleted: textCompleted[ngram] = 0\n",
    "               textCompleted[ngram] += 1\n",
    "               totalCompleted += 1\n",
    "    else:\n",
    "        nbrOfMailsStopped += 1\n",
    "        for i in range(0,len(tokens)):\n",
    "            if i < len(tokens):\n",
    "               ngram = tokens[i]\n",
    "               if not ngram in textStopped: textStopped[ngram] = 0\n",
    "               textStopped[ngram] += 1\n",
    "               totalStopped += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we determine which words appear more often in the top-left group and which words appear more often in the bottom-right group. For this purpose we use the t-score, a statistical measure for comparing frequencies in two data sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "tscores = {}\n",
    "for token in textCompleted:\n",
    "    freqCompleted = textCompleted[token]/totalCompleted\n",
    "    seCompleted = freqCompleted/totalCompleted\n",
    "    if token in textStopped:\n",
    "        freqStopped = textStopped[token]/totalStopped\n",
    "    else:\n",
    "        freqStopped = 0.5/totalStopped\n",
    "    seStopped = freqStopped/totalStopped\n",
    "    tscores[token] = (freqCompleted-freqStopped)/math.sqrt(seCompleted+seStopped)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we display the top ten most specific words of each group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import operator\n",
    "import re\n",
    "\n",
    "N = 20\n",
    "sortedTscores = sorted(tscores.items(), key=operator.itemgetter(1), reverse=True)\n",
    "\n",
    "print(\"GROUP 1 (\"+str(nbrOfMailsCompleted)+\")\")\n",
    "shown = 0\n",
    "for i in range(0,len(sortedTscores)):\n",
    "    if re.match(r\"^[a-zA-Z ]+$\",sortedTscores[i][0]):\n",
    "        print(str(shown+1)+\". \"+sortedTscores[i][0])\n",
    "        shown += 1\n",
    "    if shown >= N: break\n",
    "\n",
    "print(\"\\nGROUP 2 (\"+str(nbrOfMailsStopped)+\")\")\n",
    "shown = 0\n",
    "for i in range(len(sortedTscores)-1,-1,-1):\n",
    "    if re.match(r\"^[a-zA-Z ]+$\",sortedTscores[i][0]):\n",
    "        print(str(shown+1)+\". \"+sortedTscores[i][0])\n",
    "        shown += 1\n",
    "    if shown >= N: break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Next:\n",
    "\n",
    "* explore linking mails of the same session (+)\n",
    "* explore topic classification with nmf (-)\n",
    "* solve click in graph bug\n",
    "* check mails of stopped patients"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}