docs/source/experiment_data.rst from Dallinger/Dallinger

docs/source/experiment_data.rst
Summary

Maintainability

Test Coverage

Issues
Experiment Data
===============

Dallinger keeps track of experiment data using the database. All generated
data about Dallinger constructs, like networks, nodes, and participants, is
tracked by the system. In addition, experiment specific data, such as
questions and infos, can be stored.

The `info` table is perhaps the most useful for experiment creators. It is
intended for saving data specific to an experiment. Whenever an important
event needs to be recorded for an experiment, an Info can be created:

::

    def record_event(self, node, contents, details):
        info = Info(origin=node, contents=contents, details=details)
        session.add(info)
        session.commit()

In the above example, we have a function to record an event that would be
part of a long experiment code. Each time something important happens in the
experiment, the function will be called. In this case, we take the related
node as the first parameter, then a string representation of the event, and
finally an optional details parameter, which can include a dictionary, or
other data structure with details.

Dallinger allows users to export experiment data for performing analysis with
the tools of their choice. Data from all experiment tables are exported in CSV
format, which makes it easy to use in a variety of tools.

To export the data, the Dallinger `export` command is used. The command
requires passing in the application id. Example:

::

    $ dallinger export --app 6ab5e918-44c0-f9bc-5d97-a5ddbbddb68a

This will connect to the database and export the data, which will be saved as
a zip file inside the `data` directory:

::
    
    $ ls data
    6ab5e918-44c0-f9bc-5d97-a5ddbbddb68a.zip

To use the exported data, it is recommended that you unzip the file inside a
working directory. This will create a new `data` directory, which will
contain the experiment's exported tables as CSV files:

::

    $ unzip 6ab5e918-44c0-f9bc-5d97-a5ddbbddb68a.zip
    Archive:  6ab5e918-44c0-f9bc-5d97-a5ddbbddb68a-data.zip
      inflating: experiment_id.md
      inflating: data/network.csv
      inflating: data/info.csv
      inflating: data/notification.csv
      inflating: data/question.csv
      inflating: data/transformation.csv
      inflating: data/vector.csv
      inflating: data/transmission.csv
      inflating: data/participant.csv
      inflating: data/node.csv

Once the data is uncompressed, you can analyze it using many different
applications. Excel, for example, will easily import the data, just by double
clicking on one of the files.

In Python, pandas are a popular way of manipulating data. The library is
required by Dallinger, so if you already have Dallinger running you can begin
using it right away:

::

    $ python
    >>> import pandas
    >>> df = pandas.read_csv('question.csv')

Pandas has a handy `read_csv` method, which will read a CSV file and convert
it to a DataFrame, which is a sort of spreadsheet-like structure used by
Pandas to work with data. Once the data is in a DataFrame, we can use all the
DataFrame features to work with the data:

::

    >>> df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 6 entries, 0 to 5
    Data columns (total 14 columns):
    id                6 non-null int64
    creation_time     6 non-null datetime64[ns]
    property1         0 non-null object
    property2         0 non-null object
    property3         0 non-null object
    property4         0 non-null object
    property5         0 non-null object
    failed            6 non-null object
    time_of_death     0 non-null object
    type              6 non-null object
    participant_id    6 non-null int64
    number            6 non-null int64
    question          6 non-null object
    response          6 non-null object
    dtypes: datetime64[ns](1), int64(3), object(10)
    memory usage: 744.0+ bytes
    None
    >>> df.response.describe()
    count                                       6
    unique                                      5
    top       {"engagement":"7","difficulty":"4"}
    freq                                        2
    Name: response, dtype: object

In this case, let's say we want to analyze questionnaire responses at the end
of an experiment. We will only need the `response` column from the `question`
table. Also, since this column is stored as a string, but holds a dictionary
with the answers to the questions, we need to convert it into a suitable
format for analysis:

::

    >>> df = pandas.read_csv('question.csv', usecols=['response'],
                converters={'response': lambda x:eval(x).values()})
    >>> df
      response
    0   [4, 7]
    1   [1, 6]
    2   [4, 7]
    3   [7, 7]
    4   [3, 6]
    5   [0, 3]
    >>> responses=pandas.DataFrame(df['response'].values.tolist(),
                columns=['engagement', 'difficulty'], dtype='int64')
    >>> responses
      engagement difficulty
    0          4          7
    1          1          6
    2          4          7
    3          7          7
    4          3          6
    5          0          3

First we create a DataFrame using `read_csv` as before, but this time, we
specify which columns to use using the `usecols` parameter. To get the
numeric values for the responses, we use a converter to convert the string
back into a dictionary and extract the values.

At this point, we have both values in the response column. We really want to
have one column for each value, so we create a new dataframe, converting the
response values to a list and assigning each to a named column. We also make
sure the values are integers, with the `dtype` parameter. This makes them
plottable.

We can now make a simple bar chart of the responses using plot:

::

    >>> responses.plot(kind='bar')
    <matplotlib.axes._subplots.AxesSubplot at 0x7f7f0092dc90>

If you are running this in a `Jupyter notebook <https://jupyter.org/>`__, this
would be the result:

.. figure:: _static/barplot.png
   :alt:

Of course these are very simple examples. Pandas are a powerful library, and
offer many analysis and visualization methods, but this should at least give
an idea of what can be achieved.

Dallinger also has a helper class that allows us to handle experiment data in
different formats. You can get the DataFrame using this, as well:

::

    $ python
    >>> from dallinger.data import Table
    >>> data = Table('info.csv')
    >>> df = data.df

It might seem like a roundabout way to get the DataFrame, but the table class
has the advantage that the data can easily be converted to many other
formats. All of these formats are accessed as properties of the Table
instance, like `data.df` above. Supported formats are:

    - csv. Comma-separated values.
    - dict. A python dictionary.
    - df. A pandas DataFrame.
    - html. An html table.
    - latex. A LaTex table.
    - list. A python list.
    - ods. An open document spreadsheet.
    - tsv. Tab separated values.
    - xls. Legacy Excel spreadsheet.
    - xlsx. Excel spreadsheet.
    - yaml. YAML format.

From the list above `dict`, `df`, and `list` can be used to handle the data
inside a python interpreter or program, and the rest are better suited for
display or analysis using other tools.