data/README.md
# The `data` Directory
## Overview
This directory contains files used for pre-loading our production server
database with *Pseudomonas aeruginosa*-related samples and annotations. All of
these files are loaded into the system using the management commands listed in
[the load_default_pseudomonas_data.sh script](https://github.com/greenelab/adage-server/blob/master/load_default_pseudomonas_data.sh).
These commands take the location and filename of the desired data files as
an argument.
If using [fabric](http://www.fabfile.org/), these can also be loaded using the
`import_data_and_index()` function in
[fabfile/adage_server.py](https://github.com/greenelab/adage-server/blob/master/fabfile/adage_server.py).
The
[adage/adage/config-template.py file](https://github.com/greenelab/adage-server/blob/master/adage/adage/config-template.py)
specifies which files are loaded by this `import_data_and_index()` function.
All of the files below are tab-delimited text files (the decompressed form
of the gzipped file `gene_gene_network_cutoff_0.2.txt.gz` is a tab-delimited
text file as well). If you wish to load your own files into your adage-server
instance, they should be in the same format as the files below.
## File Listing and Description
### **PseudomonasAnnotation.tsv**
Lists samples retrieved from ArrayExpress along with manually-curated
annotations for those samples. Each row is a sample, and the columns
contain information about that sample.
### **all-pseudomonas-gene-normalized.pcl**
Contains normalized expression levels for every gene in each sample in
the compendium. In this file, genes are rows, and samples are columns.
This is the input expression data for building an ADAGE model, and was
generated using
[this script](https://bitbucket.org/greenelab/eadage/src/tip/data_collection/data_collection.sh).
### **sample_signature_activity.txt**
Contains signature (node) activity levels (as absolute values) for each
sample in the compendium derived from the eADAGE machine learning model.
In this file, samples are rows, and signatures are columns. This file is
based on:
```
all-pseudomonas-gene-normalized_HWActivity_perGene_with_net300_100models_1_100_k=300_seed=123_ClusterByweighted_avgweight_network_ADAGE.txt
```
which is now in the `data/old/` folder and was generated by
[this script](https://bitbucket.org/greenelab/eadage/src/tip/node_interpretation/write_HWactivity.R).
**Note:** The signatures shown on the web interface (view
[this page](http://adage.greenelab.com/#/signature/search?mlmodel=1) to see
an example) will be in **alphabetical order**. If you would like them ordered
differently, you may have to pad the numbers in the names for some signatures
(such as changing `Sig9` into `Sig009`).
### **gene_gene_network_cutoff_0.2.txt.gz**
Contains gene-gene network information derived from the eADAGE machine
learning model, where each row contains the edge weight between two genes
in this model and whether this is a positive or negative weight. Because
of its size, this file has been compressed using
[gzip](http://www.gzip.org/). However, its decompressed content is a
tab-delimited text file as well.
The first two columns are the genes for the given edge weight. The order of
these first two columns ("from"/"to") does not matter, as the weight of the
edge between the two genes is the same. The third column is the edge's
weight, and the fourth column indicates whether the weight is positive or
negative.
### **signature_gene_network.txt**
Lists high-weight genes participating in each signature (node) in the eADAGE
machine learning model. Each row is a signature (node), and the tab-delimited
values following each signature name are the high-weight gene names that
participate in it. This is similar to a
[GMT file format](http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GMT),
where each row represents a gene set, except this file does not include a
description column, which is the second column in the GMT file format.
As mentioned in *sample_signature_activity.txt* above, you may wish to pad
numbers in signature names so they will be displayed in a useful way.
### **The `old/` subdirectory**
The `old/` folder contains previous versions of the files above. These are
now obsolete and for backup purpose only.