greenelab/adage-server

View on GitHub
data/README.md

Summary

Maintainability
Test Coverage
# The `data` Directory

## Overview

This directory contains files used for pre-loading our production server
database with *Pseudomonas aeruginosa*-related samples and annotations. All of
these files are loaded into the system using the management commands listed in
[the load_default_pseudomonas_data.sh script](https://github.com/greenelab/adage-server/blob/master/load_default_pseudomonas_data.sh).
These commands take the location and filename of the desired data files as
an argument.

If using [fabric](http://www.fabfile.org/), these can also be loaded using the
`import_data_and_index()` function in
[fabfile/adage_server.py](https://github.com/greenelab/adage-server/blob/master/fabfile/adage_server.py).
The
[adage/adage/config-template.py file](https://github.com/greenelab/adage-server/blob/master/adage/adage/config-template.py)
specifies which files are loaded by this `import_data_and_index()` function.

All of the files below are tab-delimited text files (the decompressed form
of the gzipped file `gene_gene_network_cutoff_0.2.txt.gz` is a tab-delimited
text file as well). If you wish to load your own files into your adage-server
instance, they should be in the same format as the files below.

## File Listing and Description

### **PseudomonasAnnotation.tsv**
  Lists samples retrieved from ArrayExpress along with manually-curated
  annotations for those samples. Each row is a sample, and the columns
  contain information about that sample.

### **all-pseudomonas-gene-normalized.pcl**
  Contains normalized expression levels for every gene in each sample in
  the compendium. In this file, genes are rows, and samples are columns.
  This is the input expression data for building an ADAGE model, and was
  generated using
  [this script](https://bitbucket.org/greenelab/eadage/src/tip/data_collection/data_collection.sh).

### **sample_signature_activity.txt**
  Contains signature (node) activity levels (as absolute values) for each
  sample in the compendium derived from the eADAGE machine learning model.
  In this file, samples are rows, and signatures are columns. This file is
  based on:
  ```
  all-pseudomonas-gene-normalized_HWActivity_perGene_with_net300_100models_1_100_k=300_seed=123_ClusterByweighted_avgweight_network_ADAGE.txt
  ```

  which is now in the `data/old/` folder and was generated by
  [this script](https://bitbucket.org/greenelab/eadage/src/tip/node_interpretation/write_HWactivity.R).

  **Note:** The signatures shown on the web interface (view
  [this page](http://adage.greenelab.com/#/signature/search?mlmodel=1) to see
  an example) will be in **alphabetical order**. If you would like them ordered
  differently, you may have to pad the numbers in the names for some signatures
  (such as changing `Sig9` into `Sig009`).

### **gene_gene_network_cutoff_0.2.txt.gz**
  Contains gene-gene network information derived from the eADAGE machine
  learning model, where each row contains the edge weight between two genes
  in this model and whether this is a positive or negative weight. Because
  of its size, this file has been compressed using
  [gzip](http://www.gzip.org/). However, its decompressed content is a
  tab-delimited text file as well.

  The first two columns are the genes for the given edge weight. The order of
  these first two columns ("from"/"to") does not matter, as the weight of the
  edge between the two genes is the same. The third column is the edge's
  weight, and the fourth column indicates whether the weight is positive or
  negative.

### **signature_gene_network.txt**
  Lists high-weight genes participating in each signature (node) in the eADAGE
  machine learning model. Each row is a signature (node), and the tab-delimited
  values following each signature name are the high-weight gene names that
  participate in it. This is similar to a
  [GMT file format](http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GMT),
  where each row represents a gene set, except this file does not include a
  description column, which is the second column in the GMT file format.

  As mentioned in *sample_signature_activity.txt* above, you may wish to pad
  numbers in signature names so they will be displayed in a useful way.

### **The `old/` subdirectory**
  The `old/` folder contains previous versions of the files above. These are
  now obsolete and for backup purpose only.