NationalGenomicsInfrastructure/ngi_pipeline

View on GitHub
DELIVERY.README.txt

Summary

Maintainability
Test Coverage
README
=======

The README describes the content of the delivered data.
-----------------------------------------------------------------

Root level
----------
The root folder, which is named by the project id, contains one report folder
and one folder for each sample. Each sample folder is accompanied by a .lst-file
containing a list of the files in the folder and a .md5-file containing the
MD5-checksums of the files in the folder. Use the MD5-checksums to verify the
integrity of the files after transfer.

|--ProjectID
   |--00-Report
   |--Sample1
   |--Sample1.lst
   |--Sample1.md5
   |--Sample2
   |--Sample2.lst
   |--Sample2.md5
   ...
   ...
   |--SampleN
   |--SampleN.lst
   |--SampleN.md5

====================================================================
ProjectID -> 00-Report
====================================================================
00-Report folder contain sequence QC, aggregate statistic and used software
version report.

|--ProjectID
   |--00-Report
      |--ProjectID_aggregate_report.csv
      |--version_report.txt
      |--SequenceQC

--ProjectID_aggregate_report.csv
A tab delimited file with sequence, alignment and variant statistics for each
sample in the project.

--SequenceQC
SequenceQC folder contain sequence QC data, which provide information about the
quality and other features of the fastq-files, both on sample and lane level.
The reports are organized by sequence run.

--version_report.txt
Information from piper, about data sources and software version
that have been used.

====================================================================
ProjectID -> SampleN
====================================================================
Each sample will contain the following subfolders:

|--ProjectID
   |--SampleN
      |--00-Reports
      |--01-QC
      |--02-FASTQ
      |--03-BAM
      |--04-VCF

ProjectID -> SampleN -> 00-Reports
--------------------------------------------------------------------
Contains two type of reports, snpEff summary and sample report.

|--ProjectID
   |--SampleN
      |--00-Report
         |--SampleID_ign_sample_report.html
         |--SampleID.clean.dedup.recal.bam.raw.annotated.vcf.gz.snpEff.summary.csv
         |--SampleID.clean.dedup.recal.bam.raw.annotated.vcf.gz.snpEff.summary.genes.txt

--snpEff summary report
SnpEff generated reports, summary.csv shows basic statistics about the
analyzed variants, genes.txt file is a tab separated file having counts of
number of variants affecting each transcript and gene.

--Sample report
Report summarizing information about the analysis, alignment and variants.

ProjectID -> SampleN -> 01-QC
--------------------------------------------------------------------
--SampleName.clean.dedup.recal.qc
Qualimap QC report from mapping

--SampleName.metrics
Picard Mark duplicates output

--SampleName.clean.dedup.recal.bam.snp.eval
GATK Variant calling evaluation for snp

--SampleName.clean.dedup.recal.bam.indel.eval
GATK Variant calling evaluation for indels

ProjectID -> SampleN -> 02-FASTQ
--------------------------------------------------------------------
--bam2fastq.sh
SLURM (or bash) script that can be used for generating FASTQ files from
a BAM file. The script is formatted to be submitted and run on a
UPPMAX SLURM cluster. See top of the script file for further usage
instructions.

ProjectID -> SampleN -> 03-BAM
--------------------------------------------------------------------
--SampleName.clean.dedup.bam
--SampleName.clean.dedup.bai
BAM file (and index file) generated by piper. It's named by
using the sample name and the modifications that have been applied to it,
according to:
  -clean => applied gatk IndelRealigner on bam file
  -dedup => marked duplicates on bam file

Note that variant calling has been performed after recalibrating the base quality
scores of this BAM file (BQSR). However, due to the drastic increase in file
size during base quality recalibration, the BAM file without recalibrated
base quality scores is delivered. If recalibrated base qualities are required
for downstream applications, the script and resources below can be used to
obtain a recalibrated BAM file.

--applyRecalibration.sh
SLURM (or bash) script that can be used to apply recalibration data and
generate a recalibrated BAM file. See top of script file for further usage
instructions.

--SampleName.pre_recal.table
Recalibration covariate data to be used for obtaining a recalibrated BAM
file, using the script above.

ProjectID -> SampleN -> 04-VCF
--------------------------------------------------------------------
Contains the final VCF files and indexes, in total 6 files.

—-SampleN.clean.dedup.recal.bam.genomic.vcf.gz
—-SampleN.clean.dedup.recal.bam.genomic.vcf.gz.tbi
Genomic VCF (gVCF) (and index file) containing sequencing information for both
variant and non-variant positions. Can be used for downstream cohort calling.

--SampleN.clean.dedup.recal.bam.raw.annotated.vcf.gz
--SampleN.clean.dedup.recal.bam.raw.annotated.vcf.gz.tbi
VCF file containing variants called from the recalibrated BAM file and
annotated with variation effects generated with snpEff.