Differences between revisions 24 and 25

Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:

For a step-by-step how-to of how to make a new ITAG analysis, see AnalysisImplementationHowTo.

This page contains general information about the structure and operation of the ITAG distributed annotation pipeline.

For specific details of the functioning of the pipeline repository (file format specs, directory structure, validation, etc.), see PipelineRepository.

For a step-by-step how-to of how to make a new ITAG analysis, see AnalysisImplementationHowTo.

Table of Contents

Summary

The ITAG distributed annotation pipeline is a system for annotating tomato genomic sequences. At first, these genomic sequences will be BAC sequences, overlapping samples of the tomato genome averaging about 100,000 nucleotides in length. Later, these sequences will be used to build pseudomolecules with sequences that approximate the actual sequences of the euchromatic portions of the tomato genome.

The pipeline consists of a central file repository and several remote sites that upload and download files containing analysis results. To annotate a sequence under this system:

  • the sequence is placed in the central repository
  • several remote sites download it, analyze it in some way, then upload the results back to the repository
  • other remote sites might download these results, analyze or integrate them, and upload more results
  • finally, when all analyses are finished, the results are integrated and published

Design Concepts

Pipeline Versioning

To allow continuous pipeline development while at the same time ensuring consistent results, changes to the pipeline are strictly controlled. Each working pipeline is assigned a version number, and any changes to it require issuance of a new version number. In this way, results generated by one pipeline version can be considered comparable, and the pipeline's development history can be rigorously tracked. The exception to this is pipeline version 0, which is not change-controlled, since it will be used for initial pipeline development.

Pipeline Batches

For each version of the pipeline, the input sequences are divided into batches, each of which contains at least 10 but not more than 100 sequences. This batching scheme attempts to balance the demands of:

  • turnaround time - quickly producing an annotation when a new sequence becomes available

  • computational efficiency - larger batches can be more efficient, but very large batches can strain computational resources

  • organizational burden - too many small batches can be overly burdensome if analyses are run manually, or if computational time must be scheduled in advance

Analysis Tags

Each analysis in the pipeline has a lower-case shorthand name, called its analysis tag or just its tag. Each analysis tag is unique. By convention, analysis tags for related analyses share a common prefix. For example, a BLASTN versus E. coli sequences might have the tag "blastn_ecoli", and a BLASTN versus tomato chloroplast sequence might have the tag "blastn_chloro".

Analysis tags may contain only lower-case letters, numerals, and underscores.

Tag

Process

Database

infernal

Infernal

Rfam

Conventions

This section describes format conventions used for data in the pipeline repository.

Sequence Identifiers

  • BAC sequences - sequence identifiers for BAC sequences are versioned GenBank/EMBL/DDBJ accessions, e.g. AC193776.1

  • Pseudomolecule sequences - the format for pseudomolecule sequence identifiers has not yet been decided

File Type Abbreviations

All files in the pipeline repository (including control files or other special files) must end with a period followed by one of the following file type abbreviations:

Abbr.

Description

General Specification

fasta

FASTA format sequence file

http://www.ebi.ac.uk/help/formats.html#fasta

gff3

GFF version 3

http://www.sequenceontology.org/gff3.shtml

game

GAME XML

http://www.fruitfly.org/annot/apollo/game.rng.txt

raw

Interpro raw format

gff2

GFF version 2

xml

XML format

txt

Plain text

agp

Accessioned Golden Path assembly

http://www.ncbi.nlm.nih.gov/genome/guide/Assembly/AGP_Specification.html

File Naming

All analysis result files in the central pipeline repository are named according to the following format:

<seq>.<analysis>.<desc>.itag<pipe ver>.batch<batch>.v<file ver>.<file type>
  • seq - the sequence's identifier, see "Sequence Identifiers" above

  • analysis - the tag of the analysis that produced the file, see "Analysis Tags" above

  • desc - a very short mnemonic for the file's contents, containing only lower-case letters, numerals, and underscores

  • pipe ver - the version number of the pipeline that produced the file, zero-padded to three digits

  • batch - the pipeline batch that the file came from, zero-padded to three digits.

  • file ver - the version of this specific file. This may be incremented, for example, when an error is discovered in a file that does not affect the entire batch. This must be an integer.

  • file type - must be listed in "File Type Abbreviations" above

Examples

AC193776.1.seq.vecscreened.itag001.batch001.v1.fasta

would be the FASTA-formatted vector-screened sequence with identifier AC193776.1, produced by an analysis with the tag name "seq", in pipeline version 1, from batch number 1.

PipelineGeneral (last edited 2011-02-01 20:35:11 by RobertBuels)