This page contains general information about the structure and operation of the ITAG distributed annotation pipeline.
For specific details of the functioning of the pipeline repository (file format specs, directory structure, validation, etc.), see PipelineRepository.
For a step-by-step how-to of how to make a new ITAG analysis, see AnalysisImplementationHowTo.
Table of Contents
Contents
Summary
The ITAG distributed annotation pipeline is a system for annotating tomato genomic sequences. At first, these genomic sequences will be BAC sequences, overlapping samples of the tomato genome averaging about 100,000 nucleotides in length. Later, these sequences will be used to build pseudomolecules with sequences that approximate the actual sequences of the euchromatic portions of the tomato genome.
The pipeline consists of a central file repository and several remote sites that upload and download files containing analysis results. To annotate a sequence under this system:
- the sequence is placed in the central repository
- several remote sites download it, analyze it in some way, then upload the results back to the repository
- other remote sites might download these results, analyze or integrate them, and upload more results
- finally, when all analyses are finished, the results are integrated and published
Design Concepts
Pipeline Versioning
To allow continuous pipeline development while at the same time ensuring consistent results, changes to the pipeline are strictly controlled. Each working pipeline is assigned a version number, and any changes to it require issuance of a new version number. In this way, results generated by one pipeline version can be considered comparable, and the pipeline's development history can be rigorously tracked. The exception to this is pipeline version 0, which is not change-controlled, since it will be used for initial pipeline development.
Pipeline Batches
For each version of the pipeline, the input sequences are divided into batches, each of which contains at least 10 but not more than 100 sequences. This batching scheme attempts to balance the demands of:
turnaround time - quickly producing an annotation when a new sequence becomes available
computational efficiency - larger batches can be more efficient, but very large batches can strain computational resources
organizational burden - too many small batches can be overly burdensome if analyses are run manually, or if computational time must be scheduled in advance
Analysis Tags
Each analysis in the pipeline has a lower-case shorthand name, called its analysis tag or just its tag. Each analysis tag is unique. By convention, analysis tags for related analyses share a common prefix. For example, a BLASTN versus E. coli sequences might have the tag "blastn_ecoli", and a BLASTN versus tomato chloroplast sequence might have the tag "blastn_chloro".
Analysis tags may contain only lower-case letters, numerals, and underscores.
Tag |
Process |
Database |
infernal |
Infernal |
Rfam |
Conventions
This section describes format conventions used for data in the pipeline repository.
Sequence Identifiers
BAC sequences - sequence identifiers for BAC sequences are versioned GenBank/EMBL/DDBJ accessions, e.g. AC193776.1
Pseudomolecule sequences - the format for pseudomolecule sequence identifiers has not yet been decided
File Type Abbreviations
All files in the pipeline repository (including control files or other special files) must end with a period followed by one of the following file type abbreviations:
Abbr. |
Description |
General Specification |
fasta |
FASTA format sequence file |
|
gff3 |
GFF version 3 |
|
game |
GAME XML |
|
raw |
Interpro raw format |
|
gff2 |
GFF version 2 |
|
xml |
XML format |
|
txt |
Plain text |
|
agp |
Accessioned Golden Path assembly |
http://www.ncbi.nlm.nih.gov/genome/guide/Assembly/AGP_Specification.html |
File Naming
All analysis result files in the central pipeline repository are named according to the following format:
<seq>.<analysis>.<desc>.itag<pipe ver>.batch<batch>.v<file ver>.<file type>
seq - the sequence's identifier, see "Sequence Identifiers" above
analysis - the tag of the analysis that produced the file, see "Analysis Tags" above
desc - a very short mnemonic for the file's contents, containing only lower-case letters, numerals, and underscores
pipe ver - the version number of the pipeline that produced the file, zero-padded to three digits
batch - the pipeline batch that the file came from, zero-padded to three digits.
file ver - the version of this specific file. This may be incremented, for example, when an error is discovered in a file that does not affect the entire batch. This must be an integer.
file type - must be listed in "File Type Abbreviations" above
Examples
AC193776.1.seq.vecscreened.itag001.batch001.v1.fasta
would be the FASTA-formatted vector-screened sequence with identifier AC193776.1, produced by an analysis with the tag name "seq", in pipeline version 1, from batch number 1.
