This page contains general information about the structure and operation of the ITAG distributed annotation pipeline.

For specific details of the functioning of the pipeline repository (file format specs, directory structure, validation, etc.), see PipelineRepository.

For a step-by-step how-to of how to make a new ITAG analysis, see AnalysisImplementationHowTo.

Table of Contents

Summary

The ITAG distributed annotation pipeline is a system for annotating tomato genomic sequences. At first, these genomic sequences will be BAC sequences, overlapping samples of the tomato genome averaging about 100,000 nucleotides in length. Later, these sequences will be used to build pseudomolecules with sequences that approximate the actual sequences of the euchromatic portions of the tomato genome.

The pipeline consists of a central file repository and several remote sites that upload and download files containing analysis results. To annotate a sequence under this system:

Design Concepts

Pipeline Versioning

To allow continuous pipeline development while at the same time ensuring consistent results, changes to the pipeline are strictly controlled. Each working pipeline is assigned a version number, and any changes to it require issuance of a new version number. In this way, results generated by one pipeline version can be considered comparable, and the pipeline's development history can be rigorously tracked. The exception to this is pipeline version 0, which is not change-controlled, since it will be used for initial pipeline development.

Pipeline Batches

For each version of the pipeline, the input sequences are divided into batches, each of which contains at least 10 but not more than 100 sequences. This batching scheme attempts to balance the demands of:

Analysis Tags

Each analysis in the pipeline has a lower-case shorthand name, called its analysis tag or just its tag. Each analysis tag is unique. By convention, analysis tags for related analyses share a common prefix. For example, a BLASTN versus E. coli sequences might have the tag "blastn_ecoli", and a BLASTN versus tomato chloroplast sequence might have the tag "blastn_chloro".

Analysis tags may contain only lower-case letters, numerals, and underscores.

Tag

Process

Database

infernal

Infernal

Rfam

Conventions

This section describes format conventions used for data in the pipeline repository.

Sequence Identifiers

File Type Abbreviations

All files in the pipeline repository (including control files or other special files) must end with a period followed by one of the following file type abbreviations:

Abbr.

Description

General Specification

fasta

FASTA format sequence file

http://www.ebi.ac.uk/help/formats.html#fasta

gff3

GFF version 3

http://www.sequenceontology.org/gff3.shtml

game

GAME XML

http://www.fruitfly.org/annot/apollo/game.rng.txt

raw

Interpro raw format

gff2

GFF version 2

xml

XML format

txt

Plain text

agp

Accessioned Golden Path assembly

http://www.ncbi.nlm.nih.gov/genome/guide/Assembly/AGP_Specification.html

File Naming

All analysis result files in the central pipeline repository are named according to the following format:

<seq>.<analysis>.<desc>.itag<pipe ver>.batch<batch>.v<file ver>.<file type>

Examples

AC193776.1.seq.vecscreened.itag001.batch001.v1.fasta

would be the FASTA-formatted vector-screened sequence with identifier AC193776.1, produced by an analysis with the tag name "seq", in pipeline version 1, from batch number 1.

PipelineGeneral (last edited 2011-02-01 20:35:11 by RobertBuels)