Differences between revisions 30 and 31

Deletions are marked like this. Additions are marked like this.
Line 7: Line 7:
For a step-by-step how-to of how to make a new ITAG analysis, see Anal  ysisImplementationHowTo. For a step-by-step how-to of how to make a new ITAG analysis, see AnalysisImplementationHowTo.

This page contains specific technical information about the pipeline central repository.

For a general overview of the concepts and motivations behind the ITAG distributed pipeline, see PipelineGeneral.

For technical descriptions of each analysis in the pipeline, see Pipeline000.

For a step-by-step how-to of how to make a new ITAG analysis, see AnalysisImplementationHowTo.

Table of Contents

Summary

SGN manages the central file repository for the distributed pipeline. At its heart, the pipeline repository is a collection of files and directories available to ITAG members via SCP and SFTP.

The pipeline repository is managed by software written at SGN. The pipeline repository management system provides important services for ITAG members, chiefly Status Reporting, used to coordinate the running of analyses, and Validation, used to ensure that the analysis results uploaded by ITAG members are consistent and error-free.

Accessing the Repository Files

To gain access to the SGN pipeline repository, email RobertBuels with

  • your desired login name (lower-case letters only, no underscores, starting with 'itag'). Example: itagmips

  • the tagnames of the analyses you will be running (see Pipeline000 for a list of analyses and their tags)

Robert will then email you back with directions for accessing the repository.

Directory Structure

The directory structure of the pipeline repository looks like:

/itag/
   pipe000/
      analysis_defs/

      batch001/
         analysis_tag/
            <analysis result files>
            ...
         ...
      batch002/
         analysis_tag/
            <analysis result files>
            ...
         ...
      ...
   pipe001/
       analysis_defs/

       batch001/
         analysis_tag/
            <analysis result files>
            ...
         ...
       ...
   ...

Each pipe### directory corresponds to the version of the pipeline, each batch### directory is a batch for that version of the pipeline, and each analysis_tag directory, sometimes referred to as the "analysis directory", contains the results of its analysis for that batch.

File Naming

All files in the repository must conform to the ITAG file naming conventions covered in PipelineGeneral, Conventions->File Naming.

Pipeline Repository Management System

The pipeline repository management system is a suite of software written in object-oriented Perl. At SGN, it is used to create, delete, list, and report on the status of pipeline versions, the batches they contain, and the analyses run on each of the batches.

However, for analysis implementors and maintainers, the most important feature of the management system is the services it provides: Status Reporting to let them know when their analysis/analyses are ready to be run, and Validation to ensure that the analysis results in the central repository are error-free and published correctly.

Status Reporting

The management system tracks the status of each analysis in each batch of each pipeline version. These statuses can be checked in real-time on SGN via either a human-readable web page (http://www.sgn.cornell.edu/sequencing/itag/status_html.pl), or a machine-readable web service (see PipelineStatusWebService). These services are available now.

Analysis Definition Files

Each analysis in the pipeline has a plain-text definition file that specifies:

  • the system username and email address of its owner/developer
  • what other analyses it depends on for its input
  • the names of file(s) it is expected to produce for each input sequence

The analysis definition files are stored in the itag/pipeXXX/analysis_defs directory.

The definition file's format is multi-column tab- or space-delimited, with column 1 being a "key", and all additional columns being values for that key. For example, the analysis definition file for the interpro analysis, interpro.def.txt looks like:

  owner_user    itagimperial
  owner_group   itagimperial
  owner_email   d.buchan@imperial.ac.uk

  #this is a comment, ignored by the pipeline software.
  #blank lines are also ignored.

  depends_on            eugene
  produces_files        proteins:gff3 proteins:xml proteins:raw

This indicates that the interpro analysis is owned by the itagimperial user, produces files named like AC123456.interpro.proteins.itag000.batch001.v1.gff3, etc. and depends on the eugene analysis for its input.

Defined keys for the analysis definition file are:

Key

Required?

Valid Values

Description

owner_user

yes

any itag username

username of the owner of this analysis. Used to set permissions on the analysis's directory.

owner_group

no

any group name

name of the user's group. Defaults to same name as the user. Most analyses do not need to include this line.

owner_email

yes

valid email address

email address of the owner, used for sending automated emails

depends_on

yes

whitespace-separated list of analysis tags

Whitespace-separated list of analysis tag names that must be done before this analysis will be ready

produces_files

yes

whitespace-separated list of <desc>:<ext>

Whitespace-separated list of description:extension pairs for files this analysis will produce. The extension must be one of the filename extensions listed on PipelineGeneral. If you need to add one, please add it there and notify RobertBuels so he can update the pipeline code.

include

no

whitespace-separated list of filenames in the same directory as the current file

Includes the settings in the named file. Use for implementing groups of closely related analyses. For examples of this in use, see the def files itag/pipe000/analysis_defs/blastp_* in the pipeline repository.

The format of the definition file is extensible, allowing for more keys to be added later.

Any keys not defined above, or other errors in the definition file, cause the pipeline system to report an error status for the analysis.

Control Files

Each analysis has the option of communicating with the pipeline management system by placing a file named control.txt in its analysis directory (e.g. itag/pipeXXX/batchXXX/repeats/control.txt. The control file's format is two-column tab-delimited, with column 1 being a "key", and column 2 being its "value".

Example:

running 1

Defined keys for the control file are:

Key

Valid Values

Description

running

1 or 0

If this key is present in the control file and has a value of 1, the pipeline management system will be advised that the analysis is currently running.

The format of the control file is extensible, allowing for more keys to be added later.

Any keys not defined above are ignored by the pipeline system.

Validation

The pipeline management system provides four different types of services for validating the output of each analysis. Analysis implementors are strongly encouraged to take advantage of these services, especially Custom Validation, in order to prevent errors being introduced into the pipeline.

File Size Validation

If an analysis implementor uploads a file called manifest.txt into an analysis directory, the pipeline will use its contents to check the size of each (non-control) file in that directory. If the check fails, the analysis's status will be set to error.

Format of the manifest.txt file is two-column tabular, tab-delimited. The first column contains the name of the file, the second column contains its size in bytes.

Example:

AC171727.1.seq.vecscreened.itag000.batch001.v1.fasta    136366
AC171728.2.seq.vecscreened.itag000.batch001.v1.fasta    191581
...

On UNIX-like systems with perl installed, the following perl one-liner will produce output in the correct format:

perl -MFile::Basename -e 'print basename($_)."\t".(-s $_)."\n" foreach @ARGV' <analysis_dir>/*.itag*

File MD5 Checksum Validation

If the analysis implementor uploads a file called md5sums.txt into an analysis directory, the pipeline will use its contents to check the MD5 checksum of each (non-control) file in that directory. If the check fails, the analysis's status will be set to error.

Format of the md5sums.txt file is the same as the default format produced by the GNU md5sum program, already installed on most Linux and BSD systems, and available for most other platforms. MD5 checksums may of course also be generated using another method, provided the output adheres to this format.

The format is two-column tabular with the first column containing the 128-bit md5 checksum in hexadecimal format, and the second column containing the file name. The columns should be separated by two space characters.

Example:

564496c3ef6728d22f4fb24bfb54bbff  AC171727.1.seq.vecscreened.itag000.batch001.v1.fasta
421e3305130abfaf21d61477500c2b71  AC171728.2.seq.vecscreened.itag000.batch001.v1.fasta
...

On UNIX-like systems with GNU md5sum installed, the following one-liner will produce output in the correct format:

md5sum *.itag* > md5sums.txt

File Format Validation

The pipeline management system also contains routines that check that each file conforms to the general format specification for its file type. For example, GFF3 must have 9 columns, with some attribute names in the 9th column being reserved and capitalized. GAME XML must have a top-level <game> element. FASTA sequence files must have correct definition lines, have only certain allowed letters in their sequence, and contain no blank lines.

The file format validation routines are built into the pipeline, and are automatically run on the output of _all_ analyses.

Custom Validation

When the expected output of an analysis has been thoroughly documented, a custom or semi-custom validation routine for the output of the analysis can be integrated into the SGN pipeline management code. Steps to make this happen are:

  1. Implement the analysis.
  2. Document as fully as possible the characteristics of "good" output from this analysis. This is
  3. Email RobertBuels, who will integrate full validation of the output of your analysis into the pipeline management code. This full validation can be implemented either by SGN or by the analysis implementor in the form of an executable program or script, runnable on a Linux system.

  4. From that point forward, any deviation from "good" output will cause the analysis to report an error status.

PipelineRepository (last edited 2011-02-01 20:34:52 by RobertBuels)