Differences between revisions 35 and 36

Deletions are marked like this. Additions are marked like this.
Line 27: Line 27:
 . We filtered out alignments with 95% identity and 90% coverage for both tomato species and other solanaceae (as announced by e-mail on the 22 Jan 2007 to sol-bioinformatics, itag and to the sol-steering).  . We filtered out alignments with 98% identity and 95% coverage for both tomato species and other solanaceae (as announced by e-mail on the 22 Jan 2007 to sol-bioinformatics, itag and to the sol-steering).


Summary


Tag

transcripts_tomato/transcripts_sol

Owner

CAB Group

Input

BAC sequences from seq

Output

GenomeThreader based analysis of tomato/other solanaceae ESTs, in GFF3 format

This analysis runs the GenomeThreader program versus the BAC sequences, reporting the GenomeThreader output into GFF3.


Input


Requirements

  • files must be in ITAG standard FASTA format


Processing


GenomeThreader is used to create splice-alignments of each EST versus the S.lycopersicum BAC sequences that are made available in the GFF3 format.

Parameter settings

  • We filtered out alignments with 98% identity and 95% coverage for both tomato species and other solanaceae (as announced by e-mail on the 22 Jan 2007 to sol-bioinformatics, itag and to the sol-steering).

GFF3 release

  • The GFF3 format is validated by the online GMOD service available at: http://www.gmod.org/gff3.[[BR]] The BAC GenBank id was used for naming each GFF3 file, as reported in the following examples:

  • BACGenBankid.transcripts_tomato.itag000.batch001.v1.gff3

  • BACGenBankid.transcripts_sol.itag000.batch001.v1.gff3
    for the itag pipeline 000 and for the bach file 001.

Filenames

  • Two GFF3 files for each BAC sequence in the submission, respectively described spliced alignments of tomato and other Solanaceae ESTs, named as:

<acc>.<ver>.transcripts_tomato.spliced_alignment.itag<pipever>.batch<batchnum>.v<ver>.gff3

<acc>.<ver>.transcripts_sol.spliced_alignment.itag<pipever>.batch<batchnum>.v<ver>.gff3




Details of EST DATA SOURCE AND CAB PROCESSING


  • EST sequences are downloaded from the dbEST division of GenBank (current release is updated to October 2008). Up to now, sequences from different sources are available:

  • TOMATO:
    • SOLLC = Solanum lycopersicum;

    • SOLHA = Solanum habrochaites;

    • SOLPN = Solanum pennellii;

    • SOLLP = Solanum lycopersicum X Solanum pimpinellifolium;

  • OTHER_SOLANACEAE:
    • SOLTU = Solanum tuberosum;

    • SOLCH = Solanum chacoense;

    • CAPAN = Capsicum annuum;

    • CAPCH = Capsicum chinense;

    • TOBAC = Nicotiana tabacum;

    • NICBE = Nicotiana benthamiana;

    • NICSY = Nicotiana sylvestris;

    • NICAT = Nicotiana attenuata;

    • NICLS = Nicotiana langsdorffii x Nicotiana sandera;

    • PETHY = Petunia x hybrida;

  • OTHER_RELATED_SPECIES (RUBIACEAE):
    • COFCA = Coffea canephora;

    • COFAR = Coffea arabica;

    Each dataset of EST sequences is processed as follows:

    - Vector contaminations are trimmed using RepeatMasker for the detection and the masking of vectors using the NCBI's Vector database (update October 2008).
    - Low complexity sub-sequences and simple repeats are masked using RepeatMasker Vector cleaned and masked ESTs for each dataset are spliced-aligned versus the S.lycopersicum BAC sequences.




Current BAC uploading and annotation at CAB


  • EST to BAC alignments are released to the SGN repository as requested according to the BAC BATCH files.


  • CAB releases at http://biosrv.cab.unina.it/GBrowse/ all S.lycopersicum BAC sequences annotated.
    Currently, we update automatically the annotation of each BAC at each BAC released into the GenBank database.

    On 22.01.07, 123 BACS are annotated.
    On 03.05.07, 129 BACs are annotated.
    On 15.10.07, 381 BACs are annotated.
    On 21.10.07, 493 BACs are annotated.
    On 22.01.08, 586 BACs are annotated.
    On 29.10.09, 1307 BACs are annotated.

AnEST000 (last edited 2009-11-11 11:18:23 by AlessandraTraini)