Differences between revisions 40 and 41

Deletions are marked like this. Additions are marked like this.
Line 32: Line 32:
[http://www.genomethreader.org/ GenomeThreader] (version 1.3.4) is used to map all 454 reads rejected by Spaln (due to short read length) and the ones that mapped by Spaln only when lower thresholds were used. Gth is run with 98% alignment score and 95% cov thresholds: [http://www.genomethreader.org/ GenomeThreader] (version 1.4.3) is used to map all 454 reads rejected by Spaln (due to short read length) and the ones that mapped by Spaln only when lower thresholds were used. Gth is run with 98% alignment score and 95% cov thresholds:
Line 44: Line 44:
[http://genometools.org/ GenomeTools] is used to produce consensus spliced alignments. It is run as [http://genometools.org/ GenomeTools] (version 1.3.4) is used to produce consensus spliced alignments. It is run as


Summary


Tag

rnaseq_454

Owner

Imperial

Input

BAC sequences (masked:fasta) from repeats

Output

Spaln and GenomeThreader based analysis of tomato 454 rnaseq data, in GFF3 format

This analysis:

  • runs Spaln program versus the BAC sequences using all 454 data

  • runs GenomeThreader program versus the BAC sequences using only the 454 reads rejected by Spaln (due to short read length) as well the ones that could be mapped by Spaln only with lower thresholds

  • merges Spaln and GenomeThreader ouput (using GenomeTools gt gff3 sort apart from customized programs)

  • filters out reads mapped multiple times (more than 3) as well as non-spliced mapped reads that are "lonely" (no other reads within 1000nt range) and reports the output into GFF3
  • computes consensus spliced alignment (using GenomeTools gt csa) and reports the output into GFF3




Input


  • the BAC sequence FASTA files (masked:fasta version) produced by AnSgnRepmask000




Processing


Spaln (version 1.4.4d) is used to map all 454 reads against the BAC sequences. All BAC sequences are concatenated into a single spaln-formatted file (genome) while reads are splitted into chunks. Spaln is run as

  • spaln ­‐Q7 ­‐LS -ya3 -M5 -S3 ‐O12 -o output_<chunk_id> -dgenome reads_<chunk_id>

The output is merged and filtered for 98% p.i. and 95% cov. 90% p.i. and 90% cov. are also used to identify reads that are mapped only in the region [90-98)% p.i and [90-95)%cov.

  • sortgrcd -O0 -P98 -C95 -n3 output_*.grd


GenomeThreader (version 1.4.3) is used to map all 454 reads rejected by Spaln (due to short read length) and the ones that mapped by Spaln only when lower thresholds were used. Gth is run with 98% alignment score and 95% cov thresholds:

  • gth -intermediate -gzip -gff3out -minalignmentscore 0.98 -mincoverage 0.95 -species arabidopsis -cdna reads_<chunk_id> -genomic BAC -o output_<chunk_id>

The output is merged and sorted using GenomeTools


Custom filtering is applied to remove reads

  • that are mapped to more than 3 regions
  • that are mapped in a non-spliced way and in a "lonely" region (no other reads within 1000nt range)

and outputs one GFF3 file per BAC sequence with the filtered alignments of the 454 reads

<acc>.<ver>.rnaseq_454.spliced_alignment_filt.itag<pipever>.batch<batchnum>.v<ver>.gff3

GenomeTools (version 1.3.4) is used to produce consensus spliced alignments. It is run as

  • gt csa -o <acc>.<ver>.rnaseq_454.spliced_alignment_csa.itag<pipever>.batch<batchnum>.v<ver>.gff3 <acc>.<ver>.rnaseq_454.spliced_alignment_filt.itag<pipever>.batch<batchnum>.v<ver>.gff3

and outpus one GFF3 file per BAC sequence with the consensus alignments of the 454 reads

<acc>.<ver>.rnaseq_454.spliced_alignment_csa.itag<pipever>.batch<batchnum>.v<ver>.gff3




Details of 454 rnaseq data


Up to now, reads from different sources are available:

  • Data supplied by Dr. Brad Barbazuk to SGN for tomato genome annotation purposes
  • Data supplied by Dr. Zangjun Fei to SGN, ONLY for tomato genome annotation purposes.
  • Data supplied by Dr. Giovanni Giuliano for tomato genome annotation purposes
  • Trichome data were supplied to SGN by Dr. M. Larsson. These data are equivalent to SRR files in the SRA database:(SRR015436, SRR015435, SRR027943, SRR027942, SRR027941, SRR027940, SRR027939)


Data are cleaned with the following procedure:

  • Trimming low quality regions running lucy. (trimmed) reads of length < 50nt are excluded (option –m 50). Reads actually trimmed using lucy script.

  • Running seqclean to screen for adapters, vectors and contaminant databases (smart clontech adapters, Univec, NC_000913.2- Escherichia coli genome, NC_002692.1 - Tomato mosaic virus, NC_007898.1- Solanum lycopersicum chloroplast genome). (cleaned) reads of length < 50 nt are excluded.

    • seqclean all_reads_lucy.fasta -v adapters.fasta –s UniVec,NC_000913.2.fasta,NC_007898.1.fasta,NC_002692.1.fasta -r seqclean.info -o all_reads_lucy_seqclean.fasta -c 10 -l 50 -x 96 2> seqclean.run

  • Running nrdb (bioperl: bp_nrdb.pl) to remove duplicated reads.


Read ids are renamed to <library id>.<read id>. In this way, results for each library can be easily extracted. Here are the library ids:

Library id

Source

SGN ITAG Filename

RNA extracted from

Link

1

Brad Barbazuk

B_Barbazuk_Sl454.fasta

2

Giovanni Giuliano

F_Fuligni_Sl454.fasta

3

SRR027939

Trichomes_LA0716_Spe454.fasta

total trichomes isolated from leaf tissue of Solanum pennellii LA0716 plants

Link

4

SRR027942

Trichomes_LA1589_Spi454.fasta

enriched fractions of type VI trichomes isolated from leaf tissue of Solanum pimpinellifolium LA1589 plants

Link

5

SRR027943

Trichomes_LA1708_Sar454.fasta

enriched fractions of type VI trichomes isolated from stem tissue of Solanum arcanum LA1708 plants

Link

6

SRR027940 + SRR027941

Trichomes_LA1777_Sha454.fasta

total trichomes and enriched fractions of type VI trichomes isolated from leaf tissue of Solanum habrochaites LA1777 plants

Link1 Link2

7

SRR015435 + SRR015436

Trichomes_LA3475_Sly454.fasta

total and type VI trichomes isolated from stem and petiole tissues of 3-week-old Solanum lycopersicum cv M82 plants

Link1 Link2

8

Zangjun Fei

Z_Fei_Tom454.fasta

Comments

Extracting from gff3 all reads that start with a specific library id won't give all reads of the library mapped to the genome!!!!!!!!!! This is due to removal of duplicated reads that can be found among libraries.

AnRnaseq454000 (last edited 2010-06-17 20:15:48 by IFilippis)