Differences between revisions 40 and 41
| Deletions are marked like this. | Additions are marked like this. |
| Line 32: | Line 32: |
| [http://www.genomethreader.org/ GenomeThreader] (version 1.3.4) is used to map all 454 reads rejected by Spaln (due to short read length) and the ones that mapped by Spaln only when lower thresholds were used. Gth is run with 98% alignment score and 95% cov thresholds: | [http://www.genomethreader.org/ GenomeThreader] (version 1.4.3) is used to map all 454 reads rejected by Spaln (due to short read length) and the ones that mapped by Spaln only when lower thresholds were used. Gth is run with 98% alignment score and 95% cov thresholds: |
| Line 44: | Line 44: |
| [http://genometools.org/ GenomeTools] is used to produce consensus spliced alignments. It is run as | [http://genometools.org/ GenomeTools] (version 1.3.4) is used to produce consensus spliced alignments. It is run as |
Summary
Tag |
rnaseq_454 |
Owner |
Imperial |
Input |
BAC sequences (masked:fasta) from repeats |
Output |
Spaln and GenomeThreader based analysis of tomato 454 rnaseq data, in GFF3 format |
This analysis:
runs Spaln program versus the BAC sequences using all 454 data
runs GenomeThreader program versus the BAC sequences using only the 454 reads rejected by Spaln (due to short read length) as well the ones that could be mapped by Spaln only with lower thresholds
merges Spaln and GenomeThreader ouput (using GenomeTools gt gff3 sort apart from customized programs)
- filters out reads mapped multiple times (more than 3) as well as non-spliced mapped reads that are "lonely" (no other reads within 1000nt range) and reports the output into GFF3
computes consensus spliced alignment (using GenomeTools gt csa) and reports the output into GFF3
Input
the BAC sequence FASTA files (masked:fasta version) produced by AnSgnRepmask000
Processing
Spaln (version 1.4.4d) is used to map all 454 reads against the BAC sequences. All BAC sequences are concatenated into a single spaln-formatted file (genome) while reads are splitted into chunks. Spaln is run as
spaln ‐Q7 ‐LS -ya3 -M5 -S3 ‐O12 -o output_<chunk_id> -dgenome reads_<chunk_id>
The output is merged and filtered for 98% p.i. and 95% cov. 90% p.i. and 90% cov. are also used to identify reads that are mapped only in the region [90-98)% p.i and [90-95)%cov.
sortgrcd -O0 -P98 -C95 -n3 output_*.grd
GenomeThreader (version 1.4.3) is used to map all 454 reads rejected by Spaln (due to short read length) and the ones that mapped by Spaln only when lower thresholds were used. Gth is run with 98% alignment score and 95% cov thresholds:
gth -intermediate -gzip -gff3out -minalignmentscore 0.98 -mincoverage 0.95 -species arabidopsis -cdna reads_<chunk_id> -genomic BAC -o output_<chunk_id>
The output is merged and sorted using GenomeTools
Custom filtering is applied to remove reads
- that are mapped to more than 3 regions
- that are mapped in a non-spliced way and in a "lonely" region (no other reads within 1000nt range)
and outputs one GFF3 file per BAC sequence with the filtered alignments of the 454 reads
<acc>.<ver>.rnaseq_454.spliced_alignment_filt.itag<pipever>.batch<batchnum>.v<ver>.gff3
GenomeTools (version 1.3.4) is used to produce consensus spliced alignments. It is run as
gt csa -o <acc>.<ver>.rnaseq_454.spliced_alignment_csa.itag<pipever>.batch<batchnum>.v<ver>.gff3 <acc>.<ver>.rnaseq_454.spliced_alignment_filt.itag<pipever>.batch<batchnum>.v<ver>.gff3
and outpus one GFF3 file per BAC sequence with the consensus alignments of the 454 reads
<acc>.<ver>.rnaseq_454.spliced_alignment_csa.itag<pipever>.batch<batchnum>.v<ver>.gff3
Details of 454 rnaseq data
Up to now, reads from different sources are available:
- Data supplied by Dr. Brad Barbazuk to SGN for tomato genome annotation purposes
- Data supplied by Dr. Zangjun Fei to SGN, ONLY for tomato genome annotation purposes.
- Data supplied by Dr. Giovanni Giuliano for tomato genome annotation purposes
- Trichome data were supplied to SGN by Dr. M. Larsson. These data are equivalent to SRR files in the SRA database:(SRR015436, SRR015435, SRR027943, SRR027942, SRR027941, SRR027940, SRR027939)
Data are cleaned with the following procedure:
Trimming low quality regions running lucy. (trimmed) reads of length < 50nt are excluded (option –m 50). Reads actually trimmed using lucy script.
Running seqclean to screen for adapters, vectors and contaminant databases (smart clontech adapters, Univec, NC_000913.2- Escherichia coli genome, NC_002692.1 - Tomato mosaic virus, NC_007898.1- Solanum lycopersicum chloroplast genome). (cleaned) reads of length < 50 nt are excluded.
seqclean all_reads_lucy.fasta -v adapters.fasta –s UniVec,NC_000913.2.fasta,NC_007898.1.fasta,NC_002692.1.fasta -r seqclean.info -o all_reads_lucy_seqclean.fasta -c 10 -l 50 -x 96 2> seqclean.run
- Running nrdb (bioperl: bp_nrdb.pl) to remove duplicated reads.
Read ids are renamed to <library id>.<read id>. In this way, results for each library can be easily extracted. Here are the library ids:
Library id |
Source |
SGN ITAG Filename |
RNA extracted from |
Link |
1 |
Brad Barbazuk |
B_Barbazuk_Sl454.fasta |
|
|
2 |
Giovanni Giuliano |
F_Fuligni_Sl454.fasta |
|
|
3 |
SRR027939 |
Trichomes_LA0716_Spe454.fasta |
total trichomes isolated from leaf tissue of Solanum pennellii LA0716 plants |
|
4 |
SRR027942 |
Trichomes_LA1589_Spi454.fasta |
enriched fractions of type VI trichomes isolated from leaf tissue of Solanum pimpinellifolium LA1589 plants |
|
5 |
SRR027943 |
Trichomes_LA1708_Sar454.fasta |
enriched fractions of type VI trichomes isolated from stem tissue of Solanum arcanum LA1708 plants |
|
6 |
SRR027940 + SRR027941 |
Trichomes_LA1777_Sha454.fasta |
total trichomes and enriched fractions of type VI trichomes isolated from leaf tissue of Solanum habrochaites LA1777 plants |
|
7 |
SRR015435 + SRR015436 |
Trichomes_LA3475_Sly454.fasta |
total and type VI trichomes isolated from stem and petiole tissues of 3-week-old Solanum lycopersicum cv M82 plants |
|
8 |
Zangjun Fei |
Z_Fei_Tom454.fasta |
|
|
Comments
Extracting from gff3 all reads that start with a specific library id won't give all reads of the library mapped to the genome!!!!!!!!!! This is due to removal of duplicated reads that can be found among libraries.
