Summary
Tag |
eugene |
Owner |
VIB GENT |
Input |
seq, extrinsic data (EST, BlastX, TBlastX, repeats), and other ab-initio predictions |
Output |
EuGene predictions in GFF3 format, protein, cds, cDNA sequences in FASTA format |
This analysis runs the EuGene gene prediction on BAC sequences using extrinsic data and ab-initio predictions from other gene finders, and produces the prediction as an GFF3 output, and the protein, cds, & cDNA sequences of each predicted genes in separate FASTA files.
Input
BAC sequences from seq
mapped EST data from transcripts_tomato, transcripts_sol
BlastX and TBlastX output converted into a EuGene-specific format
repeatmasking data from repeats
ab-initio gene predictions from GeneID, AUGUSTUS, GeneMark, & GlimmerHMM
BlastX
Splits the query sequences into sequences of 10000nt, and runs BlastX against Arabidopsis proteins, and Swissprot_uniprot
blastall -p blastx -d $blastDB -i tempseq.tfa -g F -v 1000 -b 1000 -e 0.1 -L $x,$y
TBlastX
Runs decypher-TBlastX against finished dicot genomes (Arabidopsis & Poplar) with a cutoff of 0.0001, max scores and max alignments of 5000
dc_template_rt -template tera-tblastx.txt -query $fasta -targ dicot_genomes
Protein Mapping
Maps Solanaceae proteins in NCBI with GenomeThreader. A very high weight is assigned to the mapped proteins via AnnotaStruct of Eugene
gth -genomic ${i}.fasta -protein Solanaceae_Proteins_NCBI.tfa -species arabidopsis -mincoverage 0.95 -minalignmentscore 0.9 -force -o ${i}.gth.out
EuGene
(eugene -A eugene.par -p g -d -b01 -B -E -r -s $IN ) >& ${IN}.log
Output
Gene predictions, protein, cds, & cDNA sequences of the predictions
Filenames
annotations:gff3 (ID.VER.eugene.annotations.itagXXX.batchXXX.vX.gff3)
proteins:gff3 (ID.VER.eugene.proteins.itagXXX.batchXXX.vX.fasta)
cds:gff3 (ID.VER.eugene.cds.itagXXX.batchXXX.vX.fasta)
cdna:gff3 (ID.VER.eugene.cdna.itagXXX.batchXXX.vX.fasta)
Gene Descriptions
The gene name and description line basically follows the Guidelines Doc, although the functional description is not included.
We have also modified the 'Evidence Code' by expanding the single letter code used in Medicago into a multiple letter tag and adding some extra information. This 'structural tag' is currently being discussed within our group, and changes may occur in the future.
For the moment, the tag looks like - XXF()H()E()I()L(), e.g. 08F0H1E0IEGL1;
XX |
Two digits to describe the year the tag was assigned - e.g. '08' |
F |
Whether expressed sequences covering the translation start to translation stop (FL-cDNA or a combination of multiple ESTs) was used to derive the gene model or not - F0 or F1 |
H |
Whether protein-similarity information was used to derive the gene model or not - H0 or H1 |
E |
Whether similarity to expressed sequences was used to derive the CDS of the gene model or not - E0 or E1 |
I |
Two letters describing the program used to generate the gene model, in this case EuGene - IEG |
L |
A single digit (0~9) describing the length of the CDS - 0(0-150nt),1(151-300),2(301-600),3(601-1200),4(1201-1800),5(1801-2400),6(2401-4500),7(4501-15000),8(15001-30000),9(30000-) |
