GenDBWiki/CoreDocumentation/RegionPrediction

From BRF-Software
Jump to navigation Jump to search

Documentation for the GenDB Gene Finding Components

Currently integrated programs

Program Task Reference
REGANOR Prediction of CDSs (http://bioinformatics.oupjournals.org/cgi/reprint/20/10/1622)
Glimmer-2.1 Prediction of CDSs (http://nar.oupjournals.org/cgi/content/full/27/23/4636)
Critica Prediction of CDSs, RBSs, Frameshifts (http://mbe.oupjournals.org/cgi/reprint/16/4/512)
RBSFinder Relocation of CDS starts, Prediction of RBSs (http://bioinformatics.oupjournals.org/cgi/reprint/17/12/1123)
SearchforRNAs Prediction of t- and rRNAs (Niels Larsen, University Aarhus, Denmark)
tRNAscan-SE Prediction of tRNAs (http://nar.oupjournals.org/cgi/content/full/25/5/955)
Getorf Lists all ORFs (Contained in EMBOSS package, http://www.hgmp.mrc.ac.uk/Software/EMBOSS/)

Documentation for individual programs

In the following sections you can find detailed descriptions for all programs that can be used to predict regions within GenDB.

REGANOR

Reganor is a pipeline which automates the complete gene finding procedure for a sequence within GenDB. For CDS prediction, a combined strategy based on the gene finders Glimmer and Critica applied. Glimmer is run using the Critica predictions on the sequence as a training set of CDSs. This in an extensive evaluation on 113 microbial genome sequences was shown to have a significantly improved overall performance compared to the Glimmer standard application, especially for GC rich genomes. For the prediction of tRNA genes, tRNAscan-SE is run. For the prediction of RNA genes, SearchForRNAs is run, which uses tRNAscan-SE. Based on the observations from the different CDS and RNA genes are automatically annotated. Besides the Critica predicted CDSs, additional Glimmer(ct) predicted CDSs which do not overlap more than 50bps with these or RNA genes are annotated. This way, sensitivity compared to Critica is increased without significantly losing in specificity.

The reliability of the different predictions is reflected by the 'status region' of these regions. CDSs 

predicted by Critica are assigned 'status 2', additional annotated Glimmer(ct) predictions with high 'Vote Scores' are assigned 'status 1', low scoring additional Glimmer(ct)-based annotations are assigned 'attention needed', as besides some true positive predictions the latter is likely to contain also some false positive predictions.

Usage: reganor.pl -p <project name> (-r <region> | -a) [-t -f -n]
where

       -p project name - the GenDB project to be run on
-r region - Contig to be analyzed - or -
-a run on all contigs
-n Do not run autoannotation
-f restart failed, submitted or finished jobs

REGANOR is also available as a web-server at: https://www.cebitec.uni-bielefeld.de/groups/brf/software/reganor/index.html

GLIMMER-2.1

The complete Glimmer package is integrated into GenDB in a comfortable manner. A Glimmer tool optimally configured according to the evaluation given in McHardy et al. (Bioinformatics, 2004) can be created with the default_tool_creation script or simply by using reganor.pl, where this tool will be created and run along with the other gene finding programs of the default GenDB gene finding pipeline. Configuration of individually designed Glimmer tools can be done via the ToolConfigurationWizard in the graphical user interface. All options currently configurable are explicitly specified in this interface. Note that there is an additional wrapper around the Glimmer functionality, which does not recognize the options which can be given directly to the glimmer2 program, so do not specify such options as 'other command line options'. What follows are extracts from the Glimmer documentation and comments on how the programs are integrated within the GenDB system:

Glimmer 1.0 had 4 read me files, and Glimmer 2.0 maintains that structure. The four main programs are:

  1. long-orfs
  2. extract
  3. build-icm
  4. glimmer2

1. Program long-orfs takes a sequence file (in FASTA format) and outputs a list of all long "potential genes" in it that do not overlap by too much. By "potential gene" I mean the portion of an orf from the first start codon to the stop codon at the end.

2. Program extract takes a FASTA format sequence file and a file with a list of start/stop positions in that file (e.g., as produced by the long-orfs program) and extracts and outputs the specified sequences.

3. Program build-icm.c creates and outputs an interpolated Markov model (IMM) as described in the paper

 A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg.
 Improved Microbial Gene Identification with Glimmer.  
 Nucleic Acids Research, 1999, in press.

4. Program glimmer takes two inputs: a sequence file (in FASTA format) and a collection of Markov models for genes as produced by the program build-icm . It outputs a list of all open reading frames (orfs) together with scores for each as a gene.

Comment: Programs 1-3 are used to create a model of CDS properties based on putative CDSs derived from an input set of sequences. In GenDB, these are called the 'training sequences' and can be specified during configuration (e.g. if multiple contigs belonging to one genome). By 'training tool', the method of how the putative CDSs are extracted from these sequences is defined. If not choosing a 'training tool', the Glimmer default 'long-orfs' is applied. The REGANOR wizard uses the gene finder Critica as the default 'training tool'. During the Glimmer tool run, glimmer2 is then used with the ICM-model created in the training phase on the sequence specified to be analyzed.

Output: The extended output resulting from a Glimmer run contains for every prediction three kind of scores: a probability, a 'Vote Score' (stored as the 'result') and a 'Raw Score' (stored in GenDB along with other comments from the Glimmer output as 'comment').

Configurable options:

Training tool Any CDS prediction method for which observations on the specified training sequence exist.
Training sequence GenDB regions or an external sequence file in FASTA format.vcFor external sequences long orfs will be used as training tool.
Look for RBS in start prediction Yes/ No along with pattern to search for.
Minimum gene length Minimum length of genes to be predicted.
Linear region vs. circular, with genes being predicted over the end of the sequence.
Threshold Probability score cut-off for prediction. Default is 90 (out of 99).

CRITICA

Critica is part of the GenDB default gene finding pipeline usable with the Reganor Wizard. Different options from the default are configurable versus the ToolCreationWizard. Critica as integrated into GenDB does not allow separation of the training and prediction step. If you want to run it for a set of smaller contigs from one genome, the best way to do this is by 'chaining' the contigs together with linkers (containing stops in every frame) during the import into the GenDB system.

Output: Observations resulting from a Critica run includes predicted CDSs, predicted ribosome binding sites and predictions of possible frameshifts. The value stored under 'results' for the CDS predictions in GenDB is the P-value given in the Critica output: The P-value is the amount of statistical support for the coding region. Like BLAST, low scores are better.

Configurable options:

Look for RBS in start detection Use ribosome binding pattern to choose start

Note: If Critica does not detect any CDSs, it dies with an error message, although this is not really an error -- the result is that the corresponding GenDB job is assigned the state 'failed. This is most likely to happen if you are trying to analyze very short sequences <= 10000bps with Critica, which is not really what the program is intended for.

RBSFinder

From the documentation: The program "rbs_finder.pl" implements an algorithm to find ribosome binding sites (RBS) in the upstream regions of the genes annotated by Glimmer2, GeneMark, or other prokaryotic gene finders. If there is no RBS-like patterns in this region, program searches for a start codon having a RBS-like pattern,in the same reading frame upstream or downstream and relocates start codon accordingly.

Explanations for some directly configurable options in the ToolConfigurationWizard:

Window Size This parameter determines how far the program should look for RBS-like pattern in the upstream region of each of the genes. The best results obtained using a window size of 50bps.
Iterations BSsfinder achieves better results if run iteratively.

RBSFinder options configurable as 'other command-line options':

The detault sequence  is ("aggag"). However, a computed sequence can be used to get better results. The  method  to  compute the  consensus sequence is as follows:            
-Take the complement of last 30bps of  16S  rRNA  
- Find the most abundantly found 5bps subsequence of this complement in the 30bps upstream regions of the start codons annotated by Glimmer2.
-Use this sequence as consensus sequence.
The coordinates that user wants to relocate or check for RBS site,which can be a subset of coordinates annotated by Glimmer2.This file should be in following format:
            || <Gene id> || <Start Codon Coord>  || <Stop Codon Coord> ||
            || 1         || 1030    ||       1140 ||
            || 2         || 1214    ||       3010 ||

SearchforRNAs

From the documentation: Searches a fasta formatted contig file for RNA's and prints tbl style output. It works by using the RNA's from a set of closest (or explicitly named) organisms as search probes and then tries to guess the ends if the matches are not perfect. search_for_rnas includes a run of tRNAscan-SE.

Configurable options:

Type of RNA D = all; a string like "tRNA, 16S, 23S, 5S"
Organism ID D = none; Give your organism an ID like "EC", "BS" ..
Domain D = all; phylogenetic domain of organism, A,B,E
Organism Genus D = all; genus name of organism
Organism Species D = all; species name of organism

Other command line options:

--probes D = none; an organism name, or part of
--complete D = 95; min. pct completeness of probe sequence(s)

tRNAscan-SE

From the documentation:

tRNAscan-SE combines the specificity of the Cove probabilistic RNA prediction package (1) with the speed and sensitivity of tRNAscan 1.3 (2) plus an implementation of an algorithm described by Pavesi and colleagues (3), which searches for eukaryotic pol III tRNA promoters (our implementation referred to as EufindtRNA). tRNAscan and EufindtRNA are used as first-pass prefilters to identify "candidate" tRNA regions of the sequence. These subsequences are then passed to Cove for further analysis, and output if Cove confirms the initial tRNA prediction. In this way, tRNAscan-SE attains the best of both worlds: (1) a false positive rate equally low to using Cove analysis, (2) the combined sensitivities of tRNAscan and EufindtRNA (detection of 98-99% of true tRNAs), and (3) search speed 1,000 to 3,000 times faster than Cove analysis and 30 to 90 times faster than the original tRNAscan 1.3 (tRNAscan-SE uses both a code-optimized version of tRNAscan 1.3 which gives a 300-fold increase in speed, and a fast C implementation of the Pavesi et al. algorithm).

Note: The current version of tRNAscan-Se (v 1.21) fails on sequences containing non ATGC characters. Within the reganor pipeline, tRNAscan is only included if the sequences do not contain such characters.

Configurable as additional options:

-B or -P search for bacterial tRNAs (use bacterial tRNA model)
-A search for archaeal tRNAs (use archaeal tRNA model)
-O search for organellar (mitochondrial/chloroplast) tRNAs
-G use general tRNA model (cytoplasmic tRNAs from all 3 domains included)
-C search using covariance model analysis only (max sensitivity, slow)
-H show both primary and secondary structure components to covariance model bit scores
-D disable pseudogene checking

Specify Alternate Cutoffs / Data Files:

-X <score> set cutoff score (in bits) for reporting tRNAs (default=20)
-L <length> set max length of tRNA intron+variable region (default=116bp)
-I <score> manually set "intermediate" cutoff score for EufindtRNA
-z <number> use <number> nucleotides padding when passing first-pass tRNA bounds predictions to CM analysis (default=7)
-g <file> use alternate genetic codes specified in <file> for determining tRNA type
-c <file> use an alternate covariance model in <file>

Misc Options:

-h print this help message
-Q do not prompt user before overwriting pre-existing result files (for batch processing)
-n <EXPR> search only sequences with names matching <EXPR> string (<EXPR> may contain * or ? wildcard chars)
-s <EXPR> start search at sequence with name matching <EXPR> string and continue to end of input sequence file(s)

Special Options (for testing & special purposes)

-T search using tRNAscan only (defaults to strict params)
-t <mode> explicitly set tRNAscan params, where <mode>=R or S (R=relaxed, S=strict tRNAscan v1.3 params)
-E search using Eukaryotic tRNA finder (EufindtRNA) only (defaults to Normal seach parameters when run alone, or to Relaxed search params when run with Cove)
-e <mode> explicitly set EufindtRNA params, where <mode>=R, N, or S(relaxed, normal, or strict)
-r <file> save first-pass scan results from EufindtRNA and/or tRNAscan in <file> in tabular results format
-u <file> search with Cove only those sequences & regions delimited in <file> (tabular results file format)
-F <file> save first-pass candidate tRNAs in <file> that were then found to be false positives by Cove analysis
-M <file> save all seqs that do NOT have at least one tRNA prediction in them (aka "missed" seqs)
-v <file> save verbose tRNAscan 1.3 output to <file>
-V <vers> run an alternate version of tRNAscan where <vers> = 1.3, 1.39, 1.4 (default), or 2.0
-K Keep redundant tRNAscan 1.3 hits (don't filter out multiple predictions per tRNA identification)

Getorf

Besides the options which can be chosen directly, additional options for this program which can be specified as 'additional command line options' during tool configuration with the ToolConfigurationWizard:

-maxsize integer Maximum nucleotide size of ORF to report
-find menu This is a small menu of possible output options. The first four options are to select either the protein translation or the original nucleic acid sequence of the open reading frame. There are two possible definitions of an open reading frame: it can either be a region that is free of STOP codons or a region that begins with a START codon and ends with a STOP codon. The last three options are probably only of interest to people who wish to investigate the statistical properties of the regions around potential START or STOP codons. The last option assumes that ORF lengths are calculated between two STOP codons.
-[no]methionine boolean START codons at the beginning of protein products will usually code for Methionine,despite what the codon will code for when it is internal to a protein. This qualifier sets all such START codons to code for Methionine by default.
-[no]reverse boolean Set this to be false if you do not wish to find ORFs in the reverse complement of the sequence.
-flanking integer If you have chosen one of the options of the type of sequence to find that gives the flanking sequence around a STOP or START codon, this allows you to set the number of nucleotides either side of that codon to output. If the region of flanking nucleotides crosses the start or end of the sequence, no output is given for this codon.