Documentation for the GenDB Gene Finding Components

Currently integrated programs

Program	Task	Reference
REGANOR	Prediction of CDSs	(http://bioinformatics.oupjournals.org/cgi/reprint/20/10/1622)
Glimmer-2.1	Prediction of CDSs	(http://nar.oupjournals.org/cgi/content/full/27/23/4636)
Critica	Prediction of CDSs, RBSs, Frameshifts	(http://mbe.oupjournals.org/cgi/reprint/16/4/512)
RBSFinder	Relocation of CDS starts, Prediction of RBSs	(http://bioinformatics.oupjournals.org/cgi/reprint/17/12/1123)
SearchforRNAs	Prediction of t- and rRNAs	(Niels Larsen, University Aarhus, Denmark)
tRNAscan-SE	Prediction of tRNAs	(http://nar.oupjournals.org/cgi/content/full/25/5/955)
Getorf	Lists all ORFs	(Contained in EMBOSS package, http://www.hgmp.mrc.ac.uk/Software/EMBOSS/)

Documentation for individual programs

In the following sections you can find detailed descriptions for all programs that can be used to predict regions within GenDB.

REGANOR

Reganor is a pipeline which automates the complete gene finding procedure for a sequence within GenDB. For CDS prediction, a combined strategy based on the gene finders Glimmer and Critica applied. Glimmer is run using the Critica predictions on the sequence as a training set of CDSs. This in an extensive evaluation on 113 microbial genome sequences was shown to have a significantly improved overall performance compared to the Glimmer standard application, especially for GC rich genomes. For the prediction of tRNA genes, tRNAscan-SE is run. For the prediction of RNA genes, SearchForRNAs is run, which uses tRNAscan-SE. Based on the observations from the different CDS and RNA genes are automatically annotated. Besides the Critica predicted CDSs, additional Glimmer(ct) predicted CDSs which do not overlap more than 50bps with these are annotated. This way, sensitivity compared to Critica is increased without significantly losing in specificity.

The reliability of the different predictions is reflected by the 'status region' of these regions. CDSs

predicted by Critica are assigned 'status 2', additional annotated Glimmer(ct) predictions with high 'Vote Scores' are assigned 'status 1', low scoring additional Glimmer(ct)-based annotations are assigned 'attention needed', as besides some true positive predictions the latter is likely to contain also some false positive predictions.

Usage: reganor.pl -p <project name> (-r <region> | -a) [-t -f -n]
where

       -p project name - the GenDB project to be run on 
       -r region       - Contig to be analyzed   - or - 

       -a                run on all contigs 

       -n                Do not run autoannotation 

       -f                restart failed, submitted or finished jobs

GLIMMER-2.1

The complete Glimmer package is integrated into GenDB in a comfortable manner. A Glimmer tool optimally configured according to the evaluation given in McHardy et al. (Bioinformatics, 2004) can be created with the default_tool_creation script or simply by using reganor.pl, where this tool will be created and run along with the other gene finding programs of the default GenDB gene finding pipeline. Configuration of individually designed Glimmer tools can be done via the ToolConfigurationWizard in the graphical user interface. All options currently configurable are explicitly specified in this interface. Note that there is an additional wrapper around the Glimmer functionality, which does not recognize the options which can be given directly to the glimmer2 program, so do not specify such options as 'other command line options'. What follows are extracts from the Glimmer documentation and comments on how the programs are integrated within the GenDB system:

Glimmer 1.0 had 4 read me files, and Glimmer 2.0 maintains that structure. The four main programs are:

long-orfs
extract
build-icm
glimmer2

1. Program long-orfs takes a sequence file (in FASTA format) and outputs a list of all long "potential genes" in it that do not overlap by too much. By "potential gene" I mean the portion of an orf from the first start codon to the stop codon at the end.

2. Program extract takes a FASTA format sequence file and a file with a list of start/stop positions in that file (e.g., as produced by the long-orfs program) and extracts and outputs the specified sequences.

3. Program build-icm.c creates and outputs an interpolated Markov model (IMM) as described in the paper

 A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg.
 Improved Microbial Gene Identification with Glimmer.  
 Nucleic Acids Research, 1999, in press.

4. Program glimmer takes two inputs: a sequence file (in FASTA format) and a collection of Markov models for genes as produced by the program build-icm . It outputs a list of all open reading frames (orfs) together with scores for each as a gene.

Comment: Programs 1-3 are used to create a model of CDS properties based on putative CDSs derived from an input set of sequences. In GenDB, these are called the 'training sequences' and can be specified during configuration (e.g. if multiple contigs belonging to one genome). By 'training tool', the method of how the putative CDSs are extracted from these sequences is defined. If not choosing a 'training tool', the Glimmer default 'long-orfs' is applied. The REGANOR wizard uses the gene finder Critica as the default 'training tool'. During the Glimmer tool run, glimmer2 is then used with the ICM-model created in the training phase on the sequence specified to be analyzed.

Output: The extended output resulting from a Glimmer run contains for every prediction three kind of scores: a probability, a 'Vote Score' (stored as the 'result') and a 'Raw Score' (stored in GenDB along with other comments from the Glimmer output as 'comment').

Configurable options:

Training tool: Any CDS prediction method for which observations on the specified training sequence exist. Training sequence: GenDB regions or an external sequence file in FASTA format. For external sequences long orfs will be used as training tool.

Look for RBS in start prediction Yes/ No along with pattern to search for.

Minimum gene length: Minimum length of genes to be predicted.

Linear region: vs. circular, with genes being predicted over the end of the sequence.

Threshold: Probability score cut-off for prediction. Default is 90 (out of 99).

CRITICA

Critica is part of the GenDB default gene finding pipeline usable with the Reganor Wizard. Different options from the default are configurable versus the ToolCreationWizard. Critica as integrated into GenDB does not allow separation of the training and prediction step. If you want to run it for a set of smaller contigs from one genome, the best way to do this is by 'chaining' the contigs together with linkers (containing stops in every frame) during the import into the GenDB system.

Output: Observations resulting from a Critica run includes predicted CDSs, predicted ribosome binding sites and predictions of possible frameshifts. The value stored under 'results' for the CDS predictions in GenDB is the P-value given in the Critica output: The P-value is the amount of statistical support for the coding region. Like BLAST, low scores are better.

Configurable options:

Look for RBS in start detection Use ribosome binding pattern to choose start

Note: If Critica does not detect any CDSs, it dies with an error message, although this is not really an error -- the result is that the corresponding GenDB job is assigned the state 'failed. This is most likely to happen if you are trying to analyze very short sequences <= 10000bps with Critica, which is not really what the program is intended for.

RBSFinder

From the documentation: The program "rbs_finder.pl" implements an algorithm to find ribosome binding sites (RBS) in the upstream regions of the genes annotated by Glimmer2, GeneMark, or other prokaryotic gene finders. If there is no RBS-like patterns in this region, program searches for a start codon having a RBS-like pattern,in the same reading frame upstream or downstream and relocates start codon accordingly.

Explanations for some directly configurable options in the ToolConfigurationWizard:

Window Size This parameter determines how

            far  the  program  should look for RBS-like pattern  in
            the upstream region of each  of  the  genes.  The  best
            results  obtained  using a window size of 50bps.

Iterations RBSsfinder achieves better results if run iteratively.

RBSFinder options configurable as 'other command-line options':

    <Consensus_seq>: The default  consensus
         sequence  is ("aggag"). However a computed sequence can
         be used to get better results. The  method  to  compute
         the  consensus sequence is as follows:            -Take
         the complement of last 30bps of  16S  rRNA            -
         Find the most abundantly found 5bps subsequence of this
         complement in the 30bps upstream regions of  the  start
         codons  annotated  by  Glimmer2.             -Use  this
         sequence as consensus sequence.
    <Partial_Coord_File>:The   coordinates
         that  user  wants  to  relocate  or check for RBS site,
         which can be  a  subset  of  coordinates  annotated  by
         Glimmer2.This file should be in following format:

             <Gene id> <Start Codon Coord> <Stop Codon Coord>
             1          1030           1140
             2          1214           3010

SearchforRNAs

From the documentation: Searches a fasta formatted contig file for RNA's and prints tbl style output. It works by using the RNA's from a set of closest (or explicitly named) organisms as search probes and then tries to guess the ends if the matches are not perfect. search_for_rnas includes a run of tRNAscan-SE.

Configurable options: Type of RNA D = all; a string like "tRNA, 16S, 23S, 5S" Organism ID D = none; Give your organism an ID like "EC", "BS" .. Domain D = all; phylogenetic domain of organism, A,B,E Organism Genus D = all; genus name of organism Organism Species D = all; species name of organism

Other command line options:

 --probes     D = none; an organism name, or part of
 --complete   D = 95; min. pct completeness of probe sequence(s)

tRNAscan-SE

From the documentation:

tRNAscan-SE combines the specificity of the Cove probabilistic RNA prediction package (1) with the speed and sensitivity of tRNAscan 1.3 (2) plus an implementation of an algorithm described by Pavesi and colleagues (3), which searches for eukaryotic pol III tRNA promoters (our implementation referred to as EufindtRNA). tRNAscan and EufindtRNA are used as first-pass prefilters to identify "candidate" tRNA regions of the sequence. These subsequences are then passed to Cove for further analysis, and output if Cove confirms the initial tRNA prediction. In this way, tRNAscan-SE attains the best of both worlds: (1) a false positive rate equally low to using Cove analysis, (2) the combined sensitivities of tRNAscan and EufindtRNA (detection of 98-99% of true tRNAs), and (3) search speed 1,000 to 3,000 times faster than Cove analysis and 30 to 90 times faster than the original tRNAscan 1.3 (tRNAscan-SE uses both a code-optimized version of tRNAscan 1.3 which gives a 300-fold increase in speed, and a fast C implementation of the Pavesi et al. algorithm).

Note: The current version of tRNAscan-Se (v 1.21) fails on sequences containing non ATGC characters. Within the reganor pipeline, tRNAscan is only included if the sequences do not contain such characters.

Configurable as additional options:

 -B or -P   : search for bacterial tRNAs (use bacterial tRNA model)
 -A         : search for archaeal tRNAs    (use archaeal tRNA model)
 -O         : search for organellar (mitochondrial/chloroplast) tRNAs
 -G         : use general tRNA model (cytoplasmic tRNAs from all 3 domains included)

 -C         : search using covariance model analysis only (max sensitivity, slow)

 -H         : show both primary and secondary structure components to
              covariance model bit scores
 -D         : disable pseudogene checking

Specify Alternate Cutoffs / Data Files:

 -X <score> : set cutoff score (in bits) for reporting tRNAs (default=20)
 -L <length>: set max length of tRNA intron+variable region (default=116bp)

 -I <score>  : manually set "intermediate" cutoff score for EufindtRNA
 -z <number> : use <number> nucleotides padding when passing first-pass
               tRNA bounds predictions to CM analysis (default=7)

 -g <file>   : use alternate genetic codes specified in <file> for
               determining tRNA type
 -c <file>   : use an alternate covariance model in <file>

Misc Options:

 -h         : print this help message
 -Q         : do not prompt user before overwriting pre-existing
              result files  (for batch processing)

 -n <EXPR>  : search only sequences with names matching <EXPR> string
               (<EXPR> may contain * or ? wildcard chars)
 -s <EXPR>  : start search at sequence with name matching <EXPR> string
               and continue to end of input sequence file(s)

Special Options (for testing & special purposes)

 -T          : search using tRNAscan only (defaults to strict params)
 -t <mode>   : explicitly set tRNAscan params, where <mode>=R or S
               (R=relaxed, S=strict tRNAscan v1.3 params)

 -E          : search using Eukaryotic tRNA finder (EufindtRNA) only
               (defaults to Normal seach parameters when run alone,
                     or to Relaxed search params when run with Cove)
 -e <mode>   : explicitly set EufindtRNA params, where <mode>=R, N, or S
               (relaxed, normal, or strict)

 -r <file>   : save first-pass scan results from EufindtRNA and/or
               tRNAscan in <file> in tabular results format
 -u <file>   : search with Cove only those sequences & regions delimited
               in <file> (tabular results file format)
 -F <file>   : save first-pass candidate tRNAs in <file> that were then
               found to be false positives by Cove analysis
-M <file>   : save all seqs that do NOT have at least one
               tRNA prediction in them (aka "missed" seqs)
 -v <file>   : save verbose tRNAscan 1.3 output to <file>
 -V <vers>   : run an alternate version of tRNAscan
               where <vers> = 1.3, 1.39, 1.4 (default), or 2.0
 -K          : Keep redundant tRNAscan 1.3 hits (don't filter out multiple
               predictions per tRNA identification)

Getorf

Besides the options which can be chosen directly, additional options for this program which can be specified as 'additional command line options' during tool configuration with the ToolConfigurationWizard:

  -maxsize            integer    Maximum nucleotide size of ORF to report
  -find               menu       This is a small menu of possible output
                                 options. The first four options are to
                                 select either the protein translation or the
                                 original nucleic acid sequence of the open
                                 reading frame. There are two possible
                                 definitions of an open reading frame: it can
                                 either be a region that is free of STOP
                                 codons or a region that begins with a START
                                 codon and ends with a STOP codon. The last
                                 three options are probably only of interest
                                 to people who wish to investigate the
                                 statistical properties of the regions around
                                 potential START or STOP codons. The last
                                 option assumes that ORF lengths are
                                 calculated between two STOP codons.

  -[no]methionine     boolean    START codons at the beginning of protein
                                 products will usually code for Methionine,
                                 despite what the codon will code for when it
                                 is internal to a protein. This qualifier
                                 sets all such START codons to code for
                                 Methionine by default.
  -[no]reverse        boolean    Set this to be false if you do not wish to
                                 find ORFs in the reverse complement of the
                                 sequence.
  -flanking           integer    If you have chosen one of the options of the
                                 type of sequence to find that gives the
                                 flanking sequence around a STOP or START
                                 codon, this allows you to set the number of
                                 nucleotides either side of that codon to
                                 output. If the region of flanking
                                 nucleotides crosses the start or end of the
                                 sequence, no output is given for this codon.

GenDBWiki/CoreDocumentation/RegionPrediction