GenDBWiki/CoreDocumentation/FunctionPrediction: Difference between revisions
imported>LutzKrause No edit summary |
imported>LutzKrause No edit summary |
||
Line 1: | Line 1: | ||
__NOTOC__ | __NOTOC__ | ||
= Tools for function prediction integrated in GenDB = | = Tools for function prediction integrated in GenDB = | ||
In this section tools are described that can be used to predict the functions of functional regions like protein coding genes (CDS) within the GenDB annotation system. | |||
== Currently integrated programs == | == Currently integrated programs == | ||
Line 13: | Line 15: | ||
| (http://nar.oupjournals.org/cgi/content/full/25/17/3389) | | (http://nar.oupjournals.org/cgi/content/full/25/17/3389) | ||
|- | |- | ||
| [[# | | [[#hmmer|HMMER]] | ||
| | | Profile HMMs for protein sequence analysis | ||
| (http:// | | (http://hmmer.wustl.edu/) | ||
|- | |- | ||
| [[#critica|Critica]] | | [[#critica|Critica]] | ||
Line 48: | Line 50: | ||
they are the major tools used to generate observations. | they are the major tools used to generate observations. | ||
=== Using genome sequences as Blast target === | |||
One of the main features of GenDB is the ability to store only the abstracts of tool results as observations instead of putting the whole result (e.g. a full Blast report with alignments) into the database. To recreate the alignment, a method to get a single entry from a Blast-able database has to exist. GenDB currently uses a BioPerl based index or an SRS query for the main genomic databases like EMBL or SwissProt. If you want to Blast some regions versus a genome (probably the genome of a closely related organism), the default methods will fail. Since the genome database it considers is small (approx. 3-8 mb instead of 1-2 gb), GenDB recreates the alignment by rerunning Blast. There's a small performance penality, but our tests revealed 2-3 seconds for the Blast run on several systems, including highend servers, desktop PCs and laptops. | |||
= | <span id="hmmer"></span> | ||
== HMMER == | == HMMER == | ||
Next to Blast and its databases the Pfam database and the HMMER tools are the most important analysis tools integrated into GenDB. ''hmmpfam'' is part of the HMMER package. This tool can be used for scanning e.g. the Pfam or TIGRFAM database for homologous protein domains. You need this tool (and most of the HMMER package) to use Pfam in GenDB! | Next to Blast and its databases the Pfam database and the HMMER tools are the most important analysis tools integrated into GenDB. Profile hidden Markov models (profile HMMs) can be used to do sensitive database searching using statistical descriptions of a sequence family's consensus. HMMER is a implementation of a profile HMM software for protein sequence analysis. ''hmmpfam'' is part of the HMMER package. This tool can be used for scanning e.g. the Pfam or TIGRFAM database for homologous protein domains. You need this tool (and most of the HMMER package) to use Pfam in GenDB! | ||
=== Pfam database === | === Pfam database === | ||
The Pfam database | The Pfam protein families database (http://nar.oupjournals.org/cgi/content/full/32/suppl_1/D138) is a large collection of protein families and accurate protein domain definitions. These models are represented as profile hidden Markov that | ||
are constructed and searched using HMMER. The main application of Pfam in GenDB is to predict the domain organization of proteins. | |||
=== TIGRFAM database === | |||
TIGRFAMs is a database of protein families based on profile Hidden Markov Models (http://www.tigr.org/TIGRFAMs/index.shtml). The database can be searched using hmmpfam from the HMMER package. | |||
=== Pathname of HMMFETCH === | === Pathname of HMMFETCH === |
Revision as of 12:34, 23 November 2004
Tools for function prediction integrated in GenDB
In this section tools are described that can be used to predict the functions of functional regions like protein coding genes (CDS) within the GenDB annotation system.
Currently integrated programs
Program | Task | Reference |
BLAST | Search for sequence similarities | (http://nar.oupjournals.org/cgi/content/full/25/17/3389) |
HMMER | Profile HMMs for protein sequence analysis | (http://hmmer.wustl.edu/) |
Critica | Prediction of CDSs, RBSs, Frameshifts | (http://mbe.oupjournals.org/cgi/reprint/16/4/512) |
RBSFinder | Relocation of CDS starts, Prediction of RBSs | (http://bioinformatics.oupjournals.org/cgi/reprint/17/12/1123) |
SearchforRNAs | Prediction of t- and rRNAs | (Niels Larsen, University Aarhus, Denmark) |
tRNAscan-SE | Prediction of tRNAs | (http://nar.oupjournals.org/cgi/content/full/25/5/955) |
Getorf | Lists all ORFs | (Contained in EMBOSS package, http://www.hgmp.mrc.ac.uk/Software/EMBOSS/) |
Documentation for individual programs
In the following sections you can find detailed descriptions for all programs that can be used to predict region functions within GenDB.
BLAST
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. In GenDB they are the major tools used to generate observations.
Using genome sequences as Blast target
One of the main features of GenDB is the ability to store only the abstracts of tool results as observations instead of putting the whole result (e.g. a full Blast report with alignments) into the database. To recreate the alignment, a method to get a single entry from a Blast-able database has to exist. GenDB currently uses a BioPerl based index or an SRS query for the main genomic databases like EMBL or SwissProt. If you want to Blast some regions versus a genome (probably the genome of a closely related organism), the default methods will fail. Since the genome database it considers is small (approx. 3-8 mb instead of 1-2 gb), GenDB recreates the alignment by rerunning Blast. There's a small performance penality, but our tests revealed 2-3 seconds for the Blast run on several systems, including highend servers, desktop PCs and laptops.
HMMER
Next to Blast and its databases the Pfam database and the HMMER tools are the most important analysis tools integrated into GenDB. Profile hidden Markov models (profile HMMs) can be used to do sensitive database searching using statistical descriptions of a sequence family's consensus. HMMER is a implementation of a profile HMM software for protein sequence analysis. hmmpfam is part of the HMMER package. This tool can be used for scanning e.g. the Pfam or TIGRFAM database for homologous protein domains. You need this tool (and most of the HMMER package) to use Pfam in GenDB!
Pfam database
The Pfam protein families database (http://nar.oupjournals.org/cgi/content/full/32/suppl_1/D138) is a large collection of protein families and accurate protein domain definitions. These models are represented as profile hidden Markov that are constructed and searched using HMMER. The main application of Pfam in GenDB is to predict the domain organization of proteins.
TIGRFAM database
TIGRFAMs is a database of protein families based on profile Hidden Markov Models (http://www.tigr.org/TIGRFAMs/index.shtml). The database can be searched using hmmpfam from the HMMER package.
Pathname of HMMFETCH
Instead of fetching single entries from a nSRS server, the HMMER package provides a tool called hmmfetch, which extracts single HMMs from the HMM file. This tool is used similar to the BioPerl indices described above for recomputing the full Pfam result.
InterPro
InterPRO-Scan is a meta tool that combines several well known analysis tools like Blast, hmmpfam, etc and several databases to predict the function of a protein. Hits found by InterPRO-Scan are associated to the InterPRO database, which contains descriptions for functional categories. There's also a mapping from InterPRO to GO categories which can be used for assigning functional categories to genes and other regions. More information about Interpro itself can be found at the InterPRO site on the EBI web server.
InterPRO-Scan support has been integrated into GenDB. To use it, you have to download and install InterPRO-Scan, the backend binaries and the data tarball from the download site (approx. 90 mb).
Setting up InterPRO-Scan is beyond the scope of this documentation. Refer to the included README and FAQ.
Setting up an InterPRO tool is done using the Tool Creator Wizard or via the Tool Configuration Wizard.
SignalP
SignalP predicts the presence and location of signal peptide cleavage sites in amino acid sequences. Information about it can be found here. The tool is not freely available therefore you have to send a request to the author in order to receive it.
Configuring GenDB to support SignalP is done by the installer. Individual tool instances of SignalP can be created by using the Tool Creator Wizard or the Tool Configuration Wizard.
TMHMM
TMHMM predicts transmembrane regions of proteins. Information about it can be found here. The tool is not freely available; you have to send a request to the author.
Configuring GenDB to support TMHMM is done by the installer. Individual tool instances of TMHMM can be created by using the Tool Creator Wizard or the Tool Configuration Wizard.