GenDBWiki/CoreDocumentation/FunctionPrediction: Difference between revisions

From BRF-Software
Jump to navigation Jump to search
imported>AlexanderGoesmann
No edit summary
m (21 revisions)
 
(19 intermediate revisions by one other user not shown)
Line 2: Line 2:
= Tools for function prediction integrated in GenDB =
= Tools for function prediction integrated in GenDB =


== Blast ==
In this section tools are described that can be used to predict the function of functional DNA regions like protein coding genes (CDS) within the GenDB annotation system. These tools are autmatically configured by the installation
Blast 2 is one of the major tools used in GENDB to generate observations. We have included two options to influence the way Blast facts are generated:
system.


=== Blast's built in filter ===
== Currently integrated programs ==
Before running the HSP search, Blast scans the input sequence for regions of low complexity, using the filters "DUST" for Blastn and "SEQ" for other favours of Blast2n. See the [http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html#Filter Blast FAQ] for information about filters. For some applications these filters should be disabled. This can be done by setting the appropriate options in the Blast Tool Configuration.
 
{| border="1" cellpadding="2" cellspacing="0"
| '''Program'''
|  '''Task'''
| '''Reference'''
|-
| [[#blast|BLAST]]
|  Search for sequence similarities 
|  (http://nar.oupjournals.org/cgi/content/full/25/17/3389)
|-
| [[#hmmer|HMMER]]
|  Prediction of protein families and domains 
|  (http://hmmer.wustl.edu/)
|-
| [[#interpro|InterPro]]
|  Prediction of protein function and structure
|  (http://nar.oupjournals.org/cgi/content/full/31/1/315)
|-
| [[#signalp|SignalP]]
|  Detection of signal peptide cleavage sites in proteins
| [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WK7-4CKBS0M-3&_coverDate=07%2F16%2F2004&_alid=222995838&_rdoc=1&_fmt=&_orig=search&_qd=1&_cdi=6899&_sort=d&view=c&_acct=C000057302&_version=1&_urlVersion=0&_userid=2459438&md5=d6cec1333e0b2b1aa2d190ff62d26e01 SignalP]
|-
| [[#tmhmm|TMHMM]]
|  Search for transmembrane regions of proteins 
[http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=11152613&itool=iconabstr TMHMM]
|-
| [[#hth|Helixturnhelix]]
|  Detection of helix-turn-helix nucleic acid binding motifs in proteins
|  [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6TCY-40CRS27-G&_coverDate=06%2F01%2F2000&_alid=223069367&_rdoc=1&_fmt=&_orig=search&_qd=1&_cdi=5183&_sort=d&view=c&_acct=C000057302&_version=1&_urlVersion=0&_userid=2459438&md5=fcef5227189fa9f0d87d36eeb435c121 EMBOSS]
|}
 
== Documentation for individual programs ==
 
In the following sections you can find detailed descriptions for all programs that can be used to predict region functions within GenDB.
 
<span id="blast"></span>
 
== BLAST ==
The [http://nar.oupjournals.org/cgi/content/full/25/17/3389 BLAST] programs are widely used tools for searching protein and DNA databases for sequence similarities. In GenDB
they are the major tools used to generate observations.


=== Using genome sequences as Blast target ===
=== Using genome sequences as Blast target ===
One of the main features of GenDB is the ability to store only the abstracts of tool results as observations instead of putting the whole result (e.g. a full Blast report with alignments) into the database. To recreate the alignment, a method to get a single entry from a Blast-able database has to exist. GenDB currently uses a [[BioPerl]] based index or an SRS query for the main genomic databases like EMBL or [[SwissProt]]. If you want to Blast some regions versus a genome (probably the genome of a closely related organism), the default methods will fail. Since the genome database it considers is small (approx. 3-8 mb instead of 1-2 gb), GenDB recreates the alignment by rerunning Blast. There's a small performance penality, but our tests revealed 2-3 seconds for the Blast run on several systems, including highend servers, desktop PCs and laptops.  
One of the main features of GenDB is the ability to store only the abstracts of tool results as observations instead of putting the whole result (e.g. a full Blast report with alignments) into the database. To recreate the alignment, a method to get a single entry from a Blast-able database has to exist. GenDB currently uses a BioPerl based index or an SRS query for the main genomic databases like EMBL or SwissProt. If you want to Blast some regions versus a genome (probably the genome of a closely related organism), the default methods will fail. Since the genome database it considers is small (approx. 3-8 mb instead of 1-2 gb), GenDB recreates the alignment by rerunning Blast. There's a small performance penality, but our tests revealed 2-3 seconds for the Blast run on several systems, including highend servers, desktop PCs and laptops.  
 
<span id="hmmer"></span>


== HMMER ==
== HMMER ==


Next to Blast and its databases the Pfam database and the HMMER tools are the most important analysis tools integrated into GenDB. ''hmmpfam'' is part of the HMMER package. This tool can be used for scanning e.g. the Pfam or TIGRFAM database for homologous protein domains. You need this tool (and most of the HMMER package) to use Pfam in GenDB!
Next to Blast and its databases the HMMER tools are the most important analysis tools integrated into GenDB. Profile hidden Markov models (profile HMMs) can be used to do sensitive database searching using statistical descriptions of a sequence family's consensus sequence.
HMMER is a implementation of a profile HMM software for protein sequence analysis. ''hmmpfam'' is part of the HMMER package. This tool can be used for scanning e.g. the Pfam or TIGRFAM database for homologous protein domains. You need this tool (and most of the [http://hmmer.wustl.edu/#download HMMER] package) to use Pfam in GenDB!


=== Pfam database ===
=== Pfam database ===
The Pfam database consists of several files like entry descriptions, HMM models etc. To use this database you have to specifiy the complete path to this file. It is also necessary to build a special Pfam index, using the ''hmmindex'' tool (part of HMMER package). The last major release of Pfam introduced a new files structure. Instead of using a single HMM model file the new release contains two files,''Pfam_fs'' (optimized for local alignments) and ''Pfam_ls'' (optimized for global alignments). Usually you will use the Pfam_ls file, since these models are more selective.  
The [http://nar.oupjournals.org/cgi/content/full/32/suppl_1/D138 Pfam] protein families database is a large collection of protein families and accurate protein domain definitions. These models are represented as profile hidden Markov that
are constructed and searched using HMMER. The parameters of the profile HMMs are derived from multiple alignments
of family member sequences. The main application of Pfam in GenDB is to predict the domain organization of proteins.
 
To use Pfam within GenDB you have to install the [http://hmmer.wustl.edu/#download HMMER] package and to download the Pfam database (Pfam_ls.gz) from the [ftp://ftp.sanger.ac.uk/pub/databases/Pfam/ sanger site].
 
=== TIGRFAM database ===
[http://nar.oupjournals.org/cgi/content/full/31/1/371 TIGRFAMs] is a collection of manually curated protein families consisting of hidden Markov models (HMMs), multiple sequence alignments, commentary, Gene Ontology (GO) assignments, etc. TIGRFAMs contains models of full-length proteins and shorter regions at the levels of superfamilies, subfamilies and equivalogs, where equivalogs are sets of homologous proteins conserved with respect to function since their last common ancestor. TIGRFAMs is thus complementary to Pfam, whose models typically achieve broad coverage across distant homologs but end at the boundaries of conserved structural domains.
The TIGRFAMs models are designed to support function annotation of genomes. The database can be searched using hmmpfam from the HMMER package.
 
To use TIGRFAM within GenDB you have to install the [http://hmmer.wustl.edu/#download HMMER] package and to download the TIGRFAM database from the [ftp://ftp.tigr.org/pub/data/TIGRFAMs/ download site].
 
=== HMMFETCH ===
Instead of fetching single entries from a nSRS server, the HMMER package provides a tool called ''hmmfetch'', which extracts single HMMs from the HMM file. This tool is used similar to the BioPerl indices described above for recomputing the full Pfam result.


=== Pathname of HMMFETCH ===
<span id="interpro"></span>
Instead of fetching single entries from a nSRS server, the HMMER package provides a tool called ''hmmfetch'', which extracts single HMMs from the HMM file. This tool is used similar to the [[BioPerl]] indices described above for recomputing the full Pfam result.


== [[InterPro]] ==
== [[InterPro]] ==
 
InterPRO-Scan is a meta tool that combines several well known analysis tools like Blast and hmmpfam and several databases like PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs to predict protein families, domains and functional sites of proteins. Hits found by InterPRO-Scan are associated to the InterPRO database, which contains descriptions for functional categories. There's also a mapping from InterPRO to GO categories which can be used for assigning functional categories to genes and other regions. More information about Interpro itself can be found at the [http://www.ebi.ac.uk/interpro/scan.html InterPRO site] on the EBI web server.
InterPRO-Scan is a meta tool that combines several well known analysis tools like Blast, hmmpfam, etc and several databases to predict the function of a protein. Hits found by InterPRO-Scan are associated to the InterPRO database, which contains descriptions for functional categories. There's also a mapping from InterPRO to GO categories which can be used for assigning functional categories to genes and other regions. More information about Interpro itself can be found at the [http://www.ebi.ac.uk/interpro/scan.html InterPRO site] on the EBI web server.


InterPRO-Scan support has been integrated into GenDB. To use it, you have to download and install InterPRO-Scan, the backend binaries and the data tarball from the [ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan download site] (approx. 90 mb).
InterPRO-Scan support has been integrated into GenDB. To use it, you have to download and install InterPRO-Scan, the backend binaries and the data tarball from the [ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan download site] (approx. 90 mb).
Line 29: Line 82:
Setting up InterPRO-Scan is beyond the scope of this documentation. Refer to the included README and FAQ.
Setting up InterPRO-Scan is beyond the scope of this documentation. Refer to the included README and FAQ.


Setting up an InterPRO tool is done using the Tool Creator Wizard or via the Tool Configuration Wizard.
<span id="sinalp"></span>


== SignalP ==
== SignalP ==
Line 35: Line 88:
SignalP predicts the presence and location of signal peptide cleavage sites in amino acid sequences. Information about it can be found [http://www.cbs.dtu.dk/services/SignalP/ here]. The tool is not freely available therefore you have to send a request to the author in order to receive it.
SignalP predicts the presence and location of signal peptide cleavage sites in amino acid sequences. Information about it can be found [http://www.cbs.dtu.dk/services/SignalP/ here]. The tool is not freely available therefore you have to send a request to the author in order to receive it.


Configuring GenDB to support SignalP is done by the installer. Individual tool instances of SignalP can be created by using the Tool Creator Wizard or the Tool Configuration Wizard.
<span id="hth"></span>
 
== TMHMM ==


TMHMM predicts transmembrane regions of proteins. Information about it can be found [http://www.cbs.dtu.dk/services/TMHMM/ here]. The tool is not freely available; you have to send a request to the author.
== Helixturnhelix ==


Configuring GenDB to support TMHMM is done by the installer. Individual tool instances of TMHMM can be created by using the Tool Creator Wizard or the Tool Configuration Wizard.
Helixturnhelix is part of the EMBOSS package and finds helix-turn-helix nucleic acid binding motifs in proteins. To use
the program in GenDB you need to install the EMBOSS package from the MRC [http://www.rfcgr.mrc.ac.uk/Software/EMBOSS/download.html download site].

Latest revision as of 07:13, 26 October 2011

Tools for function prediction integrated in GenDB

In this section tools are described that can be used to predict the function of functional DNA regions like protein coding genes (CDS) within the GenDB annotation system. These tools are autmatically configured by the installation system.

Currently integrated programs

Program Task Reference
BLAST Search for sequence similarities (http://nar.oupjournals.org/cgi/content/full/25/17/3389)
HMMER Prediction of protein families and domains (http://hmmer.wustl.edu/)
InterPro Prediction of protein function and structure (http://nar.oupjournals.org/cgi/content/full/31/1/315)
SignalP Detection of signal peptide cleavage sites in proteins SignalP
TMHMM Search for transmembrane regions of proteins TMHMM
Helixturnhelix Detection of helix-turn-helix nucleic acid binding motifs in proteins EMBOSS

Documentation for individual programs

In the following sections you can find detailed descriptions for all programs that can be used to predict region functions within GenDB.

BLAST

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. In GenDB they are the major tools used to generate observations.

Using genome sequences as Blast target

One of the main features of GenDB is the ability to store only the abstracts of tool results as observations instead of putting the whole result (e.g. a full Blast report with alignments) into the database. To recreate the alignment, a method to get a single entry from a Blast-able database has to exist. GenDB currently uses a BioPerl based index or an SRS query for the main genomic databases like EMBL or SwissProt. If you want to Blast some regions versus a genome (probably the genome of a closely related organism), the default methods will fail. Since the genome database it considers is small (approx. 3-8 mb instead of 1-2 gb), GenDB recreates the alignment by rerunning Blast. There's a small performance penality, but our tests revealed 2-3 seconds for the Blast run on several systems, including highend servers, desktop PCs and laptops.

HMMER

Next to Blast and its databases the HMMER tools are the most important analysis tools integrated into GenDB. Profile hidden Markov models (profile HMMs) can be used to do sensitive database searching using statistical descriptions of a sequence family's consensus sequence. HMMER is a implementation of a profile HMM software for protein sequence analysis. hmmpfam is part of the HMMER package. This tool can be used for scanning e.g. the Pfam or TIGRFAM database for homologous protein domains. You need this tool (and most of the HMMER package) to use Pfam in GenDB!

Pfam database

The Pfam protein families database is a large collection of protein families and accurate protein domain definitions. These models are represented as profile hidden Markov that are constructed and searched using HMMER. The parameters of the profile HMMs are derived from multiple alignments of family member sequences. The main application of Pfam in GenDB is to predict the domain organization of proteins.

To use Pfam within GenDB you have to install the HMMER package and to download the Pfam database (Pfam_ls.gz) from the sanger site.

TIGRFAM database

TIGRFAMs is a collection of manually curated protein families consisting of hidden Markov models (HMMs), multiple sequence alignments, commentary, Gene Ontology (GO) assignments, etc. TIGRFAMs contains models of full-length proteins and shorter regions at the levels of superfamilies, subfamilies and equivalogs, where equivalogs are sets of homologous proteins conserved with respect to function since their last common ancestor. TIGRFAMs is thus complementary to Pfam, whose models typically achieve broad coverage across distant homologs but end at the boundaries of conserved structural domains. The TIGRFAMs models are designed to support function annotation of genomes. The database can be searched using hmmpfam from the HMMER package.

To use TIGRFAM within GenDB you have to install the HMMER package and to download the TIGRFAM database from the download site.

HMMFETCH

Instead of fetching single entries from a nSRS server, the HMMER package provides a tool called hmmfetch, which extracts single HMMs from the HMM file. This tool is used similar to the BioPerl indices described above for recomputing the full Pfam result.

InterPro

InterPRO-Scan is a meta tool that combines several well known analysis tools like Blast and hmmpfam and several databases like PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs to predict protein families, domains and functional sites of proteins. Hits found by InterPRO-Scan are associated to the InterPRO database, which contains descriptions for functional categories. There's also a mapping from InterPRO to GO categories which can be used for assigning functional categories to genes and other regions. More information about Interpro itself can be found at the InterPRO site on the EBI web server.

InterPRO-Scan support has been integrated into GenDB. To use it, you have to download and install InterPRO-Scan, the backend binaries and the data tarball from the download site (approx. 90 mb).

Setting up InterPRO-Scan is beyond the scope of this documentation. Refer to the included README and FAQ.

SignalP

SignalP predicts the presence and location of signal peptide cleavage sites in amino acid sequences. Information about it can be found here. The tool is not freely available therefore you have to send a request to the author in order to receive it.

Helixturnhelix

Helixturnhelix is part of the EMBOSS package and finds helix-turn-helix nucleic acid binding motifs in proteins. To use the program in GenDB you need to install the EMBOSS package from the MRC download site.