BRF-Software - User contributions [en]

GenDBWiki/WebDocumentation/FAQs

2011-10-28T14:03:42Z

Tk:

__NOTOC__
= FAQs for the GenDB Web Interface =

1.) [[#Q1|What is the difference between the 'Add New' and the 'Accept' button in the Annotation Dialog?]]

2.) [[#Q2|How can I view an alignment?]]

3.) [[#Q3|How do I search for a gene name like dnaA?]]

4.) [[#Q4|?]]

----


== Q1: What is the difference between the `Add New` and the `Accept` button in the Annotation Dialog? ==

Both buttons create a new annotation. Pressing the `Accept` button will automatically set the `status region` to `finished` and the `status function` to `annotated`. If you want to set a different status for a region you will have to use the `Add new` button.


== Q2: How can I view an alignment? ==

In the list of observations for a selected CDS you can find entries for a tools. Click on 'Result' in the rightmost column to view the alignment. If you don't see any observations of this tool in your list you have to adjust the tool settings in the configuration dialog.


== Q3: How do I search for a gene name like dnaA? ==

Open the search dialog, and enter the gene name into the field Genename (NOT Name, since this is the name of a region assigned automatically by GenDB). Select 'exact' if you know a gene name exactly like 'dnaA'. If you want to search for names with specific patterns select 'regexp'. Then you can use regular expressions, e.g. like

* <tt>gyr.</tt> searches for gyrA, gyr1, gyrB, ...
* <tt>gyr.*</tt> searches for gyrA, gyr1, gyrB, gyrAA, gyrAA, gyrABC, gyr123ABCD, ...


== Q4: ? ==

GenDBWiki/WebDocumentation/FAQs

2011-10-28T14:03:01Z

Tk:

__NOTOC__
= FAQs for the GenDB Web Interface =

1.) [[#Q1|What is the difference between the 'Add New' and the 'Accept' button in the Annotation Dialog?]]

2.) [[#Q2|How can I view an alignment?]]

3.) [[#Q3|How do I search for a gene name like dnaA?]]

4.) [[#Q4|?]]

----


== Q1: What is the difference between the `Add New` and the `Accept` button in the Annotation Dialog? ==

Both buttons create a new annotation. Pressing the `Accept` button will automatically set the `status region` to `finished` and the `status function` to `annotated`. If you want to set a different status for a region you will have to use the `Add new` button.


== Q2: How can I view an alignment? ==

In the list of observations for a selected CDS you can find entries for a tools. Click on 'Result' in the rightmost column to view the alignment. If you don't see any observations of this tool in your list you have to adjust the tool settings in the configuration dialog.


== Q3: How do I search for a gene name like dnaA? ==

Open the search dialog, and enter the gene name into the field Genename (NOT Name, since this is the name of a region assigned automatically by GenDB). Select 'exact' if you know a gene name exactly like 'dnaA'. If you want to search for names with specific patterns select 'regexp'. Then you can use regular expressions, e.g. like

* `gyr.` searches for gyrA, gyr1, gyrB, ...
* `gyr.*` searches for gyrA, gyr1, gyrB, gyrAA, gyrAA, gyrABC, gyr123ABCD, ...


== Q4: ? ==

GenDBWiki

2011-10-28T14:01:54Z

Tk: /* Contact */

__NOTOC__

= GenDB Overview Page =

== GenDB - An open source genome annotation system for prokaryote genomes ==

The GenDB system is an open source genome annotation system developed by the "Bioinformatics Resource Facility (BRF)" of the "Center for Biotechnology ([http://www.cebitec.uni-bielefeld.de/ CeBiTec])" at "Bielefeld University".

Please have a look at the [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/ GenDB website] for general information, a guided tour, login to a demo project, a list of FAQs and more information about the availability of this software. Within this GenDB Wiki you will find more detailed information about installing and maintaining GenDB, using the system, and writing your own programs. Since this information is likely to change quite frequently we decided to put it into a Wiki system. If you discover new features that are not listed here you are of course welcome to extend this documentation.

Since GenDB version 2.2 the sources are split into 3 separate major packages:

* [[GenDBWiki/CoreDocumentation|GenDB CORE]] includes the complete GenDB backend with all O2DBI class modules and several extensions but no graphical user interface. It contains the basic functionality for annotating a genome automatically by using a number of simple scripts available within this package.
* [[GenDBWiki/WebDocumentation|GenDB WEB]] contains the modules required for running the GenDB web frontend. It can be used for a distributed manual genome annotation.
* [[GenDBWiki/GUIDocumentation|GenDB GUI]] includes the sources of the graphical user interface written in Gtk. This frontend provides some enhanced functionality and it has more features than the web interface.

The following sections contain detailed information about each package and how to use it for annotating a genome. You can also find some documentation for software developers and system administrators in separate Wiki sections. If you are using the GenDB system for the first time, you should also read the [[GenDBWiki/TermsAndConcepts|Terms and Concepts]] section where the conceptual details of the system are explained.

== Installation ==

* [[GenDBWiki/AdministratorDocumentation/GenDBInstallation|Installation Guide]]
* [[GenDBWiki/AdministratorDocumentation/GenDBInstallationFAQ|Installation FAQ]]

== Administration ==

* [[GenDBWiki/AdministratorDocumentation/ManagingUsers|Managing Users and Projects]]

== Documentation ==
* [[GenDBWiki/CoreDocumentation|Core Documentation]]
* [[GenDBWiki/WebDocumentation|Web Documentation]]
* [[GenDBWiki/GUIDocumentation|GUI Documentation]]
* [[GenDBWiki/DeveloperDocumentation|Developer Documentation]]
* [[GenDBWiki/AdministratorDocumentation|Administrator Documentation]]

== General ==
* [[GenDBWiki/FeatureTable|Feature Table]]
* [[GenDBWiki/TermsAndConcepts|Terms and Concepts]]
* [[GenDBWiki/FuturePlans|Future Plans]]

== Other Issues ==

For more information about project setup, license, application examples, and publications, please go to the [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/ GenDB website].

== Contact ==

Please send an e-mail for account requests or questions concerning the use of GenDB to:
<tt>gendb AT cebitec DOT uni DASH bielefeld DOT de</tt>

For bug reports, please use our bug reporting system [https://bugs.cebitec.uni-bielefeld.de/|CeBiTec Bugzilla].

Author: [http://www.cebitec.uni-bielefeld.de/~agoesman Alexander Goesmann]

File:Mainmenu.png

2011-10-28T12:35:15Z

Tk:

GenDBWiki/WebDocumentation/MainMenu

2011-10-28T12:34:49Z

Tk: Created page with "__NOTOC__ = GenDB-Web - Main Menu = The main navigation in the Web Interface is done using the black menu bar in the main window. In the 'Contigs' menu all the contigs in the G..."

__NOTOC__
= GenDB-Web - Main Menu =

The main navigation in the Web Interface is done using the black menu bar in the main window.

In the 'Contigs' menu all the contigs in the GenDB project are listed and can be selected. Choosing a
contig redisplays the [wiki:GenDBWiki/WebDocumentation/MainViews/ContigView Contig View] jumping to the
beginning of the selected contig.

[[File:mainmenu.png]]

The 'Tools' menu contains tools working on the whole genome. One tool is the
[wiki:GenDBWiki/WebDocumentation/DialogWindows/CircularPlot Circular Plot] which displays the complete
contig chosen graphically. Another implemented tool is searching patterns in the genome via
[wiki:GenDBWiki/WebDocumentation/DialogWindows/PatScan PatScan]

Clicking the 'Search' menu provides a [wiki:GenDBWiki/WebDocumentation/DialogWindows/SearchDialog dialog window for the search in the GenDB regions].

For the classification actually three different views are available. The
[wiki:GenDBWiki/WebDocumentation/MainViews/KEGGView KEGG View] shows the data in its metabolic contexts.
The [wiki:GenDBWiki/WebDocumentation/MainViews/COGView COG View] and the [wiki:GenDBWiki/WebDocumentation/MainViews/GOView GO View] show
the functional classifications for the genes in a GenDB project.

The 'Preferences' menu gives access to the [wiki:GenDBWiki/WebDocumentation/DialogWindows/ConfigDialog Config Dialog]. Another possibility
is to turn the Region Observations on and off in the [wiki:GenDBWiki/WebDocumentation/MainViews/ContigView Contig View].

With 'Logout' the user can end his GenDB session.

GenDBWiki/WebDocumentation

2011-10-28T12:33:02Z

Tk:

__NOTOC__
= Web Interface Documentation =

In the following sections you can find detailed information and guidelines for using the GenDB web interface.

== Table of Contents ==

* [[GenDBWiki/WebDocumentation/GeneralUsage|General Usage]]
* [[GenDBWiki/WebDocumentation/MainMenu|Main Menu]]
* [[GenDBWiki/WebDocumentation/MainViews|Main Views]]
* [[GenDBWiki/WebDocumentation/ContextMenus|Context Menus]]
* [[GenDBWiki/WebDocumentation/DialogWindows|Dialog Windows]]
* [[GenDBWiki/WebDocumentation/HowTos|HowTos]]
* [[GenDBWiki/WebDocumentation/FAQs|FAQs]]

Authors: [http://www.cebitec.uni-bielefeld.de/~jomuna Jomuna V. Choudhuri], [http://www.cebitec.uni-bielefeld.de/~agoesman Alexander Goesmann]

GenDBWiki/RolesAndRights

2011-10-28T12:28:56Z

Tk: Created page with "__NOTOC__ = GenDB Roles and Rights = This section describes the ''Roles'' and ''Rights'' as they were defined for the genome annotation system Gen``DB which extensively uses dif..."

__NOTOC__
= GenDB Roles and Rights =

This section describes the ''Roles'' and ''Rights'' as they were defined for the genome annotation system Gen``DB which extensively uses different roles for a sophisticated access control.

== GenDB Roles ==

<pre>
PROJECT_CLASS GENDB

# user with read only permissions and almost completely restricted access
ROLE Guest
RIGHT basic_access

# user who is allowed to write annotations and recompute the observations
# for a single region
ROLE Annotator
RIGHT basic_access
RIGHT annotate
RIGHT export_region_data
RIGHT recompute

# (external) user who is allowed do most of the necessary tasks to maintain a project
# (e.g. import/export/edit/delete sequence, add tools and submit all jobs)
# this role should be used if several persons have to edit the sequence e.g. to correct frame-shifts
ROLE Maintainer
RIGHT basic_access
RIGHT recompute
RIGHT submit_jobs
RIGHT contig_import_export
RIGHT edit_sequence
RIGHT add_tools
RIGHT export_region_data
RIGHT delete_contig
RIGHT annotate
RIGHT region_prediction

# user who is responsible for the database and for the solution of bugs and problems
# can do almost everything and also MODIFY THE DATABASE (e.g. alter table)
ROLE Developer
RIGHT contig_import_export
RIGHT region_prediction
RIGHT submit_jobs
RIGHT recompute
# frame-shift correction and contig update
RIGHT edit_sequence
RIGHT add_tools
RIGHT export_region_data
RIGHT delete_contig
RIGHT configure_project
RIGHT basic_access
RIGHT annotate
RIGHT modify_db

# user who is responsible for the project (in the majority of cases this is one of the
# GenDB developers in Bielefeld), can do everything (e.g. configure project) except
# modifying the database
# has to add Maintainers, Annotators and Guests but cannot add Developers
ROLE Chief
RIGHT annotate
RIGHT add_user
RIGHT contig_import_export
RIGHT region_prediction
RIGHT submit_jobs
RIGHT recompute
# frame-shift correction and contig update
RIGHT edit_sequence
RIGHT add_tools
RIGHT export_region_data
RIGHT delete_contig
RIGHT configure_project
RIGHT basic_access
</pre>

== GenDB Rights ==

<pre>
CLASS GENDB

RIGHT basic_access
DS_TYPE GENDB
DB select
DS_TYPE GPMSDB
DB select
TABLE sessions delete update insert
TABLE sessions_not_permanent delete update insert
TABLE sessions_permanent delete update insert
TABLE Member_User_Project_Configs update delete insert
TABLE Member_User_Project_Configs_hash_value update delete insert
TABLE ProjectManagement_counters update

RIGHT annotate
DS_TYPE GENDB
DB insert update

RIGHT export_region_data

RIGHT recompute
DS_TYPE GENDB
DB delete update insert

RIGHT submit_jobs
DS_TYPE GENDB
DB insert update delete

RIGHT contig_import_export
DS_TYPE GENDB
DB insert update delete

# may only be granted to user if user has right annotate
RIGHT edit_sequence
DS_TYPE GENDB
DB update insert

RIGHT add_tools
DS_TYPE GENDB
DB insert update
RIGHT delete_contig
DS_TYPE GENDB
DB delete

RIGHT region_prediction
DS_TYPE GENDB
DB insert update delete

RIGHT configure_project
DS_TYPE GENDB
DB insert update delete

RIGHT modify_db
DS_TYPE GENDB
DB insert update delete alter index create drop references

RIGHT add_user
DS_TYPE GENDB
DB grant insert update delete
DS_TYPE GPMSDB
DB grant insert update delete
</pre>

File:ToolConcept.png

2011-10-28T12:27:11Z

Tk:

GenDBWiki/ToolAndJobConcept

2011-10-28T12:26:52Z

Tk:

= The GenDB Tool and Job Concept =

One major improvement of the Gen``DB system in comparison to the first version, is the modular concept for the integration of bioinformatics tools (e.g. Blast). Gen``DB allows the incorporation of arbitrary programs for different kinds of bioinformatics analysis. According to the system design, each of these programs is integrated as a ''Tool'' (e.g. ''Tool::Function::Blast''), which creates ''Observations'' for a specific kind of ''Region''. A ''Job'' that can be submitted to the scheduling system thus contains the information about a valid tool and region combination as illustrated below.

[[File:ToolConcept.png]]

For most tools, Gen``DB also features simple automatic annotators that can be activated. They are started upon completion of a tool run and create automatic annotations employing a simple "best hit" strategy based on the observations created by the tool run.

For an automated large scale computation of various bioinformatics tools, a scalable framework was developed and implemented which allows a batch submission of thousands of ''Jobs'' in a very simple manner. Therefore, the following steps have to be performed:

1. The desired ''Jobs'' have to be created, e.g. for region or function prediction by using the ''JobSubmitter Wizard''. This can be done quite easily with the ''submit_job.pl'' script or via the graphical user interface. For all valid region and tool combinations as defined by the user, the requested ''Jobs'' will be created and stored in the Gen``DB project database. Initially, these new ''Jobs'' will then have the status ''PENDING''.

2. Before the ''submit_job.pl'' script finishes, it calls the ''submit'' method of the ''JobSubmitter Wizard''. Thus, all previously created ''Jobs'' will be registered as a ''Job Array'' in the ''Scheduler::Codine'' using the ''Scheduler::Codine->freeze'' method. Finally, the array of all ''Jobs'' is submitted by calling ''Scheduler::Codine->thaw''. All ''Jobs'' should now have the status ''SUBMITTED'' and a queue of ''Jobs'' should appear in the status report of the Sun GridEngine's ''qstat'' output.

3. In the previous step, each ''Job'' was submitted to the scheduler by adding the command line for each single ''Job'' computation to the list of ''Jobs''. Actually, the script ''runtool.pl'' is called for each ''Job'' with the corresponding arguments such as ''runtool.pl -p <projectname> -j <jobid> [-a]''.

4. When such a command line is executed by one of the compute hosts, the script ''runtool.pl'' tries to initialize the ''Job'' object for the given id and project name. Since a ''Job'' contains the information about a specific region and a single tool that should be computed for that region, this script can now execute the ''run'' method that has to be defined for each tool. Such a ''run'' method normally starts a bioinformatics tool (e.g. Blast, Pfam, InterPro) for the given region and stores some observations for the results obtained. During this computation the status of the current ''Job'' is ''RUNNING''. If the option ''-a'' was specified an automatic annotation will be started upon successful computation of the tool. These are only very simple automatic annotations since they are based on the results of a single tool and region combination. Whenever the computation itself or the automatic annotation fails, the status of a ''Job'' is set to ''FAILED'', otherwise the status is ''FINISHED'' and the computation is complete.

The inclusion of new tools in Gen``DB is very easy, with the most time-consuming step typically being the implementation of a parser for the result files. For the prediction of regions, such as coding sequences (CDS) or tRNAs, GLIMMER, CRITICA, tRNAscan-SE, and others have been integrated into the system.

Homology searches on DNA or amino acid level in arbitrary sequence databases can be done using the Blast program suite. In addition to using HMMer for motif searches, we also search the BLOCKS and InterPro databases to classify sequence data based on a combination of different kinds of motif search tools. A number of additional tools have been integrated for the characterization of certain features of coding sequences, such as TMHMM for the prediction of alpha-helical transmembrane regions, SignalP for signal peptide prediction, or CoBias for analyzing trends in codon usage.

Since all tools have to be defined separately for each project, a tool configuration wizard was implemented to support this task.

Whereas some tools only return a numeric score and/or an E-value as a result, other tools like Blast or HMMer additionally provide more detailed information, such as an alignment. Although the complete tool results are available to the annotator, only a minimum data subset is stored in form of observations. Based on this subset, the complete tool result record can be recomputed on demand. Storing only a minimal subset of data reduces the storage demands by two orders of magnitude when compared to the traditional "store everything" approach. Our performance measurements have shown this also to be more time efficient than data retrieval from a disk subsystem for any realistic genome project.

GenDBWiki/ToolAndJobConcept

2011-10-28T12:26:31Z

Tk: Created page with "= The GenDB Tool and Job Concept = One major improvement of the Gen``DB system in comparison to the first version, is the modular concept for the integration of bioinformatics t..."

GenDBWiki/AnnotationConcepts

2011-10-28T12:25:21Z

Tk: Created page with "= Details about specific GenDB Concepts = The following sections illustrate some specific concepts that are implemented within the GenDB genome annotation system. Before you sta..."

= Details about specific GenDB Concepts =

The following sections illustrate some specific concepts that are implemented within the GenDB genome annotation system. Before you start annotating a genome you should read these sections in order to understand how the system works with your data and how some information is stored systematically to facilitate further analyses.

== Annotation Concepts ==

In general, annotations are used within the GenDB framework to store information about a region. They are either created by an automatic annotation step or by a human annotator. In contrast to observations that can be recomputed on demand, an annotation is '''never deleted'''. Instead, all annotations are stored with a timestamp in chronological order. Storing such a history of annotations allows you to reproduce how the collected information about a region has evolved over time. Furthermore, the GenDB system distinguishes to different types of annotations: '''Region annotations''' are used to describe why, how, and when a region (e.g. a CDS) was created or modified. '''Function annotations''' explain the functional role of a region and the (potential) tasks that a region is involved in within an organism. For the annotation of coding sequences, the function annotation is used to characterize a gene: A gene name, gene product, a description, a functional category and other information can be assigned creating a new annotation. In addition to the history of all annotations, each region can refer a single region annotation and a function annotation as the latest_annotation. This '''latest_annotation_region''' / '''latest_annotation_function''' contains the current valid annotation which is represented to the user when a region is selected. This is usually the latest annotation (by date) but not necessarily (recomputing an automatic annotation can set the latest annotation or not). Whenever a region is exported (e.g. EMBL or !GenBank export) only the latest annotation is used by default for data generation. It is also important to notice that the status of a region is usually only changed by creating a new annotation in order to protocol how the status evolved.

== Observation Levels ==

Within the GenDB annotation system all results from bioinformatics tools (e.g. Glimmer gene predictions or alignments from BLAST runs) are called observations. In order to allow a comparison of results computed by different tools (e.g. by BLAST and Pfam) all observations can be grouped into levels ranging from 1 (high) to 5 (low) according to their quality. As an example, this could mean that level 1 is assigned to all BLAST results with an e-value lower than 1E-50 while the same level is assigned to Pfam observations only if they have an e-value lower than 1E-80. Since GenDB version 2.2 the corresponding ranges for assigning the level to an observation can be set individually for each tool instance (e.g. you can assign different levels for a BLAST vs. !SwissProt and a BLAST vs. EMBL search).

== Status Region vs. Status Function ==

The '''region status''' can be used to indicate the progress of an ongoing annotation with respect to the reliability of each region. A specific '''region status''' could be set to indicate a potential frameshift or wrong gene start. Setting a special '''region status''' could also be used to distinguish the predictions from different gene finding tools, e.g. Glimmer and Critica. The '''region status''' can be one of the following:

* '''ignored''': Setting this status will ignore a region for all further analysis and also for all exports, e.g. to an EMBL file. Instead of deleting a region which is usually not supported by GenDB, the status should be set to ignored so that other annotators can still see that region and know immediately that someone already checked and discarded that region.

* '''putative''': This status is assigned to regions that were just initially created, e.g. by a human annotator.

* '''attention needed''': This status indicates that the region annotation of this region is not reliable and needs to be revised.

* '''status 1''': This status can be used to assign an individual property defined specifically for each project (sometimes unused).

* '''status 2''': This status can be used to assign an individual property defined specifically for each project (sometimes unused).

* '''status 3''': This status can be used to assign an individual property defined specifically for each project (sometimes unused).

* '''finished''': The region and function annotation of this region was assured by a human annotator. This region needs no more work and probably there is also some experimental evidence that confirms the annotation.

The '''functional status''' can be used to indicate the status of a region during an annotation with respect to the information content of a functional annotation. Whenever the function of a region is annotated a status from the following list should be assigned:

* '''putative''': All regions are initially set to this status after creation. There is neither an automatic nor a manual function annotation.

* '''attention needed''': This status indicates that the functional annotation of this region is not reliable and needs to be revised.

* '''automatically annotated''': The function of this region was automatically annotated, e.g. by Metanor. In the beginning of a manual annotation, most of the regions will have this function status.

* '''status 1''': This status can be used to assign an individual property defined specifically for each project (sometimes unused).

* '''status 2''': This status can be used to assign an individual property defined specifically for each project (sometimes unused).

* '''status 3''': This status can be used to assign an individual property defined specifically for each project (sometimes unused).

* '''annotated''': The function of this region has been verified by a human annotator. This implies that the annotation of such a region is done and therefore the region status will be usually set to ''finished''.

File:HomoserineDehydrogenase.png

2011-10-28T12:23:27Z

Tk:

File:GenDBFlowDetail.png

2011-10-28T12:23:10Z

Tk:

File:RegionPrediction.png

2011-10-28T12:22:54Z

Tk:

File:WholeGenomeShotgun.png

2011-10-28T12:22:34Z

Tk:

File:CloneByClone.png

2011-10-28T12:22:15Z

Tk:

GenDBWiki/IntroductionToGenomics

2011-10-28T12:20:38Z

Tk: added images

= A brief Introduction to Genomics Research =

After James D. Watson and Francis H. C. Crick described the structure of the DNA helix in 1953, the basic mechanisms of DNA replication and recombination, protein synthesis, and gene expression were rapidly unravelled. Technological advances like the invention of the polymerase chain reaction (PCR) and automated DNA sequencing methods have progressed to the point that today the entire genomic sequence of any organism can be obtained in a snatch. As of this writing, the [http://www.genomesonline.org GOLD database] reports more than 900 organisms, including completely sequenced genomes and genomes for which sequencing is in progress. For more than 800 genomes the (partial) sequence is already available in the [http://www.ncbi.nlm.nih.gov/About/tools/index.html NCBI databases].

== Genome sequence analysis ==

All efforts for a complete analysis of almost every genome start by reading the DNA sequence of the whole organism. Ideally, the complete correct order of the four base pairs A, T, G, and C has to be determined before any further research can be initiated (i.e. the complete and correct DNA sequence is vital for a correct gene prediction based on characteristic DNA features. Nowadays, whole genome sequencing is either done by a hierarchical (map based) sequencing approach (see figure 1) or by whole genome shotgun sequencing (see figure 2)

[[File:CloneByClone.png]]

Figure 1: The hierachical sequencing strategy first splits the genome into pieces of approximately 40 to 200 kb. These pieces are then cloned into ''large insert libraries'' (e.g. BACs, YACs, cosmids, fosmids). From the huge number of insert clones a ''minimal tiling path'' is created, selecting a subset of clones that cover the genome with minimal overlap between the individual clones. Since a map of clones is used, this approach is sometimes referred to as ''map based shotgun''. The individual clones are sequenced using a shotgun approach for each one.

[[File:WholeGenomeShotgun.png]]

Figure 2: For whole genome shotgun sequencing, the genome is split into a multitude of fragments of approximately 1 to 12 kB (shotgun phase). The resulting fragments are then cloned into sequencing vectors and transformed in bacterial cells (usually E. coli). The so-called vectors are small replicons that include a "multiple cloning site" where the fragments can be inserted. The fragment is thus flanked by the well known sequence of the vector and this sequence can be used to define a sequencing primer. This primer binds to the DNA of the vector. Two primers are used, yielding two sequences per "insert", a forward and a reverse sequence. Then the resulting DNA sequences can be assembled. Using overlaps between the individual sequences, an attempt is made to determine the genomic sequence from the sets of fragments.

While the hierarchical approach first splits up the genomic DNA into a set of clones which have to be ordered based on their overlapping ends along the minimal tiling path, the shotgun approach simply cuts the whole genome into a large number of small fragments which are then sequenced and re-assembled.

Especially the whole genome shotgun approach depends on efficient assembling algorithms and requires considerable hard- and software support. In general, minimizing the manual effort for the shotgun approach by automated high-throughput sequencing pipelines has greatly decreased the cost for whole genome sequencing projects. After the sequencing and assembly phase, the obtained genomic sequence (usually a small number of contigs) has to be finished by closing the gaps between the contigs. Furthermore, the genome has to be polished in order to improve the quality of the consensus sequence. Finally, the complete genomic DNA sequence is ideally obtained in a single large contig as a basis for all further research. Although the completion of the sequencing phase in a genome project is always an important step towards understanding the genome and the basic genetic principles behind, the DNA sequence is actually just the starting point for large scale downstream analysis.

== Finding genes - region prediction ==

The first step towards a detailed analysis of the DNA sequence in any genome is the identification of potentially functional regions like protein coding sequences (CDS) and other functional non-coding genes like transfer RNAs (tRNAs), ribosomal RNA genes (rRNAs), ribosomal binding sites (RBS), etc. Thereby, the prediction of such regions can be considered the most important task leading to the development of various approaches for gene prediction.

Due to their coding potential, the protein coding sequences in a bacterial genome typically exhibit certain, characteristic sequence properties which distinguish them from non-coding Open Reading Frames (ORFs) in the sequence. An additional useful property for gene identification is sequence homology of a potential coding region to genes of other organisms. Ab initio or intrinsic gene-finders exclusively use the statistical analysis of sequence properties (e.g. Hidden Markov Models) to distinguish real protein coding CDSs from ORFs. Examples for these ab initio gene-finders in prokaryotic sequence data are e.g. Glimmer (Gene Locator and Interpolated Context Modeller) or ZCURVE. Programs like Critica (Coding Region Identification Tool Invoking Comparative Analysis) and Orpheus which additionally use homology-based information for gene prediction are also called extrinsic gene-finders.

[[File:RegionPrediction.png]]

Figure 3: Prediction of functional regions. Protein coding sequences (CDS) as well as other functional non-coding genes (tRNAs, rRNAs, promotors, terminators, etc.) can be identified by analyzing characteristic sequence properties.

For the prediction of other non-coding regions of interest such as tRNAs, rRNAs, signal peptides, etc. a number of tools exist at different levels of quality (tRNAscan-SE, SignalP, helix-turn-helix, TMHMM, etc.). Some of the obtained predictions are also strongly related to functional assignments for the identified regions so that it is not always possible to clearly distinguish the prediction of region and function.

An objective evaluation of the predictive accuracy of different gene-finders is difficult since an experimentally verified annotation for all genes of a bacterial genome does not yet exist (even for E. coli, only a few hundred genes have been verified experimentally by now). Therefore, the current state-of-the-art is the comparison with available genome annotation data, which more or less reflects the manual annotation work of human experts. The reliability of these kinds of annotations varies, however, and depends heavily on the methods used and the manual effort involved in the annotation process. Furthermore, the state of the experimental knowledge concerning the respective organism differs quite a lot and thus reflects a certain degree of reliability for a given annotation. Nevertheless, the success of one or another gene prediction strategy can be evaluated to some degree by comparing the number of predicted genes to the number of genes found in an existing annotation and by calculating the selectivity and sensitivity for the gene numbers obtained.

== Prediction of functions ==

After identifying the regions of interest in the genomic sequence, researchers find themselves confronted with the challenging task of assigning potential functions and biological meaning to more or less unimposing parts in the genomic sequence. Since the cost and manual effort for detailed wet lab experiments on each of these regions would clearly exceed the resources of every genome project, bioinformatics tools have been implemented that allow an automated prediction of potential gene functions.

Many of these tools rely on different strategies that compare unknown sequences to DNA or protein sequences that have already been determined by researchers in the past 20 years. Almost all of them have been deposited in a number of so-called sequence databases (from a computer scientist's point of view these are merely data collections). The most current list of these sequence repositories can be found either in the first issue of NAR (Nucleic Acid Research) each year or on the web via one of the different sequence retrieval servers (e.g. via this [http://srs.ebi.ac.uk SRS server]).

While we can easily query these sequence databases for a gene with a specific name, the naming of genes is by no means consistent and each gene may have several names. So one reason for doing database searches based on sequence similarity is the chaotic state of the sequence databases.

The most important reason for performing similarity searches is the determination of putative functions for newly sequenced stretches of DNA. By comparing the new sequences to the databases of "well known" sequences and their "annotations", we can derive a putative gene function.

If we find a database "match" for a new sequence, we can assume that the function of our new sequence may in fact be related to that of our match. This is based on a dictum by Carl Woese who stated that:

* Two proteins of identical function will have a similar protein structure, because protein structure determines the protein function.
* Two proteins of similar structure will have similar amino acid sequences.
* Two similar amino acid sequences will have some degree of DNA sequence similarity.
* Thus from a similar DNA or amino acid function a similar protein function might be inferred.

Although this is true for many proteins, it should be clearly stated that even small changes in the DNA sequence can render the gene product useless or completely change its function. In contrast to similarity in function, the term homology indicates a genetic relationship based on correspondence or relation in the type of a structure (here in the DNA or amino-acid sequence itself).

Unfortunately, a "match" in a DNA or protein database needs to be interpreted; the uninitiated may mistake a chance hit (the databases are very large) with a meaningful "match".

Prominent and commonly applied tools like BLAST or FASTA compare the DNA or amino-acid query sequence with huge databases of collected already known sequences by computing alignments. The results of these tools are supposed to reflect the degree of similarity between two genes in different organisms thus following the thesis that the same (or similar) gene function should have an (almost) identical underlying genomic sequence. Although these comparisons often reveal the homology among evolutionary related organisms, the results have to be interpreted carefully since they can only be as reliable as the database entry itself.

Other tools like Pfam, Blocks, iPSORT, and PROSITE are based on (manually) curated motif or domain databases that allow the classification of proteins based on hidden markov models and other techniques. Recently developed tools like InterPro also combine the results of several other applications thus trying to compute more reliable and quite exact predictions that classify partial genomic sequences.

== Genome annotation ==

Annotation is generally thought to possess best quality when performed by a human expert. The large amounts of data which have to be evaluated in any whole-genome annotation project, however, have led to the (partial) automation of the procedure. Hence, software assistance for computation, storage, retrieval, and analysis of relevant data has become essential for the success of any genome project. Genome annotation can be done automatically (e.g. by using the "best Blast hit") or manually. The latter is supposed to possess a higher quality but on the other hand takes much more time. However, to be sure about the "real biological function", each annotation of a gene would have to be confirmed by wet lab experiments.

[[File:GenDBFlowDetail.png]]

Figure 4: Traditional flowchart of a genome annotation pipeline. The process of genome annotation can be defined as assigning a meaning to sequence data that would otherwise be almost devoid of information. By identifying regions of interest and defining putative functions for those areas, the genome can be understood and further research may be initiated. Since genome annotation is a dynamic process, the arrows indicate different mutual influences between the different steps. For example, the region prediction (1), the computation of observations (5), and the annotation (4) depend on the quality of the sequence (because of frameshifts etc.). On the other hand, "surprising" observations (2) or inconsistencies that were discovered during the annotation (6) may require updates of the region prediction. Changes of a region will thus produce new observations which have to be considered carefully for a novel annotation (3).

Figure 4 shows the flowchart of an often employed genome annotation pipeline also displaying the interactions and dependencies between the single steps: e.g. a correct gene prediction depends heavily on the quality of the genomic sequence. Vice versa questionable predictions of regions can help to identify sequencing errors (e.g. frameshifts) that require further improvement of the sequence itself in some positions.

Another important aspect for the success of any genome annotation project is the use of a consistent nomenclature when assigning gene names. Comparing just a few existing genome annotations shows that there is no commonly used systematic naming scheme: for example, the genes coding for the enzyme homoserine dehydrogenase are named completely different in the corresponding annotations for E. coli (THRA or THRA1 or THRA2 or B0002), B. subtilis (HOM or TDM), and S. cerevisiae (HOM6 or YJR139C or J2132) as illustrated in figure .

[[File:HomoserineDehydrogenase.png]]

Figure 5: Searching for a homoserine dehydrogenase in the database using the SRS system results in a number of hits for various organisms. The hits shown here illustrate that for only three organisms 9 different gene names were assigned.

They can only be identified as the same encoded enzyme because each database entry is additionally mapped onto the same enzyme classification number EC 1.1.1.3. This does not only prevent simple comparisons between different organisms but also complicates the identification of genes with the same or similar function. Using a standardized vocabulary like the Gene Ontologies (GO) might therefore be one of the most fruitful efforts towards a unified standard for genome annotations.

File:AnnotationPipeline.png

2011-10-28T12:18:06Z

Tk:

GenDBWiki/TermsAndConcepts/GenDBAnnotationWorkflow

2011-10-28T12:17:44Z

Tk: Created page with "= A typical GenDB Annotation Workflow = The GenDB-2 system features all steps for the analysis and annotation of bacterial genomes starting from the raw contig sequence. The fig..."

= A typical GenDB Annotation Workflow =

The GenDB-2 system features all steps for the analysis and annotation of bacterial genomes starting from the raw contig sequence. The figure below shows an example for a genome annotation pipeline that has been implemented with GenDB. Upon import of the raw sequence data, a parent region object describing the genome sequence is created. Following this step, user-defined tools for the prediction of different kinds of regions, such as coding sequences (CDS) or tRNA-encoding genes can be run. The output of these tools is stored as observations which refer to the parent region object. Based on these observations, an annotator, human or machine, performs a ''region annotation''. This means confirming or rejecting the results of gene prediction tools by creating region objects like CDSs or tRNAs. The annotations form a complete protocol of all ''region annotation'' events. Following the creation of different kinds of regions, additional tools such as BLAST, HMMer, or !CoBias can be run creating information related to their potential function. Each of these tools can have its own automatic annotator that creates a very simple annotation based solely on the results of a single tool run. After computing a number of standard tools, a more sophisticated automatic annotation can be accomplished by combining the results of different tools. Finally, a manual ''function annotation'' step can be performed by an annotator in which a putative function is assigned to these regions by an interpretation of the observations (see below).

[[File:AnnotationPipeline.png]]

This standard sample pipeline implemented with GenDB-2 starts with an import of a contig sequence. Afterwards, regions are predicted and created by a regional annotation (''Annotation::Region''). A biological function for these regions can then be assigned by computing different bioinformatics tools that often generate large numbers of observations. Based on these results an automatic or manual functional annotation (''Annotation::Function'') can be assigned.

The current Gtk version of the GenDB system features a graphical interface (''Annotation Pipeline Wizard'') for the configuration of different individual pipelines. The user can choose one or more steps (''Import'', ''Edit Sequence'', ''Region Prediction'' and ''Function Prediction'') which are then combined to a separate pipeline. After some initial configuration, the pipeline is submitted as a special job and the corresponding steps are executed in the specified order without any further user interaction. Using these pipelines allows a very comfortable automated annotation and increases the productivity in large-scale genome annotation projects.

Nevertheless, it is still a laborious task to manually check the predicted regions and their function assignments. Both GenDB frontends therefore provide almost identical wizards for editing the start of a gene and annotation interfaces that allow recording a comprehensive set of information about each region. Since the final manual annotation of a genome is not only the most time consuming step but also the most erroneous task, exactly defined rules and guidelines for annotating a gene are essential in order to prevent inconsistent entries.

GenDBWiki/DataModel

2011-10-28T12:16:03Z

Tk: added images

= The GenDB Data Model =

Gen``DB is based on a data model with three core types of objects. ''Regions'' describe arbitrary (sub-) sequences. A region can be related to a parent region, e.g. a CDS is part of a contig. ''Observations'' correspond to information computed by various tools (e.g. Blast or Inter``Pro) for those regions. ''Annotations'' store the interpretation of a (human) annotator. They describe regions based on the evidence stored in the observations.

[[File:GenDB-UML-DataModel.jpg]]

The figure shown above illustrates the relationships between the different core objects. As can be seen, there is a clear distinction between the results from various bioinformatics tools (observations) and their interpretation (annotations) which was implemented in the data model. While this data model seems very generic, it represents a hierarchy of classes, including the complete EM``BL feature set for prokaryotes with several extensions as illustrated in the figure below.

[[File:GenDB-Regions.jpg]]

to be continued ...

File:GenDB-Regions.jpg

2011-10-28T12:15:11Z

Tk:

File:GenDB-UML-DataModel.jpg

2011-10-28T12:14:31Z

Tk:

GenDBWiki/DataModel

2011-10-28T12:11:58Z

Tk: Created page with "= The GenDB Data Model = Gen``DB is based on a data model with three core types of objects. ''Regions'' describe arbitrary (sub-) sequences. A region can be related to a parent ..."

= The GenDB Data Model =

Gen``DB is based on a data model with three core types of objects. ''Regions'' describe arbitrary (sub-) sequences. A region can be related to a parent region, e.g. a CDS is part of a contig. ''Observations'' correspond to information computed by various tools (e.g. Blast or Inter``Pro) for those regions. ''Annotations'' store the interpretation of a (human) annotator. They describe regions based on the evidence stored in the observations.

attachment:GenDB-UML-DataModel.jpg

The figure shown above illustrates the relationships between the different core objects. As can be seen, there is a clear distinction between the results from various bioinformatics tools (observations) and their interpretation (annotations) which was implemented in the data model. While this data model seems very generic, it represents a hierarchy of classes, including the complete EM``BL feature set for prokaryotes with several extensions as illustrated in the figure below.

attachment:GenDB-Regions.jpg

to be continued ...

GenDBWiki/IntroductionToGenomics

2011-10-28T12:10:00Z

Tk: /* Prediction of functions */

= A brief Introduction to Genomics Research =

After James D. Watson and Francis H. C. Crick described the structure of the DNA helix in 1953, the basic mechanisms of DNA replication and recombination, protein synthesis, and gene expression were rapidly unravelled. Technological advances like the invention of the polymerase chain reaction (PCR) and automated DNA sequencing methods have progressed to the point that today the entire genomic sequence of any organism can be obtained in a snatch. As of this writing, the [http://www.genomesonline.org GOLD database] reports more than 900 organisms, including completely sequenced genomes and genomes for which sequencing is in progress. For more than 800 genomes the (partial) sequence is already available in the [http://www.ncbi.nlm.nih.gov/About/tools/index.html NCBI databases].

== Genome sequence analysis ==

All efforts for a complete analysis of almost every genome start by reading the DNA sequence of the whole organism. Ideally, the complete correct order of the four base pairs A, T, G, and C has to be determined before any further research can be initiated (i.e. the complete and correct DNA sequence is vital for a correct gene prediction based on characteristic DNA features. Nowadays, whole genome sequencing is either done by a hierarchical (map based) sequencing approach (see figure 1) or by whole genome shotgun sequencing (see figure 2)

attachment:CloneByClone.png

Figure 1: The hierachical sequencing strategy first splits the genome into pieces of approximately 40 to 200 kb. These pieces are then cloned into ''large insert libraries'' (e.g. BACs, YACs, cosmids, fosmids). From the huge number of insert clones a ''minimal tiling path'' is created, selecting a subset of clones that cover the genome with minimal overlap between the individual clones. Since a map of clones is used, this approach is sometimes referred to as ''map based shotgun''. The individual clones are sequenced using a shotgun approach for each one.

attachment:WholeGenomeShotgun.png

Figure 2: For whole genome shotgun sequencing, the genome is split into a multitude of fragments of approximately 1 to 12 kB (shotgun phase). The resulting fragments are then cloned into sequencing vectors and transformed in bacterial cells (usually E. coli). The so-called vectors are small replicons that include a "multiple cloning site" where the fragments can be inserted. The fragment is thus flanked by the well known sequence of the vector and this sequence can be used to define a sequencing primer. This primer binds to the DNA of the vector. Two primers are used, yielding two sequences per "insert", a forward and a reverse sequence. Then the resulting DNA sequences can be assembled. Using overlaps between the individual sequences, an attempt is made to determine the genomic sequence from the sets of fragments.

While the hierarchical approach first splits up the genomic DNA into a set of clones which have to be ordered based on their overlapping ends along the minimal tiling path, the shotgun approach simply cuts the whole genome into a large number of small fragments which are then sequenced and re-assembled.

Especially the whole genome shotgun approach depends on efficient assembling algorithms and requires considerable hard- and software support. In general, minimizing the manual effort for the shotgun approach by automated high-throughput sequencing pipelines has greatly decreased the cost for whole genome sequencing projects. After the sequencing and assembly phase, the obtained genomic sequence (usually a small number of contigs) has to be finished by closing the gaps between the contigs. Furthermore, the genome has to be polished in order to improve the quality of the consensus sequence. Finally, the complete genomic DNA sequence is ideally obtained in a single large contig as a basis for all further research. Although the completion of the sequencing phase in a genome project is always an important step towards understanding the genome and the basic genetic principles behind, the DNA sequence is actually just the starting point for large scale downstream analysis.

== Finding genes - region prediction ==

The first step towards a detailed analysis of the DNA sequence in any genome is the identification of potentially functional regions like protein coding sequences (CDS) and other functional non-coding genes like transfer RNAs (tRNAs), ribosomal RNA genes (rRNAs), ribosomal binding sites (RBS), etc. Thereby, the prediction of such regions can be considered the most important task leading to the development of various approaches for gene prediction.

Due to their coding potential, the protein coding sequences in a bacterial genome typically exhibit certain, characteristic sequence properties which distinguish them from non-coding Open Reading Frames (ORFs) in the sequence. An additional useful property for gene identification is sequence homology of a potential coding region to genes of other organisms. Ab initio or intrinsic gene-finders exclusively use the statistical analysis of sequence properties (e.g. Hidden Markov Models) to distinguish real protein coding CDSs from ORFs. Examples for these ab initio gene-finders in prokaryotic sequence data are e.g. Glimmer (Gene Locator and Interpolated Context Modeller) or ZCURVE. Programs like Critica (Coding Region Identification Tool Invoking Comparative Analysis) and Orpheus which additionally use homology-based information for gene prediction are also called extrinsic gene-finders.

attachment:RegionPrediction.png

Figure 3: Prediction of functional regions. Protein coding sequences (CDS) as well as other functional non-coding genes (tRNAs, rRNAs, promotors, terminators, etc.) can be identified by analyzing characteristic sequence properties.

For the prediction of other non-coding regions of interest such as tRNAs, rRNAs, signal peptides, etc. a number of tools exist at different levels of quality (tRNAscan-SE, SignalP, helix-turn-helix, TMHMM, etc.). Some of the obtained predictions are also strongly related to functional assignments for the identified regions so that it is not always possible to clearly distinguish the prediction of region and function.

An objective evaluation of the predictive accuracy of different gene-finders is difficult since an experimentally verified annotation for all genes of a bacterial genome does not yet exist (even for E. coli, only a few hundred genes have been verified experimentally by now). Therefore, the current state-of-the-art is the comparison with available genome annotation data, which more or less reflects the manual annotation work of human experts. The reliability of these kinds of annotations varies, however, and depends heavily on the methods used and the manual effort involved in the annotation process. Furthermore, the state of the experimental knowledge concerning the respective organism differs quite a lot and thus reflects a certain degree of reliability for a given annotation. Nevertheless, the success of one or another gene prediction strategy can be evaluated to some degree by comparing the number of predicted genes to the number of genes found in an existing annotation and by calculating the selectivity and sensitivity for the gene numbers obtained.

== Prediction of functions ==

After identifying the regions of interest in the genomic sequence, researchers find themselves confronted with the challenging task of assigning potential functions and biological meaning to more or less unimposing parts in the genomic sequence. Since the cost and manual effort for detailed wet lab experiments on each of these regions would clearly exceed the resources of every genome project, bioinformatics tools have been implemented that allow an automated prediction of potential gene functions.

Many of these tools rely on different strategies that compare unknown sequences to DNA or protein sequences that have already been determined by researchers in the past 20 years. Almost all of them have been deposited in a number of so-called sequence databases (from a computer scientist's point of view these are merely data collections). The most current list of these sequence repositories can be found either in the first issue of NAR (Nucleic Acid Research) each year or on the web via one of the different sequence retrieval servers (e.g. via this [http://srs.ebi.ac.uk SRS server]).

While we can easily query these sequence databases for a gene with a specific name, the naming of genes is by no means consistent and each gene may have several names. So one reason for doing database searches based on sequence similarity is the chaotic state of the sequence databases.

The most important reason for performing similarity searches is the determination of putative functions for newly sequenced stretches of DNA. By comparing the new sequences to the databases of "well known" sequences and their "annotations", we can derive a putative gene function.

If we find a database "match" for a new sequence, we can assume that the function of our new sequence may in fact be related to that of our match. This is based on a dictum by Carl Woese who stated that:

* Two proteins of identical function will have a similar protein structure, because protein structure determines the protein function.
* Two proteins of similar structure will have similar amino acid sequences.
* Two similar amino acid sequences will have some degree of DNA sequence similarity.
* Thus from a similar DNA or amino acid function a similar protein function might be inferred.

Although this is true for many proteins, it should be clearly stated that even small changes in the DNA sequence can render the gene product useless or completely change its function. In contrast to similarity in function, the term homology indicates a genetic relationship based on correspondence or relation in the type of a structure (here in the DNA or amino-acid sequence itself).

Unfortunately, a "match" in a DNA or protein database needs to be interpreted; the uninitiated may mistake a chance hit (the databases are very large) with a meaningful "match".

Prominent and commonly applied tools like BLAST or FASTA compare the DNA or amino-acid query sequence with huge databases of collected already known sequences by computing alignments. The results of these tools are supposed to reflect the degree of similarity between two genes in different organisms thus following the thesis that the same (or similar) gene function should have an (almost) identical underlying genomic sequence. Although these comparisons often reveal the homology among evolutionary related organisms, the results have to be interpreted carefully since they can only be as reliable as the database entry itself.

Other tools like Pfam, Blocks, iPSORT, and PROSITE are based on (manually) curated motif or domain databases that allow the classification of proteins based on hidden markov models and other techniques. Recently developed tools like InterPro also combine the results of several other applications thus trying to compute more reliable and quite exact predictions that classify partial genomic sequences.

== Genome annotation ==

Annotation is generally thought to possess best quality when performed by a human expert. The large amounts of data which have to be evaluated in any whole-genome annotation project, however, have led to the (partial) automation of the procedure. Hence, software assistance for computation, storage, retrieval, and analysis of relevant data has become essential for the success of any genome project. Genome annotation can be done automatically (e.g. by using the "best Blast hit") or manually. The latter is supposed to possess a higher quality but on the other hand takes much more time. However, to be sure about the "real biological function", each annotation of a gene would have to be confirmed by wet lab experiments.

attachment:GenDBFlowDetail.png

Figure 4: Traditional flowchart of a genome annotation pipeline. The process of genome annotation can be defined as assigning a meaning to sequence data that would otherwise be almost devoid of information. By identifying regions of interest and defining putative functions for those areas, the genome can be understood and further research may be initiated. Since genome annotation is a dynamic process, the arrows indicate different mutual influences between the different steps. For example, the region prediction (1), the computation of observations (5), and the annotation (4) depend on the quality of the sequence (because of frameshifts etc.). On the other hand, "surprising" observations (2) or inconsistencies that were discovered during the annotation (6) may require updates of the region prediction. Changes of a region will thus produce new observations which have to be considered carefully for a novel annotation (3).

Figure 4 shows the flowchart of an often employed genome annotation pipeline also displaying the interactions and dependencies between the single steps: e.g. a correct gene prediction depends heavily on the quality of the genomic sequence. Vice versa questionable predictions of regions can help to identify sequencing errors (e.g. frameshifts) that require further improvement of the sequence itself in some positions.

Another important aspect for the success of any genome annotation project is the use of a consistent nomenclature when assigning gene names. Comparing just a few existing genome annotations shows that there is no commonly used systematic naming scheme: for example, the genes coding for the enzyme homoserine dehydrogenase are named completely different in the corresponding annotations for E. coli (THRA or THRA1 or THRA2 or B0002), B. subtilis (HOM or TDM), and S. cerevisiae (HOM6 or YJR139C or J2132) as illustrated in figure .

attachment:HomoserineDehydrogenase.png

Figure 5: Searching for a homoserine dehydrogenase in the database using the SRS system results in a number of hits for various organisms. The hits shown here illustrate that for only three organisms 9 different gene names were assigned.

They can only be identified as the same encoded enzyme because each database entry is additionally mapped onto the same enzyme classification number EC 1.1.1.3. This does not only prevent simple comparisons between different organisms but also complicates the identification of genes with the same or similar function. Using a standardized vocabulary like the Gene Ontologies (GO) might therefore be one of the most fruitful efforts towards a unified standard for genome annotations.

GenDBWiki/IntroductionToGenomics

2011-10-28T12:09:08Z

Tk: Created page with "= A brief Introduction to Genomics Research = After James D. Watson and Francis H. C. Crick described the structure of the DNA helix in 1953, the basic mechanisms of DNA replica..."

= A brief Introduction to Genomics Research =

After James D. Watson and Francis H. C. Crick described the structure of the DNA helix in 1953, the basic mechanisms of DNA replication and recombination, protein synthesis, and gene expression were rapidly unravelled. Technological advances like the invention of the polymerase chain reaction (PCR) and automated DNA sequencing methods have progressed to the point that today the entire genomic sequence of any organism can be obtained in a snatch. As of this writing, the [http://www.genomesonline.org GOLD database] reports more than 900 organisms, including completely sequenced genomes and genomes for which sequencing is in progress. For more than 800 genomes the (partial) sequence is already available in the [http://www.ncbi.nlm.nih.gov/About/tools/index.html NCBI databases].

== Genome sequence analysis ==

All efforts for a complete analysis of almost every genome start by reading the DNA sequence of the whole organism. Ideally, the complete correct order of the four base pairs A, T, G, and C has to be determined before any further research can be initiated (i.e. the complete and correct DNA sequence is vital for a correct gene prediction based on characteristic DNA features. Nowadays, whole genome sequencing is either done by a hierarchical (map based) sequencing approach (see figure 1) or by whole genome shotgun sequencing (see figure 2)

attachment:CloneByClone.png

Figure 1: The hierachical sequencing strategy first splits the genome into pieces of approximately 40 to 200 kb. These pieces are then cloned into ''large insert libraries'' (e.g. BACs, YACs, cosmids, fosmids). From the huge number of insert clones a ''minimal tiling path'' is created, selecting a subset of clones that cover the genome with minimal overlap between the individual clones. Since a map of clones is used, this approach is sometimes referred to as ''map based shotgun''. The individual clones are sequenced using a shotgun approach for each one.

attachment:WholeGenomeShotgun.png

Figure 2: For whole genome shotgun sequencing, the genome is split into a multitude of fragments of approximately 1 to 12 kB (shotgun phase). The resulting fragments are then cloned into sequencing vectors and transformed in bacterial cells (usually E. coli). The so-called vectors are small replicons that include a "multiple cloning site" where the fragments can be inserted. The fragment is thus flanked by the well known sequence of the vector and this sequence can be used to define a sequencing primer. This primer binds to the DNA of the vector. Two primers are used, yielding two sequences per "insert", a forward and a reverse sequence. Then the resulting DNA sequences can be assembled. Using overlaps between the individual sequences, an attempt is made to determine the genomic sequence from the sets of fragments.

While the hierarchical approach first splits up the genomic DNA into a set of clones which have to be ordered based on their overlapping ends along the minimal tiling path, the shotgun approach simply cuts the whole genome into a large number of small fragments which are then sequenced and re-assembled.

Especially the whole genome shotgun approach depends on efficient assembling algorithms and requires considerable hard- and software support. In general, minimizing the manual effort for the shotgun approach by automated high-throughput sequencing pipelines has greatly decreased the cost for whole genome sequencing projects. After the sequencing and assembly phase, the obtained genomic sequence (usually a small number of contigs) has to be finished by closing the gaps between the contigs. Furthermore, the genome has to be polished in order to improve the quality of the consensus sequence. Finally, the complete genomic DNA sequence is ideally obtained in a single large contig as a basis for all further research. Although the completion of the sequencing phase in a genome project is always an important step towards understanding the genome and the basic genetic principles behind, the DNA sequence is actually just the starting point for large scale downstream analysis.

== Finding genes - region prediction ==

The first step towards a detailed analysis of the DNA sequence in any genome is the identification of potentially functional regions like protein coding sequences (CDS) and other functional non-coding genes like transfer RNAs (tRNAs), ribosomal RNA genes (rRNAs), ribosomal binding sites (RBS), etc. Thereby, the prediction of such regions can be considered the most important task leading to the development of various approaches for gene prediction.

Due to their coding potential, the protein coding sequences in a bacterial genome typically exhibit certain, characteristic sequence properties which distinguish them from non-coding Open Reading Frames (ORFs) in the sequence. An additional useful property for gene identification is sequence homology of a potential coding region to genes of other organisms. Ab initio or intrinsic gene-finders exclusively use the statistical analysis of sequence properties (e.g. Hidden Markov Models) to distinguish real protein coding CDSs from ORFs. Examples for these ab initio gene-finders in prokaryotic sequence data are e.g. Glimmer (Gene Locator and Interpolated Context Modeller) or ZCURVE. Programs like Critica (Coding Region Identification Tool Invoking Comparative Analysis) and Orpheus which additionally use homology-based information for gene prediction are also called extrinsic gene-finders.

attachment:RegionPrediction.png

Figure 3: Prediction of functional regions. Protein coding sequences (CDS) as well as other functional non-coding genes (tRNAs, rRNAs, promotors, terminators, etc.) can be identified by analyzing characteristic sequence properties.

For the prediction of other non-coding regions of interest such as tRNAs, rRNAs, signal peptides, etc. a number of tools exist at different levels of quality (tRNAscan-SE, SignalP, helix-turn-helix, TMHMM, etc.). Some of the obtained predictions are also strongly related to functional assignments for the identified regions so that it is not always possible to clearly distinguish the prediction of region and function.

An objective evaluation of the predictive accuracy of different gene-finders is difficult since an experimentally verified annotation for all genes of a bacterial genome does not yet exist (even for E. coli, only a few hundred genes have been verified experimentally by now). Therefore, the current state-of-the-art is the comparison with available genome annotation data, which more or less reflects the manual annotation work of human experts. The reliability of these kinds of annotations varies, however, and depends heavily on the methods used and the manual effort involved in the annotation process. Furthermore, the state of the experimental knowledge concerning the respective organism differs quite a lot and thus reflects a certain degree of reliability for a given annotation. Nevertheless, the success of one or another gene prediction strategy can be evaluated to some degree by comparing the number of predicted genes to the number of genes found in an existing annotation and by calculating the selectivity and sensitivity for the gene numbers obtained.

== Prediction of functions ==

After identifying the regions of interest in the genomic sequence, researchers find themselves confronted with the challenging task of assigning potential functions and biological meaning to more or less unimposing parts in the genomic sequence. Since the cost and manual effort for detailed wet lab experiments on each of these regions would clearly exceed the resources of every genome project, bioinformatics tools have been implemented that allow an automated prediction of potential gene functions.

Many of these tools rely on different strategies that compare unknown sequences to DNA or protein sequences that have already been determined by researchers in the past 20 years. Almost all of them have been deposited in a number of so-called sequence databases (from a computer scientist's point of view these are merely data collections). The most current list of these sequence repositories can be found either in the first issue of NAR (Nucleic Acid Research) each year or on the web via one of the different sequence retrieval servers (e.g. via this [http://srs.ebi.ac.uk SRS server]).

While we can easily query these sequence databases for a gene with a specific name, the naming of genes is by no means consistent and each gene may have several names. So one reason for doing database searches based on sequence similarity is the chaotic state of the sequence databases.

The most important reason for performing similarity searches is the determination of putative functions for newly sequenced stretches of DNA. By comparing the new sequences to the databases of "well known" sequences and their "annotations", we can derive a putative gene function.

If we find a database "match" for a new sequence, we can assume that the function of our new sequence may in fact be related to that of our match. This is based on a dictum by Carl Woese who stated that:

* Two proteins of identical function will have a similar protein structure, because protein structure determines the protein function.
* Two proteins of similar structure will have similar amino acid sequences.
* Two similar amino acid sequences will have some degree of DNA sequence similarity.
* Thus from a similar DNA or amino acid function a similar protein function might be inferred.

Although this is true for many proteins, it should be clearly stated that even small changes in the DNA sequence can render the gene product useless or completely change its function. In contrast to similarity in function, the term homology indicates a genetic relationship based on correspondence or relation in the type of a structure (here in the DNA or amino-acid sequence itself).

Unfortunately, a "match" in a DNA or protein database needs to be interpreted; the uninitiated may mistake a chance hit (the databases are very large) with a meaningful "match".

Prominent and commonly applied tools like BLAST or FASTA compare the DNA or amino-acid query sequence with huge databases of collected already known sequences by computing alignments. The results of these tools are supposed to reflect the degree of similarity between two genes in different organisms thus following the thesis that the same (or similar) gene function should have an (almost) identical underlying genomic sequence. Although these comparisons often reveal the homology among evolutionary related organisms, the results have to be interpreted carefully since they can only be as reliable as the database entry itself.

Other tools like Pfam, Blocks, iPSORT, and PROSITE are based on (manually) curated motif or domain databases that allow the classification of proteins based on hidden markov models and other techniques. Recently developed tools like InterPro also combine the results of several other applications thus trying to compute more reliable and quite exact predictions that classify partial genomic sequences.

== Genome annotation ==

Annotation is generally thought to possess best quality when performed by a human expert. The large amounts of data which have to be evaluated in any whole-genome annotation project, however, have led to the (partial) automation of the procedure. Hence, software assistance for computation, storage, retrieval, and analysis of relevant data has become essential for the success of any genome project. Genome annotation can be done automatically (e.g. by using the "best Blast hit") or manually. The latter is supposed to possess a higher quality but on the other hand takes much more time. However, to be sure about the "real biological function", each annotation of a gene would have to be confirmed by wet lab experiments.

attachment:GenDBFlowDetail.png

Figure 4: Traditional flowchart of a genome annotation pipeline. The process of genome annotation can be defined as assigning a meaning to sequence data that would otherwise be almost devoid of information. By identifying regions of interest and defining putative functions for those areas, the genome can be understood and further research may be initiated. Since genome annotation is a dynamic process, the arrows indicate different mutual influences between the different steps. For example, the region prediction (1), the computation of observations (5), and the annotation (4) depend on the quality of the sequence (because of frameshifts etc.). On the other hand, "surprising" observations (2) or inconsistencies that were discovered during the annotation (6) may require updates of the region prediction. Changes of a region will thus produce new observations which have to be considered carefully for a novel annotation (3).

Figure 4 shows the flowchart of an often employed genome annotation pipeline also displaying the interactions and dependencies between the single steps: e.g. a correct gene prediction depends heavily on the quality of the genomic sequence. Vice versa questionable predictions of regions can help to identify sequencing errors (e.g. frameshifts) that require further improvement of the sequence itself in some positions.

Another important aspect for the success of any genome annotation project is the use of a consistent nomenclature when assigning gene names. Comparing just a few existing genome annotations shows that there is no commonly used systematic naming scheme: for example, the genes coding for the enzyme homoserine dehydrogenase are named completely different in the corresponding annotations for E. coli (THRA or THRA1 or THRA2 or B0002), B. subtilis (HOM or TDM), and S. cerevisiae (HOM6 or YJR139C or J2132) as illustrated in figure .

attachment:HomoserineDehydrogenase.png

Figure 5: Searching for a homoserine dehydrogenase in the database using the SRS system results in a number of hits for various organisms. The hits shown here illustrate that for only three organisms 9 different gene names were assigned.

They can only be identified as the same encoded enzyme because each database entry is additionally mapped onto the same enzyme classification number EC 1.1.1.3. This does not only prevent simple comparisons between different organisms but also complicates the identification of genes with the same or similar function. Using a standardized vocabulary like the Gene Ontologies (GO) might therefore be one of the most fruitful efforts towards a unified standard for genome annotations.

GenDBWiki/TermsAndConcepts

2011-10-28T12:08:55Z

Tk: fixed links

__NOTOC__
= GenDB Terms and Concepts =

The sections listed below describe the most important terms and concepts used for the development of the GenDB annotation system.

* [[GenDBWiki/IntroductionToGenomics|Introduction to Genomics]]
* [[GenDBWiki/DataModel|Data Model]]
* [[GenDBWiki/TermsAndConcepts/GenDBAnnotationWorkflow| GenDB Annotation Workflow]]
* [[GenDBWiki/AnnotationConcepts|Annotation Concepts]]
* [[GenDBWiki/ToolAndJobConcept|Tool and Job Concept]]
* [[GenDBWiki/RolesAndRights|Roles and Rights]]

GenDBWiki

2011-10-28T09:27:01Z

Tk: /* Contact */

__NOTOC__

= GenDB Overview Page =

== GenDB - An open source genome annotation system for prokaryote genomes ==

The GenDB system is an open source genome annotation system developed by the "Bioinformatics Resource Facility (BRF)" of the "Center for Biotechnology ([http://www.cebitec.uni-bielefeld.de/ CeBiTec])" at "Bielefeld University".

Please have a look at the [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/ GenDB website] for general information, a guided tour, login to a demo project, a list of FAQs and more information about the availability of this software. Within this GenDB Wiki you will find more detailed information about installing and maintaining GenDB, using the system, and writing your own programs. Since this information is likely to change quite frequently we decided to put it into a Wiki system. If you discover new features that are not listed here you are of course welcome to extend this documentation.

Since GenDB version 2.2 the sources are split into 3 separate major packages:

* [[GenDBWiki/CoreDocumentation|GenDB CORE]] includes the complete GenDB backend with all O2DBI class modules and several extensions but no graphical user interface. It contains the basic functionality for annotating a genome automatically by using a number of simple scripts available within this package.
* [[GenDBWiki/WebDocumentation|GenDB WEB]] contains the modules required for running the GenDB web frontend. It can be used for a distributed manual genome annotation.
* [[GenDBWiki/GUIDocumentation|GenDB GUI]] includes the sources of the graphical user interface written in Gtk. This frontend provides some enhanced functionality and it has more features than the web interface.

The following sections contain detailed information about each package and how to use it for annotating a genome. You can also find some documentation for software developers and system administrators in separate Wiki sections. If you are using the GenDB system for the first time, you should also read the [[GenDBWiki/TermsAndConcepts|Terms and Concepts]] section where the conceptual details of the system are explained.

== Installation ==

* [[GenDBWiki/AdministratorDocumentation/GenDBInstallation|Installation Guide]]
* [[GenDBWiki/AdministratorDocumentation/GenDBInstallationFAQ|Installation FAQ]]

== Administration ==

* [[GenDBWiki/AdministratorDocumentation/ManagingUsers|Managing Users and Projects]]

== Documentation ==
* [[GenDBWiki/CoreDocumentation|Core Documentation]]
* [[GenDBWiki/WebDocumentation|Web Documentation]]
* [[GenDBWiki/GUIDocumentation|GUI Documentation]]
* [[GenDBWiki/DeveloperDocumentation|Developer Documentation]]
* [[GenDBWiki/AdministratorDocumentation|Administrator Documentation]]

== General ==
* [[GenDBWiki/FeatureTable|Feature Table]]
* [[GenDBWiki/TermsAndConcepts|Terms and Concepts]]
* [[GenDBWiki/FuturePlans|Future Plans]]

== Other Issues ==

For more information about project setup, license, application examples, and publications, please go to the [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/ GenDB website].

== Contact ==

Please send an e-mail for account requests or questions concerning the use of GenDB to:
[[MailTo(gendb AT cebitec DOT uni DASH bielefeld DOT de)]]

For bug reports, please use our bug reporting system [https://bugs.cebitec.uni-bielefeld.de/|CeBiTec Bugzilla].

Author: [http://www.cebitec.uni-bielefeld.de/~agoesman Alexander Goesmann]

GenDBWiki

2011-10-28T09:25:56Z

Tk: /* General */

__NOTOC__

= GenDB Overview Page =

== GenDB - An open source genome annotation system for prokaryote genomes ==

The GenDB system is an open source genome annotation system developed by the "Bioinformatics Resource Facility (BRF)" of the "Center for Biotechnology ([http://www.cebitec.uni-bielefeld.de/ CeBiTec])" at "Bielefeld University".

Please have a look at the [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/ GenDB website] for general information, a guided tour, login to a demo project, a list of FAQs and more information about the availability of this software. Within this GenDB Wiki you will find more detailed information about installing and maintaining GenDB, using the system, and writing your own programs. Since this information is likely to change quite frequently we decided to put it into a Wiki system. If you discover new features that are not listed here you are of course welcome to extend this documentation.

Since GenDB version 2.2 the sources are split into 3 separate major packages:

* [[GenDBWiki/CoreDocumentation|GenDB CORE]] includes the complete GenDB backend with all O2DBI class modules and several extensions but no graphical user interface. It contains the basic functionality for annotating a genome automatically by using a number of simple scripts available within this package.
* [[GenDBWiki/WebDocumentation|GenDB WEB]] contains the modules required for running the GenDB web frontend. It can be used for a distributed manual genome annotation.
* [[GenDBWiki/GUIDocumentation|GenDB GUI]] includes the sources of the graphical user interface written in Gtk. This frontend provides some enhanced functionality and it has more features than the web interface.

The following sections contain detailed information about each package and how to use it for annotating a genome. You can also find some documentation for software developers and system administrators in separate Wiki sections. If you are using the GenDB system for the first time, you should also read the [[GenDBWiki/TermsAndConcepts|Terms and Concepts]] section where the conceptual details of the system are explained.

== Installation ==

* [[GenDBWiki/AdministratorDocumentation/GenDBInstallation|Installation Guide]]
* [[GenDBWiki/AdministratorDocumentation/GenDBInstallationFAQ|Installation FAQ]]

== Administration ==

* [[GenDBWiki/AdministratorDocumentation/ManagingUsers|Managing Users and Projects]]

== Documentation ==
* [[GenDBWiki/CoreDocumentation|Core Documentation]]
* [[GenDBWiki/WebDocumentation|Web Documentation]]
* [[GenDBWiki/GUIDocumentation|GUI Documentation]]
* [[GenDBWiki/DeveloperDocumentation|Developer Documentation]]
* [[GenDBWiki/AdministratorDocumentation|Administrator Documentation]]

== General ==
* [[GenDBWiki/FeatureTable|Feature Table]]
* [[GenDBWiki/TermsAndConcepts|Terms and Concepts]]
* [[GenDBWiki/FuturePlans|Future Plans]]

== Other Issues ==

For more information about project setup, license, application examples, and publications, please go to the [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/ GenDB website].

== Contact ==

Please send an e-mail for account requests or questions concerning the use of GenDB to:
[[MailTo(gendb AT cebitec DOT uni DASH bielefeld DOT de)]]

For bug reports, please use our bug reporting system [http://bugs.cebitec.uni-bielefeld.de BugZilla].

Author: [http://www.cebitec.uni-bielefeld.de/~agoesman Alexander Goesmann]

GenDBWiki

2011-10-28T09:25:46Z

Tk: fixing links

__NOTOC__

= GenDB Overview Page =

== GenDB - An open source genome annotation system for prokaryote genomes ==

The GenDB system is an open source genome annotation system developed by the "Bioinformatics Resource Facility (BRF)" of the "Center for Biotechnology ([http://www.cebitec.uni-bielefeld.de/ CeBiTec])" at "Bielefeld University".

Please have a look at the [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/ GenDB website] for general information, a guided tour, login to a demo project, a list of FAQs and more information about the availability of this software. Within this GenDB Wiki you will find more detailed information about installing and maintaining GenDB, using the system, and writing your own programs. Since this information is likely to change quite frequently we decided to put it into a Wiki system. If you discover new features that are not listed here you are of course welcome to extend this documentation.

Since GenDB version 2.2 the sources are split into 3 separate major packages:

* [[GenDBWiki/CoreDocumentation|GenDB CORE]] includes the complete GenDB backend with all O2DBI class modules and several extensions but no graphical user interface. It contains the basic functionality for annotating a genome automatically by using a number of simple scripts available within this package.
* [[GenDBWiki/WebDocumentation|GenDB WEB]] contains the modules required for running the GenDB web frontend. It can be used for a distributed manual genome annotation.
* [[GenDBWiki/GUIDocumentation|GenDB GUI]] includes the sources of the graphical user interface written in Gtk. This frontend provides some enhanced functionality and it has more features than the web interface.

The following sections contain detailed information about each package and how to use it for annotating a genome. You can also find some documentation for software developers and system administrators in separate Wiki sections. If you are using the GenDB system for the first time, you should also read the [[GenDBWiki/TermsAndConcepts|Terms and Concepts]] section where the conceptual details of the system are explained.

== Installation ==

* [[GenDBWiki/AdministratorDocumentation/GenDBInstallation|Installation Guide]]
* [[GenDBWiki/AdministratorDocumentation/GenDBInstallationFAQ|Installation FAQ]]

== Administration ==

* [[GenDBWiki/AdministratorDocumentation/ManagingUsers|Managing Users and Projects]]

== Documentation ==
* [[GenDBWiki/CoreDocumentation|Core Documentation]]
* [[GenDBWiki/WebDocumentation|Web Documentation]]
* [[GenDBWiki/GUIDocumentation|GUI Documentation]]
* [[GenDBWiki/DeveloperDocumentation|Developer Documentation]]
* [[GenDBWiki/AdministratorDocumentation|Administrator Documentation]]

== General ==
* [[GenDBWiki/FeatureTable|Feature Table]
* [[GenDBWiki/TermsAndConcepts|Terms and Concepts]]
* [[GenDBWiki/FuturePlans|Future Plans]]

== Other Issues ==

For more information about project setup, license, application examples, and publications, please go to the [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/ GenDB website].

== Contact ==

Please send an e-mail for account requests or questions concerning the use of GenDB to:
[[MailTo(gendb AT cebitec DOT uni DASH bielefeld DOT de)]]

For bug reports, please use our bug reporting system [http://bugs.cebitec.uni-bielefeld.de BugZilla].

Author: [http://www.cebitec.uni-bielefeld.de/~agoesman Alexander Goesmann]

GenDBWiki

2011-10-28T09:23:55Z

Tk: /* Documentation */

__NOTOC__

= GenDB Overview Page =

== GenDB - An open source genome annotation system for prokaryote genomes ==

The GenDB system is an open source genome annotation system developed by the "Bioinformatics Resource Facility (BRF)" of the "Center for Biotechnology ([http://www.cebitec.uni-bielefeld.de/ CeBiTec])" at "Bielefeld University".

Please have a look at the [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/ GenDB website] for general information, a guided tour, login to a demo project, a list of FAQs and more information about the availability of this software. Within this GenDB Wiki you will find more detailed information about installing and maintaining GenDB, using the system, and writing your own programs. Since this information is likely to change quite frequently we decided to put it into a Wiki system. If you discover new features that are not listed here you are of course welcome to extend this documentation.

Since GenDB version 2.2 the sources are split into 3 separate major packages:

* [[GenDBWiki/CoreDocumentation|GenDB CORE]] includes the complete GenDB backend with all O2DBI class modules and several extensions but no graphical user interface. It contains the basic functionality for annotating a genome automatically by using a number of simple scripts available within this package.
* [[GenDBWiki/WebDocumentation|GenDB WEB]] contains the modules required for running the GenDB web frontend. It can be used for a distributed manual genome annotation.
* [[GenDBWiki/GUIDocumentation|GenDB GUI]] includes the sources of the graphical user interface written in Gtk. This frontend provides some enhanced functionality and it has more features than the web interface.

The following sections contain detailed information about each package and how to use it for annotating a genome. You can also find some documentation for software developers and system administrators in separate Wiki sections. If you are using the GenDB system for the first time, you should also read the [[GenDBWiki/TermsAndConcepts|Terms and Concepts]] section where the conceptual details of the system are explained.

== Installation ==

* [[GenDBWiki/AdministratorDocumentation/GenDBInstallation|Installation Guide]]
* [[GenDBWiki/AdministratorDocumentation/GenDBInstallationFAQ|Installation FAQ]]

== Administration ==

* [[GenDBWiki/AdministratorDocumentation/ManagingUsers|Managing Users and Projects]]

== Documentation ==
* [[GenDBWiki/CoreDocumentation|Core Documentation]]
* [[GenDBWiki/WebDocumentation|Web Documentation]]
* [[GenDBWiki/GUIDocumentation|GUI Documentation]]
* [[GenDBWiki/DeveloperDocumentation|Developer Documentation]]
* [[GenDBWiki/AdministratorDocumentation|Administrator Documentation]]

== General ==
* /FeatureTable
* /TermsAndConcepts
* /FuturePlans

== Other Issues ==

For more information about project setup, license, application examples, and publications, please go to the [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/ GenDB website].

== Contact ==

Please send an e-mail for account requests or questions concerning the use of GenDB to:
[[MailTo(gendb AT cebitec DOT uni DASH bielefeld DOT de)]]

For bug reports, please use our bug reporting system [http://bugs.cebitec.uni-bielefeld.de BugZilla].

Author: [http://www.cebitec.uni-bielefeld.de/~agoesman Alexander Goesmann]

GenDBWiki

2011-10-28T09:22:59Z

Tk: fixing links

__NOTOC__

= GenDB Overview Page =

== GenDB - An open source genome annotation system for prokaryote genomes ==

The GenDB system is an open source genome annotation system developed by the "Bioinformatics Resource Facility (BRF)" of the "Center for Biotechnology ([http://www.cebitec.uni-bielefeld.de/ CeBiTec])" at "Bielefeld University".

Please have a look at the [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/ GenDB website] for general information, a guided tour, login to a demo project, a list of FAQs and more information about the availability of this software. Within this GenDB Wiki you will find more detailed information about installing and maintaining GenDB, using the system, and writing your own programs. Since this information is likely to change quite frequently we decided to put it into a Wiki system. If you discover new features that are not listed here you are of course welcome to extend this documentation.

Since GenDB version 2.2 the sources are split into 3 separate major packages:

* [[GenDBWiki/CoreDocumentation|GenDB CORE]] includes the complete GenDB backend with all O2DBI class modules and several extensions but no graphical user interface. It contains the basic functionality for annotating a genome automatically by using a number of simple scripts available within this package.
* [[GenDBWiki/WebDocumentation|GenDB WEB]] contains the modules required for running the GenDB web frontend. It can be used for a distributed manual genome annotation.
* [[GenDBWiki/GUIDocumentation|GenDB GUI]] includes the sources of the graphical user interface written in Gtk. This frontend provides some enhanced functionality and it has more features than the web interface.

The following sections contain detailed information about each package and how to use it for annotating a genome. You can also find some documentation for software developers and system administrators in separate Wiki sections. If you are using the GenDB system for the first time, you should also read the [[GenDBWiki/TermsAndConcepts|Terms and Concepts]] section where the conceptual details of the system are explained.

== Installation ==

* [[GenDBWiki/AdministratorDocumentation/GenDBInstallation|Installation Guide]]
* [[GenDBWiki/AdministratorDocumentation/GenDBInstallationFAQ|Installation FAQ]]

== Administration ==

* [[GenDBWiki/AdministratorDocumentation/ManagingUsers|Managing Users and Projects]]

== Documentation ==
* [[GenDBWiki/CoreDocumentation|Core Documentation]]
* [[GenDBWiki/WebDocumentation|Web Documentation]]
* [[GenDBWiki/GUIDocumentation|GUI Documentation]]
* [[GenDBWiki/DeveloperDocumentation|Developer Documentation]]
* [[GenDBWiki/AdministratorDocumentation|Administrator Documentation]

== General ==
* /FeatureTable
* /TermsAndConcepts
* /FuturePlans

== Other Issues ==

For more information about project setup, license, application examples, and publications, please go to the [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/ GenDB website].

== Contact ==

Please send an e-mail for account requests or questions concerning the use of GenDB to:
[[MailTo(gendb AT cebitec DOT uni DASH bielefeld DOT de)]]

For bug reports, please use our bug reporting system [http://bugs.cebitec.uni-bielefeld.de BugZilla].

Author: [http://www.cebitec.uni-bielefeld.de/~agoesman Alexander Goesmann]

GenDBWiki/CoreDocumentation

2011-10-28T09:20:19Z

Tk: fixing more links...

__NOTOC__
= GenDB CORE Documentation =

The GenDB CORE documentation is intended to provide some general background knowledge about the GenDB annotation system. You should read the following chapters if you are planning to setup the system in order to annotate genomes.

* [[GenDBWiki/CoreDocumentation/CoreScripts|CoreScripts]]: This section describes the most important scripts for running a standard genome annotation pipeline.
* [[GenDBWiki/CoreDocumentation/RegionPrediction|RegionPrediction]]: In this part you can find a number of details about the GenDB region prediction components.
* [[GenDBWiki/CoreDocumentation/FunctionPrediction|FunctionPrediction]]: Tools used for function prediction within the GenDB system are described here.
* [[GenDBWiki/CoreDocumentation/AdditionalScripts|AdditionalScripts]]: Additional scripts for special purpose tasks are explained in this last section of the GenDB CORE documentation.

For writing your own programs using the GenDB API please take a look at the GenDB developer section and the API documentation.

GenDBWiki/CoreDocumentation

2011-10-28T09:18:35Z

Tk: trying to fix link...

__NOTOC__
= GenDB CORE Documentation =

The GenDB CORE documentation is intended to provide some general background knowledge about the GenDB annotation system. You should read the following chapters if you are planning to setup the system in order to annotate genomes.

* [[GenDBWiki/CoreDocumentation/CoreScripts|CoreScripts]]: This section describes the most important scripts for running a standard genome annotation pipeline.
* /RegionPrediction: In this part you can find a number of details about the GenDB region prediction components.
* /FunctionPrediction: Tools used for function prediction within the GenDB system are described here.
* /AdditionalScripts: Additional scripts for special purpose tasks are explained in this last section of the GenDB CORE documentation.

For writing your own programs using the GenDB API please take a look at the GenDB developer section and the API documentation.