GenDBWiki/DeveloperDocumentation/GenDBProgrammingTutorial

From BRF-Software
Jump to navigation Jump to search

GenDB Programming Tutorial

In the following sections you can learn how to write a simple GenDB script that fetches all coding sequences (CDS) of a contig, prints out some detailed information about the current annotation of each CDS and shows the best tool hits from several bioinformatics analyses. Before you go through this example step by step you should also take a look at the [wiki:GenDBWiki/DeveloperDocumentation/GenDBDemoScript GenDB demo script] and the [wiki:GenDBWiki/DeveloperDocumentation/ClassDiagram GenDB class diagrams]. Most of the terminology used here is explained in detail in the [wiki:GenDBWiki/TermsAndConcepts terms and concepts section] of the GenDB Wiki. A detailed explanation of each method applied in the script can be found in the API documentation of the corresponding class or module. For producing some reasonable output, this tutorial requires of course that you have installed GenDB successfully and that you have a project that has all information that is used within the script. For testing purposes you can setup a project with the test data set that is provided with the GenDB tarball. The complete source code of this script is available for [attachment:gendb_tutorial.pl download here].

Before we can perform some useful actions on the GenDB data we have to establish a connection to the O2DBI backend via the GenDB Application Frame:

 #!/usr/bin/env perl

 # simple GenDB script that is used for the GenDB programming tutorial

 use strict;
 use Carp;
 use Getopt::Std;
 use Term::ReadKey;
 use IO::Handle;
 use GPMS::Application_Frame::GENDB;

 #
 #  this is necessary if the script is started via rsh(1)
 #  otherwise you won't see any output until the first <RETURN>
 #
 STDOUT->autoflush(1);
 sub usage {
   print "gendb_tutorial - print some information about all CDS regions of a contig\n";
   print "usage: gendb_tutorial -p <project> -c contig name\n\n";
 }

 # global variables
 our($opt_p, $opt_c);
 getopts('p:c:');

 # start sanity checks
 if (!$opt_p) {
   usage;
   print "ERROR: Can't initialize GenDB: No project name given!\n";
   exit 1;
 };

 if (!$opt_c) {
   usage;
   print "ERROR: Don't know what to do: No contig name given!\n";
   exit 1;
 };

 # get the login name of the current user
 my $user = defined( $ENV{'LOGNAME'} ) ? $ENV{'LOGNAME'} : (getpwuid( $> ))[0];
 print "Enter your database password: ";
 ReadMode('noecho');
 my $password = ReadLine(0);
 chomp $password;
 print "\n";
 ReadMode('normal');

 # try to initialize GenDB project
 # initialize an Application_Frame for the current project
 my $gendbAppFrame = GPMS::Application_Frame::GENDB->new($user, $password);

 # check if the initialization succeeded
 die "Unable to initialize ApplicationFrame for GenDB project!" unless (ref $gendbAppFrame);

 # try to initialize a project for the given name
 unless ($gendbAppFrame->project($opt_p)) {
   print STDERR "ERROR: Could not initialize project $opt_p!\n";
   print STDERR $gendbAppFrame->error()."\n";
   exit 1;
 }

 # check if the user has the 'annotate' privilege
 exit unless $gendbAppFrame->right("annotate");
 # get a global O2DBI-2 master object
 my $master = $gendbAppFrame->application_master();

After successfully connecting to a GenDB project we can start by initializing a contig object with a given name. Since contigs belong to those objects that are directly associated with a DNA sequence, they are stored as GENDB::DB::Region::Source::Contig objects. All GENDB::DB::Region::Source classes have a sequence attribute which is a reference on a GENDB::DB::Sequence object containing the plain DNA sequence. Since the name of all regions (and so of all GENDB::DB::Region::Source::Contig objects) is unique and indexable you can initialize the corresponding object using the init_name method provided automatically by O2DBI. If you don't know the names of the contigs available within a project you can use the [wiki:GenDBWiki/DeveloperDocumentation/GenDBDemoScript GenDB demo script] to get a list of all contigs.

# try to initialize a contig object for the given name
my $contig = $master->Region->Source->Contig->init_name($opt_c);

After successfully initializing a contig object (you should always check the return value of all init methods using ref and isa) we can go on and print out some information about the contig:

if (ref $contig && $contig->isa("GENDB::DB::Region::Source::Contig")) {
  # print header with some contig information
  printf "\nContig %s, %d bp\n", $contig->name, $contig->length;

Before we continue by fetching all subregions of this contig that are coding sequence we have to clearify a subtle implementation detail that could lead to some irritations here since the GenDB region hierarchy is different from the class hierarchy: Although a GENDB::DB::Region::CDS is a subregion of a contig it is not a subclass of a GENDB::DB::Region::Source::Contig since only all source regions refer to a sequence object. On the other hand, CDS regions do not need a direct reference to a sequence object since their sequence can be derived by using an association to a corresponding parent region, e.g. a contig and relative start and stop positions with respect to the contig sequence.

For retrieving all CDS regions of a contig we have different possibilities that are already implemented in the GENDB::DB::Region superclass of the GenDB backend:

You can either find an appropriate automatically generated method for fetching all CDS regions with a specific parent region (fetchallby_parent_region()), or you can use a manually programmed (but very generic) method for fetching all subregions of a given region (get_subregions). Since we are only interested in CDS regions in this example we use the prior method that is quite sufficient for our purposes here:

# get all CDS subregions of the contig
my $subregions = $master->Region->CDS->fetchallby_parent_region($contig);

As explained in the API documentation, this method returns a reference on an array of CDS region objects that have the initialized contig as their parent region. In the next step we will now loop through this list of coding sequences and print out some of their attributes (start, stop, iso-electric point, molecular weight).

# loop through all CDS regions
foreach my $cds (@$subregions) {
  printf "%s\t%d\t%d\t%.2f\t%.2f\n", $cds->name, $cds->start, $cds->stop, $cds->iep, $cds->molweight;

As you can see from the API some of these methods are automatically generated by O2DBI as standard getter/setter methods, while others (like aasequence) were manually implemented.

In addition to its own methods, a CDS has an association (attribute latest_annotation_function) to a GENDB::DB::Annotation::Function. This annotation contains the latest (see Wiki for more details about the difference between the latest and the current annotation) information about the functional assignments to a gene. It is now a straightforward exercise to print out some information about the functional assignments of a gene:

# try to get the latest functional annotation
my $annotation = $cds->latest_annotation_function;
if (ref $annotation && $annotation->isa("GENDB::DB::Annotation::Function")) {
  printf "%s\t%s\t%s\n", $annotation->name, $annotation->geneproduct, $annotation->description;


As you will see from the printed attribute values, a CDS can have a gene name that is stored in an annotation in addition to the unique region name that has to be assigned to all regions upon creation. This means that you will find gene names like dnaA or gyrB stored in the name attribute of an annotation but NOT in the name attribute of the region object itself.

In the next step of this little sample program we will print out the best results (observations) obtained by computing different bioinformatics tools like BLAST, HMMPfam, or !InterPro. Instead of selecting one or more tools stored within a project database we are using a more generic approach here: First we initialize all GENDB::DB::Tool::Function objects and then we fetch the best observation for each combination of a tool and CDS region. Each observation is printed as some simple text graphics followed by some of the attribute values as illustrated below:

output format:
========== region name, start, stop, IEP, MW, gene name, gene product, description
++        tool name, start, stop, hit description
++++++++++ tool name, start, stop, hit description
++++++ tool name, start, stop, hit description

Please notice that we have slightly modified the output of a CDS region by adding some '=' characters at the beginning representing the (normalized) length of each coding sequence. The source fragment below shows you how all function tools are initialized using the fetchall method provided by O2DBI.

# initialize all function tools only once before we loop through all CDS regions
my $tools_ref = $master->Tool->Function->fetchall();

The following foreach loop is nested within the previously implemented loop that goes through all CDS regions on a contig. This inner loop is used for fetching the best observation of each available tool for the current CDS. The start and stop positions are normalized and the result is printed with some attribute values.

# loop through all tools and try to get the best observation for each tool and the current CDS
foreach my $tool (@$tools_ref) {
# get the best observation
my $best_obs;
my $ref = ref $tool;
my ($type) = $ref =~ /GENDB::DB::Tool::Function::(\w+)/;
if ($master->Observation->Function->can($type)) {
$best_obs = $master->Observation->Function->$type->best_observation($tool, $cds);
}

if (ref $best_obs && $best_obs->isa("GENDB::DB::Observation::Function")) {
# compute a normalized length
my ($start, $stop) = ($best_obs->start, $best_obs->stop);
my $factor = $cds->length / 10;
my $start_pos = int ($start / $factor);
my $stop_pos = int ($stop / $factor);

# simple loop for graphical output
for (my $i=0; $i<10; $i++) {
if ($i < $start_pos) {
print " ";
next;
}

if ($i <= $stop_pos) {
print "+";
next;
}

print " ";
}

# print a short tool result
printf " %s\t%d\t%d\t%s\t%s\n", $tool->name, $start, $stop, $best_obs->description;
}
}

Running the complete script will print all genes of a contig and their annotations, similar to an EMBL or !GenBank file, although in a completely different output format. You should have learned by now how to access and print out some information that is stored within a GenDB project database using the API.

But what about manipulating and modifying the stored data? Using an object-oriented approach, modifying the attributes of an object is fairly simple since all getter methods for retrieving the value of an attribute can be used (in an overloaded manner) as setter methods as well: If a method obtains an attribute of the corresponding type as parameter, the attribute is set immediately to the new value. This means that the new value is also stored in the project database at once. As an example, you can see in the last steps how the date of the latest export is stored in the comment of a region annotation for a contig.

Due to the GenDB concept that annotations are NEVER deleted so that you can always reproduce how a region evolved over time (this is especially important to protocol the progress of an ongoing manual annotation), we create a new annotation with the timestamp of the last export and store our script name in the comment. For creating a new annotation we also need a mandatory GENDB::DB::Annotator object that we try to initialize from the project, and if it does not exist we create it. Finally, we also set the new annotation as the latest annotation:

# register the export by creating a new annotation
# at first we have to get an annotator
my $annotator_name = "Demo Exporter";
my $annotator = $master->Annotator->init_name($annotator_name);
if (!ref $annotator) {
$annotator = $master->Annotator->Machine->create($annotator_name);

die "Unable to create new annotator!" unless ref $annotator;

$annotator->description("GenDB-2.2 test annotator from tutorial script.");
}

# create new annotation, the previous region status is kept
my $new_annotation = $master->Annotation->Region->create(time(),
$annotator,
$contig,
$contig->status_region
);

die "Unable to create new annotation!" unless ref $new_annotation;

# set a new comment
$new_annotation->comment("This contig was exported by the gendb_tutorial script.");

# all other attribute values are copied from the latest annotation of the contig
# each contig should have at least one annotation from the initial import
my $prev_contig_annotation = $contig->latest_annotation_region;

if (ref $prev_contig_annotation) {
$new_annotation->description($prev_contig_annotation->description);
$new_annotation->start($prev_contig_annotation->start);
$new_annotation->stop($prev_contig_annotation->stop);
}
else {
$new_annotation->start($contig->start);
$new_annotation->stop($contig->stop);
}

# set the new annotation as the latest annotation
$contig->latest_annotation_region($new_annotation);

Now you can always reproduce the exports with this script from the list of stored annotations of a contig. Although the last procedure abuses the GenDB concept of annotating a region a little bit, this example hopefully showed you how to create a new object and how to manipulate existing ones. Usually you would only write a new annotation in order to protocol that a region was changed, how it was done, and who did what.

Where to go on from here?

This short tutorial should have shown you how to access the most important data objects within the GenDB system. Of course, this sample script only used a minimal subset of all GenDB classes (there are more than 180 in the 2.2 release), but understanding the basic principles of the O2DBI API should enable you to comprehend the use of other classes and their methods as well. For getting into the details of the system you should take a closer look at the GenDB API documentation and read the source code of some [wiki:GenDBWiki/DeveloperDocumentation/ContributedScripts sample scripts]. You may also find more information in the [wiki:GenDBWiki/DeveloperDocumentation developer sections] of the GenDB Wiki. If you are new to programming with GenDB you should probably setup a test project where you keep an initial dump of the database as a backup that can be easily restored if you finally managed to corrupt the complete dataset :D

Well this is how we all started ...