GenDBWiki/AdministratorDocumentation/GenDBInstallation

From BRF-Software
Jump to navigation Jump to search

GenDB Installation Instructions

Below you can find some basic installation instructions for installing GenDB from a tarball. There's also a FAQ page describing some of the problems that may occur during installation.

Although we are testing our software on a number of systems prior to release, it may contain errors and in the worst case it does not work in the way you expect it. GenDB 2.2 is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE!

Checking system requirements

As a Perl application, GenDB 2.2 requires a recent version of Perl and a number of additional (non bioinformatics related) tools:

  • Perl (version 5.8 or higher)
  • MySQL (version 4.0 or higher)
  • Sun Grid Engine (version 6.0 or higher) or another DRMAA compatible scheduling system
  • GNU plot
  • NetPBM
  • ImageMagick
  • GraphViz ([http://www.graphviz.org)
  • wget
  • dialog
  • Apache ([1])
  • mod_perl (optional, but "highly recommended" [2])

You also need an administrative account at the MySQL server for creating several databases during the installation.

Installing the required Perl modules

GenDB 2.2 relies on a number of freely available Perl modules. These modules are available at the CPAN archive ([3]). You need to install these modules prior to installing GenDB 2.2. Depending on the kind of operating system these modules may also be available as packages provided by your operating system vendors; using these packages is recommended. List of required Perl modules (in no particular order):

  • Graph
  • GD
  • Config::IniFiles
  • LWP::UserAgent
  • HTTP::Request
  • URI
  • Chart::Graph::Gnuplot
  • Crypt::GeneratePassword
  • Mail::Mailer
  • DBI
  • DBD::mysql
  • Term::ReadKey
  • DB_File::Lock
  • HTML::Template
  • CGI
  • Digest::MD5
  • UI::Dialog
  • BioPerl (also available at [4])

Please keep in mind that these modules may have further requirements and may depend on further modules or libraries. If you are going to install the modules by yourself instead of using ready-made packages, using the so called "CPAN Shell" will help you resolving dependencies and installing modules. See the section about installing Perl modules at the CPAN website or the manpage of the CPAN module (if installed). Newer versions of Perl (> 5.8) also provide an improved version of the CPAN Shell, called "CPANPlus".

Installing bioinformatics software

GenDB 2.2 does not contain methods for gene calling or predicting gene functions; it uses a number of mostly freely available, well-known tools. Most of these tools are required for running GenDB 2.2.

  • Gene calling:
    • Critica
    • Glimmer 2
    • tRNAScan-SE
    • SearchForRNAs (available at our ftp server)
    • QRNA (optional)
    • rbsfinder (optional)
  • Function prediction:
    • NCBI Blast ([5])
    • HMMER ([6])
    • InterProScan ([7])
    • EMBOSS ([8])
    • SAPS (optional)
    • TMHMM (optional)
    • SignalP (optional)

At least the mandatory tools need to be setup and configured properly.

Getting the required sequence databases

The tools described above operate on various databases containing biological sequences, pattern, HMMs etc. You need at least the following databases for GenDB 2.2:

  • a non redundant nucleotide database (usually called "nt", available by ftp at the NCBI and the EBI)
  • a non redundant protein database ("nr", also available at the NCBI/EBI)
  • a blastable SwissProt database
    You may either use a standalone database or a subset of the "nr" database provided by the NCBI or the EBI. Setting up this subset is beyond the scope of this documentation.
  • a database containing the protein sequences of all genomes available in the KeGG database
    You can build this database by downloading the sequences from the KeGG ftp server and concatenating the necessary files.

You have to ensure that the FASTA header line format follows the defline format required by the NCBI. See the documentation to "formatdb" and "fastacmd" and [9]. If you want to use custom databases within GenDB, the header lines also have to follow the defline standard.

Installing GenDB

  • Get the most recent GenDB tarball from sourceforge([10])
  • Unpack the tarball in a temporary directory
  • Run the installation script ("sh install_gendb.sh <target directory>")

The installation script will check for the Perl modules, paths to binaries and databases, install the components and setup the system.

Note:

  • The GOPArc component downloads several databases from external servers, like the COG and KOG databases and the KeGG databases (others than the database mentioned above). You need an internet connection during the installation. The installation script will ask for a proxy server to be used for downloading. If you do not need a proxy leave the field empty. Otherwise enter the complete proxy URL, e.g. `http://proxy.my-institute.org:1234`.
  • During download the script also tries to fetch the KeGG pathway reference maps. These are about 2.300 single files, which unfortunatly have to be fetched one by one.
  • Some of the installation scripts produce a lot of output with warnings and error messages. We are aware of this and will fix it in the next release.

Setting up the web server

The apache web server is usually available as a package provided by the operating system vendors; however the way to configure the server differs between operating systems and distributions. We have provided a script called "setup_web_interface" that tries to detect the apache version and is able to create a configuration fragment you can use to setup the web server. The fragment contains the necessary configuration directives, but may need to be adopted to your local installation.

Starting the dispatcher

The "dispatcher" is the part of GenDB 2.2 responsible for managing external tools, running them on the cluster, parsing output and so on. It needs to be started prior to submitting tools within the GenDB 2.2 web interface or by a command line script. Starting a dispatcher is simple:

  • `cd` into the `/bin` directory of your GenDB installation
  • execute `./dispatcher -l /tmp/your_dispatcher.log`
  • for advanced options just execute `./dispatcher` without any arguments

You may start the dispatcher at system boot time, using a so called "init script". Again these scripts differ between operating systems and distributions, so we are not able to provide a ready-made one. Ask your local system administrator for more information about how to write an init script.

The installation is finished now and you can start with adding a new GenDB project and processing the first data. See the core scripts page for more information about this.