IGetDBWiki/Specification

From BRF-Software
Jump to navigation Jump to search

IGetDB Specification

Introduction

IGetDB is a search facility for expression data. It enables the user to quickly access experiments and their results by formulating queries such as "Give me all temperature-related stress experiments containing gene XY" or "Give me a list of all up-regulated genes in experiments on Xanthomonas campestris". IGetDB combines data from EMMA, GenDB and ProDB. The data is imported from each of these databases and processed for searchability.

Search Interface

Possible search results are lists of genes and experiments. Possible search parameters are (c.f. fig. 1):

  • the organism
  • experiment design and description
  • status of the regulation of genes (e.g. up or down regulated, same regulation level)
  • if there is a specific value for the expression level stated, the user should be able to search for a range of these values.

All these search parameter can set to a wildcard (i.e. have no influence on the search; default) and can be combined with the operations “and” and “but not”. Also the search system should be extendable to accommodate additional combinations of search parameters.

SearchParameter.png

Implementation Details

The object-model will be created with the O2DBI system. The search facility will be created as an additional layer on top of these classes. In order to simplify the implementation of the graphical user interface (GUI) the search layer will have an introspection mechanism that provides the GUI with information on what can be searched for, what search parameter can entered and how these search parameters can be combined.

Ontology

In principle the controlled vocabulary for the experiment design types, the organisms and experiment category can be stored in simple (e.g. comma/tab separated) lists. These should, however, be stored in an appropriate database to able to import and use data from ontologies such as the MGED ontology. The MGED object model (MAGE-OM) provide some basic classes to reference an ontology entry, unfortunately, the classes are not sufficient for storing a complete ontology. For the discussion why the MGED classes are not sufficient and for the specification of such an ontology database I created a separate documentation, OntologyDBWiki.

Schedule

  • The main aim is to have a running prototype end of 2006.
  • A coarse list of milestones for such a prototype would be:
    • Project setup (i.e. requesting GPMS project, setting up database).
    • Implementation of an import prototype for EMMA (which will require refinements in the object-model). This includes an evaluation and definition of required information for an import.
    • Implementation of a search layer and search user interface. The search layer will need several controlled vocabularies. These will be implemented in simple text files (CSV or RDF).
  • This prototype is then shown to Biologists to evaluate the usability of the user interface and to find missing features
  • The milestones of the main tool are then:
    • Implementation of requests of the Biology experts
    • Refinements of the EMMA import facility and implementation of a GenDB and ProDB import facility.
    • Optimisation of the search layer and implementation of a database for the controlled vocabularies.

Requirements

To ensure that experiment descriptions stay comparable and searchable we need to introduce controlled vocabulary for experiment categories (e.g. GenDB, ProDB, EMMA) and experiment design types (temperature related stress, chemical stress, etc.).

Questions

  • Should a query such as "Give me a list of all up-regulated genes in experiments on Xanthomonas campestris" include the taxonomic range, that is should this request return results for "Xanthomonas", "Xanthomonas bromi", "Xanthomonas campestris pv. badrii", "Xanthomonas campestris (pv. campestris)", "Xanthomonas campestris pv. Carotae"? If yes, how should this be done? Simply by string comparison or by taxonomy?
  • How attributes such as up/down-regulated be applied to genes? Asking the user to apply these attributes sounds not feasible. Otherwise we could normalise the data, but, to make experiments comparable with each other, we need to normalise across the complete set of experiments. This needs to be done with each new import of experiments. This is computationally expensive and also requires that we store all the original data in parallel to the normalised data.
  • Who is allowed to import data into IGetDB? Only the experiment owner, or only the project leader, or both?
  • Also to clarify is where the rights import information into IGetDB should be defined. In IGetDB itself or in the exporting databases such as EMMA, GenDB and ProDB?
  • Andi pointed out that it might be useful to have a flag such as "preliminary information". "Preliminary information" would flag data that is stored in IGetDB, but is not deemed good enough (yet) to be accessible by the search interface.
  • A "gene" is currently defined as an external reference to Region::CDS. As Andi stated IMHO correctly, it will not be only genes that we will have expression or regulation data on. Instead this should be a “substance” class that can be sub-classed for more specific definitions, for example a gene (but also a protein and later on a metabolite).