SAMSWiki/WebDocumentation/IntroDuction

From BRF-Software
Jump to navigation Jump to search

Introduction of "Sequence Analysis and Management System" (SAMS) and general concepts

Every genome project generates thousands of ESTs or shotgun reads. Roughly 70% of all sequences in GenBank/EMBL/DDBJ are ESTs (Expressed Sequence Tags), which are generated by reverse transcribing mRNAs into complementary cDNAs, and then performing single-pass sequencing on those cDNAs. ESTs thus represent segments of DNA (a few hundred of nucleotides) that code for an mRNA, and are a fast, inexpensive way to determine which genes are being actively transcribed in a tissue at a given stage of development. Other common experimental methods for sequence generation include Sequence-Tagged Sites (STS) used to derive physical maps in genome construction, and Genome Survey Sequence (GSS), short random sequences used commonly to quickly sample the type of DNA sequences t hat could be found in a genome. So Users have high interest in a first look at the DNA sequence content of the individual reads, before they are assembled (in case of shotgun reads) or clustered (in case of ESTs). Several steps are necessary to provide the researcher high quality sequences, as well as an overview of their content. For all these purposes we have implemented some additional extensions to GenDB within the SAMS system. SAMS is a simple, easy to install and maintain open source system that provides the mechanisms to run a variety of tools on each read/EST, presenting the results in a web form. The EST pipeline is consisting of four parts (figure 1.1):

  • /PreProcessing
  • /ClusTering
  • /AssemBly
  • /AnaLysis

The preprocessing begins with importing the sequences, then running user-defined quality clipping, vector clipping against a standard database or custom sequences, and repeat masking. In the second part the pipeline continues with clustering the ESTs with user-specific parameters and afterwards the assembly is carried out, generating Tentative Consensus Sequences (TCs). Finally, the user can run the automatic annotation of the ESTs and TCs, which will launch BLAST searches against several database, such as KEGG and SwissProt, as well as InterPro and Pfam. Once the annotation is done, the user can manually go through it, checking for correctness.

attachment: file.png (Figure 1.1: EST pipeline)