Biological Data bases

Bioinformatics is the branch of science that deals with the application of information technology and computer science to the field of molecular biology. The term bioinformatics was coined by Paulien Hogeweg in 1979 to explain the use of   informatics in biotic systems.

The primary use of Bioinformatics since late 1980s has been in the field of genomics and genetics, particularly in those areas of genomics involving large-scale DNA sequencing. Bioinformatics now involves the creation and advancement of databases, algorithms, computational and statistical techniques, etc.  The most common activities in bioinformatics include mapping and analyzing DNA and protein sequences, aligning different DNA and protein sequences to compare them and creating and viewing 3-D models of protein structures.

The major aim of bioinformatics is to increase the understanding of biological processes. Bioinformatics focus on developing and applying computationally intensive techniques. Data mining, machine learning algorithms, visualization etc are used to achieve this goal. Researches are going on in the field of bioinformatics for sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, etc.

Biological databases

Biological databases are datas experimentally derived after the study of cells at the molecular level. The biological data bases are mainly classified into Primary and Secondary databases.

Primary Database

These are databases consisting of data derived experimentally such as nucleotide sequences and three dimensional structures of molecules. The important primary data bases are Genome Database, Protein Database, Protein Nucleic acid Complex Database etc.

Secondary Database

These are data bases derived from the analysis or treatment of primary data such as secondary structures, hydrophobicity plots, and domains. Examples are Protein Database Complex Database etc.

The EMBL Nucleotide Sequence Database

EMBL is the primary nucleotide sequence database. It is known as Europe’s primary nucleotide sequence resource. EMBL uses sources for DNA and RNA sequences of direct submissions from individual researchers, genome sequencing projects and patent applications. The database is produced in an international collaboration with GenBank and the DNA Database of Japan – DDBJ. Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis


The GenBank sequence database is the open access database. It is the annotated collection of all publicly available nucleotide sequences and their protein translations. The GenBank is produced at National Center for Biotechnology Information-NCBI- as part of the International Nucleotide Sequence Database Collaboration. GenBank receive sequences produced in laboratories throughout the world. GenBank contains the data of over 65 billion nucleotide bases.

Protein Database

A variety of protein sequence databases exists. These range from simple sequence repositories to universal databases. The simple sequences store data with little or no manual intervention in the creation of the records while the universal databases are curated that cover all species. This includes the original sequence data enhanced by the manual addition of further information in each sequence record, These databases will play more important role as central comprehensive resources of protein information. UniProt -Universal Protein Knowledge base is such an important database.

PIR – The International Protein Sequence Database

It is a non-redundant, annotated database of protein sequences and related information. A variety of biological information is provided by PIR including protein function, homology information, and sequence-related information.

MIPS-GSF, Munich Information Center for Protein Sequences

MIPS support both national and European sequencing and functional analysis projects. It develops and maintains automatically generated and manually annotated genome-specific databases. The main goal of MIPS is the systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences.

MIPS is maintained by the National Magnetic Resonance Facility at Madison, This database is a resource for metabolomics research based on nuclear magnetic resonance spectroscopy and mass spectrometry. Its role is the identification and quantification of metabolites present in biological samples. Each metabolite entry is supported by information in an average of 50 separate data fields, which provide the chemical formula, names and synonyms, structure, physical and chemical properties.


This Database consists of documentation entries describing the protein domains, families and functional sites as well as associated patterns and profiles to identify them .PROSITE is complemented by ProRule. The ProRule is the collection of rules based on profiles and patterns. This increases the discriminatory power of profiles and patterns by providing additional information about functionally and structurally critical amino acids. Examples of such databases are PROSITE, PRINTS, Blocks, Pfam etc. They include varying levels of annotation describing the protein families they encode and technical details concerning how the pattern was derived.
Sequence Alignment

Sequence alignment is the method of arranging the sequences of DNA, RNA, or Protein to identify regions of similarities. The aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns.

BLAST- Basic Local Alignment Search Tool

The BLAST is used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. The FASTA programs find regions of local or global similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence.

ClustalW2 and Clustal X

ClustalW2 is the general purpose multiple sequence alignment program for DNA or proteinsClustalW2produces biologically meaningful multiple sequence alignments of divergent sequences. It then calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen.