Home >> TOOLS FOR BIOINFORMATICS AND BIOTECHNOLOGY

GENOMICS

Many of the tools that one needs for the analysis of genomes can be found in the DNA Sequence Analysis section. Here we have unique tools for genomic analysis which do not fit easily in that section.

DNA sequencing:

The DNA Sequence Quality Machine at IFOM - Phred (The FIRC Institute for Molecular Oncology, Italy)  - provides base calling, chromatogram display and high quality sequence region evaluation and presentation for up to five sequences simultaneously.  For further information on Phred see here.

Sequence assembly - you don't need your own contig assembly program when you can use:

CAP online at Infobiogen (France)
CAP3 (PBIL, France),
CAP EST Assembler (Istituto FIRC di Oncologia Molecolare, Italy)
Divide-and-Conquer Multiple Sequence Alignment (Universitat Bielefeld, Germany)

Sequencing errors - if your DNA sequence doesn't match the expected protein sequence you can check for errors at ERR_WISE or Wise2: Intelligent algorithms for DNA searches(EBI, United Kingdom) or  SEQERR  - Detection of Frameshift Errors in Coding Regions

In-silico.com (Dr. Joseba Bikandi & co-workers, Faculty of Pharmacy, in the University of the Basque Country) - allows in silico experiments including theoretical PCR amplification, AFLP-PCR , restriction analysis and pulsed field gel electrophoresis [PFGE] with bacterial & archael genomes found in the public database.

Genome comparisons:

  GeneOrder 2.0 (D. Seto,  Bioinformatics & Computational Biology, George Mason Univ., U.S.A.)  is ideal for comparing small GenBank genomes (up to 0.25 Mb), while GeneOrder 3.0 extends the limits to approx. 2.0Mb. Each gene from the Query sequence is compared to all of the genes from the Reference database using BLASTP. There are two display formats: graphical and tabular. Currently the graph is an applet and must be saved as a "SCREEN SHOT".

CoreGenes  (D. Seto,  Bioinformatics & Computational Biology, George Mason Univ., U.S.A.) is designed to analyze two to five genomes simultaneously, generating a table of related genes - orthologs and putative orthologs. These entries are linked to their GenBank data.  It has a limit of 0.35 Mb, while the newer version CoreGenes 2.0 extends the limit to  approx. 2.0Mb. If your data is not present in GenBank use this site.

CoreGenes 3 (D. Seto & P. Mahadevan, Bioinformatics & Computational Biology, George Mason Univ., U.S.A) - tallies the total number of genes in common between the two genomes being compared; displays the percent value of genes in common with a specific genome; determines the unique genes contained in a pair of proteomes

  WebACT - this is the web version of ACT (Artemis Comparison Tool) a DNA sequence comparison viewer based on Artemis (Reference: T.J. Carver et al. Bioinformatics 21: 3422 - 3423).   Visit the database page of EMBL-EBI and select EMBL and "Standard Query Form"  to determine the EMBL accession number for the sequence you are interested in.

 WebGMAP - is a public web service for annotating and mapping individual cDNA sequences to the genomes of many eukaryote species, currently including Arabidopsis thaliana, Chlamydomonas reinhardtii, Glycine max, Oryza sativa, Physcomitrella patens and Populus trichocarpa. (Reference: C. Liang et al. 2009. Nucl. Acids Res. 37(Web Server issue):W77-W83)

Genome annotation and/or visualization:

BASys Bacterial Annotation Tool - this incredible tool supports automated, in-depth annotation of bacterial genomic sequences. It accepts raw DNA sequence data and an optional list of gene identification information (Glimmer) and provides extensive textual annotation and hyperlinked image output. BASys uses >30 programs to determine 60 annotation subfields for each gene, including gene/protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3D structure, reactions and pathways. (Reference: G.H. Van Domselaar et al. 2005. Nucl. Acids Res. 33(Web Server issue):W455-W459).

ORF (Groningen Biomolecular Sciences and Biotechnology Institute, Haren, the Netherlands) - offers one of the choice of Glimmer, ZCurve or GeneMark predictions coupled with GenBank or Fasta-formatted output. Works very well and quickly with phage-sized genomes.

BAGEL (Groningen Biomolecular Sciences and Biotechnology Institute, Haren, the Netherlands) - will determine from an existing or non submitted GenBank file the presence of bacteriocins based on a database containing information of known bacteriocins and adjacent genes involved in bacteriocin activity.

MICheck (MIcrobial genome Checker) - enables rapid verification of sets of annotated genes and frameshifts in previously published bacterial genomes, or genomes for which the user has a *.gbk file. This tool can be seen as a preliminary step before the functional re-annotation step to check quickly for missing or wrongly annotated genes. It worked nicely with phage genomes from 43-135kb. (Reference: S. Cruveiller et al. 2005. Nucl. Acids Res. 33: W471- W479).

RibEx: Riboswitch Explorer - scans <40kb DNA for potential genes (which are linked to BLASTP) and several hundred regulatory elements, including riboswitches. If you click on the "search for attenuators" it finds terminators and antiterminators. It presents the capculated genes and perits BLAST analysis at NCBI (Reference: C. Abreu-Goodger & E. Merino. 2005. Nucl. Acids Res. 33: W690-W692).

TransTerm (Michael Nuhn, Nano+Bio-Center) - TransTerm searches for rho-independent terminators in the vicinity of annotated genes. This TIGR program can be accessed online in two ways. If you have the genome in GenBank format to use this program since it will only look for terminators in the vicinity of the annotated genes. If the genome has not been annotated use this site. The latter site combines Glimmer and RBSfinder with TransTerm.

Prophage Finder - this tool predicts potential prophage loci in prokaryotic genome sequences.  However, it does not make any predictions as to whether the identified prophage is functional and it is also important to note the identified prophage region will most likely not represent the entire prophage. (Reference: Bose, M. & Barber, R. 2006.  In Silico Biol. 6: 0020).

tRNAs: tRNAscan-SE- (Univerisity of California at San Diego, U.S.A,) and FAStRNA - (N. El-Mabrouck, Pasteur Institute, Paris, France). The former site is incredibly sensitive & also provides secondary structure  diagrams of the tRNA molecules. Alternatively use ARAGORN (Reference: Laslett, D. & Canback. 2004. Nucleic Acids Research 32:11-16).
Test sequences.

 CRISPRfinder  Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) present a curious repeat structure found in many prokaryotic genomes. They show characteristics of both tandem and interspaced repeats. (Reference: I. Grissa et al. 2007. Nucl. Acids Res. 35(Web Server issue): W52-W57).

 GenomeVx - makes editable, publication-quality, maps of mitochondrial and chloroplast genomes and of large plasmids. These maps show the location of genes and chromosomal features as well as a position scale. The program takes as input either raw feature positions or GenBank records. In the latter case, features are automatically extracted and colored, an example of which is given. Output is in the Adobe Portable Document Format (PDF) and can be edited by programs such as Adobe Illustrator.(Reference: G. Conant & K. Woolfe. 2008. Bioinformatics 24:861-862)

 LTR_Finder - is an efficient program for finding full-length LTR retrotranspsons in genome sequences. The size of input file is now limited to 50MB (Reference: Z. Xu & H. Wang. 2007. Nucl. Acids Res.35(Web Server issue): W265-W268).
 RTAnalyzer - finds retrotransposons and detects L1 retrotransposition signatures (Reference: J-F. Lucier et al. 2007. Nucl. Acids Res. 35(Web Server issue):W269-W274

 MobilomeFINDER: web-based tools for in silico and experimental discovery of bacterial genomic islands(Reference: H-Y. Ou et al. Nucl. Acids Res. 35 Web Server issue W97-W104)

 FancyGene - is a fast and user-friendly web-based tool for producing images of one or more genes directly on the corresponding genomic locus. Starting from a variety of input formats, FancyGene rebuilds the basic components of a gene (UTRs, intron, exons). Once the initial representation is obtained, the user can superimpose additional features—such as protein domains and/or a variety of biological markers—in specific positions. (Reference: D. Rambaldi & F.D. Ciccarelli. 2009. Bioinformatics 25: 2281-2282)

 Synthetic genes:

  GeneDesign - is an excellent resource for designing synthetic genes. It includes tools for codon optimization and removal of restriction sites (Reference: Richarson, S.M. et al. 2006. Genome Research 16:550-556)

 Metagenomics:

 Orphelia  - Orphelia is a metagenomic ORF finding tool for the prediction of protein coding genes in short, environmental DNA sequences with unknown phylogenetic origin. Orphelia is based on a two-stage machine learning approach that was recently introduced by our group. After the initial extraction of ORFs, linear discriminants are used to extract features from those ORFs. Subsequently, an artificial neural network combines the features and computes a gene probability for each ORF in a fragment. A greedy strategy computes a likely combination of high scoring ORFs with an overlap constraint.  (Reference: K.J. Hoff et al. 2009. Nucl. Acids Res. 37(Web Server issue:W101-W105)

 

CONVERT

Several sites are available for conversion of sequence from one format to another.  These include:

 Readseq (D.G. Gilbert, Univ. Indiana, U.S.A.) which is also accessible here.

 JaMBW (European Molecular Biology Laboratory of Heidelberg, Germany). Java based Molecular Biologist's Workbench.Select Chapter 1 for sequence format conversion (upper <---> lower case; T <---> U; reverse or complement sequence). 

 Nucleic Acid Sequence Massager  (Allotron Biosensor Corporation) which in addition to removing spurious material (numbers, breaks, HTML, spaces) changes the format (upper to low case, complement, reverse, RNA to DNA, and triplets). 

 gbk2ptt (A. Villegas, Public Health Agency of Canada) - this will convert a GenBank flat file (*.gbk) to an NCBI Protein Table (*.ptt) file.  The latter is a tab-delineated table of protein features.

 gbk2faa (A. Villegas, Public Health Agency of Canada) - this will convert a GenBank flat file (*.gbk) to a FASTA file including the coding sequences (CDS) translated into amino acids (*.faa).

 gbk2fna (A. Villegas, Public Health Agency of Canada) - this will convert a GenBank flat file (*.gbk) to a FASTA file of the whole genome (a single sequence; *.fna)

 gbk2ffn (P. Konczy, Public Health Agency of Canada) - this will extract from a GenBank flat file (*.gbk) the DNA sequences of each gene which are presented in FASTA format (*.ffn).  The program will also extract the features of your gbk file in EXCEL format (coordinates, strand (+/-), length of gene in nt, gene name, description, and any notes associated with the description.  N.B. this program cannot deal with genes which are designated as follows: 125...250 join 500..725.

 gbk2sqn (A. Villegas, Public Health Agency of Canada) - this will convert a GenBank flat file (*.gbk) to an NCBI Sequin submission (*.sqn) file.  This program was designed to convert data generated in Kodon (Applied Maths, Austin, TX) to Sequin format.  N.B. If using the "Bacterial and Plastid" genetic code please note that the translations of certain CDS will appear /translation="-XXX...." In Sequin select the "Bacterial and Plastid" genetic code and translate to appear /translation="MXXX...."

 Convert GenBank to Fasta (G. Rocap, School of Oceanography, University of Washington, U.S.A.) - Select a GenBank formatted file containing a feature table. Select whether to extract translated peptide sequences, DNA sequence for each feature, or the entire DNA sequenceof the whole record. If you chose "Peptide Sequence", your feature table must have "translation"sub-features.

    FeatureExtract - this very useful service extracts sequence and feature annotation, such as intron/exon structure, from GenBank entries and other GenBank format files. (Reference: R. Wernersson.  2005. Nucl. Acids Res. 33 Web Server issue W567-W569).  Also possible is extraction of 5' and 3' sequences.

   Sequence editor - carries out numerous functions:

 Antiparallel - Create the antiparallel DNA or RNA strand. For example the sequence ATGC will be converted into GCAT. It is a combination of the both functions Complement and Inverse.
 Complement - Create the complement DNA or RNA strand. For example the sequence ATGC will be converted into TACG.
 Inverse - Create the inverse DNA or RNA strand. For example the sequence ATGC will be converted into CGTA.
 T to U - Replace all thymidine by uracil. For example the sequence ATUGC will be converted into AUUGC.
 U to T - Replace all uracil by thymidine. For example the sequence ATUGC will be converted into ATTGC.
 UCase - Convert the sequence into upper case.
 LCase - Convert the sequence into lower case.

Sequencing Shuffling - (Arizona Research Labs) In some cases (BLAST, M-Fold) one might want a randomized sequence to compare with one's own. other sites include: Shuffle DNA and Sequence Randomizer

COMPOSITION

IUB (Degenerate Bases) Code Table

IUB Code

N

V

B

H

D

K

S

W

M

Y

R

Bases

A,C,G,T

G,A,C

G,T,C

A,T,C

G,A,T

G,T

G,C

A,T

A,C

C,T

A,G

VecScreen (National Center for Biotechnology Information) - screens your DNA sequence for potential vector sequence.  Well worth running before doing any other analysis.

Base composition - consider WORDCOUNT (Pasteur Institute, France) which gives one the option of choosing the "word size", and GEMS (Genomatix, Germany).  The latter provides a nice output of mono-, di- and trinucleotide frequencies. Select "create statistics" and "start task" to get to the sequence entry page.

Compositional heterogeneity - Graphe:ADN riche en: (Atelier BioInformatique l'Université de Provence, France) N.B. In French but obvious (Soumettre = Submit). Presents in graphic format AT, GC or single base enrichment in the sequence.
Graph DNA: DNA Skew Graphing (Viral Bioinformatics Resource Center, University of Victoria, Canada) - this Java applet performs DNA walks, purine, AT and GC skews on small (<1 Mb) genomes. Alternative locations for cumulative GC skew are the GC  Skew Tool (University of Pittsburgh, U.S.A.), and GenSkew (Munich Information Center for Protein Sequences, Germany). In the first two cases one can only analyze ca. 30 kb of DNA sequence.

Z curve (Centre of BioInformatics,Tianjin University, China) - results in unique three-dimensional curve representations for a given DNA sequence, which is composed of three components ( xn, yn and zn):

·         the x-component of a Z curve xn displays the distribution of purine/pyrimidine (R/Y) bases along the sequence;

·         the y-component of a Z curve yn displays the distribution of amino/keto (M/K) bases along the sequence;

·         the z-component of a Z curve zn displays the distribution of strong-H bond/weak-H bond (S/W) bases along the sequence

DNA base composition analysis tool  (J. Zheng, Queen's University, Canada)  - This program can analysis a 30 kb DNA sequence in three different ways. It computes the percentage of one or two selectable nucleotide(s), the normal skew of two selectable nucleotides, and the cumulative skew of two selectable nucleotides for a given sequence. The result can be displayed in both graphic and value data format. 

Sequencing Shuffling - (Arizona Research Labs) In some cases (BLAST, M-Fold) one might want a randomized sequence to compare with one's own.

 JaMBW (European Molecular Biology Laboratory of Heidelberg, Germany). Java based Molecular Biologist's Workbench.Select Chapter 1 for sequence format conversion (upper <---> lower case; T <---> U; reverse or complement sequence).  N.B. Also check out   Chapter 5 "Buffer Calculator."  Another site offering a variety of output styles (MSF, Phylip, Fasta, GCG etc.) is ReadSeq (Pasteur Institute, France).  N.B. this serves as a complement to the former site.

 DSHIFT - a web server for predicting DNA 1H, 13C & 31P chemical shifts (Reference: S.L. Lam. 2007. Nucl. Acids Res. 35(Web Server issue): W713-W717)

DNA MOTIFS

While one can use established lists of motifs to search one's DNA sequence one can also discover them directly. In order to do this one has to derive a consensus sequence or probability matrix.  In the case of bacterial proteins for which the binding sites have been determined a good place to start is the  E. coli DNA-Binding Site Matrices (A.M. McGuire, Harvard University, U.S.A.).  The following sites provides one with a training set which can be used to derive a Gibbs screening matrix.

See additional pages on Promoters, Terminators, and Transcriptional Factors.

An assessment of a set of motif identifiers can be found in Nature Biotechnology, 2005, 23(1):137-144.

Gibbs Motif Sampler Homepage (E.C. Rouchka and B. Thompson, Bioinformatics Laboratory of  Wadsworth Center, U.S.A.) - I have linked to the prokaryotic DNA default setting page. On the next page I have presented data the IHF-binding site (consensus: WWWTCAA[N4]TTR).

RSA-tools - Gibbs (A. Neuwald & Jacques van Helden, Service de Conformation des Macromolécules Biologiques et de Bioinformatique, Université Libre de Bruxelles, Belgium) - type in the matrix size desired and deselect "add reverse complement strand."  After running the program once I would delete those sequences from the discovery set which align imperfectly.

 BindGene (C. Lockwood, University of Manchester, United Kingdom) - I have found this site particularly useful.  If your sequence is less than 2kb use the default settings.  If you paste 10kb of data, I suggest changing the "shuffled matrices" to "0."

 TESS - String Search Page (Center for Bioinformatics, University of Pennsylvania, U.S.A.) - this site requires that one enter the motif consensus sequence (Search My Site Strings), and is limited to 2000 nt per search. N.B. This site also permits searching TRANSFAC Strings. Choose "small Javal applets" to view the results of the search.

Create Matrix File (J. Zheng, Queen's University, Canada) - creates a matrix from a DNA Clustal alignment and also presents the consensus:

Number of sequences: 11
Length of alignment: 29
Consensus sequence representing: 80% matching base(s)

A 0  9 11 10 0 4 1 1 2 1 0 1 2 5 1 2 2 2 1 2 1  0  0  11 0  10 8 3 4 
C 0  1 0  0  2 0 1 2 2 5 7 5 3 1 4 4 6 3 1 0 10 11 0  0  1  0  0 4 3 
G 0  0 0  1  8 1 0 0 3 3 1 2 4 2 1 2 3 3 3 9 0  0  11 0  0  1  1 0 2 
T 11 1 0  0  1 6 9 8 4 2 3 3 2 3 5 3 0 3 6 0 0  0  0  0  10 0  2 4 2 

  T  A A  A  S W T Y D B Y B V D Y H S B K G C  C  G  A  T  A  W H V

DNA Motifs Gibbs Sampler - SeSiMCMC - the Sequence Similarities by Markov Chain Monte-Carlo algorithm finds DNA motifs of unknown length and complicated structure in a set of unaligned DNA sequences. It uses an improved motif length estimator and careful Bayesian analysis of the possibility of a site absence in a sequence. Reference: A.V. Favorov et al.. 2005.  Bioinformatics 21: 2240-2245.

PromScan (D.J. Studholme & R. Dixon. 2003. Bacteriol. 185:1757-67; as modified by S. Richards, Queen's University, Canada). Scans small genomes for potential factor-binding sites including IHF-binding sites. If a *.ptt file is included the results will indicate the position of the promoter relative to the nearest gene.

FindTerm (Softberry Inc.) - only two tools exist on the internet for mapping rho-independent terminators FindTerm and TransTerm. You might consider using the advanced feature options and minimally increase the default energy threshold to -12.0.

TransTerm (Michael Nuhn, Nano+Bio-Center) - TransTerm searches for rho-independent terminators in the vicinity of annotated genes. This TIGR program can be accessed online in two ways. If you have the genome in GenBank format to use "Sequence Analysis" and choose TransTerm since it will only look for terminators in the vicinity of the annotated genes. If the genome has not been annotated choose "Annotation" and Glimmer2.02, RBSfinder & TransTerm . The latter site combines Glimmer and RBSfinder with TransTerm.

Tools to find motif clusters in DNA sequences - one should probably start at ZLAB (Dr. Zhiping Weng, Boston University, U.S.A) which has developed a  wide range of tools to interaction between regulatory proteins and their DNA/RNA target sites including:

 Cluster-Buster
 Comet
 Cister

Find short split motifs in DNA sequences with YMF (Reference: Sinha, S. & Tompa, M. 2002. Nucleic Acids Research
Motif Sampler - tries to find over-represented motifs (cis-acting regulatory elements) in the upstream region of a set of co- regulated genes. This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif. Be sure to "uncheck" the appropriate box if you don't want the complementary strand included in the analysis. (Reference: G. Thijs et al. 2002. J. Comput. Biol. 9: 447-464.)
Melina - Motif Elucidator in Nucleotide Sequence Assembly (Human Genome Center, University of Tokyo, Japan) - helps one extract a set of common motifs shared by functionally-related DNA sequences. It  utilizes CONSENSUS, GIBBS DNA, MEME and Coresearch  which are considered to be the most progressive motif search algorithms. Each algorithms is supplied with an impressive set of selection parameters. 

BioProspector  Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes (Stanford University AI Lab, U.S.A.) - uses a Gibbs sampling strategy to examine the upstream region of genes in the same gene expression pattern group and looks for regulatory sequence motifs. BioProspector uses Markov background to model the base dependencies of non-motif bases, which greatly improved the specificity of the reported motifs.
BioOptimizer - is an algorithm designed to clean up the motifs found by BioProspector, Consensus, AlignACE & MEME by finding the configuration of motif start sites that maximizes a scoring function (Reference: S.T. Jensen & J.S. Liu 2004 Bioinformatics 20:1557-1564).

 SCOPE (Suite for Computational identification Of Promoter Elements), an ensemble of programs aimed at identifying novel cis-regulatory elements from groups of upstream sequences. (Reference: J.M. Carlson et al. 2007