As genomes grow faster than Moore’s law, biology will provide numerous cyber challenges in the next 10-15 years.
Biology is an area in science that is using more and more computational resources as it is turning into a data driven discipline. Most notably, the emergence of genome and post-genome technology has made vast amounts of data available, demanding analysis. Hundreds of bacterial (more precisely prokaryotic) genomes are available today and have already proven to be a very valuable tool for many applications. A prominent example is the reconstruction of the metabolic pathways1 of several bacterial organisms. The analysis of the rising number of genomes is already an application of cyber technologies2 and to some extent limited by the available cyber resources. As more data is becoming available, this trend is likely to continue.
An important factor in this equation is the fact that the number of available complete genomic sequences is doubling almost every 12-months3 at the current state of technology. Whereas according to Moore’s law, available compute cycles double only every 18 month. The analysis of genomic sequences requires serious computational effort: most analysis techniques require binary comparison of genomes or the genes within genomes. Since the number of binary comparisons grows as the square of the number of sequences involved, the computational overhead of the sequence comparisons alone will become staggering. Whether we are trying to reconstruct the evolutionary history of a set of proteins, trying to characterize the shape they fold into, or attempting to determine correspondences between genes in distinct genomes, we are often using these binary operations, and the cost is rapidly climbing.
Today, traditional research teams in bioinformatics either totally rely on resources provided by institutions like the National Center for Biotechology Information (NCBI)4 for sequence analysis purposes5 or build up their own local resources. The NCBI provides services including comprehensive sequence databases and online sequence comparison via a browser interface. Researchers possessing private compute resources have the advantage of running algorithms of their choosing on the machines however, to keep up with the data flood, they either have to accept long waiting times or continue to invest in cluster resources to fulfill their growing sequence analysis needs.
However as the number of sequences available grows, the number of algorithms available for their analysis also increases. So today, numerous bioinformatics techniques exist or are being developed that use considerably more computational power and are yielding different results for sequence comparison than the traditionally used BLAST algorithm. Over the last five years, most notably the influx of machine learning techniques has led to an increased consumption of compute cycles in computational biology.
When researchers began to use Markov Models to search for sequence similarities not visible with BLAST and also began building databases of common sequence motifs represented as Hidden-Markov-Models (e.g. HMMer or InterPro), the CPU requirements were increased dramatically. While a BLAST search against the NCBIs comprehensive, non-redundant collection of known proteins can be run in a matter of minutes either locally or on NCBI’s BLAST-server for several hundred query sequences (remember a single genome contains thousands of genes), no resource exists that allows querying several hundred (let alone thousand) proteins for protein motifs using the European Bioinformatics Institutes’ (EBI) InterPro tool.
Today, few resources exist outside TeraGrid6 that could provide the computational power needed to run a comprehensive protein motif search for more than a few complete bacterial genomes. Only a massive, high-performance computing resource like TeraGrid can provide the CPU-hours that will be required for this and other future challenges stemming from the increasing amount of sequence data.
Pages: 1 2