Specifications for the Next-Generation Computational Biology Infrastructure
Eric Jakobsson, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign
CTWatch Quarterly
August 2006

All leading edge research in biology now utilizes computation, as a result of the development of useful tools for data gathering, data management, analysis, and simulation of biological systems. While there is still much to be done to improve these tools, there is also a completely new frontier to be attacked. The new initiatives to be undertaken will require much more interaction between applications scientists and cyberinfrastructure architects than has previously been the case. The single word that provides a common thread for the new initiatives needed in the next few years is Integration, specifically

Integration of time and length scales of description

Biological systems display important dynamics on time scales ranging from femtoseconds and faster (eg., interactions with electromagnetic radiation) to billions of years (evolution), and distance scales ranging from single atoms to the entire biosphere. Events at all time and length scales are linked to each other. For the most extreme example, the emergence of the photosynthetic reaction center (a protein that couples absorption of photons with synthesis of other biological molecules) over a billion years ago produced as a by-product a major change in the composition of the atmosphere (an increase in oxygen) that profoundly altered the course of biological evolution from that time on. Yet the vast majority of the computational tools that we use to understand biology are specialized to a particular narrow range of size and distance scales. We badly need computing environments that will facilitate analysis and simulation across time and length scales, so we may achieve a quantitative understanding of how these scales link to each other.

Integration of informatics, dynamics, and physics-based approaches

There are three core foundations of computational biology: a) Information-based approaches, exemplified by sequence-based informatics and correlational analysis of systems biology data, b) Physics-based approaches, based on biological data analysis and simulation founded in physical and chemical theory, and c) Approaches based on dynamical analysis and simulation, notably exemplified by successful dynamics models in neuroscience, ecology, and viral-immune system interactions. Typically these approaches are developed by different communities of computational biologists and pursued largely independently of each other. There is great synergy, however, in the three approaches when they are integrated in pursuing solutions to major biological problems. This can be seen notably in molecular and cellular neuroscience. Understanding of the entire field is largely organized around the dynamical systems model first put forth by Hodgkin and Huxley, which also had an underpinning of continuum physical chemistry and electrical engineering theory. Extension of the systems and continuum understanding to the molecular level depended on using informatics means to identify crystallizable versions of the membrane proteins underlying excitability. Physics-based computing has been essential to interpreting the structural data and to understand the relationship between the structures and the function of the excitability proteins. All areas of biology need a comparable synergy between the different types of computing. As a corollary, we need to train computational biologists who can use, and participate in developing, all three types of approach.

Integration of Heterogenous Data Forms

The types of data that are relevant to any particular biological problem are quite varied, including literature reports, sequence data, microarray data, proteomics data, a wide array of spectroscopies, diffraction data, time series of dynamical systems, simulation results, and many more. There is a major need for an integrated infrastructure that can enable the researcher to search, visualize, analyze, and make models based on all of the relevant data to any particular biological problem. The Biology Workbench1 is a notable example of such integration in the specific domain of sequence data. This approach needs to be extended to much more varied and complex data forms.

Integration of Basic Science with Engineering Design

Biology is different from other basic sciences such as chemistry and physics, in the sense that adaptation for function is an integral part of all biological phenomena. Physical and chemical phenomena have only one type of cause; i.e., the underlying laws of physics. Biological phenomena have two types of cause: 1) the underlying laws of physics, and 2) the imperatives of evolution, which select the actualities of biology out of all the possibilities that one could imagine for how living systems are organized and function. In this sense, biological systems are like engineered systems - purpose contributes along with the laws of physics to define the nature of both biological and human-engineered systems. Elaborate and sophisticated computer aided design (CAD) systems have been developed to guide the creation of human-engineered devices, materials, and systems. The principles of CAD systems (optimization, network analysis, multiscale modeling, etc.) need to be incorporated into our computational approaches to understanding biology. A direct target for such cross-fertilization of biology and engineering is in nanotechnology, where we seek to engineer devices and processes that are on the size scale of biomolecular complexes. The Network for Computational Nanotechnology 2 is a notable cyberinfrastructure project in nanotechnology, with a versatile computational environment delivered through its Nanohub web site.3

Integration of algorithmic development with computing architecture

The different types of biological computing have vastly different patterns of computer utilization. Some applications are very CPU-intensive, some require large amounts of memory, some must access enormous data stores, some are much more readily parallelizable than others, and there are highly varied requirements for bandwidth between hard drive, memory, and processor. We need much more extensive mutual tuning of computer architecture to applications software, to be able to do more with existing and projected computational resources. A remarkable instance of such tuning is the molecular simulation code Blue Matter, written specifically to exploit the architecture of the IBM Blue Gene supercomputer. The Blue Matter-Blue Gene combination has done biomolecular dynamics on a hitherto unprecedented scale and is directly enabling fundamentally new discoveries.4

There is finally one more critical issue with respect to the development of a suitable cyberinfrastructure for biology. Our society is not training nearly enough prospective workers in the area of computational biology, nor enough quantitative biology researchers in general, to make progress in biological computing commensurate with the increased availability and power of computing resources. We need focused training at both the undergraduate and graduate levels to produce a generation of computational biologists who will be capable of integrating the physics, systems, and informatics approaches to biological computing, and to produce a generation of biologists who will be able to use computational tools in the service of quantitative design perspectives in understanding living systems. The need for such training is well articulated in the National Academy of Sciences report BIO 2010.5 The first university to unreservedly embrace the BIO 2010 recommendations by fully integrating computing into all levels of its biology curriculum is the University of California at Merced.6

1 Biology Workbench - http://workbench.sdsc.edu/
2 Network for Computational Nanotechnology - http://www.ncn.purdue.edu/
3 nanoHUB - http://www.nanohub.org/
4see Grossfield, A., Feller, S. E., Pitman, M.C. A role for direct interactions in the modulation of rhodopsin by -3 polyunsaturated lipids. PNAS 103: 4888-4893. 2006.
5 http://www.nap.edu/books/0309085357/html/
6 http://biology.ucmerced.edu/

URL to article: http://www.ctwatch.org/quarterly/articles/2006/08/specifications-for-the-next-generation-computational-biology-infrastructure/