November 2007
Software Enabling Technologies for Petascale Science
E. Wes Bethel, Lawrence Berkeley National Laboratory
Chris Johnson, University of Utah
Cecilia Aragon, Lawrence Berkeley National Laboratory
Prabhat, Lawrence Berkeley National Laboratory
Oliver Rübel, Lawrence Berkeley National Laboratory
Gunther Weber, Lawrence Berkeley National Laboratory
Valerio Pascucci, Lawrence Livermore National Laboratory
Hank Childs, Lawrence Livermore National Laboratory
Peer-Timo Bremer, Lawrence Livermore National Laboratory
Brad Whitlock, Lawrence Livermore National Laboratory
Sean Ahern, Oak Ridge National Laboratory
Jeremey Meredith, Oak Ridge National Laboratory
George Ostrouchov, Oak Ridge National Laboratory
Ken Joy, University of California, Davis
Bernd Hamann, University of California, Davis
Christoph Garth, University of California, Davis
Martin Cole, University of Utah
Charles Hansen, University of Utah
Steven Parker, University of Utah
Allen Sanderson, University of Utah
Claudio Silva, University of Utah
Xavier Tricoche, University of Utah

High Performance Implementation

So far, we’ve discussed how one might go about specifying queries, or “defining interesting,” and have also shown a couple of different ways to present the results that show only “the interesting data.” Here, we want to turn our attention to the underlying machinery that makes this kind of approach feasible in high performance implementations suitable for use with very large datasets.

All Computer Science undergraduates are introduced to the idea of binary trees and their use as an indexing data structure. Briefly, if you have a sorted array of data of N items, you can construct a binary tree that will have N-1 nodes and N leaves where each interior node partitions the data in deeper nodes and leaves into two groups – “greater than” and “less than or equal to” the value of a key. Once you have constructed this data structure, the search for the data record having the value of some key is performed in log2N search steps assuming an optimal, or balanced tree. This basic idea – called tree-based indexing – is widely used in many types of relational and object-oriented database systems. One obvious limitation of this type of approach when considering very large data is that the size of the indexing structure – the tree – is linear with respect to the size of the dataset being indexed. As this size grows larger, we clearly don’t want to incur a commensurately larger storage cost for our search indices. Another problem, which may not be quite as obvious, is that these tree-based approaches require the original data to be sorted. For scientific data, where you typically write the data once then examine it over and over again, this may not be a serious limitation. In some instances, it may simply be impractical to sort the data.

Of greater concern is the so-called “Curse of Dimensionality”13 The previous paragraph calls out that the storage complexity for a tree-based structure is O(N) when there are N data points. If these data points, or records, have two variables, and we want to create a two-dimensional tree that spans both variables, we end up with a storage complexity of O(N2). If there are three variables, the storage requirements are of O(N3). The basic premise is that storage requirements for tree-based indices grow exponentially with respect to the number of variables being indexed. Many modern simulations routinely have on the order of 100 variables that are computed and saved at each time step. It should be obvious that tree-based indexing is simply not practical for large and complex scientific data.

This well-known problem has received a great deal of attention from our colleagues in the field of scientific data management. They have developed a unique technology called “compressed bitmap indices” that have very favorable storage and search complexity.14 This technology has been applied with great success to index/query problems of some of the world’s largest datasets.15 In a series of collaborative research projects, members of VACET and DOE’s Scientific Data Management Center have demonstrated the practicality of combining fast bitmap indexing with high performance visual data analysis, to implement a novel approach to query-driven visualization applied to visual data analysis of problems in combustion modeling8 and large-scale network traffic analysis.16

Pages: 1 2 3 4 5 6 7 8

Reference this article
Bethel, E. W., Johnson, C., Aragon, C., Prabhat, Rübel, O., Weber, G., Pascucci, V., Childs, H., Bremer, P.-T., Whitlock, B., Ahern, S., Meredith, J., Ostrouchov, G., Joy, K., Hamann, B., Garth, C., Cole, M., Hansen, C., Parker, S., Sanderson, A., Silva, C., Tricoche, X. "DOE's SciDAC Visualization and Analytics Center for Enabling Technologies - Strategy for Petascale Visual Data Analysis Success," CTWatch Quarterly, Volume 3, Number 4, November 2007. http://www.ctwatch.org/quarterly/articles/2007/11/does-scidac-visualization-and-analytics-center-for-enabling-technologies-strategy-for-petascale-visual-data-analysis-success/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.