May 2006
Designing and Supporting Science-Driven Infrastructure
Charlie Catlett, Pete Beckman, Dane Skow and Ian Foster, The Computation Institute, University of Chicago and Argonne National Laboratory

3.4 Operational Services

While largely transparent to end-users, any national grid facility must be supported by a deep foundation of operational infrastructure. This need is particularly important for facilities such as TeraGrid that operate national-scale resources, purchased and supported on behalf of government agencies, where accountability for the use of those resources is required, coupled with an open peer-review process for allocating access to the resources. Operational services discussed here also include networking, security coordination, and an operations center.

Resource Allocation and Management
Many national-scale grid consortia operate “best-effort” services that provide access to excess capacity to stakeholder user groups. In contrast, TeraGrid operates resources on behalf of broad national communities, and these resources are allocated by formal processes. Specifically, resources are allocated by a peer-review committee that meets quarterly to review user requests for allocations. (Allocations are specified in service units, analogous to CPU hours.) The mechanisms needed to support this nationally peer-reviewed system include a distributed accounting system that works in concert with authentication and authorization systems to debit project allocations according to use by users authorized by the principal investigator of the given project. In addition, support for the allocation review process itself requires a proposal request and review infrastructure, databases for users and usage, and information exchange systems for usage data and user credentials. The TeraGrid has obtained much of this infrastructure from its predecessor, the NSF Partnerships for Advanced Computational Infrastructure (PACI) program, in which several million dollars of software development was invested during the past decade.

The operation of the TeraGrid resource allocation and management infrastructure requires four GIG FTEs for coordination along with seven FTEs at resource provider facilities to support the various databases and proposal support systems, and to perform local accounting integration with the distributed TeraGrid system.

Security Coordination
Security management in a national grid facility requires a high degree of coordination among security professionals at many sites. TeraGrid security coordination is based on a set of agreed-upon policies ranging from minimum security practices to change management and protocols for incident response and notification.

The GIG team provides coordination of the distributed security team for general communication, incident response management, and analysis of the security impact of system changes (e.g., software, new systems, etc.). However, the provision of distributed authentication and authorization services for individual users and groups (or “virtual organizations”27), as is required in grid facilities, is also a significant part of the security coordination effort.

Security coordination across TeraGrid requires two GIG FTEs working with three FTEs at resource provider sites, with participation from additional security operations staff from each resource provider organization. While participation in a national grid security coordination team requires investment of time on the part of local security staff, the benefits to the site are high in terms of training, assistance, and early notification of events that might impact the local site.

Many national-scale grid facilities rely on existing Internet connectivity. In contrast, TeraGrid operates a dedicated network. Irrespective of the networking strategy, effort is needed to optimize services over networks between resource provider locations, particularly with respect to data movement over high bandwidth-delay product networks. In addition, distributed applications and services often require assistance from networking experts at multiple sites. Thus, a national-scale grid facility such as TeraGrid requires a networking team consisting of contacts from each resource provider site. As with the security team, the benefits to the site far outweigh the time-investment on the part of local networking staff.

In the case of TeraGrid, this component of the support infrastructure comprises a network architect/coordinator within the GIG to oversee the networking team, which includes five FTEs from resource provider facilities along with general networking contacts at all sites. The networking working group coordinates the operation of the TeraGrid network. Participants also assist in user support, such as diagnosing problems and optimizing performance of distributed services and applications.

TeraGrid provides a distributed operations center, leveraging the 24/7 operations centers at two of the resource provider facilities (NCSA and SDSC) to provide around-the-clock support. The distributed 24/7 operations center plays several essential roles in the TeraGrid facility, including the management of a common trouble-ticket system and ongoing measurement of key metrics related to the health and performance of the facility. TeraGrid operations requirements also include management of the distributed accounting system, which involves the collection of usage information into a central usage database. The TeraGrid GIG funds two FTEs for various aspects of operations and two FTEs at resource provider facilities.

Pages: 1 2 3 4 5 6 7

Reference this article
Catlett, C., Beckman, P., Skow, D., Foster, I. "Creating and Operating National-Scale Cyberinfrastructure Services," CTWatch Quarterly, Volume 2, Number 2, May 2006. http://www.ctwatch.org/quarterly/articles/2006/05/creating-and-operating-national-scale-cyberinfrastructure-services/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.