We need to understand how to broaden the success in parallelizing science and engineering to the much larger range of applications that could or should exploit multicore chips. In fact, this broader set of applications must get good speedups on multicore chips if Moore's law is to continue while we move from the single CPU architecture and clock speed improvements that drove past exponential performance increases to performance improvement driven by increasing cores per chip. We will focus on "scalable" parallelism where a given application can get good performance on 16-32 cores or more. On general principles backed up by experience from scientific and engineering, lessons from a small number of cores are only a good harbinger for scaling to larger systems if they are backed up with a good model of the parallel execution. So in this section, we analyze lessons from science and engineering on scalable parallelism – how one can get speed-up that is essentially proportional to the number of cores as one scales from 4-16-128 and more cores. These lessons include, in particular, a methodology for measuring speedup and identifying and measuring bottlenecks, which is intrinsically important and should be examined and extended for general multicore applications. Note that multicore offers not just an implementation platform for science and engineering but an opportunity for improved software environments sustained by a broad application base. Thus there are clear mutual benefits in incorporating lessons and technologies developed from scalable parallel science and engineering applications into broader commodity software environments.

Let us discuss the linear algebra example described below in this article. Linear algebra illustrates the key features of most scalable, parallel science and engineering applications. These applications have a "space" with degrees of freedom (for linear algebra, the space is formed by vectors and matrices and the matrix/vector elements are degrees of freedom). This space is divided into parts, and each core of the processor is responsible for "evolving" its part. The evolution consists of multiple algorithmic steps, time-steps or iterations. The parts synchronize with each other at the end of each step, and this synchronization leads to "overhead." However, with carefully designed parallel algorithms, this overhead can be quite small with the speedup S for an application on N cores having the form εN, where the efficiency often has the form (1- c/n^{α}), where n is the (grain) size of the part in each core, and c and α are application and hardware dependent constants. This form has the well-understood scaled speedup consequence that the speedup is directly proportional to the number of cores N as long as the grain size n is constant; this implies that the total problem size nN also grows proportionally to N. The exponent α is 0.5 for linear algebra so the efficiency will decrease as N increases for a fixed total problem size when n is proportional to 1/N. The success of parallel computing in science and engineering partly reflects that users do need to run larger and larger problems, so grain sizes are not decreasing as we scale up machines – in fact as both the number of nodes and memory sizes per CPU have grown in successive generations of parallel computers, the grain sizes are increasing not decreasing. It is important to ask if commodity applications have this characteristic; will they scale up in size as chips increase in number of cores? This class of applications consists of successive phases; each compute phase is followed by a communication (synchronization) phase and this pattern repeats. Note, synchronization is not typically a global operation but rather achieved by each part (core) communicating with other parts to which it is linked by the algorithm.

We take a second example, which is deliberately chosen to be very different in structure. Consider a typical Web server farm where parallelism occurs from multiple users accessing some sort of resource. This is typical of job scheduling on multiple computers or cores. Here, efficient operation is not achieved by linked, balanced processes but rather by statistical balancing of heterogeneous processes assigned dynamically to cores. These applications are termed "pleasingly" or "embarrassingly" parallel and quite common; as the jobs are largely independent, one does not require significant communication, and Grids are a popular implementation. Such (essentially) pleasingly parallel applications include data analysis of the events from the Large Hadron Collider (LHC) but also, for example, genetic algorithms for optimization.

There are of course other types of scalably parallel applications, but the two described illustrate an important characteristic – namely that good parallel performance is not an "accident." Rather, it is straightforward (albeit often hard!) to build a model for applications of these classes to predict performance, i.e., for the first application class to predict the constants c and α, and hence what grain size n is needed to meet speedup goals. Science and engineering applications involve starting with an abstract description of the problem (say the Navier-Stokes partial differential equations for computational fluid dynamics) and mapping this into a numerical formulation involving multiple steps such as a discretization and the choice of a solver for the sparse matrix equations. It is usually easiest to model the performance, at this stage as the description is manifestly parallelizable. Then one chooses a software model (and implicitly often at run time) to express the problem, and here one chooses between technologies like MPI and openMP or compiler discerned parallelism. Note, there is no known universal way of expressing a clearly parallel model in software that is guaranteed to preserve the natural parallelism of the model. Formally, the mapping is not invertible and the parallelism may or may not be discovered statically. Further, it may or may not be realistic even to discover the parallelism dynamically as the application runs. The differences between abstract description, numerical model, and software and hardware platforms are significant and lead to some of the hardest and most important problems in parallel computing. Explicit messaging passing with MPI is successful even though tough for the programmer. It expresses the inherent parallelism; many languages originating with CMFortran and HPF and continued with Global array approaches and the Darpa HPCS languages permit the expression of collective operations and data decompositions that allow the parallelism of many but not all problems. OpenMP provides hints that help compilers fill in information lost in the translation of numerical models to software. The success of fusion and tiling strategies described below for efficient compilation is a consequence of the characteristics, such as geometric locality, typical of problems whose parallelism is "not accidental." The development of powerful parallel libraries and dynamic run time strategies such as inspector-executor loops allows run time to help efficient parallel execution. The science and engineering community is still debating the pluses and minuses of these different approaches, but the issues are clear and the process of identifying and expressing parallelism understood.

It seems important to examine the process for important commodity applications. Can one identify the same process which we note has not typically been to start with a lump of software with unclear parallel properties? Some applications of interest such as machine learning or image processing are related to well understood, parallel applications, and we can be certain that parallelism is possible with good efficiency if the problem is large enough. Are we aiming to parallelize applications with clear models that suggest good performance or are we hoping for a miracle by tossing a bunch of interacting threads on a bunch of cores? The latter could in fact be very important but likely to require a different approach and different expectations from the cases where parallelism is "no accident." Applications based on large numbers of concurrent threads, all running in the same address space, are not uncommon in commercial software such as portals and web services, games, and desktop application frameworks. In these applications, data structures are shared and communication is through explicit synchronization involving locks and semaphores. Unfortunately, locked, critical sections do not scale well to large numbers of cores, and alternative solutions are required. One extremely promising approach is based on a concept called software transactional memory (STM),^{3} where critical sections are not locked, but conflicts are detected and bad transactions are rolled back. While some argue that for STM to work efficiently it will require additional hardware support, the jury is still out. It may be that smart compilers and the runtime architecture will address efficiency issues without changes to the hardware. Note, even if the same approaches can be used for commodity and science and engineering, we can well find a different solution as we will now find a broad commodity base rather than a niche application field for parallel applications; this could motivate the development of powerful software development environments or HPF style languages that were previously not economic.