August 2007
The Coming Revolution in Scholarly Communications & Cyberinfrastructure
Clifford Lynch, Coalition for Networked Information (CNI)

Scientific Articles and their Relationships to Data

The vast majority of scientific articles present and/or analyze data. (Yes, in mathematics, and some parts of theoretical physics, they do something else, and if, when and how this particular sub-genre of writings will be transformed is a fascinating question that deserves an extensive discussion in its own right. But that is another question, for another time.) As this data becomes more complex, more extensive, more elaborate, more community-based, more mediated by software, the relationships between articles and the data upon which they are based is becoming more complex and more variable. And recognize that implicit in these relationships are a whole series of disciplinary norms and supporting organizational and technical cyberinfrastructure services.

To what extent should articles incorporate the data they present and analyze, and to what extent should they simply reference that data? The issues here are profoundly complex. First, there’s the question of whether the data is original and being presented for the first time; certainly it is commonplace to draw upon, compare, combine and perhaps reinterpret data presented in earlier articles or otherwise made available, and here the right approaches would presumably be citation or similar forms of reference. Repeated publication of the same data is clearly undesirable.

For newly publicized data there are a range of approaches. Some journals offer to accept it as “supplementary materials” that accompany the article, but often with very equivocated commitments about preserving the data or the tools to work with it, as opposed to the article proper. Not all journals offer this as an option, and some place constraints on the amount of data they will accept, or on the terms of access to the data (e.g., subscribers only).

For certain types of data, specific communities — for example crystallographers, astronomers, and molecular biologists – have established norms, enforced by the editorial policies of their journals, which call for deposit of specific types of data within an international disciplinary system of data repositories, and have the article make reference to this data by an accession identifier assigned upon deposit in the relevant repository. Clearly, this works best when there are well agreed-upon structures for specific kinds of data that occur frequently (genomic sequencing, observations of the sky, etc.); it also assumes an established, trustworthy and sustainable disciplinary repository system. Indeed, we have already seen the emergence of what might be characterized as a “stub article” that in effect announces the deposit of an important new dataset in a disciplinary repository and perhaps provides some background on its creation, but offers little analysis of the data, leaving that to subsequent publications. This allows the compilers of the dataset to have their work widely recognized, acknowledged, and cited within the traditional system familiar to tenure and promotion committees.

Lynch, C. "The Shape of the Scientific Article in The Developing Cyberinfrastructure," CTWatch Quarterly, Volume 3, Number 3, August 2007.

