Publications are the currency of research. They are the mechanism through which scientists communicate their results to their peers and the means through which we evaluate each other. This model is unlikely to change completely. However, the electronic age – the introduction of cyberinfrastructure – is introducing some differences in this paradigm, a phenomenon that has been observed previously.345
One significant difference is the requirement of many publishers that the author deposit the data described in a publication in an appropriate public repository concomitant with publication of the manuscript. For example, macromolecular structure data must be deposited in the Protein Data Bank (PDB) when a manuscript describing the macromolecule’s structure is published. As part of the deposition process, a reference to the publication is included. However, this is generally the only link between paper and data repository, even though the paper many contain a wealth of information that would be relevant to someone viewing the deposition record.
A similar scenario exists with publications that reference a record in one of these repositories. For example, if someone has used the structural data describing a protein from the PDB, this structure will be referenced by ID in the publication but this reference most times is not included in the PDB itself. Someone viewing this structure in the database will see only the citation of the publication and abstract describing the generation of the initial data. Much of the research performed subsequent to structure deposition concerns functional information about the protein, information that would surely be useful to anyone interested in that molecule but information that is not trivial to obtain.
One possible reason why a link between this secondary publication and the database is not made is that there is no value associated with database annotation.4 Scientists are valued on their peer-reviewed publications, not on their database annotations (which are not independently peer-reviewed). Furthermore, proper database annotation takes time and effort so there is little to gain in this endeavor.
Another barrier that contributes to disconnection between publications and data deposition is the delay in utilization of cyberinfrastructure by scientific publishers. Most publishers have at least an online presence and generally make the articles they publish available online for download, viewing, or printing but do little else to make use of the information they are communicating. Cyberinfrastructure is little more than a new means of distribution.
A significant issue that complicates extensive use of publication content is intellectual property rights, an issue that is currently quite controversial. Some publishers have risen to the challenge and adopted the open access philosophy, publishing their articles under a Creative Commons AL license. This means that the content is free to use and distribute as long as the original attribution is maintained. The Public Library of Science (PLoS) and BioMedCentral (BMC) are good examples of publishers in the life sciences who have embraced the open access model. The articles published by these publishers and others are archived in a central repository, PubMed Central, which archives all open access articles in the life sciences. To date, there are over 54,000 articles in over 300 journals – hardly a major representation of the field, but still a solid beginning.
The centralization of open access articles is a significant step forward but, even more significant, is the storage of these articles in a standardized and machine-readable format: the NLM DTD. This document format allows all open access articles to be archived as XML files, which includes some semantic mark-up of the content and unique identifiers for the article itself and the objects (figures and tables) within. This format also allows these articles to be parsed for relevant information. Unfortunately, there is little value added to article content itself. To recall a previous example, referencing a protein structure in a manuscript, most authors referencing a protein structure do not include a link to the structural data on the PDB. In order to find a mention of a PDB ID, one would have to perform a full text search on the article content (including figure captions). Even this does not ensure that a successful search result actually references a PDB ID – the ID could belong to a different database or have an entirely different meaning, since there is no semantic context for that string of text. (This is not always true. Some papers do include direct references to the PDB using the xlink tag.)
Even if a link is included in the PDB to an article that mentions a PDB ID, it is not clear what the value of that reference is to the reader. Does the article describe the biochemical function of the protein or was the structure used in training a computational prediction algorithm? Rather than direct the reader to an article that may not be of interest, it would be useful to include some indication of the type of content of the article. Semantic mark-up of the article content is necessary. Using ontologies or controlled vocabularies within the framework of the NLM DTD would increase the usefulness of the article content dramatically.
All of these tools exist – the standardized document format, the ability to create hyperlinks in electronic documents, field-specific ontologies – but they have yet to be utilized to their full advantage. This may be due to the legacy of static manuscripts, which is largely perpetuated by scientists who did not have access to cyberinfrastructure during their formative years. However, today’s scientists do and it is time to make this happen.