Next-Generation Implications of Open Access
The technological transformation of scholarly communication infrastructure began in earnest by the mid-1990s. Its effects are ubiquitous in the daily activities of typical researchers, instructors and students, permitting discovery, access to, and reuse of material with an ease and rapidity difficult to anticipate as little as a decade ago. An instructor preparing for lecture, for example, undertakes a simple web search for up-to-date information in some technical area and finds not only a wealth of freely available, peer reviewed articles from scholarly publishers, but also background and pedagogic material provided by its authors, together with slides used by authors to present the material, perhaps video of a seminar or colloquium on the material, related software, on-line animations illustrating relevant concepts, explanatory discussions on blog sites, and often useful notes posted by 3rd party instructors of similar recent or ongoing courses at other institutions. Any and all of these can be adapted for use during lecture, added to a course website for student use prior to lecture, or as reference material afterwards. Questions or confusions that arise during lecture are either resolved in real time using a network-connected laptop, or deferred until afterwards, but with instructor's clarification propagated instantly via a course website or e-mail. Or such lingering issues are left as an exercise for students to test and hone their own information gathering skills in the current web-based scholarly environment. Some courses formalize the above procedures with a course blog that also permits posting of student writing assignments for commentary by other students and the instructor. Other courses employ wikis so that taking lecture notes becomes a collaborative exercise for students.
Many of these developments had been foreseen a decade ago, at least in principle, though certainly not in all the particulars. When the mass media and general public became aware of the Internet and World Wide Web in the mid-1990's, this new "information superhighway" was heavily promoted for its likely impact on commerce and media, but widespread adoption of social networking sites facilitating file, photo, music, and video sharing was not regularly touted. Web access is now being built into cell-phones, music players, and other mobile devices, so it will become that much more ubiquitous in the coming decade. People currently receiving their PhDs became fluent in web search engine usage in high school, and people receiving their PhDs a decade from now will have had web access since early elementary school. (My 3.5 year old son has had access to web movie trailers on demand since the age of 1, but is at least two decades from a doctorate.)
Many aspects of teaching and scholarship will remain unchanged. Web access will not fundamentally alter the rate at which students can learn Maxwell equations for electromagnetism, and many web resources of uncertain provenance, e.g., Wikipedia entries, will require independent expertise to evaluate. We've also already learned that having a vast array of information readily available does not necessarily lead to a better informed public, but can instead exacerbate the problem of finding reliable signal in the multitude of voices. Recent political experience suggests that people tend to gravitate to an information feed that supports their preexisting notions: the new communications technologies now virtually guarantee that it will exist, and moreover make it easy to find. In the below, I will focus on questions related to the dissemination and use of scholarly research results, as well as their likely evolution over the next decade or so. The issues of generating, sharing, discovering, and validating these results all have parallels in non-academic pursuits. In order to guide the anticipation of the future, I'll begin by looking backwards to developments of the early 1990's.
The e-print arXiv 1 was initiated in Aug 1991 to accommodate the needs of a segment of the High Energy Physics theoretical research community. (For general background, see Ginsparg, P. "Winners and losers....". 2) Due to its prior history of paper preprint distribution (for a history, see O'Connell, H. B. "Physicists Thriving..."3), it was natural for this community to port its prepublication distribution to the internet, even before the widespread adoption of the World Wide Web in the mid 1990's. A primary motivation for this electronic initiative was to level the research playing field, by providing equal and timely access to new research results to researchers all over the globe and at all levels, from students to postdocs to professors.
arXiv has since effectively transformed the research communication infrastructure of multiple fields of physics and will likely continue to play a prominent role in a unified set of global resources for physics, mathematics, and computer science. It has grown to contain over 430,000 articles (as of July 2007), with over 56,000 new submissions expected in calendar year 2007, and over 45 million full text downloads per year. It is an international project, with dedicated mirror sites in 16 countries, collaborations with U.S. and foreign professional societies and other international organizations, and has also provided a crucial life-line for isolated researchers in developing countries. It also helped initiate, and continues to play a leading role, in the growing movement for open access to scholarly literature. The arXiv is entirely scientist driven: articles are deposited by researchers when they choose - either prior to, simultaneous with, or post peer review - and the articles are immediately available to researchers throughout the world.
As a pure dissemination system, it operates at a factor of 100 to 1000 times lower in cost than a conventionally peer-reviewed system.4 This is the real lesson of the move to electronic formats and distribution: not that everything should somehow be free, but that with many of the production tasks automatable or off-loadable to the authors, the editorial costs will then dominate the costs of an unreviewed distribution system by many orders of magnitude. Even with the majority of science research journals now on-line, researchers continue to enjoy both the benefits of the rapid availability of the materials, even if not yet reviewed, and open archival access to the same materials, even if held in parallel by conventional publishers. The methodology works within copyright law, as long as the depositor has the authority to deposit the materials and assign a non-exclusive license to distribute at the time of deposition, since such a license takes precedence over any subsequent copyright assignment.
The site has never been a random UseNet newsgroup or blogspace-like free-for-all. From the outset, arXiv.org relied on a variety of heuristic screening mechanisms, including a filter on institutional affiliation of submitter, to ensure insofar as possible that submissions are at least "of refereeable quality." That means they satisfy the minimal criterion that they would not be peremptorily rejected by any competent journal editor as nutty, offensive, or otherwise manifestly inappropriate, and would instead at least in principle be suitable for review. These mechanisms are an important - if not essential - component of why readers find the arXiv site so useful.
The arXiv repository functions are flexible enough either to co-exist with the pre-existing publication system or to help it evolve to something better optimized for researcher needs. A recent study5 suggests the extent to which researchers continue to use the conventional publication venues in parallel with arXiv distribution. While there are no comprehensive editorial operations administered by the site, the vast majority of the 56,000 new articles per year are nonetheless subject to some form of review, whether by journals, conference organizers, or thesis committees. Physics and astronomy journals have learned to take active advantage of the prior availability of the materials, and the resulting symbiotic relation was not anticipated 15 years ago. Though the most recently submitted articles have not yet necessarily undergone formal review, most can, would, or do eventually satisfy editorial requirements somewhere.
arXiv's success has relied upon its highly efficient use of both author and administrative effort, and has served its large and ever-growing user-base with only a fixed-size skeletal staff. In this respect, it long anticipated many of the current "Web 2.0" and social networking trends: providing a framework in which a community of users can deposit, share and annotate content provided by others.
In recent polls,6 users of physics information servers rated as most important their breadth and depth of coverage, comprehensiveness, timeliness, free and readily accessible full-text content, powerful search index, organization and quality of content, spam filtering, non-commerciality, and convenience of browsing. They also emphasized the importance of notification functions (newly received articles, author/keyword/reference alerts), whether by e-mail, separate webpage listing, or RSS feed. Users have grown to expect seamless access to older articles, most conveniently from one universal portal. Considered less important were user friendliness and general ease of use, quality of submission interface, availability of citation analysis, measures of readership, multimedia content, personalization, keywords and classification, and other collaborative tools. Users also mentioned the expected future utility of citation and reference linkages to open access articles, comment threads, access to associated data and ancillary material (data underlying tables and figures, code fragments underlying plots), and various forms of smarter search tools. A majority reports being willing to spend at least a few minutes per day tagging materials for collaborative filtering purposes.
More than 80% of respondents reported a desire for article synopses and overviews, although these would likely be just as labor-intensive to produce as in the paper age. A variety of other navigation tools were suggested, some potentially automatable, including personalized table of contents, and flow diagrams representing the relations among research articles on a topic. Some suggested in addition a descriptive page for each research area including lists of recent reviews, cutting-edge research articles, experimental results, and primary researchers involved, with links to their own web pages. Many of the desired features are prompted by the increasing wealth of information available in bibliographic and other databases. Their realization will rely on tools for organizing a hierarchical literature search and that are able to preserve the context of retrieved items. Users also extol the potential utility of having conference presentations in the form of slides or video linked to articles, and more generally having linked together all instances of a given piece of research: notes, thesis, conference slides, articles, video, simulations, and associated data.
The connection between scientific literature and data in astronomy and astrophysics was illustrated by Kurtz.7 The new research methodology employed can involve multiple steps back and forth through a bibliographic database: following citations, references, looking for other articles by the same authors, or searching for keywords in abstracts of articles. It can also involve looking for related articles, or related objects, or articles about those objects in over a dozen different databases, e.g., finding reference codes of objects, finding the experimental source, examining the observation itself, finding publications at the website of experiment, following references to other external archives, including finding object catalogs and active missions, checking a sky atlas or digital sky survey at yet another site, checking an extragalactic database, cross-checking an archival catalog search, etc. While the possibilities enabled by these distributed resources are already quite impressive and result in fundamental improvements over prior methods, various manual steps in the search process are quite awkward due to the independent configurations of all the pieces of the system. 7 Many navigation functions could be automated and usefully centralized if the separate databases were set up to interoperate more efficiently. Analogs of these inefficiencies exist in other fields of research.
"Next generation" implications depend significantly on what the next generation of users will expect. Some clues in this respect are available from surveying the generic functionality of current commonly used websites. Many top-level browsing features are held in common among sites devoted to scholarship on the one hand, and to those designed for shopping, entertainment and popular file-sharing on the other, including: browse by groups, categories, subject area, most recent, or by a variety of "popularity" measures including recently featured, most viewed, top rated, most discussed, top favorites, most linked, most honored, most shared, most blogged, or most searched. The convergence of features in the parallel realms is also evident in pages devoted to specific items, including standard descriptive metadata (title, author(s), submitter), links to browse "related items," "more from this user," "related keywords" (both local at the site or linked to 3rd party aggregators), and in collaborative features including functions to add tags and labels, rate items, flag as inappropriate, save to favorites, add to groups, share/e-mail to friend, blog item, post to 3rd party site, and add or read comments and responses. Some features specific to publisher sites are links to retrieve full text and supplemental data, show references/citations, addenda/corrigenda, related web pages, export citation, cite or link using DOI, alert when cited, find same object in 3rd party database, search 3rd party database, or find similar articles by various flavors of relatedness (text similarity, co-citation, co-reference, co-readership). Many sites include a subscribe function, with the option to be alerted to new issues or when specific keywords appear, the possibility to upload content, and as well there are various forms of personalization, including provisions for a private library of "my articles" with view, and add/subtract functions. These private libraries can optionally be made open to other users, providing a potential collective enhancement, though with possible privacy issues.
All of the above can be expected to comprise the core functionality of future sites intended for scholarly participation. A recently implemented example is Nature Precedings,8 a free service from the publishers of Nature that permits researchers in the biological and life sciences to share preliminary findings, solicit community feedback, and stake priority claims. It distributes preprints as does arXiv, but it also accepts other document types such as posters and presentations. The site advertises its intention to help researchers find useful content through collaborative filtering features including tagging, voting and commenting. It is possible that preprint and file sharing have historically been impeded in the biological and life sciences communities due to the perception that such prior appearance could prevent later appearance of the work in premier journals. Since it is connected to a premier journal, Nature Precedings could significantly foster acceptance of dissemination of prepublication and ancillary materials within these important communities.
In another example of use of new technologies to port social research networking to a more distributed form, arXiv began accepting blog trackbacks in 2005. This technology permits blog software to signal to arXiv.org that a blog entry discusses a specific article, and a link added from the relevant article abstract at arXiv.org back to the blog site can then facilitate discovery of useful comment threads by readers. Blogging can require a substantial time commitment, but nonetheless a number of serious researchers have joined in and provided rich content, and provided links to other informative resources that wouldn't otherwise be readily discoverable. arXiv.org now points back from about 7000 articles to about 2000 blog entries, still a small volume. The underlying idea is to replicate in some on-line form the common experience of going to a meeting or conference, and receiving from a friend/expert some informal recent research thoughts and an instant overview of a subject area. Though without the in-person contact, the blog links provide some semblance of the above discussion framework and are moreover available to all, helping to level the playing field just as does the open article dissemination. It is not yet known whether the useful blogger lifetime will be months, years, or decades, but more researcher bloggers are currently joining in than are dropping out. Perhaps trackbacks from a heavily used archival site such as arXiv will provide some additional incentive for bloggers, giving comfort that they're not just typing into the wind. While the current number of bloggers remains a minuscule percentage of the number of authors, it is possible that externally moderated discussion fora, where people can post occasional comments non-anonymously without having to maintain their own dedicated long-term blogs, will be the most important long-term usage.
Before expecting too rapid a pace of change, however, it is useful to consider as well some current habits of the high energy and condensed matter physicists surveyed above. A large majority reports using arXiv as its primary information source, and a large majority has personal web pages. But only a small percentage (< 10%) uses RSS feeds and less than 20% listens to podcasts. Only a small percentage (< 10%) follows blogs regularly, a smaller percentage participates in blog discussions, and an even smaller percentage (< 1%) maintains its own blogs. Less than 10% have ever tried any social bookmarking sites, and only 1% found them useful. To be fair, many of these new resources have become widespread only in the past 3-4 years, so there may be an adoption lag for people already past their PhD's and already focused on research. Past generations of users can be expected to expand their repertoires as new features become commonplace at Internet commerce and other non-research sites, but can't be relied upon to anticipate the most useful future features.
There is currently much discussion of free access to the on-line scholarly literature. It has long been argued that this material becomes that much more valuable when freely accessible, and moreover that it is in public policy interests to make the results of publicly funded research freely available as a public good.9 It is also suggested that the move to open access could ultimately lead to a more cost-efficient scholarly publication system. There are recent indications that the U.S. and other governments may become directly involved, by mandating some form of open access to research funded by government agencies. The message to legislators is deceptively short and simple: "The taxpayers have paid for the research so deserve access to the results." The counter-argument is somewhat more subtle and takes paragraphs to elucidate, so the U.S. congress can be expected to legislate some form of open access, beginning with a requirement that articles based on certain forms of federally funded research be deposited in public repositories within one year of publication.10 It may seem non sequitur to force researchers to act in what should be their self-interest, while the general public spontaneously populates file sharing sites such as photobucket and YouTube, but such is the current politics of scholarly publication.
The response of the publishing community is essentially that their editorial processes provide a critical service to the research community, that these are labor-intensive and hence costly, and that even if delayed, free access could impair their ability to support these operations. In short, it costs real money to implement quality control by the time-honored methodology. If that methodology is maintained, the requisite funds must continue to flow to the same, or perhaps more cost-efficient, intermediaries between authors and readers. If the flow of funds is reduced, then a different methodology for quality control and authentication of the materials will be required. The basic tension is that authors and readers want some form of quality control, but the most efficient mechanism for providing it, and for paying for it, is still unclear. The problems of trust, integrity, and authentication mentioned earlier for the web at large remain critical to the scholarly communities from which it sprang.
A complicating factor is that the current costs of doing business vary significantly from publisher to publisher, as do the profits. One proposal is for authors to pay for publication once an article is accepted, making it free for all to read without subscription costs. As a side-effect, this proposal exposes the real costs of publication, ordinarily visible to the libraries paying the subscriptions but not to the researchers themselves. If this provides a mechanism to influence author choice of journals and if lowered profit margins necessary to attract authors persuade some of the more profitable commercial publishers to shift to other more lucrative endeavors, then other entities would still have to be available to fill the large gap in capacity left by their departure.
There are not only hierarchies of cost, but also hierarchies of funding from research discipline to research discipline. The average amount of research funding per article can vary from a few hundred thousand dollars per article in some areas of biomedical research, to zero for the majority of mathematicians who have no grant funding at all.11 The areas with higher levels of funding per article are more likely to be able to take advantage of an author-pays model. Another recent proposal12 within the High Energy Physics community is sponsorship of journals by large organizations, starting with major research laboratories. The concern is again the long-term sustainability of the commitment - while there is a loss of access for failure to pay subscriptions, there's no immediate downside for failure to meet sponsorship commitments. The "we don't do charity" sentiment expressed by librarians is also understandable. So with subscriptions already on the decline since long before the advent of Internet access, it is difficult to argue that journals should accept the transition to sponsored open access, not knowing whether or not it would be permanent. While some new open access journals13 are accepted by scientists, there are, as pointed out by Blume,14 no examples of any journal of significant size that has been converted from subscription to open access, and few if any open access examples of sustained cost recovery.
Studies have shown a correlation between openly accessible materials and citation impact,15 though a direct causal link is more difficult to establish, and other mechanisms accounting for the effect are easily imagined. It is worthwhile to note, however, that even if some articles currently receive more citations by virtue of being open access, it doesn't follow that the benefit would continue to accrue through widespread expansion of open access publication. Indeed, once the bulk of publication is moved to open access, then whatever relative boost might be enjoyed by early adopters would long since have disappeared, with relative numbers of citations once again determined by the usual independent mechanisms. Citation impact per se is consequently not a serious argument for encouraging more authors to adopt open access publication. A different potential impact and benefit to the general public, on the other hand, is the greater ease with which science journalists and bloggers can write about and link to open access articles.
A form of open access appears to be happening by a backdoor route regardless: using standard search engines, over a third of the high impact journal articles in a sample of biological/medical journals published in 2003 were found at non-journal websites.16 Informal surveys17 of publications in other fields, freely available via straightforward web search, suggest that many communities may already be further along in the direction of open access than most realize. Most significantly, the current generation of students has grown up with a variety of forms of file and content sharing, legal and otherwise. This generation greets with dumbfounded mystification the explanation of how researchers perform research, write an article, make the figures, and then are not permitted to do as they please with the final product. Since the current generation of undergraduates, and next generation of researchers, already takes it for granted that such materials should be readily accessible from anywhere, it is more than likely that the percentage of backdoor materials will only increase over time, and that the publishing community will need to adapt to the reality of some form of open access, regardless of the outcome of the government mandate debate.
There is more to open access than just free access. True open access permits any 3rd party to aggregate and data-mine the articles, themselves treated as computable objects, linkable and interoperable with associated databases. The range of possibilities for large and comprehensive full text aggregations are just starting to be probed. The PubMed Central database,18 operated in conjunction with GenBank and other biological databases at the U.S. National Library of Medicine, is a prime exemplar of a forward-looking approach. It is growing rapidly and (as of June 2007) contains over 333,000 recent articles in fully functional XML from over 200 journals (and additionally over 683,000 scanned articles from back issues19). A congressionally mandated open access policy for NIH supported publications would generate an additional 70,000 articles a year for PubMed Central.20
The full text XML documents in this database are parsed to permit multiple different "related material views" for a given article, with links to genomic, nucleotide, inheritance, gene expression, protein, chemical, taxonomic, and other databases. For example, GenBank accession numbers are recognized in articles referring to sequence data and linked directly to the relevant records in the genomic databases. Protein names are recognized and their appearances in articles are linked automatically to the protein and protein interaction databases. Names of organisms are recognized and linked directly to the taxonomic databases, which are then used to compute a minimal spanning tree of all the organisms contained in a given document. In yet another "view," technical terms are recognized and linked directly to the glossary items in the relevant standard biology or biochemistry textbook in the books database. Sets of selected articles resulting from bibliographic queries can also have their aggregated full texts searched simultaneously for links to over 25 different databases, including those mentioned above. The enormously powerful sorts of data-mining and number-crunching, already taken for granted as applied to the open access genomics databases, can be applied to the full text of the entirety of the biology and life sciences literature, and will have just as great a transformative effect on the research done with it.
On the one decade time scale, it is likely that more research communities will join some form of global unified archive system without the current partitioning and access restrictions familiar from the paper medium, for the simple reason that it is the best way to communicate knowledge and hence to create new knowledge. The genomic and related resources described above are naturally interlinked by virtue of their common hosting by a single organization, a situation very different from that described earlier for astronomy research. For most disciplines, the key to progress will be development of common web service protocols, common languages (e.g., for manipulating and visualizing data), and common data interchange standards, to facilitate distributed forms of the above resources. The adoption of these protocols will be hastened as independent data repositories adopt dissemination of seamlessly discoverable content as their raison d'être. Analogs of the test parsings described above have natural analogs in all fields: such as astronomical objects and experiments in astronomy; mathematical terms and theorems in mathematics; physical objects, terminology, and experiments in physics; chemical structures and experiments in chemistry, etc., and many of the external databases to provide targets for such automated markup already exist.
One of the surprises of the past two decades is how little progress has been made in the underlying document format employed. Equation-intensive physicists, mathematicians, and computer scientists now generally create PDF from TeX. It is a methodology based on a pre-1980's print-on-paper mentality and not optimized for network distribution. The implications of widespread usage of newer document formats such as Microsoft's Open Office XML or the OASIS OpenDocument format and the attendant ability to extract semantic information and modularize documents are scarcely appreciated by the research communities. Machine learning techniques familiar from artificial intelligence research will assist in the extraction of metadata and classification information, assisting authors and improving services based on the cleaned metadata. Semantic analysis of the document bodies will facilitate the automated interlinking to external resources described above and lead to improved navigation and discovery services for readers. A related question will be what authoring tools and functions should be added to word processing software, both commercial and otherwise, to provide an optimal environment for scientific authorship? Many of the interoperability protocols for distributed database systems will equally accommodate individual authoring clients or their proxies, and we can expect many new applications beyond real-time automated markup and autonomous reference finding.
Every generation thinks it's somehow unique, but there are nonetheless objective reasons to believe that we are witnessing an essential change in the way information is accessed, the way it is communicated to and from the general public, and among research professionals - fundamental methodological changes that will lead to a terrain 10-20 years from now more different than it was 10-20 years ago than in any comparable time period.