ARTICLES

CrossRef Text and Data Mining Services: Simplifying Life for Researchers and Publishers

Good science editors spend a great deal of time in improving life for their readers. They arrange for peer review, choose appropriate content, and take steps to ensure that authors and editors comply with ethical guidelines. Good editors spend time and resources on copyediting and markup to improve the print and online reading experiences.

Today, researchers read less, and machines process more. Editors must consider the needs of nonhuman readers. Scientific content can be mined for insights and information in ways that we could not have imagined in 1665, when the Royal Society published its first article. For example, a few years after JSTOR was founded in 1995, such researchers as Fred Shapiro, of Yale Law School, began to use the new online resource of historical texts to do something no one had anticipated: to mine the scholarly literature to discover the earliest uses of particular terms and quotations.1,2

Scholarly publications are optimized for human readers, not for robots. In fact, some publication Web-hosting platforms built when scholarly journals first launched online shut off access to any robot that they detected because of the potential for piracy or denial of service.

What a Publisher Wants

Publishers want to support legitimate research use of their content and they don’t want to spend a lot of time in working out one-off agreements. They want miners to get goodquality data. They want to ensure that services to their human readers are not inadvertently disrupted by high-volume robotic activity. Some publishers may not yet feel the urgency to support machine use of their content. “We only get a few requests a year,” they say.

What a Researcher Wants

Quantifying the market need for text- and data-mining access is difficult, but datamining researchers maintain that the few requests that publishers are aware of constitute a mere leak in the dam. They anticipate that a flood of requests will eventually overflow the barriers unless policies and systems are put into place to handle them now. Heather Piwowar, cofounder of ImpactStory, compares the dearth of researcher requests for data-mining access with people’s demand for elevators before they were widely available. No one saw the need for an elevator in a three- or four-story building. Eventually, of course, the elevator became a technology that enabled the building of skyscrapers. Researchers want “taller” knowledge stores.3

In many ways, researchers want the same things that publishers want. They want to spend their time in research rather than in requesting and following up on time-consuming permissions. They would rather analyze data than negotiate with potentially hundreds of publishers for needed content. Text- and data-mining researchers need programs that can crawl and download text without undue hurdles. They want the licenses that their institutions already pay for to cover legitimate research activities. Many are passionate about new forms of knowledge to be discovered; many are frustrated when roadblocks slow them down.

What to Do: CrossRef Text and Data Mining Services

Enter CrossRef Text and Data Mining Services, launched in May 2014. CrossRef is a not-for-profit association of publishers known for innovative services that rely on collaboration to improve scholarly communication through improved linking, discoverability, and tools for evaluating quality. Its newest service provides a method by which researchers can access the full-text content from participating publishers to mine it without having to go to each publisher individually and regardless of the publishers’ business models.

What Researchers Do

Researchers use an application programming interface (API) to access the full text of content on the basis of CrossRef digital object identifiers (DOIs) that point to the content most appropriate for mining. The API is the same for all participating publisher sites, whether the content is publicly accessible or requires subscriptions or other payment for access.

Researchers continue to use their favorite discovery tools (such as Google Scholar, PubMed, Scopus, and Web of Science) to identify the content that they are interested in mining. CrossRef does not store any full text. It does store the Web addresses of minable content and the license governing its use, even for an open-access (OA) license. Access control to the content, if any, always remains with the publisher, not with CrossRef. OA publishers can simply return the full text when it is requested by the researcher via the API, and subscription-based publishers will continue to use their existing access-control systems before allowing the researchers’ programs inside.

What Publishers Do: The Minimum

Many publisher licenses already allow text and data mining. Some countries have enacted copyright exemptions for text- and datamining uses. To participate in CrossRef Text and Data Mining Services, publishers need only deposit two new pieces of metadata:

  • The license information, even if it is an open license.
  • Full-text links to the mining-optimized version for each article.

That’s it. Researchers are ready to take advantage of the standard interface for multiple publishers.

What Publishers Do: The Options

What of publishers with other concerns? Perhaps a publisher struggles with response time on its existing platform and cannot immediately increase bandwidth. Another publisher wants to encourage data mining but does not have the permissions to some of the figures and tables in its content to grant to researchers.

Option I: Rate Limiting

For publishers who support mining but might need to protect the user experience of their core human readers from sluggish performance, CrossRef allows—but does not require—publishers to communicate download rate limits to programs.

Option II: Click-Through Agreements

In a few cases, a publisher may determine that its institutional license is not sufficient to allow text and data mining. Such a publisher can, at its option, deposit a click through license with CrossRef that outlines additional terms. Again, that is not a required part of the service, and CrossRef does not expect it to be heavily used.

If a publisher chooses to require a click-through agreement, the researcher can download the license and review it before choosing whether to agree to the terms. The terms themselves are determined by the business practices of the publisher. CrossRef provides the services to display and serve the license, but it does not have any control over or responsibility for publishers’ terms.

Where it Stands

CrossRef Text and Data Mining Services launched in May 2014, after a pilot period that involved a number of publishers, including Elsevier, Wiley, Springer, Taylor & Francis, and Walter de Gruyter. Researchers interested in text and data mining provided comments. CrossRef is working with publishers to add full-text links and license information to existing CrossRef metadata. Once they have done that, they have effectively enabled their content for mining via API.

More than 370,000 CrossRef records have links and license information fields at this writing. The number of articles and other documents is growing as more publishers adopt the service.

CrossRef Text and Data Mining Services provides a common and simple way for text- and data-mining researchers to access the content that they need and meets the demand for publisher content to be used in increasingly sophisticated ways as online scholarly research continues to evolve. The CrossRef Text and Data Mining Services API is free for researchers and the public to use, and there are no costs for publishers to implement services through 2014. Additional information is available on the CrossRef Web site.4

References

  1. Hafner K. A new way of verifying old and familiar sayings, New York Times. 1 February 2001. www.nytimes.com/2001/02/01/technology/01YALE.html. Accessed 1 August 2014.
  2. Guthrie KM, Kirchhoff A, and Tapp WN. The JSTOR solution, six years later, Digital Libraries: A Vision for the 21st Century, Patricia Hodges, et al., Ann Arbor, MI: Michigan Publishing, University of Michigan Library. 2003. http://dx.doi.org/10.3998/spobooks.bbv9812.0001.001. Accessed 1 August 2014.
  3. Piwowar H. Building skyscrapers with our scholarship. Presentation at CrossRef annual meeting, 11 November 2013, Cambridge, MA. www.slideshare.net/CrossRef/2013-crossref-annualmeeting-building-skyscrapers-heatherpiwowar. Accessed 1 August 2014.
  4. CrossRef Web site. www.crossref.org/tdm.