Keynote Address: The Research Data Revolution

Of the many striking points G. Sayeed Choudhury made in his keynote address, the one that got the biggest audience reaction was “data is the new bacon.” The more fastidious editors among us might have preferred it to read “data are the new bacon,” but the point was well taken: data management is becoming ever more relevant and essential, with the topic appearing in such diverse forums as the Harvard Business Review (“Data Scientist: The Sexiest Job of the 21st Century”) and Twitter. The statement that “data is the new oil” (Clive Humby) encourages us to think about data as a natural resource: they flow, can be extracted, and must be preserved for the use of future generations, whose goals and results cannot be predicted by today’s researchers.

An engineer by trade, Choudhury described engineering as a liberal art, involving people, processes, and products and the workflows that connect them. He applies this same outlook to the challenges of data preservation through his work with the Data Conservancy (DC; https://dataconservancy.org/), which is devoted to data preservation and the cross-disciplinary aspects of data management using a three-pronged approach: preserve, share, and discover. DC aims to collect research data, reveal their potential across many disciplines, and promote re-use and new combinations of data. At the governmental level, data preservation may be addressed by programs such as the Commons (a program of the National Institutes of Health), which multiple agencies may use to store and share data, and the White House Open Data Initiative (www.data.gov), designed to allow the public access to tools and resources for research and analysis.

The traditional definitions of “Big Data” are based on its volume, velocity, and variety. But in the future, the size of a given collection will matter less than what can be done with the data and how they interface with other available services. For Choudhury, the definition of “Big Data” is less about size and more about methods (or the lack thereof): When a community’s ability to deal with the data is overwhelmed and new methods are required, that’s when data become “big.” We are currently grappling with this issue with scientific data: They have become so massive that we need new systems to effectively manage and preserve them.

Much of this address focused on libraries, but the concepts can also be applied to publications. Both libraries and publishers can be considered from the three pillars of collections, service, and infrastructure. Data are a new form of collections. Storage is basically the same regardless of content, and if data-sets are open, libraries must distinguish themselves by the services they offer. The existing infrastructure cannot interpret data in as sophisticated a way as humans can. As with libraries, “publishing is about content, not format” (Wendy Queen). Fundamentally, we are all about collections, services, and infrastructure. The comments during the question-and-answer period touched on issues that affect every publisher: We can identify plagiarized or recycled text, but we can’t identify stolen data. All agree on the need for an iThenticate-type product for data-sets. As publishers attempt to convince researchers to make their data widely available, federal and private funding agencies increasingly require open availability of data, which helps motivate the researchers to comply with publishers’ requests. The question period wrapped up with the observation that, as editors and publishers, we have to think about how best to serve readers and what tools can we offer them. Choudhury replied that professional society publishers are uniquely positioned to provide services to researchers because of their relationships to and in-depth knowledge of specific scientific communities.

The major takeaway of this talk was that “one person’s noise is another person’s signal.” Careful data preservation and management are essential primarily because it’s impossible to anticipate how existing data might be used in the future. Data that are properly stored, archived (protected), preserved, and curated today will be available to answer the questions and solve the problems of tomorrow.