Well, I did a great job of blogging the conference for a couple of days, but then I was hit by the bug that’s been going round and didn’t have a lot of energy for anything other than paying attention and making notes during the day! I’ve now got round to reviewing my notes so here are a few reflections on day 2.
Day 2 was the day of many parallel talks! So many great and inspiring ideas to take in! Here are a few of my take-home points.
Big science and the long tail
The first parallel session had examples of practical data management in the real world. Jian Qin & Brian Dobreski (School of Information Studies, Syracuse University) worked on reproducibility with one of the research groups involved with the recent gravitational wave discovery. “Reproducibility” for this work (as with much of physics) mostly equates to computational reproducibility: tracking the provenance of the code and its input and output is key. They also found that in practice the scientists’ focus was on making the big discovery, and ensuring reproducibility was seen as secondary. This goes some way to explaining why current workflows and tools don’t really capture enough metadata.
Milena Golshan & Ashley Sands (Center for Knowledge Infrastructures, UCLA) investigated the use of Software-as-a-Service (SaaS, such as Google Drive, Dropbox or more specialised tools) as a way of meeting the needs of long-tail science research such as ocean science. This research is characterised by small teams, diverse data, dynamic local development of tools, local practices and difficulty disseminating data. This results in a need for researchers to be generalists, as opposed to “big science” research areas, where they can afford to specialise much more deeply. Such generalists tend to develop their own isolated workflows, which can differ greatly even within a single lab. Long-tail research also often struggles from a lack of dedicated IT support. They found that use of SaaS could help to meet these challenges, but with a high cost required to cover the needed guarantees of security and stability.
Education & training
This session focussed on the professional development of library staff. Eleanor Mattern (University of Pittsburgh) described the immersive training introduced to improve librarians’ understanding of the data needs of their subject areas in delivering their RDM service delivery model. The participants each conducted a “disciplinary deep dive”, shadowing researchers and then reporting back to the group on their discoveries with a presentation and discussion.
Liz Lyon (also University of Pittsburgh, formerly UKOLN/DCC) gave a systematic breakdown of the skills, knowledge and experience required in different data-related roles, obtained from an analysis of job adverts. She identified distinct roles of data analyst, data engineer and data journalist, and as well as each role’s distinctive skills, pinpointed common requirements of all three: Python, R, SQL and Excel. This work follows on from an earlier phase which identified an allied set of roles: data archivist, data librarian and data steward.
Data sharing and reuse
This session gave an overview of several specific workflow tools designed for researchers. Marisa Strong (University of California Curation Centre/California Digital Libraries) presented Dash, a highly modular tool for manual data curation and deposit by researchers. It’s built on their flexible backend, Stash, and though it’s currently optimised to deposit in their Merritt data repository it could easily be hooked up to other repositories. It captures DataCite metadata and a few other fields, and is integrated with ORCID to uniquely identify people.
In a different vein, Eleni Castro (Institute for Quantitative Social Science, Harvard University) discussed some of the ways that Harvard’s Dataverse repository is streamlining deposit by enabling automation. It provides a number of standardised endpoints such as OAI-PMH for metadata harvest and SWORD for deposit, as well as custom APIs for discovery and deposit. Interesting use cases include:
- An addon for the Open Science Framework to deposit in Dataverse via SWORD
- An R package to enable automatic deposit of simulation and analysis results
- Integration with publisher workflows Open Journal Systems
- A growing set of visualisations for deposited data
In the future they’re also looking to integrate with DMPtool to capture data management plans and with Archivematica for digital preservation.
Andrew Treloar (Australian National Data Service) gave us some reflections on the ANDS “applications programme”, a series of 25 small funded projects intended to address the fourth of their strategic transformations, single use → reusable. He observed that essentially these projects worked because they were able to throw money at a problem until they found a solution: not very sustainable. Some of them stuck to a traditional “waterfall” approach to project management, resulting in “the right solution 2 years late”. Every researcher’s needs are “special” and communities are still constrained by old ways of working. The conclusions from this programme were that:
- “Good enough” is fine most of the time
- Adopt/Adapt/Augment is better than Build
- Existing toolkits let you focus on the 10% functionality that’s missing
- Succussful projects involved research champions who can: 1) articulate their community’s requirements; and 2) promote project outcomes
All in all, it was a really exciting conference, and I’ve come home with loads of new ideas and plans to develop our services at Sheffield. I noticed a continuation of some of the trends I spotted at last year’s IDCC, especially an increasing focus on “second-order” problems: we’re no longer spending most of our energy just convincing researchers to take data management seriously and are able to spend more time helping them to do it better and get value out of it. There’s also a shift in emphasis (identified by closing speaker Cliff Lynch) from sharing to reuse, and making sure that data is not just available but valuable.