What I'm doing these days: research data management

So, it’s been a while since I’ve properly updated this blog, and since I seem to be having another try, I thought it would be useful to give a brief overview of what I’m doing these days, so that some of the other stuff I have in the pipeline makes a bit more sense.

My current work focus is research data management: helping university researchers to look after their data in ways that let them and the community get the most out of it. Data is the bedrock of most (all?) research: the evidence on which all the arguments and conclusions and new ideas are based. In the past, this data has been managed well (generally speaking) by and for the researchers collecting and using it, and this situation could have continued indefinitely.

Technology, however, has caused two fundamental changes to this position. First, we’re able to measure more and more about more and more, creating what has been termed a “data deluge”. It’s now possible for on researcher to generate, in the normal course of their work, far more data than they could possibly analyse themselves in a lifetime. For example, the development of polymerase chain reaction (PCR) techniques have enabled the fast, cheap sequencing of entire genomes: for some conditions, patients’ genomes are now routinely sequenced for future study. A typical human genome sequence occupies 8TB (about 1700 DVDs), and after processing and compression, this shrinks to around 100GB (21 DVDs). This covers approximately 23,000 genes, of which any one researcher may only be interested in a handful.

Second, the combination of the internet and cheap availability of computing power means that it has never been easier to share, combine and process this data on a huge scale. To continue our example, it’s possible to study genetic variations across hundreds or thousands of individuals to get new insights into how the body works. The 100,000 Genomes Project (“100KGP”) is an ambitious endeavour to establish a database of such genomes and, crucially, develop the infrastructure to allow researchers to access and analyse it at scale.

In order to make this work, there are plenty of barriers to overcome. The practices that kept data in line long enough to publish the next paper are no longer good enough: the organisation and documentation must be made explicit and consistent so that others can make sense of it. It also needs to be protected better from loss and corruption. Obviously, this takes more work than just dumping it on a laptop, so most people want some reassurance that this extra work will pay off.

Sharing has risks too. Identifiable patient data cannot be shared without the patients consent; indeed doing so would be a criminal offence in Europe. Similar rules apply to sensitive commercial information. Even if there aren’t legal restrictions, most researchers have a reasonable expectation (albeit developed before the “data deluge”) that they be able to reap the reputational rewards of their own hard work by publishing papers based on it.

There is therefore a great deal of resistance to these changes. But there can be benefits too. For society, there is the possibility of making advancing knowledge in directions that would never have been possible even ten years ago. But there are practical benefits to the individuals too: every PhD supervisor and most PhD students know the frustration of trying to continue a student’s poorly-documented work after they’ve graduated.

For funders the need for change is particularly acute. Budgets are being squeezed, and with the best will in the world there is less money to go around, so there is pressure to ensure the best possible return on investment. This means that it’s no longer acceptable, for example, for several labs in the country to be running identical experiments to do different things with the results. It’s more important than ever to make more data available to and reusable by more people.

So the funders (in the UK, particularly the government-funded research councils), are introducing requirements on the researchers they fund to move along this path quicker than they might feel comfortable with. It therefore seems reasonable to offer these hard-working people some support, and that’s where I come in.

I’m currently spending my time providing training and advice, bringing people together to solve problems and trying to convince a lot of researchers to fix what, in many cases, they didn’t think was broken! They are subject to conflicting expectations and need help navigating this maze so that they can do what they do best: discover amazing new stuff and change the world.

For the last 6ish months I’ve been doing this at Imperial College (my alma mater, no less) and loving it. It’s a fascinating area for me, and I’m really excited to see where it will lead me next!

If you have time, here’s a (slightly tongue-in-cheek) take on the problem from the perspective of a researcher trying to reuse someone else’s data:

What I'm doing these days: research data management

Me elsewhere