Emacs org-mode is a powerful tool, as I’ve written about before. As well being good at project and task management, it also has features for writing documents:

  • Export in a variety of formats, including HTML, OpenDocument and LaTeX
  • Embed snippets of code (in your favourite language), execute them and include the results (and optionally the code itself) in the exported document

I’ll come back to why that’s useful in a moment.

I’ve had a presentation (a position paper on preserving software) accepted for the LIBER conference next month. I may choose to subsequently submit it as an article to LIBER Quarterly (this is a relatively common pattern) so I thought I’d try writing the article and the presentation together in a single document, and see how it worked. If nothing else, writing the paper will help me structure my ideas for the presentation, even if it never gets published.

There are probably other ways of doing it, but I’ve set it up so that exporting in different formats gives me the two different versions of the document:

  • Export as LaTeX produces a PDF version of the presentation slides, using the Beamer package.
  • Export as OpenDocument text produces the article, ready for submission.

It took a little sleight of hand, but I’m putting my Beamer slides in #+BEGIN_LaTeX blocks, which org-mode will include in the LaTeX export but not any other format, and I’ve configured Beamer to ignore any text outside frames, which forms the body of the article.

I don’t want the abstract and other metadata cluttering up the document itself, so I’ve pushed those out to a separate file and used an #+INCLUDE statement to pull it into the main document at export time.

The one thing missing is a built-in way of integrating with Zotero, which I use as my bibliographic database, to format my references. However, Zotero has a very functional API, so I’ve put together a short Ruby script that grabs a Zotero collection and tweaks the formatting. Whenever I export the document from org-mode, the code is run and the result (a formatting bibliography) is embedded in the finished version.

require 'open-uri'; require 'rexml/document'
url = "https://api.zotero.org/groups/#{group_id}/collections/#{coll_id}/items/top?format=bib&style=#{style}"
REXML::Document.new(open(url)).elements.each('//div[@class="csl-entry"]') do |entry|
  puts '- ' + entry.children
              .collect{|c| if c.instance_of? REXML::Text then c.value else c.text end}
              .join.gsub(%r{</?i>}, '/').gsub(%r{/\(}, '/ (')

This formats a Zotero collection, converting HTML <i> tags to the equivalent org-mode markup. It requires three parameters, group_id, coll_id and style, which can be configured in the org-mode document and passed through when emacs executes the code. The same snippet could thus be used in multiple documents, just varying the parameters to format a different set of references.

Clearly, embedding executable source code in a document has a lot more potential than I’ve used here. It allows data analysis and visualisation code to be embedded directly in a document, even with the option of processing data from tables that are also in the document. You can also use it to write entire programs in the Literate Programming style, formatting them in the way that makes most narrative sense but exporting (“tangling”) pure executable source code.

Once I’m sure it’s not a violation of the journal’s policy, I’ll push the source of the two documents up to github in case anyone wants to see how I’ve done it.


When is the right time to curate?

One of the things that I’ve been thinking about quite a bit since IDCC 2015 is this: exactly when should curation take place in a digital workflow.

There seem to be two main camps here, though of course it’s more of a spectrum than a simple dichotomy. These two views can be described as “sheer curation” and “just-in-time curation”.

Sheer curation

Sheer curation involves completely and seamlessly integrating curation of data into the workflow itself. That’s “sheer” as in tights: it’s such a thin layer that you can barely tell it’s there. The argument here is that the only way to properly capture the context of an artefact is to document it while that context still exists, and this makes a lot of sense. If you wait until later, the danger is that you won’t remember exactly what the experimental conditions for that observation were. Worse, if you wait long enough you’ll forget about the whole thing entirely until it comes time to make use of the data. Then you run the danger of having to repeat the experiment entirely because you can’t remember enough about it for it to be useful.

For this to work, though, you really need it to be as effortless as possible so that it doesn’t interrupt the research process. You also need researchers to have some curation skills themselves, and to minimise the effort required those skills need to be at the stage of unconscious competence. Finally you need a set of tools to support the process. These do exist, but in most cases they’re just not ready for widespread use yet.

Just-in-time curation

The other extreme is to just do the absolute minimum, and apply the majority of curation effort at the point where someone has requested access. This is the just-in-time approach: literally making the effort just in time for the data to be delivered. The major advantage is that there is no wasted effort curating things that don’t turn out to be useful. The alternative is “just-in-case”, where you curate before you know what will or won’t be useful.

The key downside is the high risk of vital context being lost. If a dataset is valuable but its value doesn’t become apparent for a long time, the researchers who created it may well have forgotten or misplaced key details of how it was collected or processed. You also need good, flexible tools that don’t complain if you leave big holes in your metadata for a long time.


When might each be useful?

I can see sheer-mode curation being most useful where standards and procedures are well established, especially if value of data can easily be judged up front and disposal integrated into the process. In particular, this would work well if data capture methods can be automated and instrumented, so that metadata about the context is recorded accurately, consistently and without intervention by the researcher.

Right now this is the case in well-developed data-intensive fields, such as astrophysics and high-energy physics, and newer areas like bioinformatics are getting there too. In the future, it would be great if this could also apply to any data coming out of shared research facilities (such as chemical characterisation and microscopy). Electronic lab notebooks could play a big part for observational research, too.

Just-in-time-mode curation seems to make sense where the overheads of curating are high and only a small fraction of collected data is ever reused, so that the return on investment for curation is very low. It might sometimes be necessary also, if the resources needed for curation aren’t actually made available until someone wants to reuse the data.

Could they be combined?

As I mentioned at the start, these are just two ends of a spectrum of possibilities, and for most situations the ideal solution will lie somewhere in between. A pragmatic approach would be to capture as much context as is available transparently and up-front (sheer) and then defer any further curation until it is justified. This would allow the existence of the data to be advertised up-front through its metadata (as required by e.g. the EPSRC expectations), while minimising the amount of effort required. The clear downside is the potential for delays fulfilling the first request for the data, if such ever comes.


The sharp-eyed amongst you will have noticed I’ve recently ended a bit of a break in service on this blog. I’ve been doing that thing of half-writing posts and then never finishing them, so I’ve decided to clear out the pipeline and see what’s still worth publishing. This is a slightly-longer-than-usual piece I started writing about 9 months ago, still in my previous job. It still seems relevant, so here you go. You’re welcome.

What is an electronic lab notebook?

For the last little while at work, I’ve been investigating the possibility of implementing an electronic lab notebook (ELN) system, so here are a few of my thoughts on what an ELN actually is.

What is a lab notebook?

All science is built on data1. Definitions of “data” vary, but they mostly boil down to this: data is evidence gathered through observation (direct or via instruments) of real-world phenomena.

A lab notebook is the traditional device for recording scientific data before it can be processed, analysed and turned into conclusions. It is typically a hardback notebook, A4 size (in the UK at least) with sequentially numbered pages recording the method and conditions of each experiment along with any measurements and observations taken during that experiment.

In industrial contexts, where patent law is king, all entries must be in indelible ink and various arcane procedures followed, such as the daily signing of pages by researcher and supervisor; even in academia some of these precautions can be sensible.

But the most important think about a lab notebook is that it records absolutely everything about your research project for future reference. If any item of data is missing, the scientific record is incomplete and may be called into question. At best this is frustrating, as time-consuming and costly work must be repeated; at worst, it leaves you open to accusations of scientific misconduct of the type perpetrated by Diederik Stapel.

So what’s an electronic lab notebook?

An ELN, then, is some (any?) system that gives you the affordances I’ve described above while being digital in nature. In practice, this means a notebook that’s accessed via a computer (or, increasingly, a mobile device such as a tablet or smartphone), and stores information in digital form.

This might be a dedicated native app (this is the route taken by most industrial ELN options), giving you a lot of functionality right on your own desktop. Alternatively, it might be web-based, accessed using your choice of browser without any new software to be installed at all.

It might be standalone, existing entirely on a single computer/device with no need for network access. Alternatively it might operate in a client-server configuration, with a central database providing all the storage and processing power, and your own device just providing a window onto that.

These are all implementation details though. The important thing is that you can record your research using it. By why? What’s the point?

What’s wrong with paper?

Paper lab notebooks work perfectly well already, don’t they? We’ve been using them for hundreds of years.

While paper has a lot going for it (it’s cheap and requires no electricity or special training to use), it has its disadvantages too. It’s all too easy to lose it (maybe on a train) or accidentally destroy it (by spilling nasty organic solvents on it, or just getting caught out in the rain).

At the same time, it’s very difficult to safeguard in any meaningful sense, short of scanning or photocopying each individual page.

It’s hard to share: an increasingly important factor when collaborative, multidisciplinary research is on the rise. If I want to share my notes with you, I either have to post you the original (risky) or make a physical or digital copy and send that.

Of more immediate relevance to most researchers, it’s also difficult to interrogate unless you’re some kind of an indexing ninja. When you can’t remember exactly what page recorded that experiment nine months ago, you’re in for a dull few hours searching through page-by-page.

What can a good ELN give us?

The most obvious benefit is that all of your data is now digital, and can therefore be backed up to your heart’s content. Ideally, all data is stored on a safe, remote server and accessed remotely, but even if it’s stored directly on your laptop/tablet you now have the option of backing it up. Of course, the corollary is that you have to make sure you are backing up, otherwise you’ll look a bit silly when you drop your laptop in a puddle.

The next benefit of digital is that it can be indexed, categorised and searched in potentially dozens of different dimensions, making it much easier to find what you were looking for and collect together sets of related information.

A good electronic system can do some useful things to support the integrity of the scientific record. Many systems can track all the old versions of an entry. As well as giving you an important safety net in case of mistakes, this also demonstrates the evolution of your ideas. Coupled with some cryptographic magic and digital signatures, it’s even possible to freeze each version to use as evidence in a court of law that you had a given idea on a given day.

Finally, moving notes and data to a digital platform can set them free. Suddenly it becomes trivial to share them with collaborators, whether in the next room or the next continent. While some researchers advocate fully “open notebook science” — where all notes and data are made public as soon as possible after they’re recorded — not everyone is comfortable with that, so some control over exactly who the notebook is shared with is useful too.

What are the potential disadvantages?

The first thing to note is that a poorly implemented ELN will just serve to make life more awkward, adding extra work for no gain. This is to be avoided at all costs — great care must be taken to ensure that the system is appropriate to the people who want to use it.

It’s also true that going digital introduces some potential new risks. We’ve all seen the… My own opinion is that there will always be risks, whether data is stored in the cloud or on dead trees in a filing cabinet. As long as those risks are understood and appropriate measures taken to mitigate them, digital data can be much safer than the average paper notebook.

One big stumbling block that still affects a lot of the ELN options currently available is that they assume that the users will have network access. In the lab, this is unlikely to be a problem, but how about on the train? On a plane or in a foreign country? A lot of researchers will need to get work done in a lot of those places. This isn’t an easy problem to solve fully, though it’s often possible with some forethought to export and save individual entries to support remote working, or to make secure use of mobile data or public wireless networks.


So there you have it. In my humble opinion, a well-implemented ELN provides so many advantages over the paper alternative that it’s a no-brainer, but that’s certainly not true for everyone. Some activities, by their very nature, work better with paper, and either way most people are very comfortable with their current ways of working.

What’s your experience of note-taking, within research or elsewhere? What works for you? Do you prefer paper or bits, or a mixture of the two?

  1. Yes, even theoretical science, in my humble opinion. I know, I know. The comment section is open for debate.


One of the best ways of getting started developing open source software is to “scratch your own itch”: when you have a problem, get coding and solve it. So it is with this little bit of code.

Scroll Back is a very simple Chrome extension that replicates a little-known feature of Firefox: if you hold down the Shift key and use the mouse wheel, you can go forward and backward in your browser history. The idea came from issue 927 on the Chromium bug tracker, which is a request for this very feature.

You can install the extension from the Chrome Web Store if you use Chrome (or Chromium).

The code is so simple I can reproduce it here in full:

document.addEventListener("wheel", function(e) {
  if (e.shiftKey && e.deltaX != 0) {
    return e.preventDefault();
  • Line 1 adds an event listener which is executed every time the user uses the scroll wheel.
  • If the Shift key is held down and the user has scrolled (line 2), line 3 goes backward or forward in the history according to whether the user scrolled down or up respectively (e.deltaX is positive for down, negative for up)
  • Line 4 prevents any unwanted side-effects of scrolling.

The code is automatically executed every time a page is loaded, so has the effect of enabling this behaviour in all pages.

It’s open source (licensed under the MIT License), so you can check out the full source code on github.


I run Linux on my laptop, and I’ve had some problems with the wifi intermittently dropping out. I think I’ve found the solution to this, so I just wanted to record it here so I don’t forget, and in case anyone else finds it useful.

What I found was that any time the wifi was idle for too long it just stopped working and the connection needed to be manually restarted. Worse, after a while even that didn’t work and I had to reboot to fix it.

The problem seems to be with the power-saving features of the wifi card, which is identified by lspci as:

01:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8723BE PCIe Wireless Network Adapter

What appears to happen is that the card goes into power-saving mode, goes to sleep and never wakes up again.

It makes use of the rtl8723be driver, and the solution appears to be to disable the power-saving features by passing some parameters to the relevant kernel module. You can do this by passing the parameters on the command line if manually loading the module with modprobe, but the easiest thing is to create a file in /etc/modprobe.d (which can be called anything) with the following contents:

# Prevents the WiFi card from automatically sleeping and halting connection
options rtl8723be fwlps=0 swlps=0

This seems to be working for me now. It’s possible that only one out of the parameters fwlps and swlps are needed, but I haven’t had chance to test this yet.

The following pages helped me figure this out:


OpenRefine has a pretty cool feature. You can export a project’s entire edit history in JSON format, and subsequently paste it back to exactly repeat what you did. This is great for transparency: if someone asks what you did in cleaning up your data, you can tell them exactly instead of giving them a vague, general description of what you think you remember you did. It also means that if you get a new, slightly-updated version of the raw data, you can clean it up in exactly the same way very quickly.

    "op": "core/column-rename",
    "description": "Rename column Column to Funder",
    "oldColumnName": "Column",
    "newColumnName": "Funder"
    "op": "core/row-removal",
    "description": "Remove rows",
    "engineConfig": {
      "mode": "row-based",
// etc…

Now this is great, but it could be better. I’ve been playing with Python for data wrangling, and it would be amazing if you could load up an OpenRefine history script in Python and execute it over an arbitrary dataset. You’d be able to reproduce the analysis without having to load up a whole Java stack and muck around with a web browser, and you could integrate it much more tightly with any pre- or post-processing.

Going a stage further, it would be even better to be able to convert the OpenRefine history JSON to an actual Python script. That would be a great learning tool for anyone wanting to go from OpenRefine to writing their own code.

import pandas as pd

data = pd.read_csv("funder_info.csv")
data = data.rename(columns = {"Column": "Funder"})
data = data.drop(data.index[6:9])

This seems like it could be fairly straightforward to implement: it just requires a bit of digging to understand the semantics of the JSON thot OpenRefine produces, and then the implementation of each operation in Python. The latter part shouldn’t be much of a stretch with so many existing tools like pandas.

It’s just an idea right now, but I’d be willing to have a crack at implementing something if there was any interest — let me know in the comments or on Twitter if you think it’s worth doing, or if you fancy contributing.


If you’re viewing this on the web, you might notice there have been a few changes round here: I’ve updated my theme to be more responsive and easier to read on different screen sizes. It’s been interesting learning how to use the Bootstrap CSS framework, originally developed by Twitter to make putting together responsive sites with a fixed or fluid grid layout straightforward.

I don’t have the means to test it on every possible combination of browsers and devices, so please let me know if you notice anything weird-looking!


This post is a little bit of an experiment, in two areas:

  1. Playing with some of my own data, using a language that I'm quite familiar with (Python) and a mixture of tools which are old and new to me; and
  2. Publishing that investigation directly on my blog, as a static IPython Notebook export.

The data I'll be using is information about my exercise that I've been tracking over the last few years using RunKeeper. RunKeeper allows you to export and download your full exercise history, including GPX files showing where you went on each activity, and a couple of summary files in CSV format. It's this latter that I'm going to take a look at; just an initial survey to see if there's anything interesting that jumps out.

I'm not expecting any massive insights just yet, but I hope you find this a useful introduction to some very valuable data wrangling and analysis tools.

Looking at the data

First up, we need to do some basic setup:

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from IPython.display import display

This imports some Python packages that we'll need:

  • matplotlib: makes pretty plots (the %matplotlib inline bit is some IPython Notebook magic to make plots appear inline)
  • numpy: allows some maths and stats functions
  • pandas: loads and manipulates tabular data

Next, let's check what files we have to work with:

In [2]:
%ls data/*.csv
data/cardioActivities.csv  data/measurements.csv

I'm interested in cardioActivities.csv, which contains a summary of each activity in my RunKeeper history. Loading it up gives us this:

In [3]:
cardio = pd.read_csv('data/cardioActivities.csv',
                     parse_dates=[0, 4, 5],
Type Route Name Distance (mi) Duration Average Pace Average Speed (mph) Calories Burned Climb (ft) Average Heart Rate (bpm) Notes GPX File
2015-03-27 12:26:34 Running NaN 3.31 29:57 9:03 6.63 297 48.45 157 NaN 2015-03-27-1226.gpx
2015-03-21 09:44:25 Running NaN 11.31 2:12:01 11:40 5.14 986 283.41 146 NaN 2015-03-21-0944.gpx
2015-03-19 07:21:36 Running NaN 5.17 52:45 10:12 5.88 423 75.23 150 NaN 2015-03-19-0721.gpx
2015-03-17 06:51:42 Running NaN 1.81 17:10 9:29 6.32 144 23.27 137 NaN 2015-03-17-0651.gpx
2015-03-17 06:21:51 Running NaN 0.94 7:25 7:52 7.64 17 3.65 136 NaN 2015-03-17-0621.gpx

Although my last few activities are runs, there are actually several different possible values for the "Type" column. We can take a look like this:

In [4]:
cardio['Type'] = cardio['Type'].astype('category')
Index(['Cycling', 'Hiking', 'Running', 'Walking'], dtype='object')

From this you can see there are four types: Cycling, Hiking, Running and Walking. Right now, I'm only interested in my runs, so let's select those and do an initial plot.

In [5]:
runs = cardio[cardio['Type'] == 'Running']
runs['Distance (mi)'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d914ac8>

We can notice two things straight away:

  • There's a gap at the start of 2014: this is probably where RunKeeper hasn't got information about the distance because my GPS watch didn't work right or something, and I don't want to include these in my analysis.
  • There's a big spike from where I did the 12 Labours of Hercules ultramarathon, which isn't really an ordinary run so I don't want to include that either.

Let's do some filtering (excluding those, and some runs with "unreasonable" speeds that might be mislabelled runs or cycles) and try again.

In [6]:
runs = runs[(runs['Distance (mi)'] <= 15)
            & runs['Average Speed (mph)'].between(3.5, 10)]
runs['Distance (mi)'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d5f4fd0>

That looks much better. Now we can clearly see the break I took between late 2012 and early 2014 (problems with my iliotibial band), followed by a gradual return to training and an increase in distance leading up to my recent half-marathon.

There are other types of plot we can look at too. How about a histogram of my run distances?

In [7]:
runs['Distance (mi)'].hist(bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d56fc50>

You can clearly see here the divide between my usual weekday runs (usually around 3–5 miles) and my longer weekend runs. I've only been running >7 miles very recently, but I suspect as time goes on this graph will start to show two distinct peaks. There also seem to be peaks around whole numbers of miles: it looks like I have a tendency to finish my runs shortly after the distance readout on my watch ticks over! The smaller peak around 1 mile is where I run to the gym as a warmup before a strength workout.

How fast do I run? Let's take a look.

In [8]:
runs['Average Speed (mph)'].hist(bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d4dd518>

Looks like another bimodal distribution. There's not really enough data here to be sure, but this could well be a distinction between longer, slower runs and shorter, faster ones. Let's try plotting distance against speed to get a better idea.

In [9]:
runs.plot(kind='scatter', x='Distance (mi)', y='Average Speed (mph)')
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d43ac18>

Hmm, no clear trend here. Maybe that's because when I first started running I was nowhere near so fit as I am now, so those early runs were both short and slow! What if we restrict it just to this year?

In [10]:
runs = runs.loc[runs.index > '2015-01-01']
runs.plot(kind='scatter', x='Distance (mi)', y='Average Speed (mph)')
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d35e128>

That's better: now it's clear to see that, in general, the further I go, the slower I run!

So, as expected, no major insights. Now that I'm over my knee injury and back to training regularly, I'm hoping that I'll be able to collect more data and learn a bit more about how I exercise and maybe even make some improvements.

How about you? What data do you have that might be worth exploring? If you haven't anything of your own, try browsing through one of these:


In a classic example of the human tendency to weave everything we see into our own narrative, I recently found myself looking at the 18th century research data of botanist John Sibthorp, embodied in his Flora Graeca.

Hellebores officinalis

It all came about through a visit to Oxford, organised as part of the CPD programme organised by the M25 group of libraries. We first had a tour of the famous Bodleian Library’s reading rooms — quite a revelation for a very STEM-focussed non-librarian like me!

After finishing at the Bodleian, we dutifully trooped up Parks Road to the Department of Botany and its pride and joy the Sherardian Library and Herbaria. The Sherardian includes, alongside many classic botanical reference books, an impressive collection of original botanical sketches and specimens dating back centuries (and still used by researchers today).

John Sibthorp was a English botanist, and was Sherardian Professor of Botany at the University of Oxford, a chair he inherited from his father Dr Humphry Sibthorp in 1784. In the late 1780’s he took a botanical tour of Greece and Cyprus to collect material for a flora of the region, eventually published as the Flora Graeca Sibthorpiana.

The lovely staff at the Sherardian had laid out several original volumes of Sibthorp’s Flora Graeca to inspect, alongside the various source materials:

  • Sibthorp’s diary of his trip to the Mediterranean
  • Original pencil sketches of the flora, painstakingly labelled by Sibthorp’s artist, Ferdinand Bauer, to indicate the precise shade of each part (he used only graphite pencil in the field)
  • The actual specimens collected by Sibthorp, carefully pressed and preserved with mercury
  • The watercolours developed by Bauer on their return to Oxford, based only on the sketches, the fast-fading specimens and his memory (he produced around 900 of these at a rate of roughly one every 1 1/4 days!)

What’s interesting about all this is that Sibthorp was, in reality, a lousy field biologist. His diary, while beginning fairly well, became less and less legible as the trip went on. Most of the specimens, along with Bauer’s sketches, were unlabelled. In fact, the vast majority of the material collected by Sibthorp remained only in his head.

Before publishing, Sibthorp felt he had to return to the Mediterranean for a second time to collect more material, which he duly did. He never returned to Oxford: instead he died of consumption in Bath in 1796, and his work was published posthumously by the University of Oxford only because of some clever manoeuvring by his lawyer and a close friend.

Of course, all of that knowledge, much of his “research data” died with him. The Flora Graeca Sibthorpiana was eventually published, but only after a lot of work to decode his diary and figure out which specimens, sketches and watercolours went together.

There are a number of errors in the final version which would easily have been caught had Sibthorp been alive to edit it. A spider’s web on one of the specimens, lovingly reproduced by Bauer in his watercolour, was misinterpreted by one of the artists producing the plates for printing, and was rendered as fine, downy hairs on the leaf; of course, the actual plant has no such hairs. Reading between the lines, I suspect that the final published work is much poorer for the loss of the information locked up in Sibthorp’s brain.

Would he have been allowed to get away with this in the modern world? Today his trip would have been funded not by the university at the insistence of his professor father, but probably by the BBSRC. That funding would come with a number of conditions, including an expectation that the work be documented, preserved and made available to other researchers to study. Now, though, we’ll never know what we lost when John Sibthorp died.

The Flora Graeca and its associated material still provide valuable information to this day. New analytical techniques allow us to obtain new data from the specimens, many of which are type specimens for their species. All of the associated artwork has been digitised, and low-resolution versions of the watercolours and colour plates are available to use under a Creative Commons license. Although the physical books are no longer routinely used for reference, the high-resolution scans are consulted quite regularly by modern researchers, and work is currently in progress to link together all of the digitised material so it can be searched by species, family, geographical area or a number of other aspects.

It was fascinating to see such rare materials first-hand, and to have them brought to life by such a knowledgeable speaker, and I feel privileged to have had the chance. For anyone interested, you can browse a digital version of the Flora Graeca online.


So, it’s been a while since I’ve properly updated this blog, and since I seem to be having another try, I thought it would be useful to give a brief overview of what I’m doing these days, so that some of the other stuff I have in the pipeline makes a bit more sense.

My current work focus is research data management: helping university researchers to look after their data in ways that let them and the community get the most out of it. Data is the bedrock of most (all?) research: the evidence on which all the arguments and conclusions and new ideas are based. In the past, this data has been managed well (generally speaking) by and for the researchers collecting and using it, and this situation could have continued indefinitely.

Technology, however, has caused two fundamental changes to this position. First, we’re able to measure more and more about more and more, creating what has been termed a “data deluge”. It’s now possible for on researcher to generate, in the normal course of their work, far more data than they could possibly analyse themselves in a lifetime. For example, the development of polymerase chain reaction (PCR) techniques have enabled the fast, cheap sequencing of entire genomes: for some conditions, patients’ genomes are now routinely sequenced for future study. A typical human genome sequence occupies 8TB (about 1700 DVDs), and after processing and compression, this shrinks to around 100GB (21 DVDs). This covers approximately 23,000 genes, of which any one researcher may only be interested in a handful.

Second, the combination of the internet and cheap availability of computing power means that it has never been easier to share, combine and process this data on a huge scale. To continue our example, it’s possible to study genetic variations across hundreds or thousands of individuals to get new insights into how the body works. The 100,000 Genomes Project (“100KGP”) is an ambitious endeavour to establish a database of such genomes and, crucially, develop the infrastructure to allow researchers to access and analyse it at scale.

In order to make this work, there are plenty of barriers to overcome. The practices that kept data in line long enough to publish the next paper are no longer good enough: the organisation and documentation must be made explicit and consistent so that others can make sense of it. It also needs to be protected better from loss and corruption. Obviously, this takes more work than just dumping it on a laptop, so most people want some reassurance that this extra work will pay off.

Sharing has risks too. Identifiable patient data cannot be shared without the patients consent; indeed doing so would be a criminal offence in Europe. Similar rules apply to sensitive commercial information. Even if there aren’t legal restrictions, most researchers have a reasonable expectation (albeit developed before the “data deluge”) that they be able to reap the reputational rewards of their own hard work by publishing papers based on it.

There is therefore a great deal of resistance to these changes. But there can be benefits too. For society, there is the possibility of making advancing knowledge in directions that would never have been possible even ten years ago. But there are practical benefits to the individuals too: every PhD supervisor and most PhD students know the frustration of trying to continue a student’s poorly-documented work after they’ve graduated.

For funders the need for change is particularly acute. Budgets are being squeezed, and with the best will in the world there is less money to go around, so there is pressure to ensure the best possible return on investment. This means that it’s no longer acceptable, for example, for several labs in the country to be running identical experiments to do different things with the results. It’s more important than ever to make more data available to and reusable by more people.

So the funders (in the UK, particularly the government-funded research councils), are introducing requirements on the researchers they fund to move along this path quicker than they might feel comfortable with. It therefore seems reasonable to offer these hard-working people some support, and that’s where I come in.

I’m currently spending my time providing training and advice, bringing people together to solve problems and trying to convince a lot of researchers to fix what, in many cases, they didn’t think was broken! They are subject to conflicting expectations and need help navigating this maze so that they can do what they do best: discover amazing new stuff and change the world.

For the last 6ish months I’ve been doing this at Imperial College (my alma mater, no less) and loving it. It’s a fascinating area for me, and I’m really excited to see where it will lead me next!

If you have time, here’s a (slightly tongue-in-cheek) take on the problem from the perspective of a researcher trying to reuse someone else’s data: