OpenRefine has a pretty cool feature. You can export a project’s entire edit history in JSON format, and subsequently paste it back to exactly repeat what you did. This is great for transparency: if someone asks what you did in cleaning up your data, you can tell them exactly instead of giving them a vague, general description of what you think you remember you did. It also means that if you get a new, slightly-updated version of the raw data, you can clean it up in exactly the same way very quickly.

    "op": "core/column-rename",
    "description": "Rename column Column to Funder",
    "oldColumnName": "Column",
    "newColumnName": "Funder"
    "op": "core/row-removal",
    "description": "Remove rows",
    "engineConfig": {
      "mode": "row-based",
// etc…

Now this is great, but it could be better. I’ve been playing with Python for data wrangling, and it would be amazing if you could load up an OpenRefine history script in Python and execute it over an arbitrary dataset. You’d be able to reproduce the analysis without having to load up a whole Java stack and muck around with a web browser, and you could integrate it much more tightly with any pre- or post-processing.

Going a stage further, it would be even better to be able to convert the OpenRefine history JSON to an actual Python script. That would be a great learning tool for anyone wanting to go from OpenRefine to writing their own code.

import pandas as pd

data = pd.read_csv("funder_info.csv")
data = data.rename(columns = {"Column": "Funder"})
data = data.drop(data.index[6:9])

This seems like it could be fairly straightforward to implement: it just requires a bit of digging to understand the semantics of the JSON thot OpenRefine produces, and then the implementation of each operation in Python. The latter part shouldn’t be much of a stretch with so many existing tools like pandas.

It’s just an idea right now, but I’d be willing to have a crack at implementing something if there was any interest — let me know in the comments or on Twitter if you think it’s worth doing, or if you fancy contributing.


If you’re viewing this on the web, you might notice there have been a few changes round here: I’ve updated my theme to be more responsive and easier to read on different screen sizes. It’s been interesting learning how to use the Bootstrap CSS framework, originally developed by Twitter to make putting together responsive sites with a fixed or fluid grid layout straightforward.

I don’t have the means to test it on every possible combination of browsers and devices, so please let me know if you notice anything weird-looking!


This post is a little bit of an experiment, in two areas:

  1. Playing with some of my own data, using a language that I'm quite familiar with (Python) and a mixture of tools which are old and new to me; and
  2. Publishing that investigation directly on my blog, as a static IPython Notebook export.

The data I'll be using is information about my exercise that I've been tracking over the last few years using RunKeeper. RunKeeper allows you to export and download your full exercise history, including GPX files showing where you went on each activity, and a couple of summary files in CSV format. It's this latter that I'm going to take a look at; just an initial survey to see if there's anything interesting that jumps out.

I'm not expecting any massive insights just yet, but I hope you find this a useful introduction to some very valuable data wrangling and analysis tools.

Looking at the data

First up, we need to do some basic setup:

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from IPython.display import display

This imports some Python packages that we'll need:

  • matplotlib: makes pretty plots (the %matplotlib inline bit is some IPython Notebook magic to make plots appear inline)
  • numpy: allows some maths and stats functions
  • pandas: loads and manipulates tabular data

Next, let's check what files we have to work with:

In [2]:
%ls data/*.csv
data/cardioActivities.csv  data/measurements.csv

I'm interested in cardioActivities.csv, which contains a summary of each activity in my RunKeeper history. Loading it up gives us this:

In [3]:
cardio = pd.read_csv('data/cardioActivities.csv',
                     parse_dates=[0, 4, 5],
Type Route Name Distance (mi) Duration Average Pace Average Speed (mph) Calories Burned Climb (ft) Average Heart Rate (bpm) Notes GPX File
2015-03-27 12:26:34 Running NaN 3.31 29:57 9:03 6.63 297 48.45 157 NaN 2015-03-27-1226.gpx
2015-03-21 09:44:25 Running NaN 11.31 2:12:01 11:40 5.14 986 283.41 146 NaN 2015-03-21-0944.gpx
2015-03-19 07:21:36 Running NaN 5.17 52:45 10:12 5.88 423 75.23 150 NaN 2015-03-19-0721.gpx
2015-03-17 06:51:42 Running NaN 1.81 17:10 9:29 6.32 144 23.27 137 NaN 2015-03-17-0651.gpx
2015-03-17 06:21:51 Running NaN 0.94 7:25 7:52 7.64 17 3.65 136 NaN 2015-03-17-0621.gpx

Although my last few activities are runs, there are actually several different possible values for the "Type" column. We can take a look like this:

In [4]:
cardio['Type'] = cardio['Type'].astype('category')
Index(['Cycling', 'Hiking', 'Running', 'Walking'], dtype='object')

From this you can see there are four types: Cycling, Hiking, Running and Walking. Right now, I'm only interested in my runs, so let's select those and do an initial plot.

In [5]:
runs = cardio[cardio['Type'] == 'Running']
runs['Distance (mi)'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d914ac8>

We can notice two things straight away:

  • There's a gap at the start of 2014: this is probably where RunKeeper hasn't got information about the distance because my GPS watch didn't work right or something, and I don't want to include these in my analysis.
  • There's a big spike from where I did the 12 Labours of Hercules ultramarathon, which isn't really an ordinary run so I don't want to include that either.

Let's do some filtering (excluding those, and some runs with "unreasonable" speeds that might be mislabelled runs or cycles) and try again.

In [6]:
runs = runs[(runs['Distance (mi)'] <= 15)
            & runs['Average Speed (mph)'].between(3.5, 10)]
runs['Distance (mi)'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d5f4fd0>

That looks much better. Now we can clearly see the break I took between late 2012 and early 2014 (problems with my iliotibial band), followed by a gradual return to training and an increase in distance leading up to my recent half-marathon.

There are other types of plot we can look at too. How about a histogram of my run distances?

In [7]:
runs['Distance (mi)'].hist(bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d56fc50>

You can clearly see here the divide between my usual weekday runs (usually around 3–5 miles) and my longer weekend runs. I've only been running >7 miles very recently, but I suspect as time goes on this graph will start to show two distinct peaks. There also seem to be peaks around whole numbers of miles: it looks like I have a tendency to finish my runs shortly after the distance readout on my watch ticks over! The smaller peak around 1 mile is where I run to the gym as a warmup before a strength workout.

How fast do I run? Let's take a look.

In [8]:
runs['Average Speed (mph)'].hist(bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d4dd518>

Looks like another bimodal distribution. There's not really enough data here to be sure, but this could well be a distinction between longer, slower runs and shorter, faster ones. Let's try plotting distance against speed to get a better idea.

In [9]:
runs.plot(kind='scatter', x='Distance (mi)', y='Average Speed (mph)')
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d43ac18>

Hmm, no clear trend here. Maybe that's because when I first started running I was nowhere near so fit as I am now, so those early runs were both short and slow! What if we restrict it just to this year?

In [10]:
runs = runs.loc[runs.index > '2015-01-01']
runs.plot(kind='scatter', x='Distance (mi)', y='Average Speed (mph)')
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d35e128>

That's better: now it's clear to see that, in general, the further I go, the slower I run!

So, as expected, no major insights. Now that I'm over my knee injury and back to training regularly, I'm hoping that I'll be able to collect more data and learn a bit more about how I exercise and maybe even make some improvements.

How about you? What data do you have that might be worth exploring? If you haven't anything of your own, try browsing through one of these:


In a classic example of the human tendency to weave everything we see into our own narrative, I recently found myself looking at the 18th century research data of botanist John Sibthorp, embodied in his Flora Graeca.

Hellebores officinalis

It all came about through a visit to Oxford, organised as part of the CPD programme organised by the M25 group of libraries. We first had a tour of the famous Bodleian Library’s reading rooms — quite a revelation for a very STEM-focussed non-librarian like me!

After finishing at the Bodleian, we dutifully trooped up Parks Road to the Department of Botany and its pride and joy the Sherardian Library and Herbaria. The Sherardian includes, alongside many classic botanical reference books, an impressive collection of original botanical sketches and specimens dating back centuries (and still used by researchers today).

John Sibthorp was a English botanist, and was Sherardian Professor of Botany at the University of Oxford, a chair he inherited from his father Dr Humphry Sibthorp in 1784. In the late 1780’s he took a botanical tour of Greece and Cyprus to collect material for a flora of the region, eventually published as the Flora Graeca Sibthorpiana.

The lovely staff at the Sherardian had laid out several original volumes of Sibthorp’s Flora Graeca to inspect, alongside the various source materials:

  • Sibthorp’s diary of his trip to the Mediterranean
  • Original pencil sketches of the flora, painstakingly labelled by Sibthorp’s artist, Ferdinand Bauer, to indicate the precise shade of each part (he used only graphite pencil in the field)
  • The actual specimens collected by Sibthorp, carefully pressed and preserved with mercury
  • The watercolours developed by Bauer on their return to Oxford, based only on the sketches, the fast-fading specimens and his memory (he produced around 900 of these at a rate of roughly one every 1 1/4 days!)

What’s interesting about all this is that Sibthorp was, in reality, a lousy field biologist. His diary, while beginning fairly well, became less and less legible as the trip went on. Most of the specimens, along with Bauer’s sketches, were unlabelled. In fact, the vast majority of the material collected by Sibthorp remained only in his head.

Before publishing, Sibthorp felt he had to return to the Mediterranean for a second time to collect more material, which he duly did. He never returned to Oxford: instead he died of consumption in Bath in 1796, and his work was published posthumously by the University of Oxford only because of some clever manoeuvring by his lawyer and a close friend.

Of course, all of that knowledge, much of his “research data” died with him. The Flora Graeca Sibthorpiana was eventually published, but only after a lot of work to decode his diary and figure out which specimens, sketches and watercolours went together.

There are a number of errors in the final version which would easily have been caught had Sibthorp been alive to edit it. A spider’s web on one of the specimens, lovingly reproduced by Bauer in his watercolour, was misinterpreted by one of the artists producing the plates for printing, and was rendered as fine, downy hairs on the leaf; of course, the actual plant has no such hairs. Reading between the lines, I suspect that the final published work is much poorer for the loss of the information locked up in Sibthorp’s brain.

Would he have been allowed to get away with this in the modern world? Today his trip would have been funded not by the university at the insistence of his professor father, but probably by the BBSRC. That funding would come with a number of conditions, including an expectation that the work be documented, preserved and made available to other researchers to study. Now, though, we’ll never know what we lost when John Sibthorp died.

The Flora Graeca and its associated material still provide valuable information to this day. New analytical techniques allow us to obtain new data from the specimens, many of which are type specimens for their species. All of the associated artwork has been digitised, and low-resolution versions of the watercolours and colour plates are available to use under a Creative Commons license. Although the physical books are no longer routinely used for reference, the high-resolution scans are consulted quite regularly by modern researchers, and work is currently in progress to link together all of the digitised material so it can be searched by species, family, geographical area or a number of other aspects.

It was fascinating to see such rare materials first-hand, and to have them brought to life by such a knowledgeable speaker, and I feel privileged to have had the chance. For anyone interested, you can browse a digital version of the Flora Graeca online.


So, it’s been a while since I’ve properly updated this blog, and since I seem to be having another try, I thought it would be useful to give a brief overview of what I’m doing these days, so that some of the other stuff I have in the pipeline makes a bit more sense.

My current work focus is research data management: helping university researchers to look after their data in ways that let them and the community get the most out of it. Data is the bedrock of most (all?) research: the evidence on which all the arguments and conclusions and new ideas are based. In the past, this data has been managed well (generally speaking) by and for the researchers collecting and using it, and this situation could have continued indefinitely.

Technology, however, has caused two fundamental changes to this position. First, we’re able to measure more and more about more and more, creating what has been termed a “data deluge”. It’s now possible for on researcher to generate, in the normal course of their work, far more data than they could possibly analyse themselves in a lifetime. For example, the development of polymerase chain reaction (PCR) techniques have enabled the fast, cheap sequencing of entire genomes: for some conditions, patients’ genomes are now routinely sequenced for future study. A typical human genome sequence occupies 8TB (about 1700 DVDs), and after processing and compression, this shrinks to around 100GB (21 DVDs). This covers approximately 23,000 genes, of which any one researcher may only be interested in a handful.

Second, the combination of the internet and cheap availability of computing power means that it has never been easier to share, combine and process this data on a huge scale. To continue our example, it’s possible to study genetic variations across hundreds or thousands of individuals to get new insights into how the body works. The 100,000 Genomes Project (“100KGP”) is an ambitious endeavour to establish a database of such genomes and, crucially, develop the infrastructure to allow researchers to access and analyse it at scale.

In order to make this work, there are plenty of barriers to overcome. The practices that kept data in line long enough to publish the next paper are no longer good enough: the organisation and documentation must be made explicit and consistent so that others can make sense of it. It also needs to be protected better from loss and corruption. Obviously, this takes more work than just dumping it on a laptop, so most people want some reassurance that this extra work will pay off.

Sharing has risks too. Identifiable patient data cannot be shared without the patients consent; indeed doing so would be a criminal offence in Europe. Similar rules apply to sensitive commercial information. Even if there aren’t legal restrictions, most researchers have a reasonable expectation (albeit developed before the “data deluge”) that they be able to reap the reputational rewards of their own hard work by publishing papers based on it.

There is therefore a great deal of resistance to these changes. But there can be benefits too. For society, there is the possibility of making advancing knowledge in directions that would never have been possible even ten years ago. But there are practical benefits to the individuals too: every PhD supervisor and most PhD students know the frustration of trying to continue a student’s poorly-documented work after they’ve graduated.

For funders the need for change is particularly acute. Budgets are being squeezed, and with the best will in the world there is less money to go around, so there is pressure to ensure the best possible return on investment. This means that it’s no longer acceptable, for example, for several labs in the country to be running identical experiments to do different things with the results. It’s more important than ever to make more data available to and reusable by more people.

So the funders (in the UK, particularly the government-funded research councils), are introducing requirements on the researchers they fund to move along this path quicker than they might feel comfortable with. It therefore seems reasonable to offer these hard-working people some support, and that’s where I come in.

I’m currently spending my time providing training and advice, bringing people together to solve problems and trying to convince a lot of researchers to fix what, in many cases, they didn’t think was broken! They are subject to conflicting expectations and need help navigating this maze so that they can do what they do best: discover amazing new stuff and change the world.

For the last 6ish months I’ve been doing this at Imperial College (my alma mater, no less) and loving it. It’s a fascinating area for me, and I’m really excited to see where it will lead me next!

If you have time, here’s a (slightly tongue-in-cheek) take on the problem from the perspective of a researcher trying to reuse someone else’s data:


I recently read an article, by Cox, Pinfield and Smith1, which attempted to analyse research data management (RDM) as a “wicked problem”. The concept of a wicked problem has been around for a little while in the context of social policy, but this is the first I’ve heard of it.

What, then, is a “wicked” problem?

Cox et al invoke a full 16 properties of wicked problems (drawing on the work of earlier authors), but put simply a problem is wicked if it is so difficult to define that they can only ever admit of an imperfect, compromise solution. Many different, often contradictory, perspectives on such a problem exist, each suggesting its own solution, and any intervention “changes the problem in an irreversible way”, so that gradual learning by trial and error is impossible.

Many truly wicked problems affecting society as a whole, such as child poverty or gender inequality, are unlikely to be satisfactorily solved in our lifetimes (though we should, of course, keep trying). For RDM though, I feel that there is some hope. It does display many aspects of wickedness, but it seems unlikely that it will stay this way forever. Technology is evolving, cultures and attitudes are shifting and the disparate perspectives on research data are gradually coming into alignment.

In the meantime, “wickedness” provides a useful lens through which to view the challenge of RDM. Knowing that it shows many wicked features, we can be more intelligent about the way we go about solving it. For example, Cox et al draw on the literature on wicked problems to suggest some leadership principles that are appropriate, such as:

“Collective intelligence not individual genius – turning to individuals to solve problems through their individual creativity is less successful than people working together to develop ideas.”

Far from getting disheartened about the difficulty of the task ahead, we can accept that it is necessarily complex and act accordingly. I look forward to following up some of the references on strategies for dealing with wickedness — hopefully I’ll be able to share more of what I learn as I go along.

  1. Cox, Andrew M., Stephen Pinfield, and Jennifer Smith. 2014. “Moving a Brick Building: UK Libraries Coping with Research Data Management as a ‘wicked’ Problem.” Journal of Librarianship and Information Science, May, 0961000614533717. doi:10.1177/0961000614533717.


You might have noticed that my latest spurt of blogging has been interrupted again. This time, it’s due to a recurrence of an old enemy of mine, repetitive strain injury (RSI). It’s coming back under control now and I thought I’d share what’s worked for me, but before we go any further, heed my warning:

If you have any sort of pain or numbness at all when using a computer, even if it’s momentary, take the time to evaluate your setup and see if there is anything you can change.

I can’t over-stress how important this is — RSI is a serious medical problem if allowed to get out of hand (potentially as career-ending as a sports injury can be for a professional athlete), and it can be prevented entirely by ensuring your workspace is appropriate to the way you do your work. If in doubt, seek expert and/or medical advice: many larger employers have occupational health advisors, and your GP will be able to advise or refer you to an appropriate specialist.

As a lot of intensive computer-users do, I’ve had numerous bouts of computer-related pain over the years, and at one point even had to switch to voice recognition software for several months. If you’ve ever tried to use voice recognition, especially to do any programming, you’ll understand how frustrating that can be.

Previously, I’ve had pain associated with using the mouse, so these days I tend to drive my computer primarily by keyboard. I use apps (like Emacs) that are very keyboard-friendly, and when on Linux I even use a keyboard friendly window manager to minimise my need to use the mouse at all. I also have a regular mouse at work and a trackball at home, so I’m varying the set of muscles I use to mouse with.

As a result, I use a lot of key combinations involving the Control, Alt and Windows/Command keys, and recently I’ve started having pain in my thumbs (particularly the left) from curling them under to hit the Alt and Win keys. Note: emacs users more commonly suffer from the problem known as “Emacs pinky”, but I headed that one off at the pass early on by remapping my caps-lock key as another Control key.

I’m very lucky: my workplace has a dedicated assistive technologies advisor, who has a collection of alternative keyboards, mice and other input devices, so I was able to have a chat with him, get some expert advice and try out several possible options before committing to buying anything.

Here’s what I’ve ended up with:

  • Kinesis Freestyle II USB keyboard: This keyboard is split down the middle, and allows the halves to be positioned and tilted independently to reduce the unnatural bend in my wrist, as well as putting less strain on my shoulders. Through some experimentation with this and similar keyboards, I’ve found that having the two halves parallel but about 15–20cm apart is more comfortable for me than having them close together and angled (like the keyboards made by Goldtouch).
  • Programmable USB foot pedals: These are such simple devices they can be picked up for cheap almost anywhere on the net, or even hacked together using the controller from an old USB keyboard. I have three pedals set up, with Control under my left foot, Alt under my right and Windows/Command inside Alt so I can reach it with my right foot easily. Initially, I found that I was getting some back pain after starting to use the pedals, but I’ve since realised that this was because I had them positioned awkwardly — I started over by looking at where my feet fell naturally while working and then moving the pedals into position accordingly.
  • Posturite Penguin mouse: This also helps to reduce the unnatural bend in my wrist, as well as dealing with my tendency to “anchor” or rest my forearm on the desk while moving the mouse only with my fingers and wrist. It comes in three sizes to fit your hand, and has a switch to swap the scroll wheel direction so you can swap it to your other hand from time to time. Plus it’s made by a British company!

This combination straightens my wrists out completely while typing, and is slowly eliminating (as I train myself to use the pedals) my use of the left thumb for anything other than the space bar.

I hope this is of some use to a few people out there suffering needlessly from similar problems.


I use a lot of the ideas of David Allen’s Getting Things Done as the basis for my system of capturing, organising and checking off projects and tasks. I like it; it helps make sure that I’m not missing anything.

I do, however, find it somewhat lacking in the area of giving me a day-to-day tactical feel of what I need to get done. Recently I’ve been trying to fill that gap with a simple tool called personal kanban.

What is personal kanban?

Kanban board
Kanban (看板 — literally “billboard”) is a scheduling system developed by Taiichi Ohno at Toyota to direct the manufacture of vehicles to minimise work-in-progress. Personal Kanban (PK) is an adaptation of the ideas behind the original kanban system for the type of knowledge work done by the majority of typical office workers.

PK has two key principles:

  1. Visualise your work
  2. Limit your work in progress

The idea is that if you can see what you’re doing (and not doing yet) you’ll feel more in control, which will give you the confidence to make a conscious decision to focus on just a small number of things at once. That in turn helps to alleviate that nagging feeling of being overloaded, while still letting you get work done.

The implementation involves moving cards or sticky notes between columns on a wall or whiteboard, a concept which is probably easier to understand with an example.

PK and web publishing

A piece of content (blog post, news article, whatever) typically moves through a fairly fixed workflow. It starts life as an idea, then the time comes when it’s ready to write, after which you might outline it, draft it, send it round for review and finally publish it.

On your whiteboard, draw up a column for each of the stages highlighted in bold in the previous paragraph, and assign each article its own sticky note. Then simply move the sticky notes from column to column as you work and experience the satisfaction of watching the system flow and seeing work get done.

It’s a great way to ensure a sensible flow of content without either working yourself to death or running out of things to publish. I’ve used a variation of this system at work for a while now to get news items and blog posts published, and I’m just starting to implement it for this blog too.

It works very well with teams too, as everyone can see the whole team’s workload. I use this to assist in coordinating a small team of PhD students who contribute stuff to our website, using the excellent Trello in place of a physical board.

PK and generic tasks

Once you’ve understood the basic concept, you can basically use it however works for you. You’re encouraged to experiment and adapt the basic idea in whatever way seems to make sense, in a kaizen-like continual improvement fashion.

While it’s useful for sets of similar tasks like blog posts, you can also adapt it to a generic task workflow. I use the following:

Tasks which could potentially be done now
Tasks actually in progress
Tasks which can’t be acted on yet because they’re waiting for input from someone else
Completed tasks

I have this up on a whiteboard in my office, with each task on a post-it note, which allows me to see at a glance everything that I’ve got going on at the moment, and thus make sure that I’m balancing my priorities correctly — in accordance with PK principle 1 (“Visualise your work”). I also have a limit on the number of tasks that can be in “Doing” and “Waiting” at any one time (PK principle 2: “Limit your work in progress”), which helps me to make sure I’m not feeling overloaded.

I try to keep this as simple as possible, but occasionally introduce little codes like coloured stickers to help with visualising the balance when I need to. The whole point is to use the basic ideas to make a system that works for you, rather than anything that’s too prescriptive.

Of course, I can’t carry a whiteboard around with me, so when I’m out of the office for a while I’ll transfer everything to Trello, which I can access via the web and on my phone and iPad, or even just take a photo of the board.

Combining PK with GTD

GTD is a great system for making sure you’re capturing all the work that needs to be done, but I’ve always been dissatisfied with its ideas about prioritising, which are based on:

  1. Context (where you are/what facilities you have access to)
  2. Time available
  3. Energy (how tired/refreshed you are)

Organising tasks by context has always felt like unnecessary detail, while worrying too much about time and energy on a task-by-task basis seems like a recipe for procrastination (though managing time/energy on a more general level can be useful).

I’ve ended up with a two-level system. GTD is for strategic purposes: tracking projects, balancing long-term priorities and making sure nothing slips through the cracks. Kanban is a much more tactical tool, to help see what needs to be done right now, this week, or later on.


This year, I’ve decided to make a project of working through the winners of the Hugo Award for Best Novel. The Hugos are one of the biggest English-language science fiction and fantasy awards going. Late last year, I came across the list looking to broaden my reading (I tend to fixate on individual authors and devour entire canons of work before moving on) and realised I’d already read quite a few, and there were quite a few more that I had my eye on to read soon.

I’ve already been distracted by the epic Great Book of Amber (all 10 of Robert Zelazny’s Amber novels in a single mighty volume — I’ve wanted to read it for ages and I got it for Christmas), but I’m enjoying the variety. So far, I’ve read Kim Stanley Robinson’s Green Mars and Blue Mars, having started the Mars trilogy before Christmas, and Zelazny’s classic Lord of Light (watch out for The Pun!), and I’m currently enjoying The Left Hand of Darkness by Ursula K. Le Guin.

Winners I’d already read:

  • Starship Troopers, Robert Heinlein
  • Dune, Frank Herbert
  • Ringworld, Larry Niven
  • Rendezvous with Rama, Arthur C. Clarke
  • The Fountains of Paradise, Arthur C. Clarke
  • Foundation’s Edge, Isaac Asimov
  • Neuromancer, William Gibson
  • Hyperion, Dan Simmons
  • Harry Potter and the Goblet of Fire, J. K. Rowling
  • The Graveyard Book, Neil Gaiman
  • Redshirts, John Scalzi

Open Access Button
Every day people around the world such as doctors, scientists, students and patients are denied access to the research they need. With the power of the internet, the results of academic research should be available to all. It’s time to showcase the impact of paywalls and help people get the research they need. That’s where Open Access Button comes in.

The Open Access Button is a browser plugin that allows people to report when they hit a paywall and cannot access a research article. Head to to sign up for your very own Button and start using it.

I just want to flag up this cool project that’s trying to improve access to scholarly literature for everyone. I’ve been involved with the project from the start, helping to figure out how to tie it in with open access repositories, but it’s medical students David Carroll and Joe McArthur who deserve the credit for coming up with the idea and driving it forward.

To date, more than 5,000 blocked articles have already been logged. It even got mentioned in the Guardian! Take a look, give it a try or even get involved:

How did I get involved?

Last year, I spent some free time creating an experimental web tool to look up Open Access1 versions of scholarly articles from their DOIs2. There is already a system for getting the official version of record for any DOI, but it struck me that where that version is hidden behind a paywall and a free version is available elsewhere, it should be just as easy to find that.

This work got noticed by a group of people at a hack day3, which resulted in my contributing to their project, the Open Access Button. The primary purpose of the OA Button is to allow people to report whenever they hit a paywall while trying to read an article (so that the scale of the problem can be visualised), and as an added bonus, we’re adding functionality to help gain access through other channels, including finding green open access versions and contacting the corresponding author.

  1. “Open Access” refers to content which is freely available for all to read and use (usually referring to scholarly articles that publish the results of academic research), as distinct from that which is only accessible by paying a fee (either per-article or a as subscription to a journal).

  2. A Digital Object Identifier (DOI) is a unique string of characters that identifies a published object, such as an article, book or dataset. They look something like this: 10.1000/182.

  3. A hack day is an opportunity for developers and non-developers to get together and prototype projects without any risk of loss or ridicule if things don’t work out — great for getting the creative juices flowing!