Please note: this older content has been archived and is no longer fully linked into the site. Please go to the current home page for up-to-date content.

This post is a little bit of an experiment, in two areas:

  1. Playing with some of my own data, using a language that I'm quite familiar with (Python) and a mixture of tools which are old and new to me; and
  2. Publishing that investigation directly on my blog, as a static IPython Notebook export.

The data I'll be using is information about my exercise that I've been tracking over the last few years using RunKeeper. RunKeeper allows you to export and download your full exercise history, including GPX files showing where you went on each activity, and a couple of summary files in CSV format. It's this latter that I'm going to take a look at; just an initial survey to see if there's anything interesting that jumps out.

I'm not expecting any massive insights just yet, but I hope you find this a useful introduction to some very valuable data wrangling and analysis tools.

Looking at the data

First up, we need to do some basic setup:

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from IPython.display import display

This imports some Python packages that we'll need:

  • matplotlib: makes pretty plots (the %matplotlib inline bit is some IPython Notebook magic to make plots appear inline)
  • numpy: allows some maths and stats functions
  • pandas: loads and manipulates tabular data

Next, let's check what files we have to work with:

In [2]:
%ls data/*.csv
data/cardioActivities.csv  data/measurements.csv

I'm interested in cardioActivities.csv, which contains a summary of each activity in my RunKeeper history. Loading it up gives us this:

In [3]:
cardio = pd.read_csv('data/cardioActivities.csv',
                     parse_dates=[0, 4, 5],
                     index_col=0)
display(cardio.head())
Type Route Name Distance (mi) Duration Average Pace Average Speed (mph) Calories Burned Climb (ft) Average Heart Rate (bpm) Notes GPX File
Date
2015-03-27 12:26:34 Running NaN 3.31 29:57 9:03 6.63 297 48.45 157 NaN 2015-03-27-1226.gpx
2015-03-21 09:44:25 Running NaN 11.31 2:12:01 11:40 5.14 986 283.41 146 NaN 2015-03-21-0944.gpx
2015-03-19 07:21:36 Running NaN 5.17 52:45 10:12 5.88 423 75.23 150 NaN 2015-03-19-0721.gpx
2015-03-17 06:51:42 Running NaN 1.81 17:10 9:29 6.32 144 23.27 137 NaN 2015-03-17-0651.gpx
2015-03-17 06:21:51 Running NaN 0.94 7:25 7:52 7.64 17 3.65 136 NaN 2015-03-17-0621.gpx

Although my last few activities are runs, there are actually several different possible values for the "Type" column. We can take a look like this:

In [4]:
cardio['Type'] = cardio['Type'].astype('category')
print(cardio['Type'].cat.categories)
Index(['Cycling', 'Hiking', 'Running', 'Walking'], dtype='object')

From this you can see there are four types: Cycling, Hiking, Running and Walking. Right now, I'm only interested in my runs, so let's select those and do an initial plot.

In [5]:
runs = cardio[cardio['Type'] == 'Running']
runs['Distance (mi)'].plot()
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d914ac8>

We can notice two things straight away:

  • There's a gap at the start of 2014: this is probably where RunKeeper hasn't got information about the distance because my GPS watch didn't work right or something, and I don't want to include these in my analysis.
  • There's a big spike from where I did the 12 Labours of Hercules ultramarathon, which isn't really an ordinary run so I don't want to include that either.

Let's do some filtering (excluding those, and some runs with "unreasonable" speeds that might be mislabelled runs or cycles) and try again.

In [6]:
runs = runs[(runs['Distance (mi)'] <= 15)
            & runs['Average Speed (mph)'].between(3.5, 10)]
runs['Distance (mi)'].plot()
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d5f4fd0>

That looks much better. Now we can clearly see the break I took between late 2012 and early 2014 (problems with my iliotibial band), followed by a gradual return to training and an increase in distance leading up to my recent half-marathon.

There are other types of plot we can look at too. How about a histogram of my run distances?

In [7]:
runs['Distance (mi)'].hist(bins=30)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d56fc50>

You can clearly see here the divide between my usual weekday runs (usually around 3–5 miles) and my longer weekend runs. I've only been running >7 miles very recently, but I suspect as time goes on this graph will start to show two distinct peaks. There also seem to be peaks around whole numbers of miles: it looks like I have a tendency to finish my runs shortly after the distance readout on my watch ticks over! The smaller peak around 1 mile is where I run to the gym as a warmup before a strength workout.

How fast do I run? Let's take a look.

In [8]:
runs['Average Speed (mph)'].hist(bins=30)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d4dd518>

Looks like another bimodal distribution. There's not really enough data here to be sure, but this could well be a distinction between longer, slower runs and shorter, faster ones. Let's try plotting distance against speed to get a better idea.

In [9]:
runs.plot(kind='scatter', x='Distance (mi)', y='Average Speed (mph)')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d43ac18>

Hmm, no clear trend here. Maybe that's because when I first started running I was nowhere near so fit as I am now, so those early runs were both short and slow! What if we restrict it just to this year?

In [10]:
runs = runs.loc[runs.index > '2015-01-01']
runs.plot(kind='scatter', x='Distance (mi)', y='Average Speed (mph)')
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d35e128>

That's better: now it's clear to see that, in general, the further I go, the slower I run!

So, as expected, no major insights. Now that I'm over my knee injury and back to training regularly, I'm hoping that I'll be able to collect more data and learn a bit more about how I exercise and maybe even make some improvements.

How about you? What data do you have that might be worth exploring? If you haven't anything of your own, try browsing through one of these:

Comments

In a classic example of the human tendency to weave everything we see into our own narrative, I recently found myself looking at the 18th century research data of botanist John Sibthorp, embodied in his Flora Graeca.

Hellebores officinalis

It all came about through a visit to Oxford, organised as part of the CPD programme organised by the M25 group of libraries. We first had a tour of the famous Bodleian Library’s reading rooms — quite a revelation for a very STEM-focussed non-librarian like me!

After finishing at the Bodleian, we dutifully trooped up Parks Road to the Department of Botany and its pride and joy the Sherardian Library and Herbaria. The Sherardian includes, alongside many classic botanical reference books, an impressive collection of original botanical sketches and specimens dating back centuries (and still used by researchers today).

John Sibthorp was a English botanist, and was Sherardian Professor of Botany at the University of Oxford, a chair he inherited from his father Dr Humphry Sibthorp in 1784. In the late 1780’s he took a botanical tour of Greece and Cyprus to collect material for a flora of the region, eventually published as the Flora Graeca Sibthorpiana.

The lovely staff at the Sherardian had laid out several original volumes of Sibthorp’s Flora Graeca to inspect, alongside the various source materials:

  • Sibthorp’s diary of his trip to the Mediterranean
  • Original pencil sketches of the flora, painstakingly labelled by Sibthorp’s artist, Ferdinand Bauer, to indicate the precise shade of each part (he used only graphite pencil in the field)
  • The actual specimens collected by Sibthorp, carefully pressed and preserved with mercury
  • The watercolours developed by Bauer on their return to Oxford, based only on the sketches, the fast-fading specimens and his memory (he produced around 900 of these at a rate of roughly one every 1 1/4 days!)

What’s interesting about all this is that Sibthorp was, in reality, a lousy field biologist. His diary, while beginning fairly well, became less and less legible as the trip went on. Most of the specimens, along with Bauer’s sketches, were unlabelled. In fact, the vast majority of the material collected by Sibthorp remained only in his head.

Before publishing, Sibthorp felt he had to return to the Mediterranean for a second time to collect more material, which he duly did. He never returned to Oxford: instead he died of consumption in Bath in 1796, and his work was published posthumously by the University of Oxford only because of some clever manoeuvring by his lawyer and a close friend.

Of course, all of that knowledge, much of his “research data” died with him. The Flora Graeca Sibthorpiana was eventually published, but only after a lot of work to decode his diary and figure out which specimens, sketches and watercolours went together.

There are a number of errors in the final version which would easily have been caught had Sibthorp been alive to edit it. A spider’s web on one of the specimens, lovingly reproduced by Bauer in his watercolour, was misinterpreted by one of the artists producing the plates for printing, and was rendered as fine, downy hairs on the leaf; of course, the actual plant has no such hairs. Reading between the lines, I suspect that the final published work is much poorer for the loss of the information locked up in Sibthorp’s brain.

Would he have been allowed to get away with this in the modern world? Today his trip would have been funded not by the university at the insistence of his professor father, but probably by the BBSRC. That funding would come with a number of conditions, including an expectation that the work be documented, preserved and made available to other researchers to study. Now, though, we’ll never know what we lost when John Sibthorp died.

The Flora Graeca and its associated material still provide valuable information to this day. New analytical techniques allow us to obtain new data from the specimens, many of which are type specimens for their species. All of the associated artwork has been digitised, and low-resolution versions of the watercolours and colour plates are available to use under a Creative Commons license. Although the physical books are no longer routinely used for reference, the high-resolution scans are consulted quite regularly by modern researchers, and work is currently in progress to link together all of the digitised material so it can be searched by species, family, geographical area or a number of other aspects.

It was fascinating to see such rare materials first-hand, and to have them brought to life by such a knowledgeable speaker, and I feel privileged to have had the chance. For anyone interested, you can browse a digital version of the Flora Graeca online.

Comments

So, it’s been a while since I’ve properly updated this blog, and since I seem to be having another try, I thought it would be useful to give a brief overview of what I’m doing these days, so that some of the other stuff I have in the pipeline makes a bit more sense.

My current work focus is research data management: helping university researchers to look after their data in ways that let them and the community get the most out of it. Data is the bedrock of most (all?) research: the evidence on which all the arguments and conclusions and new ideas are based. In the past, this data has been managed well (generally speaking) by and for the researchers collecting and using it, and this situation could have continued indefinitely.

Technology, however, has caused two fundamental changes to this position. First, we’re able to measure more and more about more and more, creating what has been termed a “data deluge”. It’s now possible for on researcher to generate, in the normal course of their work, far more data than they could possibly analyse themselves in a lifetime. For example, the development of polymerase chain reaction (PCR) techniques have enabled the fast, cheap sequencing of entire genomes: for some conditions, patients’ genomes are now routinely sequenced for future study. A typical human genome sequence occupies 8TB (about 1700 DVDs), and after processing and compression, this shrinks to around 100GB (21 DVDs). This covers approximately 23,000 genes, of which any one researcher may only be interested in a handful.

Second, the combination of the internet and cheap availability of computing power means that it has never been easier to share, combine and process this data on a huge scale. To continue our example, it’s possible to study genetic variations across hundreds or thousands of individuals to get new insights into how the body works. The 100,000 Genomes Project (“100KGP”) is an ambitious endeavour to establish a database of such genomes and, crucially, develop the infrastructure to allow researchers to access and analyse it at scale.

In order to make this work, there are plenty of barriers to overcome. The practices that kept data in line long enough to publish the next paper are no longer good enough: the organisation and documentation must be made explicit and consistent so that others can make sense of it. It also needs to be protected better from loss and corruption. Obviously, this takes more work than just dumping it on a laptop, so most people want some reassurance that this extra work will pay off.

Sharing has risks too. Identifiable patient data cannot be shared without the patients consent; indeed doing so would be a criminal offence in Europe. Similar rules apply to sensitive commercial information. Even if there aren’t legal restrictions, most researchers have a reasonable expectation (albeit developed before the “data deluge”) that they be able to reap the reputational rewards of their own hard work by publishing papers based on it.

There is therefore a great deal of resistance to these changes. But there can be benefits too. For society, there is the possibility of making advancing knowledge in directions that would never have been possible even ten years ago. But there are practical benefits to the individuals too: every PhD supervisor and most PhD students know the frustration of trying to continue a student’s poorly-documented work after they’ve graduated.

For funders the need for change is particularly acute. Budgets are being squeezed, and with the best will in the world there is less money to go around, so there is pressure to ensure the best possible return on investment. This means that it’s no longer acceptable, for example, for several labs in the country to be running identical experiments to do different things with the results. It’s more important than ever to make more data available to and reusable by more people.

So the funders (in the UK, particularly the government-funded research councils), are introducing requirements on the researchers they fund to move along this path quicker than they might feel comfortable with. It therefore seems reasonable to offer these hard-working people some support, and that’s where I come in.

I’m currently spending my time providing training and advice, bringing people together to solve problems and trying to convince a lot of researchers to fix what, in many cases, they didn’t think was broken! They are subject to conflicting expectations and need help navigating this maze so that they can do what they do best: discover amazing new stuff and change the world.

For the last 6ish months I’ve been doing this at Imperial College (my alma mater, no less) and loving it. It’s a fascinating area for me, and I’m really excited to see where it will lead me next!

If you have time, here’s a (slightly tongue-in-cheek) take on the problem from the perspective of a researcher trying to reuse someone else’s data:

Comments

I recently read an article, by Cox, Pinfield and Smith1, which attempted to analyse research data management (RDM) as a “wicked problem”. The concept of a wicked problem has been around for a little while in the context of social policy, but this is the first I’ve heard of it.

What, then, is a “wicked” problem?

Cox et al invoke a full 16 properties of wicked problems (drawing on the work of earlier authors), but put simply a problem is wicked if it is so difficult to define that they can only ever admit of an imperfect, compromise solution. Many different, often contradictory, perspectives on such a problem exist, each suggesting its own solution, and any intervention “changes the problem in an irreversible way”, so that gradual learning by trial and error is impossible.

Many truly wicked problems affecting society as a whole, such as child poverty or gender inequality, are unlikely to be satisfactorily solved in our lifetimes (though we should, of course, keep trying). For RDM though, I feel that there is some hope. It does display many aspects of wickedness, but it seems unlikely that it will stay this way forever. Technology is evolving, cultures and attitudes are shifting and the disparate perspectives on research data are gradually coming into alignment.

In the meantime, “wickedness” provides a useful lens through which to view the challenge of RDM. Knowing that it shows many wicked features, we can be more intelligent about the way we go about solving it. For example, Cox et al draw on the literature on wicked problems to suggest some leadership principles that are appropriate, such as:

“Collective intelligence not individual genius – turning to individuals to solve problems through their individual creativity is less successful than people working together to develop ideas.”

Far from getting disheartened about the difficulty of the task ahead, we can accept that it is necessarily complex and act accordingly. I look forward to following up some of the references on strategies for dealing with wickedness — hopefully I’ll be able to share more of what I learn as I go along.

  1. Cox, Andrew M., Stephen Pinfield, and Jennifer Smith. 2014. “Moving a Brick Building: UK Libraries Coping with Research Data Management as a ‘wicked’ Problem.” Journal of Librarianship and Information Science, May, 0961000614533717. doi:10.1177/0961000614533717.

Comments

You might have noticed that my latest spurt of blogging has been interrupted again. This time, it’s due to a recurrence of an old enemy of mine, repetitive strain injury (RSI). It’s coming back under control now and I thought I’d share what’s worked for me, but before we go any further, heed my warning:

If you have any sort of pain or numbness at all when using a computer, even if it’s momentary, take the time to evaluate your setup and see if there is anything you can change.

I can’t over-stress how important this is — RSI is a serious medical problem if allowed to get out of hand (potentially as career-ending as a sports injury can be for a professional athlete), and it can be prevented entirely by ensuring your workspace is appropriate to the way you do your work. If in doubt, seek expert and/or medical advice: many larger employers have occupational health advisors, and your GP will be able to advise or refer you to an appropriate specialist.

As a lot of intensive computer-users do, I’ve had numerous bouts of computer-related pain over the years, and at one point even had to switch to voice recognition software for several months. If you’ve ever tried to use voice recognition, especially to do any programming, you’ll understand how frustrating that can be.

Previously, I’ve had pain associated with using the mouse, so these days I tend to drive my computer primarily by keyboard. I use apps (like Emacs) that are very keyboard-friendly, and when on Linux I even use a keyboard friendly window manager to minimise my need to use the mouse at all. I also have a regular mouse at work and a trackball at home, so I’m varying the set of muscles I use to mouse with.

As a result, I use a lot of key combinations involving the Control, Alt and Windows/Command keys, and recently I’ve started having pain in my thumbs (particularly the left) from curling them under to hit the Alt and Win keys. Note: emacs users more commonly suffer from the problem known as “Emacs pinky”, but I headed that one off at the pass early on by remapping my caps-lock key as another Control key.

I’m very lucky: my workplace has a dedicated assistive technologies advisor, who has a collection of alternative keyboards, mice and other input devices, so I was able to have a chat with him, get some expert advice and try out several possible options before committing to buying anything.

Here’s what I’ve ended up with:

  • Kinesis Freestyle II USB keyboard: This keyboard is split down the middle, and allows the halves to be positioned and tilted independently to reduce the unnatural bend in my wrist, as well as putting less strain on my shoulders. Through some experimentation with this and similar keyboards, I’ve found that having the two halves parallel but about 15–20cm apart is more comfortable for me than having them close together and angled (like the keyboards made by Goldtouch).
  • Programmable USB foot pedals: These are such simple devices they can be picked up for cheap almost anywhere on the net, or even hacked together using the controller from an old USB keyboard. I have three pedals set up, with Control under my left foot, Alt under my right and Windows/Command inside Alt so I can reach it with my right foot easily. Initially, I found that I was getting some back pain after starting to use the pedals, but I’ve since realised that this was because I had them positioned awkwardly — I started over by looking at where my feet fell naturally while working and then moving the pedals into position accordingly.
  • Posturite Penguin mouse: This also helps to reduce the unnatural bend in my wrist, as well as dealing with my tendency to “anchor” or rest my forearm on the desk while moving the mouse only with my fingers and wrist. It comes in three sizes to fit your hand, and has a switch to swap the scroll wheel direction so you can swap it to your other hand from time to time. Plus it’s made by a British company!

This combination straightens my wrists out completely while typing, and is slowly eliminating (as I train myself to use the pedals) my use of the left thumb for anything other than the space bar.

I hope this is of some use to a few people out there suffering needlessly from similar problems.

Comments

I use a lot of the ideas of David Allen’s Getting Things Done as the basis for my system of capturing, organising and checking off projects and tasks. I like it; it helps make sure that I’m not missing anything.

I do, however, find it somewhat lacking in the area of giving me a day-to-day tactical feel of what I need to get done. Recently I’ve been trying to fill that gap with a simple tool called personal kanban.

What is personal kanban?

Kanban board
Kanban (看板 — literally “billboard”) is a scheduling system developed by Taiichi Ohno at Toyota to direct the manufacture of vehicles to minimise work-in-progress. Personal Kanban (PK) is an adaptation of the ideas behind the original kanban system for the type of knowledge work done by the majority of typical office workers.

PK has two key principles:

  1. Visualise your work
  2. Limit your work in progress

The idea is that if you can see what you’re doing (and not doing yet) you’ll feel more in control, which will give you the confidence to make a conscious decision to focus on just a small number of things at once. That in turn helps to alleviate that nagging feeling of being overloaded, while still letting you get work done.

The implementation involves moving cards or sticky notes between columns on a wall or whiteboard, a concept which is probably easier to understand with an example.

PK and web publishing

A piece of content (blog post, news article, whatever) typically moves through a fairly fixed workflow. It starts life as an idea, then the time comes when it’s ready to write, after which you might outline it, draft it, send it round for review and finally publish it.

On your whiteboard, draw up a column for each of the stages highlighted in bold in the previous paragraph, and assign each article its own sticky note. Then simply move the sticky notes from column to column as you work and experience the satisfaction of watching the system flow and seeing work get done.

It’s a great way to ensure a sensible flow of content without either working yourself to death or running out of things to publish. I’ve used a variation of this system at work for a while now to get news items and blog posts published, and I’m just starting to implement it for this blog too.

It works very well with teams too, as everyone can see the whole team’s workload. I use this to assist in coordinating a small team of PhD students who contribute stuff to our website, using the excellent Trello in place of a physical board.

PK and generic tasks

Once you’ve understood the basic concept, you can basically use it however works for you. You’re encouraged to experiment and adapt the basic idea in whatever way seems to make sense, in a kaizen-like continual improvement fashion.

While it’s useful for sets of similar tasks like blog posts, you can also adapt it to a generic task workflow. I use the following:

Ready
Tasks which could potentially be done now
Doing
Tasks actually in progress
Waiting
Tasks which can’t be acted on yet because they’re waiting for input from someone else
Done
Completed tasks

I have this up on a whiteboard in my office, with each task on a post-it note, which allows me to see at a glance everything that I’ve got going on at the moment, and thus make sure that I’m balancing my priorities correctly — in accordance with PK principle 1 (“Visualise your work”). I also have a limit on the number of tasks that can be in “Doing” and “Waiting” at any one time (PK principle 2: “Limit your work in progress”), which helps me to make sure I’m not feeling overloaded.

I try to keep this as simple as possible, but occasionally introduce little codes like coloured stickers to help with visualising the balance when I need to. The whole point is to use the basic ideas to make a system that works for you, rather than anything that’s too prescriptive.

Of course, I can’t carry a whiteboard around with me, so when I’m out of the office for a while I’ll transfer everything to Trello, which I can access via the web and on my phone and iPad, or even just take a photo of the board.

Combining PK with GTD

GTD is a great system for making sure you’re capturing all the work that needs to be done, but I’ve always been dissatisfied with its ideas about prioritising, which are based on:

  1. Context (where you are/what facilities you have access to)
  2. Time available
  3. Energy (how tired/refreshed you are)

Organising tasks by context has always felt like unnecessary detail, while worrying too much about time and energy on a task-by-task basis seems like a recipe for procrastination (though managing time/energy on a more general level can be useful).

I’ve ended up with a two-level system. GTD is for strategic purposes: tracking projects, balancing long-term priorities and making sure nothing slips through the cracks. Kanban is a much more tactical tool, to help see what needs to be done right now, this week, or later on.

Comments

This year, I’ve decided to make a project of working through the winners of the Hugo Award for Best Novel. The Hugos are one of the biggest English-language science fiction and fantasy awards going. Late last year, I came across the list looking to broaden my reading (I tend to fixate on individual authors and devour entire canons of work before moving on) and realised I’d already read quite a few, and there were quite a few more that I had my eye on to read soon.

I’ve already been distracted by the epic Great Book of Amber (all 10 of Robert Zelazny’s Amber novels in a single mighty volume — I’ve wanted to read it for ages and I got it for Christmas), but I’m enjoying the variety. So far, I’ve read Kim Stanley Robinson’s Green Mars and Blue Mars, having started the Mars trilogy before Christmas, and Zelazny’s classic Lord of Light (watch out for The Pun!), and I’m currently enjoying The Left Hand of Darkness by Ursula K. Le Guin.


Winners I’d already read:

  • Starship Troopers, Robert Heinlein
  • Dune, Frank Herbert
  • Ringworld, Larry Niven
  • Rendezvous with Rama, Arthur C. Clarke
  • The Fountains of Paradise, Arthur C. Clarke
  • Foundation’s Edge, Isaac Asimov
  • Neuromancer, William Gibson
  • Hyperion, Dan Simmons
  • Harry Potter and the Goblet of Fire, J. K. Rowling
  • The Graveyard Book, Neil Gaiman
  • Redshirts, John Scalzi
Comments

Open Access Button
Every day people around the world such as doctors, scientists, students and patients are denied access to the research they need. With the power of the internet, the results of academic research should be available to all. It’s time to showcase the impact of paywalls and help people get the research they need. That’s where Open Access Button comes in.

The Open Access Button is a browser plugin that allows people to report when they hit a paywall and cannot access a research article. Head to openaccessbutton.org to sign up for your very own Button and start using it.

I just want to flag up this cool project that’s trying to improve access to scholarly literature for everyone. I’ve been involved with the project from the start, helping to figure out how to tie it in with open access repositories, but it’s medical students David Carroll and Joe McArthur who deserve the credit for coming up with the idea and driving it forward.

To date, more than 5,000 blocked articles have already been logged. It even got mentioned in the Guardian! Take a look, give it a try or even get involved:


How did I get involved?

Last year, I spent some free time creating an experimental web tool to look up Open Access1 versions of scholarly articles from their DOIs2. There is already a system for getting the official version of record for any DOI, but it struck me that where that version is hidden behind a paywall and a free version is available elsewhere, it should be just as easy to find that.

This work got noticed by a group of people at a hack day3, which resulted in my contributing to their project, the Open Access Button. The primary purpose of the OA Button is to allow people to report whenever they hit a paywall while trying to read an article (so that the scale of the problem can be visualised), and as an added bonus, we’re adding functionality to help gain access through other channels, including finding green open access versions and contacting the corresponding author.

  1. “Open Access” refers to content which is freely available for all to read and use (usually referring to scholarly articles that publish the results of academic research), as distinct from that which is only accessible by paying a fee (either per-article or a as subscription to a journal).

  2. A Digital Object Identifier (DOI) is a unique string of characters that identifies a published object, such as an article, book or dataset. They look something like this: 10.1000/182.

  3. A hack day is an opportunity for developers and non-developers to get together and prototype projects without any risk of loss or ridicule if things don’t work out — great for getting the creative juices flowing!

Comments

Gnu

As I’ve mentioned previously, I periodically try out new task management software. The latest in that story is Emacs and Org-mode.

What is Org?

In its creator’s own words, Org is:

“for keeping notes, maintaining TODO lists, planning projects, and authoring documents with a fast and effective plain-text system”

It started as an Emacs extension for authoring documents with some neat outlining features, then went mad with power and became a complete personal information organiser.

But wait, what the **** is Emacs?

Emacs is the mother of all text editors. It’s one of the oldest pieces of free software, having been around since the dawn of time 1970’s, and is still under active development. Being so venerable, it still cleaves to the conventions of the 70’s and is entirely keyboard-controllable (though it now has excellent support for your favourite rodent as well).

“Text editor” is actually a pretty loose term in this instance: it’s completely programmable, in a slightly odd language called Elisp (which appeals to my computer scientist side). Because many of the people who use it are programmers, it’s been extended to do almost anything that you might want, from transparently editing encrypted or remote (or both) files to browsing the web and checking your email.

My needs for an organisational system

In my last productivity-related post I mentioned that the key properties of a task management system were:

  • One system for everything
  • Multiple ways of structuring and viewing tasks

I would now probably add a third property: the ability to “shrink-wrap”, or be as simple as possible for the current situation while keeping extra features hidden until needed.

And Org very much fits the bill.

One system for everything

Emacs has been ported to pretty much every operating system under the sun, so I know I can use it on my Linux desktop at work, my iMac at home plus whatever I end up with in the future. Because the files are all plain text, they’re trivial to keep synchronised between multiple machines.

There are also apps for iOS and Android, and while they’re not perfect, they’re good enough for when I want to take my todo list on the road.

Multiple ways of structuring and viewing tasks

Whatever I’m doing in Emacs, an instant agenda with all my current tasks is only two keystrokes away. That’s programmable too, so I have it customised to view my tasks in the way that makes most sense to me.

Shrink wrapping

Org has a lot of very clever features added by its user community over its 10+ years, but you don’t have to use them, or even know they exist, until you need them. As an illustration, a simple task list in Org looks like this:

* TODO Project 1
** TODO Task one
** TODO Task two

* TODO Project 2
** DONE Another task
** TODO A further task

And changing TODO to DONE is a single keystroke. Simplicity itself.

Here’s Carsten Dominik on the subject

”[Org-mode] is a zero-setup, totally simple TODO manager that works with plain files, files that can be edited on pretty much any system out there, either as plain text in any editor …

Of course, Org-mode allows you to do more, but I would hope in a non-imposing way! It has lots of features under the hood that you can pull in when you are ready, when you find out that there is something more you’d like to do.”

Wow, what else can it do?

“I didn’t know I could do that!”

If that’s not enough, here are a few more reasons:

  • Keyboard shortcuts for quick outline editing
  • Lots of detailed organisational tools (but only when you need them):
    • Schedule and deadline dates for tasks
    • Flexible system for repeating tasks/projects
    • Complete tasks in series or parallel
    • Arbitrary properties, notes and tags for tasks and projects
  • Use the same tools for authoring HTML/LaTeX documents or even literate programming
  • It’s programmable! If it doesn’t have the functionality you want just write it, from adding keyboard shortcuts to whole new use cases (such as a contact manager or habit tracker)

Give it a try

Emacs is worth trying on its own, especially if you do a lot of programming, web design or anything else that involves a lot of time editing text files. A recent version of Org is bundled with the latest GNU Emacs, and can easily be updated to the current version.

Comments

So, as you’ll have seen from my last post, I’ve been putting together an alternative DOI resolver that points to open access copies in institutional repositories. I’m enjoying learning some new tools and the challenge of cleaning up some not-quite-ideal data, but if it’s to grow into a useful service, it needs several several things:

A better name

Seriously. “Open Access DOI Resolver” is descriptive but not very distinctive. Sadly, the only name I’ve come up with so far is “Duh-DOI!” (see the YouTube video below), which doesn’t quite convey the right impression.

A new home

I’ve grabbed a list of DOI endpoints for British institutional repositories — well over 100. Having tested the code on my iMac, I can confirm it happily harvests DOIs from most EPrints-based repositories. But I’ve hit 10,000 database rows (the free limit on Heroku, the current host) with just the DOIs from a single repository, which means the public version won’t be able to resolve anything from outside Bath until the situation changes.

Better standards compliance

It’s a fact of life that everyone implements a standard differently. OAI-PMH and Dublin Core are no exception. Some repositories report both the DOI and the open access URL in <dc:identifier> elements; others use <dc:relation> for both while using <dc:identifier> for something totally different, like the title. Some don’t report a URL for the items repository entry at all, only the publisher’s (usually paywalled) official URL.

There are efforts under way to improve the situation (like RIOXX), but until then, the best I can do is to implement gradually better heuristics to standardise the diverse data available. To do that, I’m gradually collecting examples of repositories that break my harvesting algorithm and fixing them, but that’s a fairly slow process since I’m only working on this in my free time.

xkcd: Standards

Better data

Even with better standards compliance, the tool can only be as good as the available data. I can only resolve a DOI if it’s actually been associated with its article in an institutional repository, but not every record that should have a DOI has one. It’s possible that a side benefit of this tool is that it will flag up the proportion of IR records that have DOIs assigned.

Then there’s the fact that most repository front ends seem not to do any validation on DOIs. As they’re entered by humans, there’s always going to be scope for error, which there should be some validation in place to at least try and detect. Here are just a few of the “DOIs” from an anonymous sample of British repositories:

  • +10.1063/1.3247966
  • /10.1016/S0921-4526(98)01208-3
  • 0.1111/j.1467-8322.2006.00410.x
  • 07510210.1088/0953-8984/21/7/075102
  • 10.2436 / 20.2500.01.93
  • 235109 10.1103/PhysRevB.71.235109
  • DOI: 10.1109/TSP.2012.2212434
  • ShowEdit 10.1074/jbc.274.22.15678
  • http://doi.acm.org/10.1145/989863.989893
  • http://hdl.handle.net/10.1007/s00191-008-0096-6
  • <U+200B>10.<U+200B>1104/<U+200B>pp.<U+200B>111.<U+200B>186957

In some cases it’s clear what the error is and how to correct it programmatically. In other cases any attempt to correct it is guesswork at best and could introduce as many problems as it solves.

That last one is particularly interesting: the <U+200B> codes are “zero width spaces”. They don’t show on screen but are still there to trip up computers trying to read the DOI. I’m not sure how they would get there other than by a deliberate attempt on the part of the publisher to obfuscate the identifier.

It’s also only really useful where the repository record we’re pointing to actually has the open access full text, rather than just linking to the publisher version, which many do.

A license

Ok, this one’s pretty easy to solve. I’m releasing the code under the GNU General Public License. It’s on github so go fork it.

And here’s the video I promised:

Comments