Funders, publishers, research institutions and many other groups are increasingly keen that researchers make more of their data more open. There are some very good reasons for doing this, but many researchers have legitimate concerns that must be dealt with before they can be convinced. This is the first in what I hope will be a series of posts exploring arguments against sharing data.

“We really want to share our data more widely, but we’re worried that it’s going to give the crackpots more opportunity to pick holes in our findings.”

A PhD student asked me something like this recently, and it’s representative of some very real concerns for a lot of researchers. While I answered the question, I didn’t feel satisfied with my response, so I wanted to unpack it a bit more in preparation for next time.

It seems to me that there are three parts to this. No-one likes to:

  • Have their time wasted
  • Be wrongfully and unfairly discredited
  • Have genuine flaws found in their work

Having genuine errors challenged is a very useful thing, but spurious challenges (i.e. those with no valid basis) can be a stressful time-sink. Such challenges may be made by someone with an interest in seeing you (or your results) discredited; they may also be made by someone who simply fails to understand a key concept of your research1. Either way, they’re a nuisance and rightly to be avoided.

Perhaps the scariest aspect of this is the possibility that your critics might actually be on to something. No-one really enjoys finding out that they’ve made a mistake, and we naturally tend to avoid situations where an error we didn’t know was there might be brought to light.

If all this is so, why should you share your data? Ultimately, there will always be crackpots, or at least people with an ax to grind. Publishing your data won’t change this, but it will add weight to your own arguments. Firstly it says that you’re confident enough in your work to put it out there. But secondly it gives impartial readers the opportunity to verify your claims independently and come to their own judgement about any potential criticism. It’s much harder for the “crackpots” to pick holes in your work when your supporting evidence is available and the validity of your argument can be easily demonstrated.

There’s also a need to accept, and indeed seek out, valid criticism. None of us is perfect and everyone makes mistakes from time to time. When that happens it’s important to find out sooner rather than later and be ready to make corrections, learn and move on.

  1. Don’t forget Hanlon’s razor: “Never attribute to malice that which can adequately be explained by incompetence.”

Comments

It’s been a bit hectic lately because I’ve been finishing up my old job (at Imperial College) and getting started on my new one (Research Data Manager at the University of Sheffield), with a bit of a holiday in between. Hopefully things will calm down a bit now and get back to normal (whatever that looks like…).

In the meantime here are three things I will miss from Imperial:

  • Lovely, friendly, supportive, competent and professional colleagues
  • Lunchtime walks in Hyde Park
  • Imperial College Scifi & Fantasy library (part of the Students’ Union)

And three things I won’t miss:

  • Rude people & overcrowding on the tube/bus/etc
  • Masses of air and noise pollution
  • Travelling between Leeds & London all the time

And finally, three things I’m looking forward to in Sheffield:

  • Taking up a new challenge with a new set of disciplines to work with
  • Catching up with old friends and making a few new ones
  • Lunchtime walks in Weston Park, Crookes Valley Park & the Ponderosa
Comments

Emacs org-mode is a powerful tool, as I’ve written about before. As well being good at project and task management, it also has features for writing documents:

  • Export in a variety of formats, including HTML, OpenDocument and LaTeX
  • Embed snippets of code (in your favourite language), execute them and include the results (and optionally the code itself) in the exported document

I’ll come back to why that’s useful in a moment.

I’ve had a presentation (a position paper on preserving software) accepted for the LIBER conference next month. I may choose to subsequently submit it as an article to LIBER Quarterly (this is a relatively common pattern) so I thought I’d try writing the article and the presentation together in a single document, and see how it worked. If nothing else, writing the paper will help me structure my ideas for the presentation, even if it never gets published.

There are probably other ways of doing it, but I’ve set it up so that exporting in different formats gives me the two different versions of the document:

  • Export as LaTeX produces a PDF version of the presentation slides, using the Beamer package.
  • Export as OpenDocument text produces the article, ready for submission.

It took a little sleight of hand, but I’m putting my Beamer slides in #+BEGIN_LaTeX blocks, which org-mode will include in the LaTeX export but not any other format, and I’ve configured Beamer to ignore any text outside frames, which forms the body of the article.

I don’t want the abstract and other metadata cluttering up the document itself, so I’ve pushed those out to a separate file and used an #+INCLUDE statement to pull it into the main document at export time.

The one thing missing is a built-in way of integrating with Zotero, which I use as my bibliographic database, to format my references. However, Zotero has a very functional API, so I’ve put together a short Ruby script that grabs a Zotero collection and tweaks the formatting. Whenever I export the document from org-mode, the code is run and the result (a formatting bibliography) is embedded in the finished version.

require 'open-uri'; require 'rexml/document'
url = "https://api.zotero.org/groups/#{group_id}/collections/#{coll_id}/items/top?format=bib&style=#{style}"
REXML::Document.new(open(url)).elements.each('//div[@class="csl-entry"]') do |entry|
  puts '- ' + entry.children
              .collect{|c| if c.instance_of? REXML::Text then c.value else c.text end}
              .join.gsub(%r{</?i>}, '/').gsub(%r{/\(}, '/ (')
end

This formats a Zotero collection, converting HTML <i> tags to the equivalent org-mode markup. It requires three parameters, group_id, coll_id and style, which can be configured in the org-mode document and passed through when emacs executes the code. The same snippet could thus be used in multiple documents, just varying the parameters to format a different set of references.

Clearly, embedding executable source code in a document has a lot more potential than I’ve used here. It allows data analysis and visualisation code to be embedded directly in a document, even with the option of processing data from tables that are also in the document. You can also use it to write entire programs in the Literate Programming style, formatting them in the way that makes most narrative sense but exporting (“tangling”) pure executable source code.

Once I’m sure it’s not a violation of the journal’s policy, I’ll push the source of the two documents up to github in case anyone wants to see how I’ve done it.

Comments

When is the right time to curate?

One of the things that I’ve been thinking about quite a bit since IDCC 2015 is this: exactly when should curation take place in a digital workflow.

There seem to be two main camps here, though of course it’s more of a spectrum than a simple dichotomy. These two views can be described as “sheer curation” and “just-in-time curation”.

Sheer curation

Sheer curation involves completely and seamlessly integrating curation of data into the workflow itself. That’s “sheer” as in tights: it’s such a thin layer that you can barely tell it’s there. The argument here is that the only way to properly capture the context of an artefact is to document it while that context still exists, and this makes a lot of sense. If you wait until later, the danger is that you won’t remember exactly what the experimental conditions for that observation were. Worse, if you wait long enough you’ll forget about the whole thing entirely until it comes time to make use of the data. Then you run the danger of having to repeat the experiment entirely because you can’t remember enough about it for it to be useful.

For this to work, though, you really need it to be as effortless as possible so that it doesn’t interrupt the research process. You also need researchers to have some curation skills themselves, and to minimise the effort required those skills need to be at the stage of unconscious competence. Finally you need a set of tools to support the process. These do exist, but in most cases they’re just not ready for widespread use yet.

Just-in-time curation

The other extreme is to just do the absolute minimum, and apply the majority of curation effort at the point where someone has requested access. This is the just-in-time approach: literally making the effort just in time for the data to be delivered. The major advantage is that there is no wasted effort curating things that don’t turn out to be useful. The alternative is “just-in-case”, where you curate before you know what will or won’t be useful.

The key downside is the high risk of vital context being lost. If a dataset is valuable but its value doesn’t become apparent for a long time, the researchers who created it may well have forgotten or misplaced key details of how it was collected or processed. You also need good, flexible tools that don’t complain if you leave big holes in your metadata for a long time.

Comparison

When might each be useful?

I can see sheer-mode curation being most useful where standards and procedures are well established, especially if value of data can easily be judged up front and disposal integrated into the process. In particular, this would work well if data capture methods can be automated and instrumented, so that metadata about the context is recorded accurately, consistently and without intervention by the researcher.

Right now this is the case in well-developed data-intensive fields, such as astrophysics and high-energy physics, and newer areas like bioinformatics are getting there too. In the future, it would be great if this could also apply to any data coming out of shared research facilities (such as chemical characterisation and microscopy). Electronic lab notebooks could play a big part for observational research, too.

Just-in-time-mode curation seems to make sense where the overheads of curating are high and only a small fraction of collected data is ever reused, so that the return on investment for curation is very low. It might sometimes be necessary also, if the resources needed for curation aren’t actually made available until someone wants to reuse the data.

Could they be combined?

As I mentioned at the start, these are just two ends of a spectrum of possibilities, and for most situations the ideal solution will lie somewhere in between. A pragmatic approach would be to capture as much context as is available transparently and up-front (sheer) and then defer any further curation until it is justified. This would allow the existence of the data to be advertised up-front through its metadata (as required by e.g. the EPSRC expectations), while minimising the amount of effort required. The clear downside is the potential for delays fulfilling the first request for the data, if such ever comes.

Comments

The sharp-eyed amongst you will have noticed I’ve recently ended a bit of a break in service on this blog. I’ve been doing that thing of half-writing posts and then never finishing them, so I’ve decided to clear out the pipeline and see what’s still worth publishing. This is a slightly-longer-than-usual piece I started writing about 9 months ago, still in my previous job. It still seems relevant, so here you go. You’re welcome.

What is an electronic lab notebook?

For the last little while at work, I’ve been investigating the possibility of implementing an electronic lab notebook (ELN) system, so here are a few of my thoughts on what an ELN actually is.

What is a lab notebook?

All science is built on data1. Definitions of “data” vary, but they mostly boil down to this: data is evidence gathered through observation (direct or via instruments) of real-world phenomena.

A lab notebook is the traditional device for recording scientific data before it can be processed, analysed and turned into conclusions. It is typically a hardback notebook, A4 size (in the UK at least) with sequentially numbered pages recording the method and conditions of each experiment along with any measurements and observations taken during that experiment.

In industrial contexts, where patent law is king, all entries must be in indelible ink and various arcane procedures followed, such as the daily signing of pages by researcher and supervisor; even in academia some of these precautions can be sensible.

But the most important think about a lab notebook is that it records absolutely everything about your research project for future reference. If any item of data is missing, the scientific record is incomplete and may be called into question. At best this is frustrating, as time-consuming and costly work must be repeated; at worst, it leaves you open to accusations of scientific misconduct of the type perpetrated by Diederik Stapel.

So what’s an electronic lab notebook?

An ELN, then, is some (any?) system that gives you the affordances I’ve described above while being digital in nature. In practice, this means a notebook that’s accessed via a computer (or, increasingly, a mobile device such as a tablet or smartphone), and stores information in digital form.

This might be a dedicated native app (this is the route taken by most industrial ELN options), giving you a lot of functionality right on your own desktop. Alternatively, it might be web-based, accessed using your choice of browser without any new software to be installed at all.

It might be standalone, existing entirely on a single computer/device with no need for network access. Alternatively it might operate in a client-server configuration, with a central database providing all the storage and processing power, and your own device just providing a window onto that.

These are all implementation details though. The important thing is that you can record your research using it. By why? What’s the point?

What’s wrong with paper?

Paper lab notebooks work perfectly well already, don’t they? We’ve been using them for hundreds of years.

While paper has a lot going for it (it’s cheap and requires no electricity or special training to use), it has its disadvantages too. It’s all too easy to lose it (maybe on a train) or accidentally destroy it (by spilling nasty organic solvents on it, or just getting caught out in the rain).

At the same time, it’s very difficult to safeguard in any meaningful sense, short of scanning or photocopying each individual page.

It’s hard to share: an increasingly important factor when collaborative, multidisciplinary research is on the rise. If I want to share my notes with you, I either have to post you the original (risky) or make a physical or digital copy and send that.

Of more immediate relevance to most researchers, it’s also difficult to interrogate unless you’re some kind of an indexing ninja. When you can’t remember exactly what page recorded that experiment nine months ago, you’re in for a dull few hours searching through page-by-page.

What can a good ELN give us?

The most obvious benefit is that all of your data is now digital, and can therefore be backed up to your heart’s content. Ideally, all data is stored on a safe, remote server and accessed remotely, but even if it’s stored directly on your laptop/tablet you now have the option of backing it up. Of course, the corollary is that you have to make sure you are backing up, otherwise you’ll look a bit silly when you drop your laptop in a puddle.

The next benefit of digital is that it can be indexed, categorised and searched in potentially dozens of different dimensions, making it much easier to find what you were looking for and collect together sets of related information.

A good electronic system can do some useful things to support the integrity of the scientific record. Many systems can track all the old versions of an entry. As well as giving you an important safety net in case of mistakes, this also demonstrates the evolution of your ideas. Coupled with some cryptographic magic and digital signatures, it’s even possible to freeze each version to use as evidence in a court of law that you had a given idea on a given day.

Finally, moving notes and data to a digital platform can set them free. Suddenly it becomes trivial to share them with collaborators, whether in the next room or the next continent. While some researchers advocate fully “open notebook science” — where all notes and data are made public as soon as possible after they’re recorded — not everyone is comfortable with that, so some control over exactly who the notebook is shared with is useful too.

What are the potential disadvantages?

The first thing to note is that a poorly implemented ELN will just serve to make life more awkward, adding extra work for no gain. This is to be avoided at all costs — great care must be taken to ensure that the system is appropriate to the people who want to use it.

It’s also true that going digital introduces some potential new risks. We’ve all seen the… My own opinion is that there will always be risks, whether data is stored in the cloud or on dead trees in a filing cabinet. As long as those risks are understood and appropriate measures taken to mitigate them, digital data can be much safer than the average paper notebook.

One big stumbling block that still affects a lot of the ELN options currently available is that they assume that the users will have network access. In the lab, this is unlikely to be a problem, but how about on the train? On a plane or in a foreign country? A lot of researchers will need to get work done in a lot of those places. This isn’t an easy problem to solve fully, though it’s often possible with some forethought to export and save individual entries to support remote working, or to make secure use of mobile data or public wireless networks.

Summary

So there you have it. In my humble opinion, a well-implemented ELN provides so many advantages over the paper alternative that it’s a no-brainer, but that’s certainly not true for everyone. Some activities, by their very nature, work better with paper, and either way most people are very comfortable with their current ways of working.

What’s your experience of note-taking, within research or elsewhere? What works for you? Do you prefer paper or bits, or a mixture of the two?

  1. Yes, even theoretical science, in my humble opinion. I know, I know. The comment section is open for debate.

Comments

One of the best ways of getting started developing open source software is to “scratch your own itch”: when you have a problem, get coding and solve it. So it is with this little bit of code.

Scroll Back is a very simple Chrome extension that replicates a little-known feature of Firefox: if you hold down the Shift key and use the mouse wheel, you can go forward and backward in your browser history. The idea came from issue 927 on the Chromium bug tracker, which is a request for this very feature.

You can install the extension from the Chrome Web Store if you use Chrome (or Chromium).

The code is so simple I can reproduce it here in full:

document.addEventListener("wheel", function(e) {
  if (e.shiftKey && e.deltaX != 0) {
    window.history.go(-Math.sign(e.deltaX));
    return e.preventDefault();
  }
});
  • Line 1 adds an event listener which is executed every time the user uses the scroll wheel.
  • If the Shift key is held down and the user has scrolled (line 2), line 3 goes backward or forward in the history according to whether the user scrolled down or up respectively (e.deltaX is positive for down, negative for up)
  • Line 4 prevents any unwanted side-effects of scrolling.

The code is automatically executed every time a page is loaded, so has the effect of enabling this behaviour in all pages.

It’s open source (licensed under the MIT License), so you can check out the full source code on github.

Comments

I run Linux on my laptop, and I’ve had some problems with the wifi intermittently dropping out. I think I’ve found the solution to this, so I just wanted to record it here so I don’t forget, and in case anyone else finds it useful.

What I found was that any time the wifi was idle for too long it just stopped working and the connection needed to be manually restarted. Worse, after a while even that didn’t work and I had to reboot to fix it.

The problem seems to be with the power-saving features of the wifi card, which is identified by lspci as:

01:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8723BE PCIe Wireless Network Adapter

What appears to happen is that the card goes into power-saving mode, goes to sleep and never wakes up again.

It makes use of the rtl8723be driver, and the solution appears to be to disable the power-saving features by passing some parameters to the relevant kernel module. You can do this by passing the parameters on the command line if manually loading the module with modprobe, but the easiest thing is to create a file in /etc/modprobe.d (which can be called anything) with the following contents:

# Prevents the WiFi card from automatically sleeping and halting connection
options rtl8723be fwlps=0 swlps=0

This seems to be working for me now. It’s possible that only one out of the parameters fwlps and swlps are needed, but I haven’t had chance to test this yet.

The following pages helped me figure this out:

Comments

OpenRefine has a pretty cool feature. You can export a project’s entire edit history in JSON format, and subsequently paste it back to exactly repeat what you did. This is great for transparency: if someone asks what you did in cleaning up your data, you can tell them exactly instead of giving them a vague, general description of what you think you remember you did. It also means that if you get a new, slightly-updated version of the raw data, you can clean it up in exactly the same way very quickly.

[
  {
    "op": "core/column-rename",
    "description": "Rename column Column to Funder",
    "oldColumnName": "Column",
    "newColumnName": "Funder"
  },
  {
    "op": "core/row-removal",
    "description": "Remove rows",
    "engineConfig": {
      "mode": "row-based",
// etc…

Now this is great, but it could be better. I’ve been playing with Python for data wrangling, and it would be amazing if you could load up an OpenRefine history script in Python and execute it over an arbitrary dataset. You’d be able to reproduce the analysis without having to load up a whole Java stack and muck around with a web browser, and you could integrate it much more tightly with any pre- or post-processing.

Going a stage further, it would be even better to be able to convert the OpenRefine history JSON to an actual Python script. That would be a great learning tool for anyone wanting to go from OpenRefine to writing their own code.

import pandas as pd

data = pd.read_csv("funder_info.csv")
data = data.rename(columns = {"Column": "Funder"})
data = data.drop(data.index[6:9])

This seems like it could be fairly straightforward to implement: it just requires a bit of digging to understand the semantics of the JSON thot OpenRefine produces, and then the implementation of each operation in Python. The latter part shouldn’t be much of a stretch with so many existing tools like pandas.

It’s just an idea right now, but I’d be willing to have a crack at implementing something if there was any interest — let me know in the comments or on Twitter if you think it’s worth doing, or if you fancy contributing.

Comments

If you’re viewing this on the web, you might notice there have been a few changes round here: I’ve updated my theme to be more responsive and easier to read on different screen sizes. It’s been interesting learning how to use the Bootstrap CSS framework, originally developed by Twitter to make putting together responsive sites with a fixed or fluid grid layout straightforward.

I don’t have the means to test it on every possible combination of browsers and devices, so please let me know if you notice anything weird-looking!

Comments

This post is a little bit of an experiment, in two areas:

  1. Playing with some of my own data, using a language that I'm quite familiar with (Python) and a mixture of tools which are old and new to me; and
  2. Publishing that investigation directly on my blog, as a static IPython Notebook export.

The data I'll be using is information about my exercise that I've been tracking over the last few years using RunKeeper. RunKeeper allows you to export and download your full exercise history, including GPX files showing where you went on each activity, and a couple of summary files in CSV format. It's this latter that I'm going to take a look at; just an initial survey to see if there's anything interesting that jumps out.

I'm not expecting any massive insights just yet, but I hope you find this a useful introduction to some very valuable data wrangling and analysis tools.

Looking at the data

First up, we need to do some basic setup:

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from IPython.display import display

This imports some Python packages that we'll need:

  • matplotlib: makes pretty plots (the %matplotlib inline bit is some IPython Notebook magic to make plots appear inline)
  • numpy: allows some maths and stats functions
  • pandas: loads and manipulates tabular data

Next, let's check what files we have to work with:

In [2]:
%ls data/*.csv
data/cardioActivities.csv  data/measurements.csv

I'm interested in cardioActivities.csv, which contains a summary of each activity in my RunKeeper history. Loading it up gives us this:

In [3]:
cardio = pd.read_csv('data/cardioActivities.csv',
                     parse_dates=[0, 4, 5],
                     index_col=0)
display(cardio.head())
Type Route Name Distance (mi) Duration Average Pace Average Speed (mph) Calories Burned Climb (ft) Average Heart Rate (bpm) Notes GPX File
Date
2015-03-27 12:26:34 Running NaN 3.31 29:57 9:03 6.63 297 48.45 157 NaN 2015-03-27-1226.gpx
2015-03-21 09:44:25 Running NaN 11.31 2:12:01 11:40 5.14 986 283.41 146 NaN 2015-03-21-0944.gpx
2015-03-19 07:21:36 Running NaN 5.17 52:45 10:12 5.88 423 75.23 150 NaN 2015-03-19-0721.gpx
2015-03-17 06:51:42 Running NaN 1.81 17:10 9:29 6.32 144 23.27 137 NaN 2015-03-17-0651.gpx
2015-03-17 06:21:51 Running NaN 0.94 7:25 7:52 7.64 17 3.65 136 NaN 2015-03-17-0621.gpx

Although my last few activities are runs, there are actually several different possible values for the "Type" column. We can take a look like this:

In [4]:
cardio['Type'] = cardio['Type'].astype('category')
print(cardio['Type'].cat.categories)
Index(['Cycling', 'Hiking', 'Running', 'Walking'], dtype='object')

From this you can see there are four types: Cycling, Hiking, Running and Walking. Right now, I'm only interested in my runs, so let's select those and do an initial plot.

In [5]:
runs = cardio[cardio['Type'] == 'Running']
runs['Distance (mi)'].plot()
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d914ac8>

We can notice two things straight away:

  • There's a gap at the start of 2014: this is probably where RunKeeper hasn't got information about the distance because my GPS watch didn't work right or something, and I don't want to include these in my analysis.
  • There's a big spike from where I did the 12 Labours of Hercules ultramarathon, which isn't really an ordinary run so I don't want to include that either.

Let's do some filtering (excluding those, and some runs with "unreasonable" speeds that might be mislabelled runs or cycles) and try again.

In [6]:
runs = runs[(runs['Distance (mi)'] <= 15)
            & runs['Average Speed (mph)'].between(3.5, 10)]
runs['Distance (mi)'].plot()
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d5f4fd0>

That looks much better. Now we can clearly see the break I took between late 2012 and early 2014 (problems with my iliotibial band), followed by a gradual return to training and an increase in distance leading up to my recent half-marathon.

There are other types of plot we can look at too. How about a histogram of my run distances?

In [7]:
runs['Distance (mi)'].hist(bins=30)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d56fc50>

You can clearly see here the divide between my usual weekday runs (usually around 3–5 miles) and my longer weekend runs. I've only been running >7 miles very recently, but I suspect as time goes on this graph will start to show two distinct peaks. There also seem to be peaks around whole numbers of miles: it looks like I have a tendency to finish my runs shortly after the distance readout on my watch ticks over! The smaller peak around 1 mile is where I run to the gym as a warmup before a strength workout.

How fast do I run? Let's take a look.

In [8]:
runs['Average Speed (mph)'].hist(bins=30)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d4dd518>

Looks like another bimodal distribution. There's not really enough data here to be sure, but this could well be a distinction between longer, slower runs and shorter, faster ones. Let's try plotting distance against speed to get a better idea.

In [9]:
runs.plot(kind='scatter', x='Distance (mi)', y='Average Speed (mph)')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d43ac18>

Hmm, no clear trend here. Maybe that's because when I first started running I was nowhere near so fit as I am now, so those early runs were both short and slow! What if we restrict it just to this year?

In [10]:
runs = runs.loc[runs.index > '2015-01-01']
runs.plot(kind='scatter', x='Distance (mi)', y='Average Speed (mph)')
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f716d35e128>

That's better: now it's clear to see that, in general, the further I go, the slower I run!

So, as expected, no major insights. Now that I'm over my knee injury and back to training regularly, I'm hoping that I'll be able to collect more data and learn a bit more about how I exercise and maybe even make some improvements.

How about you? What data do you have that might be worth exploring? If you haven't anything of your own, try browsing through one of these:

Comments