Scraping for gold at the Olympics

Tuesday 7 August 2012

Tagged with

Olympics
London 2012
Medals
Open data
Ruby
Screen scraping
Coding

What if it wasn’t all about the gold medals? The Olympic medal table is always ranked in order of gold medals first, then silver, then bronze.

That seems reasonable, but if you looked at the table at the end of 6 August, for example, you’d have seen that Germany had an impressive 22 medals, including 5 golds, but ranked one place behind Kazakhstan, who had only 7 medals, but 6 of which were gold.

So I thought it was time to do a few things I’ve wanted to try for a while: scrape some publicly available data, do something interesting with it, and write and deploy a Ruby webapp beyond my desktop.

Finding the data

It just so happens that the BBC’s medal table is marked up with some nice semantic attributes:

Each <tr> tag has two attributes: data-country-name and data-country-code;
Each <td> tag uses the class gold, silver or bronze and contains only the number of medals of that type for that country.

Just scraping by

I could have just scraped that data from within the webapp, but I wanted a) to have a bit more robustness if the source page changed format or disappeared; and b) to make the data easily available to others.

So I wrote this London 2012 medal table scraper in ScraperWiki. ScraperWiki lets you write scrapers in Ruby, Python or PHP using their API and some standard parsing modules to scrape data and store it in an SQLite table. The data is then available as JSON via a REST API, and remains so even if the source page vanishes (it just sends you a notification so you can fix your scraper).

Let’s go Camping

I briefly thought about using Ruby on Rails, but that’s a pretty heavy solution to a very small problem, so instead I turned to Camping, a “web framework which consistently stays at less than 4kB of code.”

Camping is very MVC-based, but your whole app can live in a single file, like a simple CGI script.

Putting it all together

So, here’s my alternative Olympic medal table app, and here’s the code on GitHub.

What are the effects? Well, if you sort by total medals, there’s quite a big shake up. Russia with 41 medals (only 7 gold) shoot up from 6th to 3rd place, pushing Britain down to 4th. North Korea, on the other hand, drop down from 8th to 24th.

Using a weighted sum of the medals (with a gold worth 3 points, silver 2 and bronze 1) yields a similar but less dramatic upheaval, with Russia still up and North Korea still down, but GB restored to 3rd place.

Can you think of a different way to sort the medals? Stick a feature request on the GitHub tracker, or fork it and have a go yourself.

Comments

Open Source #ioe12

Monday 12 March 2012

Tagged with

Copyright
Licensing
Open source
IOE12
Openness

This blog post is part of my contribution to the open online course
Introduction to Openness in Education.

Ok, so the last post was a bit long. Like essay long. I started writing and
then I kept on writing til I’d got it all out. I’m pretty happy with the
content, but it took too long to write and it takes too long to read.

So here’s my pithy(ish) introduction to Open Source.

‘Open’, as you might expect, refers to the free sharing of stuff. The ‘Source’
part refers to source code: the human-readable form in which computer
software is written. So we’re talking about software distributed in
human-modifiable form, not the compiled, click-to-run executable most people
are used to.

There are two key arguments in favour of Open Source: the moral one and the
economic one.

The moral argument goes like this. In the beginning only a few dedicated
hackers had computers. They put their craft first, worked together well and
shared their developments with each other. They were able to learn from and
build on each other’s code, and everyone was happy.

As the computer industry grew, the business types who started up companies to
exploit new developments realised that they could make money by keeping the
source code secret and only releasing the executable code to customers. So they
made non-free software the norm and the world a poorer place for it.

But there are many people who feel this is naive and unrealistic. To convince
them, you also need the economic argument.

Conventional wisdom has it that if you try to build software with a team that’s
too large, you get bogged down in communication between team-members and the
whole enterprise becomes unmanageable.

This is fairly accurate for closed-source software: the nature of commercial
companies is that everything has to be managed in a certain way and everyone
has to be in communication with everyone else.

Mathematicians may recognise this as a complete graph — in which every node in
connected to every other node — and the problem is that the number of links
grows much quicker than the number of people.

Open source projects, like Linux, involve huge numbers of people, so on paper
they shouldn’t work. But on a large open source project, most people contribute
only to a small part of the whole, only communicating with a few others. Only a
small number, by dint of personality type or happenstance, coordinate with many
others to keep the whole thing together.

And because these projects don’t suffer from the communication difficulties,
they can capitalise on the much larger group of minds working on a problem.

Thanks to this effect, hobbyist programmers really can built high quality
software and that’s why OS projects Linux and Apache dominate the modern web
between them.

But why should we use open source software?

As Cory Doctorow points out in his recent talk “The coming war on general
computation”, the computer is fully general: there’s no program that they can’t
in theory run.

That scares a lot of people: it means you can run whatever you like, even
software that (shock horror!) makes it possible to break the law. So should
governments or corporations be restricting what we can run?

Cars can be used to commit crime, but only a police state would try to restrict
where you can drive to, or insist on low-jacking each one. Open source software
is controlled by the community, and so is naturally resistant this type of
centralised control — you may not agree but I think that’s worth defending.

And as Benjamin Franklin once wrote, “Those who would give up Essential Liberty
to purchase a little Temporary Safety, deserve neither Liberty nor Safety.”

Comments

Sharing and flaky butter buns

Sunday 5 February 2012

Tagged with

Copyright
Recipes
Openness
Sharing
IOE12

You know you’ve made it on the web when you’re asked to take something down.

The story

For Christmas I received, amongst other lovely presents, a copy of Dan Lepard’s
book The Handmade Loaf. I really enjoy breadmaking, with all of the processes
and the minor biological miracle that turns flour and water into a cohesive
loaf.

Since then I’ve been trying out at least one, sometimes several, recipes from
the book each weekend. Two weeks ago I made flaky butter buns, posting a photo
of the result (delicious) on Twitter and Google+.

I was asked for the recipe, but as a) I’m fairly conscientious and b) I’ve been
learning a lot about copyright recently this raised a question: is it a breach
of copyright to share someone else’s recipe.

A couple of online conversations and one Guardian
article
later, I had my answer: recipes are not protected by copyright under either UK
or US law. A recipe is an idea, not the expression of an idea, and is therefore
not copyrightable.

A recipe may be covered by a patent or trade secret, but for a patent to be
granted it would have to differ significantly from any other previous recipe
and, having been published in a book, it clearly cannot be a trade secret.

So conscience satisfied, I went ahead and posted the recipe for flaky butter
buns on my other blog.

Rather than use Lepard’s own words, which would have infringed copyright, I
wrote it in my own style, which tends to skip over steps — you either need to
have some baking knowledge or take the hint and buy the book. I also raved (as
I have done before) about the book itself, including an Amazon link so that
readers could go ahead and buy it for themselves.

I felt in doing so that I behaved appropriately both legally and morally, and
thought no more about it.

Several days later, I got an email notifying my of a comment on the post (this
rarely happens). As it turns out, this comment was from a member of Lepard’s
team accusing me of infringing his copyright.

Now as I’ve said, I don’t believe that I did infringe copyright (if you’re a
lawyer, I’d love to hear a legal opinion on this), but since I respect Lepard
as a professional and small businessman I chose to respect his wishes (or at
least those of his employee) and remove the recipe anyway.

The point

In the end this isn’t a question about the law. It’s about whether sharing (and
letting other people share) your stuff is a good idea or not.

Even the most cursory Google search for recipe titles suggests that, should I
want to, I could recreate the entire collection for free. But one of the
reasons I like this book is that it’s more than a collection of recipes. It’s a
well crafted book about bread. In addition to the recipes it contains both
photographs (by the author) and descriptions of encounters with bakers around
the world.

Yes, you can get most if not all of those recipes for free online and not have
to pay a penny, but anyone who’s going to do that was never going to buy the
book in the first place. In fact you could get the whole experience of the book
for free, just by going down to your local
library.

On the other hand, I like to think that a few of my friends might have been
motivated to buy their own copy of the book on the basis of my recommendation —
word of mouth being the best form of advertising and free to boot.

So now there is no recipe, and no endorsement, and no link to buy the book on
Amazon. I’ll probably think twice before recommending the book in the future
(wait, didn’t I just do that again three paragraphs ago?).

I’d be kidding myself if I thought this will make the slightest difference to
the book’s sales, but you have to wonder: if you have a book to sell, is it
worth paying someone to spend time trawling the internet (which is a pretty big
place) just to ensure your book is the only place the contents can be found?

People will still send recipes by email, or photocopy them, or pass them on by
word of mouth. They will clip them out of the paper, note them down in
notebooks and then post the clippings to loved ones.

This has always happened and always will, and though some instances are covered
by copyright law, it’s completely unenforcible in such cases.

The internet makes this sharing more visible, but it presents an opportunity
too. The classic example is YouTube: increasingly rights owners are taking the
option to place ads around potentially infringing videos rather than blindly
demand takedowns.

By the way, Martin Weller’s made his whole book, The Digital
Scholar
available online for free, and some mugs (me included) still seem to be paying
for it. Perhaps we’re all just idiots.

All I really want to say is this: if you have a book to sell (or any other
creative work), consider carefully the pros and cons of permitting parts to be
shared freely.

Policing takes time and time is money, and even if the pros and cons balance
out all you’re doing is spending that money to achieve zero result. Perhaps
that time would be better spent engaging with your readers in positive ways.

Comments

Open Licensing #ioe12

Sunday 15 January 2012

Tagged with

Copyright
Licensing
Creative Commons
IOE12
Openness

This blog post is part of my contribution to the open online course
Introduction to Openness in Education.

Copyright

At the heart of the various forms of “open” lies the concept of intellectual
property: who owns it, who can use it and for what.

A physical object, such as the computer I’m writing this blog post on, is in
one place at a time, and its ownership is pretty clear cut: I paid for it and
it’s in my house, and if you took it without my permission we’d call that
theft.

Things get trickier when you start talking about creative works. If I write a
piece of music and you make a copy, I still have the piece of music, but so do
you. I can take a photograph of a painting by Degas, and it stays hanging in
the gallery, but in some sense I have a copy that I can enjoy independently of
the original work.

If this situation goes unchecked, then there’s not a lot of incentive to become
an artist, or a composer, or a writer. Even if you charge for your work there’s
nothing to stop me buying one copy and then selling hundreds, for which you
would see no profit whatsoever.

Under most modern legal systems, the concept of copyright exists to right this
imbalance. It does this by allowing the creator of a work the opportunity to
exploit that work in whatever way they see fit, effectively creating a
monopoly.

As the creator of a work, it’s still possible to grant certain rights to third
parties, and this is done by the granting of licenses. This is the mechanism
which allows you to “sell” rights to a work in exchange for money or some other
consideration.

Fair use/fair dealing

If you were to film an interview in the high street of your town, you might
think that it would be difficult to infringe copyright in any way. If you’re
not infringing copyright, you don’t need to pay anyone for a license. Yet if,
say, a TV set in the background was showing reruns of The Simpsons, then you
could well be in from a visit from lawyers representing the Fox Broadcasting
Company.

Some jurisdictions include a concept of “fair use” (or fair dealing in the UK),
which permits such incidental reuses under a specific set of circumstances.
This can make documentary-making, for example, much easier.

However, many organisations (Fox being a common example) are quite happy to
threaten legal action and demand that you pay tens or hundreds of thousands of
pounds(/dollars/euros/etc.) for a license, even if you may in fact be covered
by fair use rules. They are able to do this because most people are unaware of
their legal rights, or even if they are do not have the money to fight the
ensuing lawsuit.

Even if the law gives you a fair use right to use some work or other, other
organisations to which you might sell your own work may not be so forgiving.
Because of the litigation culture surrounding copyright, a lot of organisations
take a very paranoid approach and insist on rights being cleared and licenses
purchased even if they’re not strictly necessary.

Orphaned works

The situation becomes worse when the holder of the rights that must be cleared
cannot be found. This usually happens when no contact details can be found for
the creator of a work, or when those that can be found are out of date. In many
cases, it’s impossible even to know whether the rights holder is still alive,
and works like this are referred to as “orphaned works”.

In the early days of copyright this would not have been a problem: for
copyright to exist it was necessary to the creator to explicitly assert their
rights, and to renew them periodically.

However it is now the case in the US and the UK that copyright automatically
exists for the lifetime of the creator and for 70 years after their death. If
the creator has passed away, their estate still owns the copyright, but may be
impossible to trace until they discover the breach.

For this reason, it is almost impossible to safely use orphaned works — if
you do, you do so at your own risk.

Open licensing

As you can see, copyright creates incentives to create, but the way it’s
currently implemented can also have a chilling effect on certain types of
creation, especially those that involve mashing up existing content.

There’s not a lot most of us can do about the depredations of Fox and their
ilk, other than lobbying our MPs for a change in the law. But thankfully we can
make it easier for others to make use of our own works.

Open licensing gives creators legal tools to relinquish some or all of their
rights over a piece of work, in the interests of supporting the creativity of
others.

Creative Commons was set up to provide a set of
open licenses which creators can use to make it very easy to understand what
can and can’t be done with their work.

The key terms which can be applied by the standard Creative Commons licenses
are:

Attribution: the creator of the work must be acknowledged in any works
which incorporate it;
Share-alike: the work can only be used if the resulting work is
released under the same license;
Non-commercial: the work may only be used if the user doesn’t profit
financially from doing so;
No derivatives: the work may only be redistributed unchanged from its
original form.

By combining these terms, it is possible to specify exactly what rights you
want to retain on each individual work.

In higher education, we often find ourselves needing a photo or video to
illustrate a point in a class or at a conference, or increasingly in a blog
post (like this one). Thanks to Creative Commons, finding content to be used
legally in this way is as easy as doing a simple web
search — no more excuses!

Conclusion

This was intended to be a short blog post, and it’s already longer than I
intended! There are a whole raft of other important issues, such as the
creeping extension of copyright terms, which I haven’t had space to cover, but
hopefully I’ll come back to those some other time.

For now, I hope you’ve got a good idea of why open licensing is necessary and
how you can apply it to your own creative works. It’s worth noting that this
whole blog is released under a CC license — just scroll to the bottom!

In writing this post, I made heavy use of this open licensing
material, which I encourage you to
take a look at if you want to learn more.

Photo credit: Ioan Sameli via
Flickr

Comments

The Research Technologist part 2: research focus

Tuesday 10 January 2012

Tagged with

Research technologist
Research
Job description
Reflection

This is the second part in my exploration of what it means to be a research
technologist. If you haven’t already, check out part 1: proactivity and
innovation.

Research focus

There’s another area where the role diverges from the typical member of IT
staff: a focus on the unique needs of researchers. Network infrastructure, file
storage, email are necessary but not sufficient to meet the needs of a modern
researcher.

It’s vitally important to pay close attention to the unique needs of
researchers and to find appropriate tools and techniques to adapt to serve
those needs as well as possible. Research is after all the primary business of
a university, alongside teaching.

So we need to find ways to fulfil the needs not just of an institution’s
researchers, but of a faculty’s researchers, or a department’s or even a single
research group’s.

I actually think that once we start doing this well, there will be a lot more
commonality than there appears to be right now. But first we’ve got to get
there.

Serving the long tail

The much abused Pareto Principle holds that in many circumstances 80% of your
profit comes from 20% of the people/products/whatever. But we’re not looking to
profit from our users, we’re looking to serve them. Questions of how to fund
that not withstanding, taking this attitude means you’re ignoring of the
people!

If there’s one thing we’ve learned from successes like eBay, Amazon and many
more, it’s that if we’re smart we can use modern technology to efficiently
provide large numbers of niche products and services without drowning in the
overhead traditionally associated with trying to do so.

Research attitude

Again, this can be a problem for centralised IT services, because it’s seen as
inefficient for them to put significant R&D time into things which may only
ever be of use to a minority of their users.

In an academic department, however, the culture is different. Success in
research demands innovation, which requires risk. Scientists and engineers, for
example, intrinsically understand the need to experiment, and no-one questions
the idea that many of those experiments will fail.

Notice that word fail. In this context failure is not a loss, it’s merely a
failure to produce the anticipated results. Most researchers still don’t like
failure — they’re human after all. But they learn not to get so hung up on it,
because if you set up your experiment right (which is really the key to the
whole enterprise) then you learn as much or more from failing as you do from
succeeding.

And that’s really the point. We want to help our researchers to do their jobs
even better than they already do, which means we need to learn, which in turn
means we need to make mistakes. There are no lectures and degree courses to
teach us about ideas which don’t exist yet.

So to steal one of those trite little phrases life coaches and the like love so
much: fail early, fail often, fail smart and learn from it.

Comments

My first MOOC — Introduction to Online Education

Tuesday 10 January 2012

Tagged with

MOOC
IOE12
Meta

I’ve decided to sign up and join David Wiley’s MOOC, Introduction to Open
Education 2012. A MOOC (Massively Open Online
Course) is an online course, typically run
by a lecturer at a university, which is freely accessible and built around the
ideas of connectivism and social learning.

The content of the course, which is about the various ‘kinds’ of openness
currently practised in higher education, fits nicely with what I’m doing at the
moment so I thought I’d give it a try.

Although I could theoretically find, study and blog about all of the content in
this course on my own, I think that the social aspect and the defined set of
objectives (in the form of “badges”) combined make it more likely that I will
follow through.

Let’s see if that’s actually true…

Comments

Amazon Kindle — 12 months on

Monday 2 January 2012

Tagged with

Amazon
Kindle
eBooks
Gadgets

I’ve now had my Kindle for just over 12 months — it
was last year’s Christmas gift from my wonderful wife — and I can quite
honestly say that it’s completely changed the way I read.

I’ve always been a keen reader, but sometimes found it difficult to find time
to read while also having a book available. I also tended only to buy books one
at a time when I was in a physical bookshop. As a consequence, most of my
reading happened at home, either in bed or in the bath, and I would get through
books at around one a month.

Since getting my Kindle (well, since first getting the Kindle app for iPhone 14
months ago) I have read 45 books. I never used to read non-fiction books, but
have just finished my third of the last few months. My decision on what to read
next would generally wait until I’d finished my last book, but now I have 14
books waiting to be read and about another 20 on an Amazon wishlist waiting to
be purchased.

What’s caused this change? As you might guess, it’s a combination of several
things. Compared to a paper book, my Kindle weighs almost nothing, so I can
slip it in a bag or a pocket. I can hold it in one hand while drinking tea, or
lie on my back and read, both of which I found too tiring to do with paper
books.

I also have iPhone and desktop Kindle apps, which are always in sync. I always
have my current book with me, so I have many more opportunities to read.

When I finish a book, I can immediately start the next, whether I have one
already lined up or I need to go online and buy one. I’ve basically turned into
a chain-reader, going from book to book without pause.

Irritatingly, the prices do not reflect the near-zero marginal cost of
distributing digital content — if you shift content in the volume that Amazon
can, your income is almost pure profit.

However, digital books are still cheaper than the print editions. The
difference for popular fiction is pretty small, but I appreciate it
nonetheless. For specialist non-fiction, on the other hand, where low volumes
make print copies prohibitively expensive, digital editions come at a
significant discount — often half price or better in my experience.

I actually wrote the entire first draft of this post without mentioning either
screen quality or battery life. Both are so good that it didn’t even occur to
me to mention them.

There are downsides too. Because I’m locked into Amazon’s infrastructure, I
can’t lend books to friends or family (this feature still hasn’t been enabled
outside the US). I also can’t donate books to charity shops once I’ve finished
them.

Both of these facts still make me uneasy, and I’m not sure that I want all my
books to be controlled by a single company for the rest of my life. And I
haven’t even started on the problem of how many books I need to read on Kindle
to break even on the carbon footprint, or even whether that’s possible.

That said, my pragmatic side is winning at the moment. Reading on Kindle just
works, and it seems to suit my lifestyle much better than books made of dead
tree.

I know a lot of people have been given Kindles this Christmas, so I’d love to
know if any of my readers have thoughts on this.

Comments

The Research Technologist part 1: proactivity and innovation

Thursday 15 December 2011

Tagged with

Research technologist
ICT
Job description
Reflection

I began writing this a couple of months ago, shortly after ALT-C, the
Association of Learning Technology Conference. Then it turned into “one of
those posts” that I had to perfect before I could publish it. And that’s silly,
so I’m going to publish it now and continue it in further posts, because this
is a blog, not a thesis.

Anyway, as is often the case at conferences when you meet a lot of people, I
kept having to answer the question “What do you do?”. My actual job title is
“ICT Project Manager”, which while impressive sounding doesn’t go any way to
explain what I do. In the end, I came up with the following stock response:

“I’m a research technologist: I have a very similar role to learning
technologists, except that I support academics as researchers instead of
as teachers.”

There are a few roles out there which sound similar, or which have similar
names, so I thought I’d mention a few things that set this role apart from
other similar sounding jobs. This post is the first part of a series exploring
those aspects.

First, a disclaimer

I’m not foolish enough to think I’m the only person doing this type of job, or
to pick out these features as important, or even to come up with that name. I’m
quite certain there are people doing this in IT departments, in research
development departments, certainly in academic departments and quite possibly
in e-learning departments too. It’s more that there seems to be no standard
position for this role (except where institutions have dedicated e-research
teams) and I’m setting out to find other people in similar roles to share ideas
with.

Proactivity and innovation

Although part of my role is to support existing systems and respond to queries
from users, that’s not the whole of it. I feel it’s important to keep abreast
of the latest technology innovations and explore how they can be used to
support research. This contrasts with the typical approach of central
university IT services, which generally have a core set of “supported” software
and services with rigorous procedures and checks in place to control changes to
that set.

I don’t wish to suggest that this centralised model is inappropriate: on the
contrary it’s absolutely necessary. University IT services have the very
challenging job of providing an acceptable and consistent standard of service
to a huge and diverse user base. To do this efficiently it’s necessary to make
sure that all IT staff have a reasonable understanding of every supported
service, which just can’t happen if that set of services is too large.

The trouble is that as well as providing users with a very stable, high level
of support for essential services (networking, email, payroll and so on), it
also tends to stifle innovation. If a new service is to be offered, a lot of
time and resources must be invested in doing so at the level of existing
services; quite a risk if there’s no guarantee that the new service will
succeed. That means that there’s no scope to start something small, with the
option of either growing it organically if it takes off or letting it die
peacefully if it’s not right.

I’ll be exploring this further soon, but for now I’d be interested in your
take, especially if you disagree or recognise some of what I say in your own
role.

Comments

Nose to the blogstone

Saturday 3 December 2011

Tagged with

Meta
Blogging
Web design
Admin

Well, I’m just back from the launch meeting of the JISC Managing Research Data
programme, of which our Research360 project at Bath is a part, and coming
to terms with the fact that blogging is now an inescapable part of my job.

Looks like it’s time to get back into my blogging rhythm once more. Time to
make a few tweaks that I’ve been planning to the layout too. Let me know what
you think.

Comments