Lessons from the library: Behind the UK’s web archive

Dale Vile, originally published on Computer Weekly

Published/updated: March 2011

After all, the role of libraries in general has been increasingly threatened by the march of online media. Why drive to a physical building to look something up when we have facilities like Google and Wikipedia a couple of mouse clicks away?

The truth is that the British Library has some of the most experienced and talented technologists in the IT and communications space, applying cutting-edge technology to solve some pretty tough and very interesting problems.

And their mission is not to bring a traditional institution into the modern age, because it’s already there. Indeed in areas such as digitisation and information storage, archiving and retrieval, it would put many big corporate IT departments to shame. A quick browse of www.bl.uk will provide a flavour of how some of this manifests itself on the Web, though a lot more goes on behind the scenes in support of academic institutions and researchers around the world.

I got a first-hand glimpse of all this when I visited the library’s facility in Boston Spa a few months ago and was hosted by Nicki Clegg, who manages the technical architecture group. Nicki oversees the evolution of the library’s information systems’ architecture, and leads a team that provides technical architectural expertise to programmes and projects.

One of these is the Web Archiving programme, which has been selectively preserving UK websites through a permission-based process since 2004 and making them accessible through the UK Web Archive. The programme acknowledges that a lot of the UK’s history now plays out on the web. It also works on the premise that website content is very often transient in nature. As any website designer or online media strategist will tell you, the key to a successful site is keeping the content fresh, current and relevant to your audience.

The upshot is that the way a site looks today, and the content it contains, is likely to be quite different compared to, say, a year ago. Some sites may even re-interpret past events based on current knowledge and circumstances. The volatile nature of a lot of online content means that if you want to see how life and current affairs were reflected on the Web at certain points in time, or indeed how popular websites were impacting our lives, then you must take periodic snapshots of what was published.

This is exactly what the UK Web Archive does. So in politics, for example, if you want to compare how Conservative Party policy was presented on www.conservatives.com in the lead-up to the last general election, a month afterwards, then today, you can do that (see: here, here and here). A similar snapshotting process is currently applied to around 9,000 websites on topics ranging from the arts, through conservation, to men’s and women’s health. The idea is to capture a cross-section of key sites and other content that is representative of the broader UK web.

That’s a lot of information, which raises the subject of navigation. At the moment, content can either be searched by website title or browsed by special collection. Some of the collections relate to ongoing themes such as ’Live Art’ and ’Energy’, while others correspond to key historical events such as the London terrorist bombings of 2005, the Swine Flu outbreak of 2009, or the General Election in 2010. In each collection, you’ll find snapshots of sites run by public and private organisations, and even blogs that are relevant to the collection topic.

So what goes on behind the scenes to drive a project of this scale and complexity?

Lewis Crawford, technical lead of the web archiving team, doesn’t pretend it’s easy. First, some basic logical design challenges need to be dealt with, such as where to start when building a snapshot (given that some sites have multiple entry points) and how deep to go when following links (ie, how many levels of a site to pull out). Related to this is the issue of "live leakage", which basically means that snapshots will invariably contain links that resolve to the current (as opposed to historical) version of external information. Such problems, which are related to the web-like and dynamic nature of the internet, keep Crawford and the team quite busy.

There’s then the issue of permissions, given that many of the sites archived contain intellectual property that is not in the public domain. The library works on a permission basis at present, but the whole area of Legal Deposit, mandating that copies of certain types of information are submitted for storage and reference, is evolving to embrace electronic as well as printed output. The scope of the UK Web Archive programme is therefore destined to become significantly larger in the future.

The current and future scale of the problem clearly puts a sharp focus on technical architecture. As you can imagine, Crawford and his team have investigated (and continue to investigate) all relevant options, including cloud computing. Initial trials with Amazon EC2, for example, threw up problems with data transport, not just the practicalities of moving large amounts of information into and out of the cloud, but also high service provider charges.

The current architecture is therefore a hybrid one, with local Tomcat servers running a version of the Wayback open source archiving software, working in tandem with very selective cloud-based processing. The clever piece (or one of them) is hosting of the archive index on EC2. The index is a critical component from a performance and scalability perspective, yet is very compressed and therefore easier to move around. Sending index requests for fast resolution in the cloud, but keeping the heavy-lifting content-serving mechanism local is a good compromise.

But having so much of the UK’s web held in one place also provides opportunities for advanced analytics. What, for instance, were the most frequently spoken-about topics in the 2010 election? How has the emphasis on different aspects of health changed over the years? In order to explore questions such as these, Crawford and the team have been experimenting with Hadoop, an open source project, together with the IBM BigSheets business intelligence solution.

As Crawford points out, they are only just scratching the surface of what’s possible with such technologies at the moment, but the potential to analyse the way the nation is evolving through the highly dynamic and increasingly democratised medium of the web is very exciting.

We’ll be looking at the technology aspects of some of the other programmes being run by the library in future articles. In the meantime, the key takeaway is that whether it’s cloud computing or ’big data’, the British Library is demonstrating that the value is there. But the devil is in the detail when it comes to dealing with bigger problems. The other thing of course, is that the role for experienced and talented IT professionals becomes even more pronounced as technology advances.


Featured Content