Text Mining #4: My Wish List to the GLAM: Providing access to data for text mining

Guest Blog Post by Beatrice Alex, Research Fellow at the School of Informatics at the University of Edinburgh and a member of the Edinburgh Language Technology Group.

I specialise in text mining and over the last eight years obtaining access to data has been crucial for doing the work I do. More recently, I have been part of many digital humanities projects, including Trading Consequences and Palimpsest, where we mined a series of different textual sources from archives and libraries.

In the case of Trading Consequences, a project on mining information on nineteenth century commodity trading in the British Empire, we were able to text-mine, among other collections, summaries from the Kew Garden's Directors' Correspondence archive. In Palimpsest, a project on mining Edinburgh's literary landscape, the largest datasets we obtained access to were a collection of books from HathiTrust and the Nineteenth Century Books Collection from the British Library Labs.

Getting access to such datasets required us to know about them to begin with. In many cases, our project ideas are inspired and enabled by relevant data. Applications for research funding are always much stronger if we can provide evidence for being able to work with a given dataset. If galleries, libraries, archives and museums (GLAMs) are interested in sharing their available datasets widely for text mining and other research purposes, then they need to be proactive in publicising and explaining how to get hold of them.

One institution which is already doing this very well is the HathiTrust. Their website contains detailed information on available documents, metadata, how to access subsets of data via their API and gives clear guidance on how to get hold of their entire collection. If the data is not already accessible online then it is also useful to know details of the content of a collection, any other useful metadata, what the format is, how big it is in terms of storage space and a contact person.

Although GLAM institutions may be interested in making their data available for text mining, they might be worried that the data they carefully curated over many years will be copied and misused for things that they did not anticipate. Assuring them that their data is safe and that we will not release it to the public is very important. Our aim is merely to identify patterns in the data, usually ones related to a particular hypothesis and domain in question.

In both Palimpsest and Trading Consequences, wherever possible, we provided links to the original source documents held by the data providers thereby encouraging access to the original content. Having URLs available for the source is very important.

Once we are aware of a dataset and know its holder is happy to share it, then there need to be mechanisms in place to transfer this data. This will apply whether an individual scholar or research group approach a GLAM for data. This can be handled in different ways and often depends on the size of data, the support available in the data holder institution and their technical expertise. With the copyright exceptions for text mining research put in place in the UK in October 2014, and lobbying to extend these exceptions Europe-wide, more and more data holders with copyrighted material will be approached to get access to their data.

If an agreement needs to be signed then it's preferable if a sample agreement already exists, which can be modified if necessary. This will avoid unnecessary delays in sorting out the legalities of sharing and using the data. In my research group, this has been more of an issue in the case where a commercial entity owns the copyright of a digitised dataset and not where a GLAM institution itself is keen to share its data as open data or under a permissive license.

In the past, we have received data in many different ways: as downloads, via authorised scraping by email, via APIs, on disk, by rsync (a Unix program used for file transfer and synchronisation), you name it. For example, we received the Directors' Correspondence archive from Kew Gardens in several Excel files as email attachments. In that case, Helen Hartley, the Project Digitisation Manager at Kew's Herbarium, Library, Art & Archives, provided very helpful support over email to explain the content and format of the data. For larger datasets, email transfer and support might not be an option.

In the case of the HathiTrust data, a couple of hundred thousand world public domain documents, we simply had to run an rsync command which allowed us instant computational access to their data . The advantage of rsync is that it can be rerun when new documents get added to the collection or if the copyright status of existing document changes and thereby keep a collection that changes up-to-date. Some institutions may not have the resources to support automatic syncing of data, however, if their data is in high demand then putting some type of automatic sharing mechanisms in place (e.g. via the cloud or downloads) will be more cost-effective long-term.

In the case of the British Library Nineteenth Century Books data, we received the entire collection, including all the images and the full text, on several hard disks. For text mining purposes, we require access to the full text and not only the digitised images. So if a collection has already been optically character recognised (OCRed) in-house then giving us access to the full electronic text will save us and other teams valuable time having to re-OCR it. This might be an obvious point to make but from conversations with data holders it does not always seem to be apparent and online search interfaces often only present the images to users. Also any available document-level metadata can be very useful. For example, knowing the location that a text is about would help us to run the Edinburgh Geoparser with a locality setting and restrict place name disambiguation to the specified area.

Figure 1: Delivery of the British Library Nineteenth Century Books Collection on hard disk for Palimpsest.

In terms of data formats, our text mining tools work with XML, so that format is our preference. However, we have the expertise to convert other formats into XML first. In fact, we usually spend time in the initial phase of any project doing data preparation which includes format conversion. While there are already efforts to use standards in different domains (TEI, MARC etc.), each dataset comes with its own individual characteristics and we can rarely re-use existing conversion scripts without having to make even some small adjustments. However, we have a lot of experience in data preparation which GLAM institutions might not necessarily have. So I believe that making the data available is therefore more vital than determining the optimal format it should be in. However, having a consistent, well-formed format definitely helps us to get started more quickly.

So in short, my wish list to data providers from the GLAM sector who are interested in sharing their data is:

publicise the data you would like to be used for text mining purposes
give us information on what a collection or dataset contains (metadata, content, size, format)
tell us how we can get hold of it
find a mechanism to share the data easily
if there are copyright issues, draw up a template agreement (which can be modified if necessary)
provide the full text if you have it (not just the images), ideally in a consistent and well-formed format
provide document-level metadata, if possible
provide a URL for each source document, if possible, so that we can link back to you

Merry data sharing.