Using computer vision tools for historical newspaper analysis: SIAMESE and Europeana Newspapers
An interview with Melvin Wevers, Digital Humanities researcher at KNAW, and Clemens Neudecker, coordinator of Europeana Newspapers
Melvin Wevers, a researcher in the Digital Humanities Group of the KNAW Humanities Cluster, created —a tool that analyses adverts in historical newspapers— together with Juliette Lonij, as a Researcher-in-Residence at the Royal Library of the Netherlands. Clemens Neudecker, a researcher at the Staatsbibliothek zu Berlin - Preußischer Kulturbesitz, is working on making Europeana Newspapers —a huge set of historical newspapers— available through Europeana Collections. Here, they tell us more about how their areas of expertise come together....
Tell us about SIAMESE - the tool you’ve developed to analyse images in newspapers.
Melvin: SIAMESE uses machine-learning techniques (like Convolutional Neural Networks and Approximate Nearest Neighbour algorithms) to detect shapes and objects, and search for similar images. For instance, it can recognise visual elements representing a car in an advert, and come up with similar car ads. This technology isn’t new in the world of computer vision, but applying it to historical objects brings new challenges and opportunities.
Why are machine learning algorithms useful for Europeana Newspapers?
Clemens: Firstly, Europeana not only stores the metadata about the newspapers, but also the images and the OCR-ed full text. The second big reason is that throughout the project we have arranged for the majority of the content to be licensed as public domain. Both the images and text are openly licensed, and all the metadata is licensed as CC0. Thirdly, the newspapers have a wide variety of content.
Melvin: I see opportunities in the international scope of Europeana Newspapers. For instance, multinational companies created big advertising campaigns in the past, including newspaper ads for different countries. Looking at how they compare and differ based on their local context is interesting to me.
Clemens: Researchers are keen to use the Europeana Newspapers corpus to conduct transnational comparison work. They find it natural to go to Europeana first instead of going to the national libraries since the data in Europeana is harmonised in a common format and available through a single API. There is a constant influx of new projects, proposals and ideas. Now is the time for Newspapers!
What opportunities are there for computer vision in the research of historical sources in general?
Melvin:. Newspaper ads are often combinations of different objects, and being able to identify and classify them would make a large-scale study of historical visual imagery possible. The most important development would be the creation of a set of historical images on which the tools can be trained. To create these complex algorithms, you need a big dataset where all the objects are already labelled by humans: this is a big and complex task.
Clemens: Image-based methods create a lot of possibilities for the analysis of historical newspapers. SIAMESE looks at adverts, but there are a lot of other graphical elements which might facilitate or enable more advanced newspaper analysis. Obituaries, for instance, are graphically unique items in newspapers and are very interesting to the whole family history community.
What interesting research have you come across recently that uses computer vision analysis on historical sources?
Melvin: In Switzerland, he Replica project, for instance, analyses historical artworks. There’s also a lot of other interesting research that isn’t historical, but this is mostly due to the fact that we don’t yet have a good training set for analysis of historical data.
Clemens: Oceanic Exchanges, is a transatlantic ‘digging into data’ project with partners from the US, Mexico and some European partners including Germany, the Netherlands, Finland and the UK. One of the tools used aims to detect reuse of newspaper articles. The impresso project, a Swiss-based research project with partners in Luxembourg, enables critical text mining of historical newspaper archives, taking into account multilingual perspectives.
What are your next projects?
Melvin: I’m working as a postdoctoral researcher and I would like to keep on looking at how ideas and concepts shift through time, and how I can detect those in text and images.
Clemens: I’m currently working on making Europeana Newspapers available in Europeana Collections. All the content will be available through IIIF which I’m excited about, and we should be able to launch the Europeana Newspapers thematic collection in the course of 2018.
More information
The SIAMESE code is on GitHub, and you can test the tool on the KB labs site. The dataset of advertisements that was used to train the SIAMESE tool is now also available for reuse by researchers, on request access from the Royal Library of the Netherlands.