Close Encounters with AI: an interview on automatic semantic enrichment

Marco Rendina: Let’s start from the basics. What is semantic enrichment?

Eirini Kaldeli: Semantic enrichment is the process of adding new semantics to unstructured data, such as free text, so that machines can make sense of it and build connections to it. In the case of textual metadata that describe cultural heritage items, these can be analysed and augmented with controlled terms from Linked Open datasets or vocabularies, such as Wikidata or the Getty Art & Architecture Thesaurus (AAT). These terms are commonly referred to as annotations and can represent concepts and attributes (such as ‘Costume’ or ‘Renaissance’), persons, locations, organisations or chronological periods. For example, the strings ‘Leonardo da Vinci’ and ‘da Vinci, Leonardo’ can both be linked to the Wikidata item representing the Italian Renaissance polymath.

MR: Why is it important to enrich metadata with terms from Linked Open datasets or vocabularies?

EK: Semantic enrichment adds meaning and context to digital collections and makes them more easily discoverable. Given its importance, it has been a main concern and focus of efforts by the Europeana Initiative as well as individual aggregators and data providers.

Firstly, linked data makes textual metadata unambiguous. For example, the string ‘Leonardo da Vinci’ may also refer, depending on the context, to the Italian airport or a battleship with the same name. Each of these concepts are represented via a dedicated URI (Unique Reference Identifier) from Wikidata, and, thus, by linking the text with the correct URI, it becomes clear what the text refers to.

Secondly, linked data allows us to retrieve additional information about a certain entity, build connections between different resources and contextualise them. For example, it allows us to link items tagged with the term ‘ring’ with the broader concept ‘jewellery’ and interconnect them with items enriched with the term ‘bracelet’, which is also an instance of ’jewellery’.

Finally, linked data usually comes with translations, improving the capabilities for multilingual search. This enables those using online repositories to browse and search collections at the so-called ‘semantic layer’: someone who searches for ‘κόσμημα’ (the Greek word for ‘jewellery’) will be able to discover items described as rings as well as bracelets.

MR: Alexandros, enriching metadata requires effort and resources that cultural heritage institutions often lack. How can digital technologies help address this challenge?

Alexandros Chortaras: Cultural heritage institutions can use state-of-the-art technologies to automate the manual, time-consuming, and often mundane process of metadata enrichment. Natural language processing tools can be used to analyse textual metadata and detect and classify named entities, such as persons or location names, mentioned in unstructured text. Machine learning approaches are extensively used for the task of named entity disambiguation, which is responsible for deciding if, for example, the reference to ‘Leonardo da Vinci’ in the text refers to the Italian polymath or to the battleship. Depending on the text characteristics, such as its length and language, the vocabulary that we wish to link it to, and the type of entities we wish to detect, one has to combine the tools that are most appropriate for the specific task. For example, from our experience with previous projects such as CRAFTED, for certain tasks with a well-defined restricted context, even a simple lemmatization and string matching approach may be more appropriate than complex ML-based algorithms.

MR: But can I fully trust the results of an automatic algorithm? What if it makes mistakes?

AC: Indeed, automatic algorithms that analyse free text for named entity recognition and disambiguation make mistakes. The accuracy depends on the task at hand and the algorithm applied. For example, short textual descriptions that are common in metadata lack context and thus ML algorithms trained on Wikipedia articles may result in incorrect matches.

What’s more, even if the automatically detected links are correct, they may be considered undesirable in a certain context. For example, linking metadata records with terms representing colours may be important for a fashion collection, but it may be undesirable for describing a manuscript that happens to mention a certain colour. Thus, human inspection and validation of automatic annotations are indispensable. However, since there are often thousands of automatic annotations, manual validation can be a very resource-intensive process. On a practical level, humans should review a selected sample of the annotations and, depending on the results and the objective, decide on appropriate filtering criteria.

MR: A final question for Eirini. There are many algorithms and libraries out there, but it seems that considerable technical knowledge is required to set them up. How does AI4Culture help cultural heritage institutions to take advantage of those technologies?

ΕΚ: In the context of the AI4Culture project, we are working on a platform, called SAGE, developed by the National Technical University of Athens. SAGE facilitates the semantic enrichment of cultural heritage metadata by offering a suite of established annotators (enrichment templates) configured to serve the needs of the sector. The platform supports the whole enrichment workflow, from data import and automatic production of semantic annotations to human validation and data publication in the format expected by Europeana. The tool has been successfully used to enrich cultural heritage metadata in several applications (including through the CRAFTED and Europeana XX projects). In the context of AI4Culture, it has been extended to hide the technical complexity of automatic semantic enrichment algorithms and to support seamless interoperability with the common European data space for cultural heritage. To this end, the platform supports formats relevant to cultural heritage metadata, such as EDM (Europeana Data Model) and facilitates the direct import of metadata from cultural heritage related sources such as Europeana.eu or the MINT tool used by several Europeana aggregators.

For now, interested people can try out SAGE here. The source code is available on GitHub (frontend, backend). You can learn how to use SAGE following a series of video tutorials and reading the Wiki instructions

Find out more

In September 2024, the AI4Culture project will launch a platform where open tools, like the SAGE tool for semantic enrichment presented above, will be made available online, together with related documentation and training materials. Keep an eye on the project page on Europeana Pro for more details and stay tuned on the project LinkedIn and X account!