Going the 'eXXtra' mile: new technologies for enriching cultural heritage data

Enabling automatic translation for enrichment

An API tool developed by project partner Pangeanic detects the language used in Europeana metadata, and allows it to be machine translated. Named the Heritage Metadata Automatic Translation Service (HM ATS), this tool is part of a suite of semantic enrichment tools developed by Europeana XX.

To create the tool, Pangeanic built 10 neural machine translation engines (translating Italian, German, Czech, Greek, French, Swedish, Catalan, Dutch, Polish, and Spanish to English). They used training data from Pangeanic’s own repositories and open data on the internet. Pangeanic also employed translators to translate a limited amount of records from Europeana repositories in order to have Europeana specific training data for several languages.

The tool was used to translate and enrich approximately two and a half million records during the project. Pangeanic successfully extended and fine-tuned the tool to fit the performance requirements of such a massive volume of data. Use the API code yourself.

To evaluate and validate the quality of machine translation, partners also set up a translation validation system (based on LabelStudio). Cultural heritage professionals and native speakers of relevant languages have validated more than 2,700 translations using this system. The feedback was overwhelmingly positive, confirming the high quality of the neural machine translation and and that it works well for the domain of digital cultural heritage.

Validated translations will be used to further improve machine translation engines in the Europeana Translate project, in which Pangeanic is also involved. The goal of this project is to help Europeana progress on the implementation of its multilingual strategy, by providing metadata translations that will enable better search and display of its collections across their native languages and the users' languages.

Enrichment for datasets

SAGE, a web-based tool for producing, enriching, publishing, accessing and managing RDF datasets, was developed by the National Technical University of Athens (NTUA) for Europeana XX. RDF (resource description framework) is a language used to represent the content of a dataset. RDF data can be directly imported or generated from diverse data sources and formats, organised in datasets, and enriched using annotators. These enrichments can then be manually validated. All datasets, including any annotations, can be published in RDF stores, indexed and accessed through API calls.

Thanks to SAGE, selected parts of published datasets can also now be annotated and enriched through external API services, such as tools linking data to relevant Wikidata, DBPedia, Geonames and other resources, or tools that detect occurrences of vocabulary terms in the data. Once enrichments are made in SAGE, they are then manually validated through a system that allows bulk validations using text grouping and text frequency sorting, assignment of validation tasks to multiple users, and close monitoring of the overall validation process.

The SAGE tool was also used in the Pagode project to automatically enrich more than 20,000 records. It will also be used in the CRAFTED project to analyse metadata fields and text extracted from Artificial Intelligence content analysis tools in order to identify and remove uncertainty from named entities. The ultimate aim is to enrich more than 100,000 records and enable user validation and assessment of automatically extracted entities.

Find out more

You can explore all of the tools developed under the Europeana XX project (and other Generic Services projects) on the Europeana Services and Tools page.