And the quest for better metadata quality goes on: an update from the DQC

16 January 2017 Comment

In March 2016 we announced the creation of a Europeana Data Quality Committee (DQC) to work on the various facets of data quality. The Data Quality subject is a Hydra difficult to defeat, but the Committee is definitely challenging the beast - as proven by the work accomplished over the past year.

Hercule tuant l'Hydre de Lerne | Cort Cornelis, 1533 environ-1578 (graveur) ; Floris, Frans, 1516-1570 (d'après). Image: Bibliothèque Municipale de Lyon, Public Domain

Supporting user needs with quality metadata - the need for high quality metadata in Europeana is motivated by its impact on search performance, on the overall Europeana user experience and on the re-use of the data. Rather than making it a theoretical effort, the DQC decided to focus on data quality from the perspective of its intended use. It resulted in the creation of a series of usage scenarios reflecting information retrieval requirements. These scenarios provide context and guiding principles for the work of the DQC.

Defining clearer requirements - a way for Europeana to improve data quality has been to define mandatory elements as part of the Europeana Data Model (EDM). However mandatoriness does not always entail good data quality: the required elements are not necessarily in the source metadata, and are often created by providers on the go, which sometimes result in duplicated values. Although our first impulse was to make more elements mandatory, we had to go against it !

The DQC agreed on two new ways of categorising the current EDM elements: mandatory elements and enabling elements. While enabling elements support definite desirable (but optional) functionalities for a specific (set of) usage scenario(s), mandatory elements are required as a fundamental minimum for all metadata descriptions. The DQC has formulated a list of recommendations for mandatory elements which will be incorporated into the EDM mapping guidelines. These recommendations confirm some existing EDM elements as mandatory, define new ones as recommended, and remove the constraints for others.

The next step is to define a list of enabling elements. This new set of recommendations should highlight the importance of some elements in the realisation of the usage scenarios defined earlier. In the meantime, we also suggested clarifications in the definitions of some elements.

Measuring metadata quality - Assessing the quality of its data is crucial for Europeana. The DQC started to work on a completeness rating indicating most complete metadata descriptions. This work was highly motivated by DQC member Péter Király’s individual effort as part of his Metadata Quality Assurance Framework. The completeness is measured along different profiles, such as the presence of mandatory fields or the appropriate tagging of field values (including translations) within a record with language tags (e.g., 'en' for English and 'fr' for French). Once complete, this work is expected to yield a metric for metadata completeness that can be used to inform data reusers and Europeana officers of the datasets quality, and to improve search and ranking on Europeana Collections. The DQC also discusses ways of visualising these calculations in charts or other graphic representations. Examples can be found here.

Identifying errors in metadata values - While the completeness measure focuses on the presence of a given field in a metadata record, it doesn’t provide any quality indicator on the value a field holds. The Committee has gathered a list of problem patterns for metadata values (e.g., data normalisation issues) that affect search and interfere with ranking algorithms. The next step will consist in testing new technologies to (1) identify the problem patterns in the data, (2) solve some of these issues (via normalisation for instance). We encourage data providers working with Europeana to help us flagging more problems so that they can be fixed as early as possible in the delivery of the data !

Discussions were not limited to these topics. The relation between data quality and the representation of events, dates and language normalisation as well as the distinction between type and genre information also raised attention and might be considered as topics for more work in the near future.

You can follow the work of the Data Quality Committee here.