Training our image classification model

In a previous post, we introduced you to the Europeana Foundation’s image classification pilot, for which we had selected a target vocabulary and gathered a dataset to train a model to classify images from digitised cultural heritage. In this post, we analyse the results of the training, identify some challenges with our current approach, and propose future lines of work!

A model for single label classification

The dataset we gathered for our image classification pilot was suitable for training a single label classification model - that is, a model that outputs a single category per image. The labels or categories from the training dataset are also known as the ‘ground truth’, meaning that those are the true or correct labels for the given images.

We used a type of convolutional neural network as our classifier for the images, which is a mathematical model with a layered structure inspired by the functioning of the brain. A convolutional neural network is a deep learning model designed to extract relevant information from images, and they are the usual choice for computer vision applications.

In our case, the input of the model was an image, and the output was a probability distribution over all the categories of the target vocabulary. It gave each category a number between 0 and 1 that is often interpreted as a confidence score. This model was then trained by iteratively predicting images from the dataset, and correcting the resulting predictions by comparing them with the actual ground truth.

Once the model was trained, we assessed its performance by testing it on unseen images and comparing whether the prediction made by the model corresponded to the concept depicted in the image. We also employed an Explainable AI algorithm that helped us to understand the output of the model by visualising the regions of interest for each of the output categories. This allowed us to understand the areas of the image that are most relevant for each category, which provided clues about the inner workings of the model.

Below, you can see several examples of predictions on samples obtained using the Search API, along with the confidence scores and the explainability maps. The model uses the following images: aanzicht, Beeldbank van de Rijksdienst voor het Cultureel Erfgoed, Netherlands, G.Th. Delemarre, 1965-03, CC-BY-SA. Lerkärl, kärl, vessel@eng, Vasija, Världskulturmuseet, Sweden, CC-BY. Esimene rohelus, Eesti Sõjamuuseum - Kindral Laidoneri Muuseum, Estonia, Genin, CC0.

Our learnings

From the previous results, we can see that the model was able to successfully capture the most relevant concepts of the vocabulary for the given images. While it is far from perfect, the model can learn from our enriched collections, and can be applied to new images to generate potentially useful metadata.

The main limitation of our approach is that the concepts of the vocabulary are not exclusive, and this doesn’t align well with a single class per image. For example, an image can be a photograph and contain both a building and a sculpture, but due to the single label approach we can only train and evaluate our model to identify one of these aspects.

This gives us a model that often outputs a high confidence score for only one of the categories, with the confidence for the rest of the categories low. By setting a low threshold for the confidence scores of the output, we can get more than one label as the output. However, this approach is not ideal since all the confidence scores need to add up to one (as in any legal probability distribution), which prevents high confidence values in the case of a vocabulary with multiple categories.

Ideally, our model would be a multilabel classifier - a model that is trained with more than one label per image and that is able to output high confidence scores for several categories.

It is also worth mentioning that our dataset has been assembled without human supervision (we didn’t review the images obtained or checked whether or not they are indeed aligned with the categories). This means that the quality of the dataset will depend on the metadata associated with the cultural heritage objects and on previous automatic enrichments based on metadata. In practice not all the images from the training dataset were aligned with the correct categories.

Next steps

We are currently assembling a training dataset for multilabel classification, and will share our work and approach in a future Pro news post - stay tuned! In the meantime, you can explore our Github repository for the pilot,and this Colab notebook, where you can make your own queries to Europeana Search API and apply the single label classification model.

Feel free to contact us at rd@europeana.eu if you have any questions or ideas!