SHIFT is a project funded by the EU’s Horizon Europe program to develop specialised toolkits tailored to different aspects of cultural heritage engagement. The first to be unveiled is the SHIFT Audio Toolkit, which pioneers AI-powered voice synthesis, emotion-driven speech, multilingual accessibility and immersive soundscapes.
Created by the SHIFT consortium with audEERING taking the lead, this Toolkit will harness the power of text-to-speech (TTS), emotion recognition, video dubbing and soundscape generation to create immersive experiences that bring heritage to life.
SHIFT TTS: an affective, multilingual text-to-speech system
SHIFT TTS is a text-to-speech system (a technology that can read out loud any written text) which generates high-quality, emotionally expressive speech in multiple languages. Unlike traditional TTS tools, which sound robotic, SHIFT TTS incorporates affective speech synthesis—meaning it can express emotions such as excitement, calmness or solemnity based on the content it narrates.
The SHIFT TTS tool supports multiple languages, including Albanian, Hungarian, Romanian, Serbian, German, Greek and English, with over 200 affective English voices with native and non-native accents. The toolkit works with subtitles or plain text input, and offers voice personalisation, allowing users to clone voices for unique and customised narration, or choose from the 200 voices the tool makes available. Watch the tool in use.
The TTS tool can enhance the accessibility and engagement of cultural heritage content, for example, adding multilingual, emotionally rich narrations to make exhibitions more engaging for diverse audiences. Additionally, the tool provides a resource for visually impaired visitors by offering accessible content in the form of detailed audio descriptions. It can transform historical documents into engaging audio storytelling experiences, allowing users an immersive historical experience.
Video dubbing or image-to-speech narration
Museums and cultural institutions often rely on videos to educate and engage visitors. However, creating multilingual versions or narrating silent images can be challenging. The SHIFT TTS system offers seamless functionality for video dubbing and the generation of narrated videos from images.
One of the key features of the video dubbing facility allows users to replace the original voice in a video with AI-generated speech (even cloning the voice of historical figures) while carefully preserving the emotional tone of the content. The system also excels in silent image vocalisation, converting still images via text descriptions to narrated videos, making visual content more accessible and engaging for diverse audiences. See an example.
The video dubbing and image-to-speech narration features allow museums to create multilingual versions of their video content, broadening accessibility for international audiences. These tools can be used to add narration to artworks and historical artifacts in digital exhibits, providing richer, more engaging storytelling that enhances the visitor experience. By integrating AI-generated voiceovers, the SHIFT TTS tool can make online museum experiences more interactive and accessible, particularly for those who may have visual impairments or prefer audio-based content.
Voice cloning for personalised narration
One of the most innovative features of SHIFT TTS is its voice cloning capability, allowing users to replicate a speaker’s voice for narration. This feature is particularly useful for preserving the voices of historical figures or narrators, offering a unique and authentic way to bring history to life.
Users can upload a short audio sample, and SHIFT TTS will generate speech that mimics the person’s voice. This ensures that the cloned voice retains authentic emotions and speech characteristics, creating a more realistic and engaging experience. The voice cloning feature opens up possibilities for personalised storytelling, particularly for historical exhibitions, where figures such as Andy Warhol or Salvador Dalí could have their voices cloned to narrate their own stories, offering a deeper connection to the content.
The voice cloning enables the recreation of lost or incomplete historical recordings, bringing long-gone voices back to life for audiences to experience. Museum curators, content creators and others can also benefit from this feature by using their own voices for narrations, ensuring a consistent and personal touch to audio guides, exhibitions and other types of content.
AI-Generated soundscapes for immersive storytelling
To create fully immersive experiences, SHIFT has tested integrating AudioGen, an AI tool that generates realistic soundscapes from text descriptions. This feature allows the addition of environmental or ambient background sounds to exhibitions, enriching them with customised soundscapes that match the specific time period or setting described in the exhibit. Ancient markets, battlefields, or sacred spaces can be brought to life with authentic, era-appropriate background sounds, adding a sensory layer to the storytelling that deepens visitors' connection with history. The tool works in multiple languages, ensuring its accessibility to international audiences and enhancing cross-cultural engagement. Watch it in action with the lead image of this piece below!