We’ve hit a turning point in the development of artificial intelligence (AI) advances, with voice cloning technology coming to the fore in leaps and bounds. This we owe to state-of-the-art innovations made by leading companies such as Camb AI, a Dubai-based startup which has unveiled a new, powerful AI-based model for voice cloning dubbed Mars5. Set to transform the content landscape with its unprecedented degree of realism and unprecedented breadth of linguistic diversity, Mars5 raises the bar for industry-leader ElevenLabs and its counterparts, while also widening accessibility of content the world over.
The strengths of Mars5, however, lie less in its ability to reproduce the timbre of the original speaker’s voice than its extraordinary success in replicating its complex prosodic features, including rhythm, emotion and intonation. It’s an impressive feat for an open and proprietary model, and marks an important new turn in the lifelike character of digital voice replicas. Mars5’s ability to capture the more subtle nuances of emotional tonalities makes it especially well suited to content that previously resisted convincing synthesis – namely, sport and cinema.
Most notably, Mars5 (like Camb AI, which developed the model but went silent in 2020) supports more than 140 languages – nearly four times as many as ElevenLabs (which supports 36). It’s particularly impressive to see support for so many languages that are rarely (if ever) supported by other cloning services, including Icelandic, Swahili and a number of less widely used languages between those extremes. Camb AI’s model is also available as an open-source English-specific version on GitHub. This helps to build a community around this work and hopefully enable improvements over time.
This is because Mars5’s non-autoregressive multinomial diffusion model (a non-autoregressive model that allows parallel processing of multiclass data, similar to the non-autoregressive model used in GPT-NeoX) is perfectly mixed with its corresponding Mistral-style autoregressive model, which allows for a more efficient search of commands and gating in the complex probability space spanning the bins. This sophisticated architecture allows Mars5 to achieve a sound quality that is unprecedented in the field of voice cloning, and perhaps rivals or even surpasses the quality of our own voices.
Different from the paradigm of most models of voice cloning, Mars5 fully integrates voice cloning and text-to-speech conversion into one convenient system by enabling users to input an audio file and text to generate synthetic speech that not only sounds like the real speaker but also perfectly retains the speaker’s emotion. This unique combination of functionalities saves time and effort on creating synthetic speech.
Initial benchmarks and comparison testing shows that Mars5 beats open-, closed-source alternatives such as Metavoice, ElevenLabs and many others on a scale of accuracy and quality. Further advancements expected by the open-source community will help improve Mars5, yet again, and demonstrate Camb AI’s dedication to ongoing improvement and innovation.
Going forward, Camb AI has no plans to rest on its laurels. An open-source release of another model called Boli, which promises to stir up translation as we know it with its astute grasp of context, grammar and slang across the world’s languages, is also in the pipeline. With Mars5 and Boli, Camb AI is ready to usher in a new generation of voice cloning as well as other AI-powered localisation and communication technologies.
Camb AI has already launched some of its technology into marquee real-world uses, such as live-dubbing Major League Soccer games or producing subtitles in real time for international film and music releases. The mobile audio technologies I’ve been discussing have an enormous untapped application in overcoming cultural and linguistic barriers for using real-time content.
Openness is also central to the way that Camb AI develops its technology: the code for Mars5’s English version is available via GitHub for interested developers, linguists and enthusiasts the world over to contribute towards and further develop, ensuring that the technology is constantly being brought closer to having broader applications across a range of industries.
To conclude, Camb AI’s Mars5 is a definite leap forward in the world of voice cloning, and the ultimate package for any advanced voice synthesis needs: the most natural-sounding voice, the most extensive linguistic coverage, and the most user-friendly design. The future of global communication and content creation will undoubtedly be shaped by the evolution and expansion of the technology, and the current appeal for open-source contributions will only propel us further towards this goal. It is exciting to think about how voice cloning and localisation technologies powered by AI will develop in the near future and beyond.
© 2024 UC Technology Inc . All Rights Reserved.