ALTEN Group Case Studies Sharing

3D illustration of a cube, featuring a luminous texture. The cube displays the inscriptions 'AI' and 'ALTEN case studies,' along with a ALTEN's logo.A Japanese metaverse startup, focused on virtual influencers, was interested in developing advanced AI models for generating real-time talking videos. They contacted VMO Holdings, a brand of the ALTEN Group specialized in software product development, to help them create the avatars. Thanks to their innovations, high-quality, real-time talking AI avatars now populate their metaverse.

The innovative Japanese startup sought to generate high-quality video tailored to Japanese metaverse audiences. The scarcity of Japanese training datasets in existing commercial application programming interfaces (APIs) posed a significant obstacle to achieving the accurate speech-to-text capabilities the startup was looking for. In addition, the high cost of third-party commercial products made the development of an in-house AI product.

 

Challenge: To develop cutting-edge AI technologies for real-time talking video generation, tailored to the needs of a Japanese metaverse audience focused on virtual influencers

Solution: Innovative research on Al models to generate real-time, high-quality, real-time talking AI avatars for virtual applications, coupled with custom finetuning of speech recognition models for better performance in Japanese

Benefits:

  • Real-time, high-quality talking AI characters
  • State-of-the-art solution
  • Enhanced AI capabilities
  • Superior performance in Japanese speech recognition
  • Improved efficiency in video generation
  • Faster processing times and higher resolution outputs

Key performance indicators:

Speech-To-Text:

  • 75% reduction in processing time
  • 15% enhancement in Japanese speech-to-text accuracy compared to market standards

Video Generation: 

  • 100% improvement in alignment between voice and lip movements
  • increased resolution, with equivalent GPU usage, from 96 x 96 to 256 x 256
  • decreased latency between input and output, in real-time, from 45 seconds to < 3 seconds
  • >90% reduction in cost of video creation, compared to traditional tools

 

Person-talking video

Together with VMO, the startup set out to revolutionize virtual communication for the metaverse platform.

The solution included two main components. The first involved research on and development of AI models for the generation of high-quality, real-time talking avatars. The models needed to support the creation of dynamic video with realistic lip-sync and facial expressions, optimized for low usage of graphic processing units (GPU). This was complemented by finetuning of the speech-to-text model, enhancing it for Japanese data to compensate for the lack of existing commercial alternatives. The finetuning resulted in significant improvements in the recognition accuracy of Japanese speech. The improved efficiency in video generation enabled faster processing times and higher-resolution outputs.

The first AI lip sync model entered production in September 2023, with fake video lip-sync. The current version, with real lip-sync, went into production in January 2024.

The tools

Model Training on Nvidia GPU H100s was utilized for intensive model training tasks.
Speech-to-text benchmarking was conducted with Whisper API and various pretrained models, for comparison against industry benchmarks (OpenAI and Google ASR). The advanced AI models included SadTalker for high-quality, GPU-intensive video generation. ER-NERF was deployed for static body D-ID, transforming still photos into personalized streaming AI videos optimized for low GPU usage. Wav2Lip enabled efficient, full body motion with good quality. Custom Japanese datasets were used for finetuning, leveraging internal preprocessing and optimization techniques. State-of-the-art solutions achieved near real-time video generation, reducing processing time from 45 seconds to just 5 seconds, at a resolution of 96×96.

The VMO AI resulted in enhanced video quality, achieving a resolution of 256×256 pixels with only three seconds of generation time. The results in Japanese speech recognition ensured superior performance, with a word error rate (WER) of 18.01 that far outperforms OpenAI’s Whisper (21.11 WER) and Google ASR (27.74 WER), providing a significant edge in Japanese language support.

New horizons on the metaverse

This innovative application of AI technologies offers Japanese audiences previously unknown advances in avatar generation, opening up territories in the metaverse focused on their needs and preferences.