The 8 NFE Inference Speed Powering LongCat-Video-Avatar 1.5

The digital human industry has long been haunted by the uncanny valley, that visceral sense of discomfort when a synthetic face looks almost human but fails in the subtle synchronization of speech and movement. For years, the gap between audio input and lip movement remained a stubborn technical hurdle, often resulting in robotic expressions that broke the user's immersion. This week, the release of LongCat-Video-Avatar 1.5 on Hugging Face suggests that the industry is finally moving past these aesthetic failures toward a production-ready standard for audio-driven human video generation.

Technical Architecture and Inference Optimization

LongCat-Video-Avatar 1.5 is built upon the LongCat-Video foundation, designed to handle a wide array of generative modalities. The framework supports two primary workflows: Audio-Text-to-Video (AT2V), which transforms audio and text prompts into video, and Audio-Text-Image-to-Video (ATI2V), which incorporates a reference image to maintain visual identity. A critical architectural shift in this version is the replacement of the traditional Wav2Vec2 audio encoder with OpenAI's Whisper-Large. By leveraging the superior speech recognition capabilities of Whisper-Large, the model achieves a more nuanced understanding of phonetic dynamics, resulting in lip movements that feel fluid and naturally aligned with the spoken word.

Beyond visual fidelity, the model addresses the primary barrier to commercial adoption: computational cost. The development team implemented a step-distillation technique based on Distribution Matching Distillation 2 (DMD2), which drastically accelerates the generation process. This optimization reduces the inference requirement to just 8 NFE (Number of Function Evaluations). In a production environment, reducing the number of iterations required to produce a high-fidelity frame directly translates to lower GPU overhead and faster response times for end-users. The system is further enhanced to support multi-stream audio inputs, ensuring that the avatar remains stable even when dealing with complex or layered audio environments.

To deploy the environment, developers can use the following sequence:

bash

git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Video
cd LongCat-Video

conda create -n longcat-video python=3.10

conda activate longcat-video

pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

pip install ninja

pip install psutil

pip install packaging

pip install flash_attn==2.7.4.post1

pip install -r requirements.txt

conda install -c conda-forge librosa

conda install -c conda-forge ffmpeg

pip install -r requirements_avatar.txt

From Benchmarks to Domain Generalization

While technical specifications provide the foundation, the actual utility of LongCat-Video-Avatar 1.5 is revealed through its performance across diverse real-world scenarios. The model was subjected to a rigorous human evaluation benchmark across six core domains: news broadcasting, educational content, daily life interactions, entertainment, singing, and commercial promotions. The testing pool consisted of 508 image-audio pairs, combining English and Chinese languages with both photorealistic and animated visual styles.

The scale of the validation process was significant, involving 770 crowd-sourced evaluators who provided a total of 13,240 individual judgments. To ensure technical rigor, 10 domain experts performed a deep-dive analysis focusing on four specific metrics: physical rationality, audio-video harmony, temporal stability, and identity consistency. The results indicate that LongCat-Video-Avatar 1.5 performs at a level comparable to leading commercial models, effectively eliminating the jitter and identity drift that typically plague long-form AI video.

The most surprising insight from these tests is the model's capacity for domain generalization. It does not merely excel at generating human faces; it maintains robustness when applied to animated characters, animals, and complex scenes involving multiple people interacting or manipulating objects. This versatility suggests a shift in application from simple talking heads to full-scale virtual actors and AI anchors. To further optimize performance based on available hardware, the model supports FlashAttention-2 by default, while offering optional integration with FlashAttention-3 or xformers to maximize memory efficiency and throughput.

LongCat-Video-Avatar 1.5 transforms the digital human from a research curiosity into a scalable tool for the global content economy.

The 8 NFE Inference Speed Powering LongCat-Video-Avatar 1.5

Technical Architecture and Inference Optimization

From Benchmarks to Domain Generalization

Related Articles