#vision-enhanced comprehension

MarktechPostvoice cloning alibaba qwen qwen3.5-livetranslate-flash

Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency

Alibaba's Qwen team has released Qwen3.5-LiveTranslate-Flash, a real-time multimodal translation model that processes audio and video simultaneously. The model covers 60 input languages and produces speech output in 29 languages at 2.8 seconds of latency. Key additions over the previous Qwen3 version include real-time speaker voice cloning, vision-enhanced comprehension via lip movements and on-screen text, and dynamic keyword configuration for domain-specific terminology. On FLEURS and CoVoST2 benchmarks, the model outperforms major commercial alternatives. It is available as an API-only model through Alibaba Cloud Model Studio using a WebSocket-based protocol. The post Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency appeared first on MarkTechPost.

May 20, 8:09 AM

#vision-enhanced comprehension

Mentions — May 14, 2026 – May 20, 2026

Related Keywords

Latest Content

Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency