An undergraduate researcher recently put a new AI model to the test with a simple, high-stakes request: correct my pronunciation the moment you hear a mistake. In a typical AI interaction, the user speaks, pauses, and waits for the system to process the audio before receiving a critique. But this interaction was different. The model did not wait for the silence. It intervened mid-sentence, pinpointing errors in real-time as the student spoke. For a brief moment, the mechanical lag of human-computer interaction vanished, and the AI synchronized perfectly with the natural rhythm of human speech.
The Architecture of the 200ms Micro-Turn
Thinking Machines Lab has officially unveiled TML-Interaction-Small, a model engineered specifically to bridge the gap between asynchronous processing and fluid conversation. At the heart of this system is a 200ms micro-turn design. Rather than treating a conversation as a series of discrete turns—where one party finishes and the other begins—TML-Interaction-Small processes input and output as a continuous stream. This allows the model to support seamless interruptions and simultaneous speech, mimicking the overlapping nature of organic human dialogue.
To achieve this without sacrificing cognitive depth, the system employs a bifurcated architecture. It splits responsibilities between an Interaction Model and a Background Model. The Interaction Model serves as the immediate interface, handling the rapid-fire demands of real-time response. Meanwhile, the Background Model manages long-term reasoning and complex computations. These two entities operate asynchronously but share a unified context. When the Interaction Model encounters a query that requires deep reasoning beyond its immediate capacity, it delegates the task to the Background Model. Crucially, the Interaction Model remains active and present for the user during this delegation, maintaining the conversational thread and answering follow-up questions while the Background Model works in the periphery.
From External Harnesses to Native Interaction
For years, the industry has attempted to simulate real-time interaction using external harnesses. The most common method involves Voice Activity Detection (VAD), a separate layer of software that monitors audio to decide when a user has stopped speaking. This approach creates a fundamental tension: if the VAD is too sensitive, the AI interrupts too early; if it is too slow, the conversation feels sluggish. TML-Interaction-Small discards this external framework entirely, embedding interaction capabilities directly into the model's internal weights.
This native approach is powered by an early fusion structure that processes text, audio, and video inputs simultaneously. Audio signals are ingested as dMel (digital mel-spectrograms), which are then converted via a lightweight embedding layer. Visual data is handled by dividing images into 40x40 patches, which are encoded using a hierarchical multilayer perceptron (hMLP) to compress data efficiently. Because every component was jointly trained with the transformer from the ground up, the model does not just react to signals; it understands the multimodal context of the interaction in real-time.
This architectural shift manifests in significant performance gains. On the FD-bench V1 turn-taking latency test, TML-Interaction-Small recorded a response time of 0.40 seconds. In the FD-bench v1.5 evaluation, the model achieved an average score of 77.8, surpassing the performance of both GPT-realtime-2.0 and Gemini-3.1-flash-live. When tested on FD-bench V3 Audio+Tools with the Background Agent active, the model maintained a response quality of 82.8%.
To sustain these speeds in production, Thinking Machines Lab optimized the inference pipeline through SGLang. They upstreamed a streaming session feature to the SGLang library, which allows the system to append chunks to persistent sequences in GPU memory. This eliminates the overhead associated with frequent memory reallocation. Furthermore, the team implemented low-latency communication kernels for the NVIDIA Blackwell architecture using NVLS (NVIDIA Link Steering), ensuring that inter-GPU communication does not become a bottleneck. Safety was not overlooked in the pursuit of speed, as the model achieved a 99.0% text refusal rate on the Harmbench safety benchmark.
The benchmark shift suggests that the next frontier of AI competition is no longer about the raw ceiling of intelligence, but the density of interaction and the ability of a model to breathe in sync with its user.




