For years, developers attempting to build video-intelligent applications have faced a frustrating compromise. To analyze a video clip, most vision-language models essentially treat the footage as a slideshow, sampling a few static frames and hoping the critical action occurs within those snapshots. This approach is not only computationally expensive but fundamentally blind to the fluid nature of time and motion. As the industry pivots toward Physical AI—systems that actually understand the laws of the physical world—the bottleneck has remained the prohibitive cost of processing high-frequency visual data at scale.

The Efficiency Frontier of Mk1

Perceptron has entered this gap with the release of Mk1, a native video inference model designed to collapse the cost of high-performance video analysis. The pricing structure is aggressive, positioned to undercut the current market leaders. Perceptron has set the API cost at $0.15 per million input tokens and $1.50 per million output tokens. When compared to the pricing of frontier models like Anthropic's Claude Sonnet 4.5, OpenAI's GPT-5, and Google's Gemini 3.1 Pro, Mk1 represents a cost reduction of 80 to 90 percent.

This pricing is not a mere marketing tactic but the result of a 16-month development cycle led by CEO Armen Aghajanyan, a veteran of Meta FAIR and Microsoft. The team focused on a specific multimodal recipe tailored for the complexities of the physical world. In the efficiency frontier charts provided by the development team, the disparity in mixed costs is stark. While Mk1 operates at a mixed cost of approximately $0.30, GPT-5 sits at roughly $2.00, and Gemini 3.1 Pro reaches approximately $3.00.

Technically, Mk1 is built for native video processing rather than image-sequence approximation. It supports a processing speed of up to 2 frames per second (FPS) and provides a 32K token context window, allowing the model to maintain a significant amount of visual information in its active memory. To facilitate integration, Perceptron provides a Python-based SDK that grants developers access to specialized capabilities such as Focus, Counting, and In-Context Learning.

Redefining Precision Through Temporal Continuity

The true distinction of Mk1 lies in its departure from the frame-by-frame paradigm. While traditional vision-language models (VLMs) struggle with temporal consistency, Mk1 is engineered to maintain the identity of objects even when they are occluded or move rapidly across the screen. This ability to track continuity is the cornerstone of Physical AI, moving the needle from simple pattern recognition to a genuine understanding of spatiotemporal dynamics.

This architectural shift is reflected in the benchmark data. On the EmbSpatialBench, which measures spatial reasoning, Mk1 achieved a score of 85.1. This outperforms both Alibaba's Q3.5-27B, which scored 84.5, and Google's Robotics-ER 1.5, which scored 78.4. The gap becomes even more pronounced in the RefSpatialBench, which tests the understanding of referential expressions. Mk1 recorded a score of 72.4, dwarfing the 9.0 achieved by GPT-5m and the 2.2 recorded by Sonnet 4.5.

Temporal reasoning benchmarks further validate this approach. In the EgoSchema hard subset—a test where inference is impossible if the model only looks at the first and last frames—Mk1 scored 41.4, significantly higher than Gemini 3.1 Flash-Lite's 25.0. Furthermore, Mk1 reached the highest score among compared models on the VSI-Bench, which specifically measures temporal reasoning, with a score of 88.5.

These numbers translate into capabilities that were previously computationally impractical. Mk1 can analyze a basketball game in real-time, simultaneously tracking the ball's trajectory in the air and the countdown of the shot clock to determine if a shot was a buzzer-beater. It demonstrates high reliability in reading analog gauges and clock hands, and it can perform pixel-level pointing and count hundreds of distinct objects within a scene. In one practical test, the model accurately described historical footage of New York City skyscrapers being built in 1906, identifying specific details like workers hanging from ropes and correctly deducing the early 20th-century era based solely on visual cues.

By removing the financial barrier to high-fidelity video analysis, Physical AI is moving out of the research lab and into the infrastructure of the real world, from automated factory floors to city-wide surveillance networks.