The modern developer experience with AI agents is defined by a frustrating paradox. While the underlying models are capable of complex reasoning, the actual execution feels sluggish because agents do not simply answer a question; they enter a recursive loop of reviewing, planning, editing, testing, and refining. In this sequential workflow, the bottleneck is not how many requests a server can handle at once, but how fast a single token can be generated. When an agent must generate 50,000 tokens to complete a complex coding task, the difference between 100 tokens per second and 3,000 tokens per second is the difference between an eight-minute wait and a twenty-second response. This latency is the primary barrier preventing AI agents from moving from experimental chatbots to real-time autonomous coworkers.
The Performance Gap in Agentic Workflows
Kog has addressed this latency crisis by redesigning the software stack to maximize the raw potential of standard data center GPUs. In recent benchmarks, Kog achieved a staggering inference speed of 3,000 tokens per second on a node equipped with 8x AMD MI300X GPUs. When deployed on 8x NVIDIA H200 GPUs, the system recorded 2,100 tokens per second. These figures are particularly significant because they were achieved using Laneformer 2B, a model pre-trained on 6T tokens that maintains a 50% score on the HumanEval benchmark, making it highly competent for coding tasks.
Crucially, these speeds were not reached through common shortcuts. Kog did not employ quantization, speculative decoding, or pruning—techniques that typically trade off model precision for speed. Instead, the performance gain comes from a holistic optimization of the model architecture, runtime, and low-level GPU code. By eliminating the traditional bottlenecks of inference engines, Kog allows agents to iterate through their internal thought loops almost instantaneously. Developers can test this performance firsthand at playground.kog.ai, where the removal of standard framework overhead is immediately apparent.
This leap in decode speed fundamentally alters the productivity of AI agents. When the cost of generation drops, the agent can afford to call more tools, run more tests, and perform more self-corrections within the same time window. The result is not just a faster user interface, but a more intelligent agent that can autonomously catch and fix its own errors more frequently before presenting a final result to the user.
Engineering the Microsecond: Beyond High-Level Frameworks
To understand how Kog reached 3,000 tokens per second, one must look at the hidden costs of modern AI frameworks. On an AMD MI300X, the overhead for launching and cleaning up a single kernel is approximately 4.5µs. For a model with 25 layers, where each layer executes 10 kernels, the total overhead per token reaches 1,125µs. However, to hit a target of 3,000 tokens per second, the total time budget per token is only 333µs. In a standard environment, the kernel launch overhead alone is more than three times the total time budget, effectively capping the speed at around 890 tokens per second regardless of the hardware's raw power.
Kog solves this by shifting the optimization metric from Model Flops Utilization (MFU) to Memory Bandwidth Utilization (MBU). In autoregressive decoding with a batch size of one, the primary operation is matrix-vector multiplication. In this scenario, the bottleneck is not the compute power of the GPU cores, but the speed at which weights are moved from High Bandwidth Memory (HBM) to the processors. At FP16 precision, only about one FLOP is performed for every two bytes of weight data. This means the upper limit of performance is dictated entirely by how efficiently the system utilizes memory bandwidth.
To reclaim these lost microseconds, Kog completely stripped away high-level abstraction layers like PyTorch and Triton. The team bypassed general-purpose libraries such as CUTLASS, NCCL, and ROCm CK, opting instead to manually implement GPU code using PTX inline assembly for CUDA and CDNA ISA inline assembly for HIP. To further eliminate latency, they developed KCCL, a custom communication function that removes the scheduling delays inherent in framework-level communication libraries.
This low-level co-design extends to the physical architecture of the hardware. Kog treats the AMD MI300X not as a generic processor, but as a specific physical system with a unique chiplet-topology. By aligning the software execution sequence with the physical layout of the chiplets, Kog optimizes the data movement paths and ensures continuous memory streaming. This approach converts idle hardware cycles into active computation, pushing the memory bandwidth utilization to its theoretical limit.
By proving that 3,000 tokens per second is possible on standard GPUs through software optimization alone, Kog removes the necessity for expensive, proprietary NPU accelerators. The focus shifts from the hardware arms race to the efficiency of the software stack, enabling a new generation of agents that can think and act in real-time.




