This week, a video search team hits the same wall again: intent routing slows down, and users see results only after a 2–4 second delay.

Amazon Bedrock moves routing intelligence from Nova Premier to Nova Micro via distillation

The approach described here does not center on OpenAI models at all. Instead, it uses Amazon Bedrock’s Model Distillation to shift routing intelligence inside the Bedrock stack.

In the teacher–student setup, the teacher model is Amazon Nova Premier, and the student model is Amazon Nova Micro. The claim is straightforward: keep routing quality while cutting inference cost by 95% or more and reducing latency by 50%.

The write-up also ties the performance problem to a specific part of the pipeline. In Part 1, the team uses Claude Haiku (from Anthropic) to build a multimodal video semantic search system. In that configuration, end-to-end search time lands in the 2–4 second range, and the authors state that 75% of the total delay comes from that intent routing stage.

So the engineering question becomes: if routing is the dominant contributor to delay, can you compress the routing step into a smaller model without losing the ability to interpret user intent from multimodal signals?

Routing logic gets heavier as enterprise metadata grows, and that drives prompt and response costs

The paper’s second section explains why routing gets slower and more expensive as systems mature.

In earlier versions of the pipeline, routing metadata needed for intent classification could be as simple as five attributes: title, captions, people, genre, and timestamps. With that limited context, the team says intent classification could be handled with relatively simple prompts.

But the enterprise reality is different. As the metadata surface expands, routing conditions start to include more complex signals such as camera angle, mood and sentiment, licensing and rights windows, and even domain-specific taxonomies. The more elaborate the routing logic becomes, the heavier the prompt needs to be.

And heavier prompts do not come for free. The authors connect the dots: more complex routing requires a larger prompt, a larger prompt increases cost, and the increased cost shows up as slower responses.

The tension is not just “accuracy versus speed.” It is also “accuracy versus cost versus latency,” all moving together. Instead of forcing a single trade-off between a fast but simple model and a slower but more accurate model, the paper proposes training a smaller model that can hit all three targets at once: accuracy, cost, and delay.

Distillation replaces SFT by synthesizing up to 15,000 prompt–response pairs

The third section lays out why the team chooses distillation rather than supervised fine-tuning (SFT).

SFT typically requires fully labeled examples, meaning a human provides the correct response for each training input. Distillation, by contrast, can start from prompts alone. The paper frames the key advantage as “no need for complete labeled data.”

In practice, Amazon Bedrock calls the teacher model automatically to generate high-quality responses. Then the system uses data synthesis and augmentation to create as many as 15,000 prompt–response pairs.

The authors also note that label data can be added optionally. For the data format, they specify that JSONL records follow the `bedrock-conversation-2024` schema.

In that schema, the `user` role (the input prompt) is required, while the `assistant` role (the desired response) is optional. That matters because the distillation pipeline can generate the assistant responses from the teacher model during dataset creation, rather than requiring a fully curated labeled dataset up front.

The result is a training workflow that can scale the routing capability without scaling human annotation.

Train Nova Micro with 10,000 synthetic examples, then deploy on demand and pay per token

The final section gets operational, describing the experiment configuration and how the team deploys the distilled router.

The experiment begins by generating 10,000 synthetic labeled examples using Nova Premier as the teacher. The team says it balances the data across visual, audio, transcript, and metadata query signals.

The synthetic examples are also designed to cover the full expected range of search inputs, with difficulty levels divided into tiers. To avoid overfitting to specific query patterns, the authors say the dataset includes edge cases and variations.

If additional data is needed, the paper points to a script named `generate_training_data.py` that uses Nova Premier to generate more synthetic training data.

From there, the workflow looks like this: upload the training data to Amazon S3, submit the distillation job, and let Bedrock generate teacher responses from prompts and fine-tune the student model using the resulting prompt–response pairs.

A key operational claim is that Bedrock handles orchestration and infrastructure automatically. The authors say you do not need to manage cluster provisioning, hyperparameter tuning, or the teacher–student pipeline configuration yourself.

They also describe job execution as asynchronous. To check progress, you can look under Foundation models > Custom models, or monitor the job programmatically.

For the specific scale used here—10,000 label examples combined with Nova Micro—the paper claims training completes in “within a few hours.”

Deployment then splits into two options: Provisioned Throughput, which targets predictable high-volume traffic, and On-Demand Inference, which charges based on usage without requiring upfront commitments. For teams starting out, the authors recommend On-Demand.

They list concrete On-Demand advantages: no endpoint pre-allocation, no hourly commit, and no minimum usage requirement.

Once the model reaches the InService state, the paper says you can invoke it like other base models using either `InvokeModel` or the `Converse` API.

Finally, the pricing model is spelled out in token terms for Nova Micro inference. The authors state that you pay only the Nova Micro inference rate for tokens, with input priced at $0.000035 per 1,000 tokens and output priced at $0.000140 per 1,000 tokens.

The practical shift developers feel, the paper argues, is that distillation breaks the coupling between routing complexity and routing delay. As routing logic becomes more complex, latency and cost normally rise together; the distilled router aims to sever that relationship by fixing the router behavior into a smaller model.

In video semantic search, that means you can keep the routing step as a model rather than an ever-growing prompt-and-metadata exercise, and you can better control operating costs even as an organization’s metadata landscape expands.