The modern job hunt has devolved into a grueling exercise in pattern matching. Candidates spend hours scanning dense job descriptions, manually cross-referencing their resumes against a checklist of required skills, only to wonder why a specific role is a better fit than another. While large language models can automate this, the latency and cost of running frontier-scale models for every single resume-to-job comparison make them impractical for real-time, scalable services. The industry is now shifting toward a more surgical approach: taking the sophisticated reasoning of a giant and compressing it into a lean, specialized engine.
The Architecture of Knowledge Distillation
To bridge the gap between massive reasoning capabilities and deployment efficiency, the development team built Job Searcher using a teacher-student distillation framework. The teacher model, DeepSeek V4 Pro, was selected for its superior structural reasoning and its ability to adhere to strict output schemas. Rather than using DeepSeek V4 Pro for real-time inference, the team utilized it as an offline label generator. This process involved creating a closed-loop corpus specifically designed for resume recognition, with all configuration files managed within the `build-small-hackathon/job-search-distill` repository.
The student model, Qwen3-8B, was tasked with absorbing these high-fidelity reasoning patterns. To ensure the model could run on limited hardware, the team applied Q4_K_M quantization. This specific quantization level allowed the 8B model to fit within a single ZeroGPU slice, maintaining a balance between memory efficiency and predictive accuracy. By training on the structured judgments produced by DeepSeek V4 Pro, Qwen3-8B evolved from a general-purpose small model into a specialized agent capable of producing a shortlist of jobs backed by logical, defensible evidence.
Infrastructure played a critical role in this optimization. The team leveraged the Modal platform, utilizing a single A100 GPU to perform two distinct rounds of Low-Rank Adaptation Supervised Fine-Tuning (LoRA SFT). These rounds focused on two core competencies: query generation and suitability evaluation. For deployment, the team integrated `llama-cpp-python` with pre-built CUDA wheels on a HuggingFace ZeroGPU Space to maximize hardware acceleration. To ensure a seamless user experience, they implemented the `create_chat_completion(stream=True)` function, which aligns with OpenAI API specifications and allows the model's reasoning process to stream to the UI token by token. The live implementation is available at huggingface.co/spaces/build-small-hackathon/job-search-assistant.
Solving Format Leakage via LoRA Hot-Swapping
During the initial development phase, the team attempted to consolidate both query generation and suitability evaluation into a single LoRA adapter. This approach led to a critical failure known as format leakage. The model struggled to maintain a boundary between two fundamentally different output styles. When the system needed a strict JSON format for query generation, it would occasionally bleed in prose-style sentences from the evaluation task. Conversely, during the suitability assessment, the model would unexpectedly inject JSON structures into its natural language explanations, breaking the user interface and degrading the quality of the reasoning.
The solution was a structural pivot to a hot-swap strategy. Instead of one multipurpose adapter, the team deployed two separate LoRA adapters on the same base Qwen3-8B model: one dedicated exclusively to query generation and another to suitability evaluation. The system now dynamically swaps the active adapter based on the current task. This separation eliminates weight interference, ensuring that the model adheres to the required schema without cross-contamination. This architectural choice allowed a small 8B model to achieve the level of schema discipline typically reserved for models ten times its size.
This precision was further enhanced by refining the teacher model's prompts. The team moved away from ambiguous instructions like strong technical matching and instead forced DeepSeek V4 Pro to use concrete evidence, such as noting that a candidate has 4 years of Rust experience while the role requires 5. This habit of granular contrast was distilled directly into Qwen3-8B, enabling the student model to provide specific, evidence-based justifications for its rankings. The entire build process was orchestrated using the AI coding agent Claude Code, and the resulting raw JSONL events and trial-and-error logs have been released as the HuggingFace agent-traces dataset.
This project demonstrates that the perceived intelligence of a model is often less about the number of parameters and more about the quality of the distillation pipeline. By combining strategic quantization, adapter hot-swapping, and high-fidelity labeling, it is possible to migrate frontier-level reasoning into a low-cost, high-speed production environment.




