Gemma 4 12B OBLITERATED Hits 0% Refusal Rate via Weight Surgery

The modern developer experience with large language models is often defined by a frustrating tug-of-war between utility and safety. For those pushing the boundaries of autonomous agents or complex red-teaming exercises, the dreaded refusal message—the AI's insistence that it cannot fulfill a request due to safety guidelines—acts as a hard wall. Historically, the community has attempted to bypass these walls through uncensored fine-tuning, but these efforts frequently result in a cognitive decline, where the model loses its reasoning capabilities in exchange for its lack of inhibitions. This trade-off has long been accepted as an inevitable cost of removing guardrails.

The Metrics of Total Unrestriction

Into this tension steps Gemma 4 12B OBLITERATED, a specialized iteration of the Gemma 4 architecture released on Hugging Face by the developer OBLITERATUS. The primary objective of this release was not merely to create a model that says yes, but to create one that says yes without becoming less intelligent. To verify this, the developer subjected the model to a rigorous battery of 842 test prompts specifically designed to trigger safety refusals. The result was an absolute zero percent refusal rate, meaning the model provided a response to every single prompt regardless of the restrictive nature of the original safety training.

Crucially, this lack of censorship did not come at the expense of the model's intellectual capacity. The developer utilized the MMLU-Pro benchmark, a challenging evaluation of massive multitask language understanding, to ensure the model's reasoning remained intact. Gemma 4 12B OBLITERATED scored 46/70, or 65.7%, which is identical to the performance of the original, censored base model. This parity suggests that the knowledge and logic encoded in the weights were not degraded during the uncensoring process. For developers looking to deploy the model locally, it is available via the following command:

bash

huggingface-cli download elder-plinius/Gemma-4-12B-OBLITERATED

The Mechanics of Weight Surgery

What separates this model from previous uncensored attempts is the abandonment of traditional retraining. Most uncensored models rely on Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO) to override safety layers. These methods essentially try to teach the model new behaviors, which often overwrites the existing neural pathways responsible for high-level reasoning, leading to the aforementioned cognitive decline. Instead, OBLITERATUS employed a technique known as weight surgery, which treats the model's internal parameters as a geometric space to be edited rather than a dataset to be retrained.

The process followed a precise two-stage pipeline. The first stage involved the application of SOM, a technique designed to identify and remove the specific geometric directions within the model's weights that trigger refusal responses. This surgery was targeted specifically at layers 12 through 21. By isolating the refusal mechanism to these specific layers, the developer was able to maintain a KL divergence (Kullback-Leibler divergence) of 0.094, indicating that the overall probability distribution of the model's language remained remarkably stable despite the intervention.

However, the initial removal of these directions caused a slight dip in MMLU-Pro scores, revealing a hidden dependency between the refusal mechanism and certain reasoning capabilities. To resolve this, the second stage introduced ASPA, or gradual gradient source tethering. This process focused on layers 22 through 46, where the modified weights were subtly blended back toward their original directions. Through a precise sweep of gamma values, the developer found the optimal equilibrium where the refusal behavior remained suppressed but the intelligence was restored. To prove this wasn't a fluke, a Z-test was performed on the sample means, resulting in a p-value of less than 0.05. This statistically confirms that there is no significant performance gap between the original model and the OBLITERATED version. Furthermore, the model achieved perfect scores across six different consistency checks, ensuring that the surgery did not introduce logical instability or hallucinations.

This model is not intended for general consumer application or public-facing services. Instead, it serves as a critical instrument for mechanistic interpretability research, allowing scientists to analyze exactly how safety guardrails are encoded within a neural network. It also provides a baseline for red-teaming efforts, where security researchers can test the absolute limits of a model's capabilities without the interference of built-in filters. For those implementing the model in a local Python environment, the following configuration is used:

python

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "elder-plinius/Gemma-4-12B-OBLITERATED"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

inputs = tokenizer("Your unrestricted prompt here", return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

By demonstrating that refusal mechanisms can be surgically removed without damaging the underlying intelligence, this work shifts the conversation from how to train safer models to how to precisely map the internal geography of AI alignment.

Gemma 4 12B OBLITERATED Hits 0% Refusal Rate via Weight Surgery

The Metrics of Total Unrestriction

The Mechanics of Weight Surgery

Related Articles