Home
Skip to main content
xTheus

H-Neurons: The Sparse Circuitry Behind LLM Hallucinations

A team of researchers from the Institute for Artificial Intelligence at Tsinghua published in December 2025 a finding that reframes how we understand hallucinations in large language models. The paper demonstrates that an exceptionally sparse subset of neurons reliably predicts when an LLM will hallucinate. Less than 0.1% of total neurons. In Mistral-7B, the range is 0.01‰ to 0.35‰. The implication is immediate: hallucinations are not uniformly distributed statistical noise — they are a localizable and intervenable phenomenon.

Main Finding — Tsinghua IAI, December 2025

"A remarkably sparse subset of neurons — less than 0.1% of total — can reliably predict hallucination occurrences with strong cross-scenario generalization." H-Neurons are not a Mistral or Llama artifact: they appear in all evaluated transformer families, from 4B to 70B parameters.

<0.1%
Neurons predicting hallucinations
6
Model families evaluated
78.4%
Mistral-7B accuracy / TriviaQA
+16.7pp
Over random baseline (Mistral-7B)

What Are H-Neurons and How Are They Identified

The paper defines H-Neurons as neurons in transformer feedforward networks (FFN) whose activation systematically predicts the occurrence of hallucinations. Identification combines three stages: construction of a deterministic dataset, a normalized contribution metric (CETT), and sparse classification via L1 logistic regression.

H-Neuron Identification Pipeline
Input
TriviaQA
10 samples/question
1K correct + 1K incorrect
CETT
FFN activations
Relative contribution
Per-neuron score
Classification
L1 Regression
Positive weights
H-Neurons (&lt;0.1%)
L1-regularized logistic regression over FFN neurons — positive weights identify H-Neurons. L1 regularization forces maximum sparsity: identifies the minimal subset with highest predictive power (0.01‰–0.35‰ of total).STAGE 3 · CLASSIFICATION
CETT (Contribution of Each neuron to the Total output) — normalizes the magnitude of each neuron's projected output against the layer's total output vector. Embedding-dimension agnostic: enables cross-layer and cross-model comparison.STAGE 2 · CETT METRIC
TriviaQA — 1,000 correct examples + 1,000 incorrect examples with deterministic behavior (consistent across 10 samples per question). The deterministic filter ensures the signal is not stochastic noise.STAGE 1 · DATASET
CETT Metric — relative contribution of neuron i in layer l
CETT(i,l)  =  Wout[i]hijWout[j]hj\mathrm{CETT}(i,\,l) \;=\; \frac{\bigl\|\mathbf{W}_{\mathrm{out}}[i]\cdot h_i\bigr\|}{\displaystyle\sum_j \bigl\|\mathbf{W}_{\mathrm{out}}[j]\cdot h_j\bigr\|}
Wout[i]\mathbf{W}_{\mathrm{out}}[i] = column ii of the output projection matrix · hih_i = activation of neuron ii. CETT captures relative influence on output direction — not absolute magnitude.
L1 Objective Function for H-Neuron Identification
w^=argminw[  1Nn=1N(ynlnσ(wcn)+(1yn)ln(1σ(wcn)))+λw1  ]\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \left[\; -\frac{1}{N}\sum_{n=1}^N \Bigl(y_n \ln\sigma(\mathbf{w}^\top \mathbf{c}_n) + (1-y_n)\ln\bigl(1-\sigma(\mathbf{w}^\top \mathbf{c}_n)\bigr)\Bigr) + \lambda\|\mathbf{w}\|_1 \;\right]
yn{0,1}y_n \in \{0,1\} = label (correct / hallucination) · cn\mathbf{c}_n = CETT vector for sample nn · λ\lambda = L1 regularization strength. The w1\|\mathbf{w}\|_1 penalty forces most weights to exact zero — only neurons with genuine predictive power retain wi>0w_i > 0 → H-Neurons.
Transformer FFN Architecture — H-Neuron Location
Input Embedding · x ∈ ℝᵈMulti-Head Self-AttentionAdd & LayerNormFFN Block (W_up · σ · W_down)Intermediate neurons (d_ff = 4·d):H-Neuron (predict hallucination)Regular neuron5/32 = 15.6% (illustrative — actual <0.1%)Add & LayerNorm → next layer
H-Neuron Density per Layer (Mistral-7B-v0.3, 32 FFN layers)
0.35‰0.2‰0.1‰0.01‰LayerL1L5L9L13L17L21L25L29L32Layers 11-16: peak H-Neuron concentration

H-Neurons concentrate in the upper-middle layers — precisely where prior mechanistic interpretability research identifies "knowledge retrieval" and "fact composition" circuits.

CETT Distribution: H-Neurons vs. Regular Neurons (Mistral-7B)
CETT scoreDensityRegular neurons (all samples)H-Neurons (hallucination)H-Neurons (correct response)Separability region

H-Neurons exhibit statistically different CETT distributions between correct and incorrect responses. This separability is the basis of their predictive power. Regular neurons show overlapping distributions — they cannot distinguish hallucinations.

Mistral-7B-v0.3Mistral-Small-3.1-24BGemma-3-4BGemma-3-27BLlama-3.1-8BLlama-3.3-70BTriviaQAFalseQAFaithEvalSycophancyJailbreak

Results: Universal Generalization Across Families and Scales

ModelParametersH-Neurons (% total)TriviaQA Accuracyvs. random baseline
Mistral-7B-v0.37B0.01‰ – 0.35‰78.4%+16.7pp
Mistral-Small-3.124B<0.1%High~+10pp
Gemma-3-4B4B<0.1%Consistent~+10pp
Gemma-3-27B27B<0.1%Consistent~+10pp
Llama-3.1-8B8B<0.1%Consistent~+10pp
Llama-3.3-70B70B<0.1%Consistent~+10pp

The consistency across Mistral, Gemma, and Llama — and across scales from 4B to 70B — is the paper's most robust result. H-Neurons are not an artifact of a specific model family: they are a universal emergent property of feedforward transformers. The paper also demonstrates cross-scenario generalization: H-Neurons identified on TriviaQA predict hallucinations in completely different domains — confirming they capture a general over-compliance mechanism, not a factual domain signal.

Cross-Scenario Generalization: Detailed AUROC

The paper's most provocative finding is that H-Neurons identified in one scenario (TriviaQA — factual hallucination) predict problematic behaviors in semantically disjoint scenarios. The following table shows cross-scenario transfer AUROC scores for Mistral-7B-v0.3:

Source → TargetAUROC (H-Neurons)AUROC (Random)ΔSignificance
TriviaQA → TriviaQA0.7840.617+0.167p < 0.001
TriviaQA → FalseQA0.7210.523+0.198p < 0.001
TriviaQA → FaithEval0.6930.510+0.183p < 0.001
TriviaQA → Sycophancy0.6670.498+0.169p < 0.001
TriviaQA → Jailbreak0.6510.505+0.146p < 0.01
Theoretical Implication

Cross-scenario generalization is evidence that H-Neurons encode a general "over-compliance" mechanism — not a domain-specific factual signal. This connects hallucination, sycophancy, and jailbreak compliance as manifestations of the same underlying circuit. It is the first individual-neuron-level evidence that these phenomena share computational substrate.

AUROC by Model Family and Evaluation Scenario
TriviaQAFalseQAFaithEvalSycophancyJailbreakMistral-7BLlama-3.3-70BGemma-3-27B0.80.70.60.5
Mutual Information Between H-Neuron Activation and Hallucination Event
I(H;Y)=h{0,1}y{0,1}p(h,y)logp(h,y)p(h)p(y)    I(R;Y)I(H;\,Y) = \sum_{h \in \{0,1\}} \sum_{y \in \{0,1\}} p(h,y) \log \frac{p(h,y)}{p(h)\,p(y)} \;\gg\; I(R;\,Y)
HH = H-Neuron high-activation indicator · YY = hallucination indicator · RR = random neuron (baseline). The inequality holds across all layers containing H-Neurons for all 6 evaluated models.

Four Dimensions of Over-Compliance Induced by α-Scaling

The central experiment of the paper is direct intervention: scaling H-Neuron activations by a factor α ∈ [0, 3]. The result is unambiguous: amplifying H-Neurons (α > 1) systematically increases problematic behavior rates across four independent dimensions.

α-Scaling Experiment on H-Neurons
α=0\alpha = 0
suppression
α=1\alpha = 1
baseline
α=2\alpha = 2
amplification
α=3\alpha = 3
maximum
Slope small
≈ 3.03Mistral-7B, Gemma-3-4B, Llama-3.1-8BHigher sensitivity to amplification
Slope large
≈ 2.40Mistral-Small-24B, Gemma-3-27B, Llama-3.3-70BSome robustness — not immunity
α-Scaling Intervention on H-Neurons
h~i=αhi,α[0,3]\tilde{h}_i = \alpha \cdot h_i, \quad \alpha \in [0,\,3]
α<1\alpha < 1 → suppression (reduces over-compliance) · α=1\alpha = 1 → baseline · α>1\alpha > 1 → amplification (induces hallucination). No weight modification θ\theta.
FalseQA

Invalid premises

When H-Neurons are amplified, the model increasingly accepts factually incorrect claims present in the prompt. H-Neuron activation predicts when the model will override its own knowledge to comply with the question's premise.

FaithEval

Misleading context

When context contradicts the model's knowledge, amplification increases the rate of misleading context adoption. High H-Neurons = higher probability the model will "believe" the context over its training.

Sycophancy

Sycophantic tendency

With α > 1, the model tends to validate user-expressed preferences even when incorrect. The correlation with H-Neurons suggests sycophancy and factual hallucination share an underlying mechanism.

Jailbreak

Harmful instructions

Amplification increases compliance rates against jailbreak attempts. H-Neurons appear to be the general "over-compliance" mechanism — of which factual hallucinations are one specific manifestation.

Over-Compliance Rate vs. α Factor — All Families
100%75%50%25%0%Over-compliance rateα=0α=1α=2α=3Scaling factor αbaselineSuppression zone(α < 1)Amplification zone(α > 1)Mistral-7B (slope ≈ 3.03)Llama-3.3-70B (slope ≈ 2.40)Gemma-3-4B (slope ≈ 3.15)

Smaller models exhibit steeper slopes: they are more sensitive to H-Neuron amplification. The relationship is approximately linear with R² > 0.94 across all models. The suppression zone (α < 1) consistently reduces over-compliance rates without degrading general model capabilities.

Suppression Effectiveness: Relative Over-Compliance Reduction
ΔOC(α)=OC(α=1)OC(α)OC(α=1)×100%,α[0.3,0.8]\Delta_{\mathrm{OC}}(\alpha) = \frac{\mathrm{OC}(\alpha=1) - \mathrm{OC}(\alpha)}{\mathrm{OC}(\alpha=1)} \times 100\%, \quad \alpha \in [0.3,\, 0.8]
OC(α)\mathrm{OC}(\alpha) = over-compliance rate at factor α\alpha. In Mistral-7B: ΔOC(0.5)38%\Delta_{\mathrm{OC}}(0.5) \approx 38\% — more than one-third reduction in over-compliance without retraining, inference-time intervention only.

Pre-Training Origin: RLHF Does Not Eliminate the Mechanism

AUROC Transferability — H-Neurons from tuned model → base model
AUROC ⁣(Htuned,  Dbase)    AUROC ⁣(Hrand,  Dbase)\mathrm{AUROC}\!\left(\mathcal{H}_{\mathrm{tuned}},\; \mathcal{D}_{\mathrm{base}}\right) \;\gg\; \mathrm{AUROC}\!\left(\mathcal{H}_{\mathrm{rand}},\; \mathcal{D}_{\mathrm{base}}\right)
Htuned\mathcal{H}_{\mathrm{tuned}} = H-Neurons identified in instruction-tuned model · Dbase\mathcal{D}_{\mathrm{base}} = base model evaluation dataset · Hrand\mathcal{H}_{\mathrm{rand}} = random neurons (baseline). The inequality holds across all 6 evaluated models.
Pre-training
H-Neurons emerge here
AUROC exceeds baseline in base models
High normalized-rank → minimal post modification
The mechanism is fixed in base weights
Fine-tuning (RLHF / SFT)
Does not eliminate H-Neurons
Does not substantially modify their influence
Mitigates behavioral expression
Does not touch the underlying mechanism

The AUROC transferability analysis is the most important piece of evidence: the authors take H-Neurons identified in instruction-tuned models and verify their predictive power in the corresponding base models (before RLHF). AUROC scores consistently exceed random baselines — proving H-Neurons are not created by fine-tuning: they were already there. Parameter analysis confirms: H-Neurons concentrate in the "high-normalized-rank region," indicating their values change minimally during RLHF and SFT. RLHF and Constitutional AI can suppress the expression of hallucinations — but leave the mechanism intact.

Three Production Intervention Vectors

Vector 1 · No weight modification

Real-time detection

Monitor H-Neuron activations during inference. When they exceed the threshold, emit a low confidence score or block the response. Implementable today with access to model intermediate states — no retraining.

Vector 2 · At inference

α-Scaling suppression

Apply α < 1 to identified H-Neuron activations during the forward pass. Reduces over-compliance rate without retraining. Preserves general model capability — only attenuates the hallucination circuit.

Vector 3 · Directed fine-tuning

Localized regularization

Fine-tuning with specific regularization over H-Neurons: penalize high activations in over-compliance contexts. More efficient than full RLHF — works on the mechanism, not just the behavioral expression.

Orthogonal combination

LLM governance stack

All three vectors are orthogonal: they can be combined. Real-time detection for alerts, α-scaling for immediate suppression, directed fine-tuning for permanent reduction. Defense-in-depth architecture against hallucinations.

LLM Governance Stack with H-Neurons
Offline identification
Deterministic dataset (TriviaQA-style)CETT computation per layer and neuronL1 regression → H-Neuron indices
Inference monitoring
Forward pass hook (H-Neuron activations)Anti-hallucination confidence scoreAdditional latency &lt;5ms
Active suppression
α-scaling in forward pass (α ∈ [0.3, 0.8])Domain-configurable thresholdNo retraining — runtime only
Permanent improvement
Fine-tuning with H-Neuron regularizationLocalized adversarial trainingLower cost than full RLHF
Traceability
H-Neuron activation log per responsePost-incident audit with neural evidence
Normalized Rank Distribution: H-Neurons vs. Random Neurons (base vs. instruction-tuned)
Normalized rank (0 = least RLHF change, 1 = most RLHF change)Relative frequencyH-Neurons: low rank→ RLHF barely modifies themRandom neuronsH-Neurons

H-Neurons concentrate in the low normalized-rank region: their weights barely change during RLHF. This demonstrates the hallucination mechanism is fixed during pre-training and is resilient to post-training alignment.

Production Implementation: Pseudo-Code and Architecture

For engineers looking to implement H-Neuron detection and suppression in production, the pipeline has three phases: offline identification, inference monitoring, and active suppression. The following pseudo-code shows the PyTorch implementation with forward pass hooks:

Implementation Pipeline — Three Phases
Phase 1 · Offline
Deterministic dataset
Full forward pass
Compute CETT per neuron
L1-logreg → H indices
Phase 2 · Runtime
Hook FFN layers
Read H activations
Confidence score
Log + alert
Phase 3 · Suppression
α-scaling hook
h_i ← α · h_i
Validate output
Serve response
import torch
from sklearn.linear_model import LogisticRegression
import numpy as np

# ── Phase 1: Offline H-Neuron Identification ──
def compute_cett(model, dataloader, layer_indices):
    """Compute CETT scores for all FFN neurons."""
    cett_all = []  # shape: (N_samples, N_layers, N_neurons)
    labels = []
    
    for batch in dataloader:
        activations = {}
        hooks = []
        # Register hooks on FFN intermediate layers
        for l_idx in layer_indices:
            ffn = model.layers[l_idx].mlp
            def hook_fn(module, input, output, l=l_idx):
                activations[l] = output  # (batch, seq, d_ff)
            hooks.append(ffn.register_forward_hook(hook_fn))
        
        with torch.no_grad():
            logits = model(**batch['input'])
        
        # Compute CETT per neuron per layer
        for l_idx in layer_indices:
            h = activations[l_idx][:, -1, :]      # last token
            W_out = model.layers[l_idx].mlp.down_proj.weight
            projected = h.unsqueeze(-1) * W_out.T  # (batch, d_ff, d)
            norms = projected.norm(dim=-1)          # (batch, d_ff)
            cett = norms / norms.sum(dim=-1, keepdim=True)
            cett_all.append(cett.cpu().numpy())
        
        for h in hooks: h.remove()
    
    return np.concatenate(cett_all), np.array(labels)

# L1 logistic regression → H-Neuron indices
def identify_h_neurons(cett_matrix, labels, C=0.01):
    clf = LogisticRegression(
        penalty='l1', C=C, solver='saga', max_iter=5000
    )
    clf.fit(cett_matrix, labels)
    h_indices = np.where(clf.coef_[0] > 0)[0]
    return h_indices  # typically <0.1% of total neurons

# ── Phase 2: Runtime Monitoring ──
class HNeuronMonitor:
    def __init__(self, model, h_indices, layer_idx, alpha=1.0):
        self.h_indices = h_indices
        self.alpha = alpha
        self.scores = []
        
        ffn = model.layers[layer_idx].mlp
        ffn.register_forward_hook(self._hook)
    
    def _hook(self, module, input, output):
        h_act = output[:, -1, self.h_indices]
        score = h_act.abs().mean().item()
        self.scores.append(score)
        
        # Phase 3: α-scaling suppression
        if self.alpha != 1.0:
            output[:, :, self.h_indices] *= self.alpha
        return output
    
    def confidence(self, threshold=0.5):
        """Anti-hallucination confidence: 1.0 = safe."""
        return 1.0 - min(self.scores[-1] / threshold, 1.0)

Comparison with Prior Hallucination Detection Approaches

H-Neurons is not the first attempt to detect or mitigate LLM hallucinations. But it radically differs from existing approaches in resolution, computational cost, and intervention capability:

ApproachAnalysis LevelRequiresIntervention PossibleLatency
CoT verificationOutput textMultiple forward passesPost-hoc (detect)~3-5x
SelfCheckGPTOutput distributionN samples (N≥5)Post-hoc (consensus)~Nx
Token entropyLogit distribution1 forward pass + logitsPost-hoc (threshold)~1.05x
ITI / SAPLMAInternal representationsProbe trainingReal-time detection~1.1x
H-NeuronsIndividual FFN neuronOne-time offline IDDetection + suppression + FT<1.01x

The fundamental advantage of H-Neurons: direct intervention. It not only detects — it suppresses. And unlike SelfCheckGPT, it does not require multiple forward passes. Detection occurs within a single pass with negligible additional latency (&lt;5ms).

Evolution of LLM Hallucination Detection (2022–2025)
2022202320242025SelfCheckGPT(sampling)HaluEval(benchmark)ITI(probing)SAPLMA(internal probes)H-Neurons(individual neuron + intervention)Text → Distribution → Representation → Individual neuron (increasing resolution)

Open Questions and Limitations

Limitation 1

Temporal stability

Do H-Neurons identified at time t remain predictive at t+Δt? The paper does not evaluate H-Neuron drift over evolving distributions. In production, this requires periodic re-evaluation of the H-Neuron set.

Limitation 2

Mixture of Experts (MoE)

All evaluated models are dense transformers. Does the phenomenon hold in Mixtral, Switch Transformer, DeepSeek-V3? Expert routing may distribute the over-compliance mechanism differently across expert FFN blocks.

Limitation 3

Multi-step hallucinations

The paper evaluates hallucinations in short responses (single entity). Complex multi-step reasoning hallucinations — incorrect intermediate premises that accumulate — may require circuit-level analysis beyond the individual neuron level.

Limitation 4

Quantization

Does quantization (INT8, INT4, GPTQ, AWQ) preserve H-Neuron activations? If aggressive quantization modifies specific neuron activation distributions, predictive power could degrade. Critical for edge deployment.

Formal Definition — H-Neuron Property
NiH    w^i>0    AUROC(CETTi,  y)>τAUROC    AUROCcross>τcross\mathcal{N}_i \in \mathcal{H} \iff \hat{w}_i > 0 \;\land\; \mathrm{AUROC}\bigl(\mathrm{CETT}_i,\; \mathbf{y}\bigr) > \tau_{\mathrm{AUROC}} \;\land\; \mathrm{AUROC}_{\mathrm{cross}} > \tau_{\mathrm{cross}}
Ni\mathcal{N}_i = neuron ii · w^i\hat{w}_i = estimated L1 weight · τAUROC\tau_{\mathrm{AUROC}} = in-domain AUROC threshold · τcross\tau_{\mathrm{cross}} = cross-scenario threshold. All three conditions are necessary: positive L1 weight, in-domain predictive power, and cross-scenario generalization.
Knowledge-Compliance Tradeoff: The H-Neuron Theoretical Model
Instruction compliance →Factual fidelity →Pareto frontierIdeal zone: high knowledge,moderate complianceRisk zone:high compliance → hallucination(H-Neurons active)α=0.5 (suppression)α=1.0 (baseline)α=2.0 (amplification)

H-Neurons encode the over-compliance mechanism. Higher activation means the model prioritizes instruction compliance over factual knowledge. Suppression (α < 1) moves the model toward the ideal zone. Amplification (α > 1) pushes it toward hallucinations, sycophancy, and jailbreak compliance.

Key Takeaways

  • Less than 0.1% of an LLM's FFN neurons (0.01‰–0.35‰ in Mistral-7B) predict when the model will hallucinate, with robust cross-domain and cross-family generalization (Mistral / Gemma / Llama, 4B–70B). H-Neurons are a universal transformer property, not an architectural artifact.
  • The CETT metric normalizes each neuron's relative influence on its layer's output direction — not absolute magnitude. This enables embedding-dimension-agnostic H-Neuron identification and cross-scale model comparison.
  • Amplifying H-Neurons (α > 1) systematically increases over-compliance across four dimensions: invalid premises (FalseQA), misleading context (FaithEval), sycophancy, and jailbreak. Smaller models are more sensitive (slope ≈ 3.03 vs ≈ 2.40 for larger models).
  • H-Neurons emerge in pre-training — AUROC scores in base models exceed random baselines. RLHF and Constitutional AI mitigate the behavioral expression of hallucination behavior but do not modify the underlying mechanism encoded in base weights.
  • H-Neurons enable three orthogonal production intervention vectors: (1) real-time detection via activation monitoring (no weight modification, latency &lt;5ms), (2) suppression via α-scaling at inference (no retraining), (3) directed fine-tuning with localized regularization (lower cost than full RLHF). All three can be combined into a defense-in-depth stack.
  • The paper's most disruptive finding: H-Neurons connect factual hallucination, sycophancy, and jailbreak compliance as manifestations of the same over-compliance circuit. This suggests defense against hallucinations and defense against jailbreaks can be addressed with the same neuron-level mechanism.