ICML 2026

Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

Arda Uzunoglu*, Alvin Zhang*, Daniel Khashabi

* Equal contribution.

Paper Code BibTeX

TLDR: Learning-to-Trust identifies which weak labels are reliable enough to supervise stronger students, recovering near ground-truth performance through trust-filtered weak supervision.

Approach

Our method treats weak-to-strong generalization primarily as a data selection problem. Rather than training a strong student on all weak labels equally, we introduce a trust function that assigns each weak label a scalar trust score reflecting its estimated reliability as supervision. In our instantiation, this trust function is a neural network that is trained on a labeled source distribution to predict whether a weak teacher’s output is correct from the teacher’s hidden states.

At deployment, the weak teacher generates labels on unlabeled target data, the neural trust function scores those labels, and only the highest-trust examples are retained for student training. This yields a filtered supervision set that preserves the benefit of weak labels while reducing harmful or misleading ones. Crucially, the same trust-filtering procedure can be applied iteratively: once a student is trained, it can be reused as the next teacher, producing a weak-to-strong chain that compounds gains across generations and often approaches ground-truth supervision with little loss.

Results

World knowledge 95.9%–113.9% recovery

Quant reasoning 89.2%–92.5% recovery

Decision making 76.1%–110.5% recovery

World Knowledge

OBQA, ARC-C, ARC-E, SciQ, and SIQA

Method	OLMo2-1B	OLMo2-7B	OLMo2-13B	Qwen3-0.6B	Qwen3-1.7B	Qwen3-4B	Qwen3-8B	Qwen3-14B
Teacher Performance		43.1				52.6
No-SFT	43.1	65.1	73.8	52.6	69.1	72.0	77.3	79.4
Naive	49.0	69.3	74.7	59.2	74.0	78.5	83.8	86.0
I-Confidence	49.2	69.2	75.1	59.4	74.3	78.8	83.8	85.7
ICL + I-Confidence	50.0	72.0	77.9	61.8	74.4	79.4	84.8	86.5
Ensemble	49.2	70.9	76.7	59.7	74.1	77.2	82.9	85.1
RM	46.1	68.9	78.4	59.0	71.7	77.0	82.6	86.1
NTF (Ours)	50.5	73.7	80.9	61.6	75.0	80.0	84.9	87.1
Ground Truth	49.6	73.8	81.2	61.9	74.8	80.6	85.2	87.0
Recovery (%)	113.9	98.9	95.9	96.8	103.5	93.0	96.2	101.3

Dark green: super-recovery. Light green: near-lossless. Red: GT better.

Quantitative Reasoning

AIME

Method	Qwen3-4B	Qwen3-8B	Qwen3-8B	Qwen3-8B
Teacher Performance	4.27	4.27	10.91	18.88
No-GRPO	10.9	18.9	18.9	18.9
Naive	19.6	26.0	27.2	26.8
I-Confidence	20.7	25.6	26.7	27.3
V-Confidence	18.7	25.9	27.1	26.5
NTF (Ours)	22.0	26.6	27.9	27.4
Ground Truth	22.9	27.4	28.7	28.4
Recovery (%)	92.5	91.0	91.7	89.2

Decision Making

Lichess Puzzles

Method	OLMo2-1B	OLMo2-7B	OLMo2-13B	Qwen3-0.6B	Qwen3-1.7B	Qwen3-4B	Qwen3-8B	Qwen3-14B
Teacher Performance		28.5				6.9
Naive	34.7	37.0	38.3	10.6	22.1	27.0	33.7	38.1
I-Confidence	37.1	39.9	40.7	11.5	23.0	36.9	36.3	37.5
V-Confidence	35.2	37.7	38.7	7.3	17.9	31.8	31.8	33.4
NTF (Ours)	37.4	41.2	41.5	15.5	25.4	35.4	38.0	44.1
Ground Truth	37.7	52.9	54.5	14.9	23.0	36.3	37.0	39.9
Recovery (%)	99.2	77.9	76.1	104.4	110.5	97.6	102.9	110.4

Mechanisms

Easy-First Curriculum

NTF shifts selection toward easier puzzles, acting like an implicit curriculum.

Optimal Alternatives

Some labels marked incorrect are still strong alternatives under engine evaluation.

Coherent Gradients

Selected examples create more aligned, lower-rank update directions.

Finding 1 tests whether NTF's advantage can be explained simply by a curriculum effect. By comparing NTF to a difficulty-matched naive baseline, the table isolates how much of the gain comes from selecting easier examples versus from trust-based filtering itself.

Finding 1

Difficulty-matched ablation

Method	Qwen3-1.7B	Qwen3-4B	Qwen3-8B	Qwen3-14B
NTF (Ours)	25.4	35.4	38.0	44.1
Naive	22.1	27.0	33.7	38.1
Naive-DM	24.8	29.7	32.7	35.9

Finding 2 asks whether NTF works only because it selects better training examples, or also because some labels marked incorrect by the dataset are still strong alternatives. Replacing NTF-selected teacher labels with ground truth helps separate the value of selection from the value of retaining these alternative labels.

Finding 2

Ground-truth relabeling on the NTF subset

Method	Qwen3-1.7B	Qwen3-4B	Qwen3-8B	Qwen3-14B
Ground Truth	23.0	36.3	37.0	39.9
NTF (Ours)	25.4	35.4	38.0	44.1
NTF-GT	24.8	35.3	36.2	43.6

BibTeX

@misc{uzunoglu2026trustfunctionsnearlosslessweaktostrong,
      title={Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher}, 
      author={Arda Uzunoglu and Alvin Zhang and Daniel Khashabi},
      year={2026},
      eprint={2606.01000},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.01000}, 
}