Helping Language Models Reason to Remember the Knowledge They Already Have

Melody Ma and John Hewitt · February 2026

TL;DR: we don't usually think of RLVR as useful for knowledge recall from parameters, but it helps a lot.

Motivation

Reasoning language models trained with Reinforcement Learning from Verifiable Rewards (RLVR) have achieved remarkable performance on mathematics and coding benchmarks. For example, frontier models achieve near-perfect accuracy on AIME and around 80% resolution rate on SWE-bench. These models learn to produce step-by-step reasoning traces, which is required to perform well on such domains.

But what about knowledge recall? Consider the question "What is the capital of Australia?" A human might think through major cities—Sydney...Melbourne—before recalling the concept of purpose-built capitals and arriving at Canberra. This kind of reasoning, known as spreading activation in cognitive science, is qualitatively different from the chain-of-thought reasoning used in mathematics. It remains unclear how well reasoning models generate reasoning traces to access their own memory.

Language models can reason better on knowledge recall tasks, but they are not doing it yet.

We run a simple experiment: take four reasoning models, evaluate them on two closed-book QA datasets (TriviaQA and Natural Questions) and one math benchmark (MATH), with and without a "think step-by-step" cue appended to the prompt. If models already reason well for recall, the cue shouldn't matter.

But it does. Across all four models, the cue consistently improves knowledge recall—for example, +1.1% and +1.3% Ex-Recall (recall after using an LLM to extract a single best answer, preventing models from gaming the metric by hedging with multiple guesses) on TriviaQA and Natural Questions for GPT-OSS-20B. But the same cue doesn't help on MATH, where accuracy generally decreases. This asymmetry suggests that models are not performing their best reasoning for knowledge recall.

We also verify that reasoning helps knowledge access compared to direct answering: forcing the model to emit an empty <think> span causes an average 6.9% drop across models, confirming that reasoning traces genuinely contribute to knowledge recall.

Table 1. Performance with and without the think step-by-step cue. Ex-Recall for TriviaQA/NQ; accuracy for MATH. * = statistically significant at 95% level by McNemar's test (McNemar, 1947).

Cue?	Reasoning?	TriviaQA	NQ	MATH
GPT-OSS-20B
−Cue	−Reasoning	45.2%	24.6%	25.9%
−Cue	+Reasoning	60.1%	30.7%	80.9%
+Cue	+Reasoning	61.2%*	32.0%	80.4%
OLMo-3-7B-Think
−Cue	−Reasoning	41.4%	23.1%	71.6%
−Cue	+Reasoning	55.1%	28.0%	85.8%
+Cue	+Reasoning	56.1%*	28.7%	84.8%*
R1-Distill-Qwen-1.5B
−Cue	−Reasoning	11.8%	6.8%	65.5%
−Cue	+Reasoning	12.7%	7.4%	64.9%
+Cue	+Reasoning	12.8%	7.6%	64.3%
GPT-5.2
−Cue	+Reasoning	90.8%	57.0%	90.4%
+Cue	+Reasoning	91.5%*	58.1%*	91.6%*

Reinforcement Learning for Knowledge Recall

If a simple cue helps, can we do better by actually training models to reason for knowledge recall? We apply the same RLVR framework used for math—but targeting closed-book knowledge recall QA.

Setup

We train GPT-OSS-20B on TriviaQA using online RL with LoRA adaptation. Given an input $x$, the model generates a reasoning trace $\hat{c} \sim p_\theta(\cdot \mid x)$, then a final answer $\hat{y} \sim p_\theta(\cdot \mid x, \hat{c})$. We train with a scalar reward $r(\hat{y}, y)$ that depends only on the final answer, maximizing the expected reward:

\max_\theta \; \mathbb{E}_{\substack{x,\, y \\ \hat{c}, \hat{y} \sim p_\theta(\cdot \mid x)}} \left[ r(\hat{y}, y) \right]

The reward is defined as:

r(\hat{y}, y) = r_{\text{answer}}(\hat{y}, y) + 0.1 \cdot (f_{\text{format}}(\hat{y}) - 1)

where $r_{\text{answer}}$ is 1.0 for an exact match, 0.5 if the correct answer appears somewhere in the output (recall), and 0 otherwise. The format term penalizes outputs missing <answer></answer> tags.

We optimize using a GRPO-style importance-sampling policy gradient. For each input $x$, we sample a group of $K$ trajectories $\{(\hat{c}_k, \hat{y}_k)\}_{k=1}^K$ and compute advantages relative to the group-average reward, $A_k = r_k - \frac{1}{K}\sum_j r_j$:

\nabla_\theta \mathcal{L} = \mathbb{E}\!\left[\sum_{k} A_k \sum_{i \in \tau_k} \frac{\pi_\theta(t_i \mid t_{\lt i}, x)}{\pi_{\theta_{\text{old}}}(t_i \mid t_{\lt i}, x)} \, \nabla_\theta \log \pi_\theta(t_i \mid t_{\lt i}, x)\right]

Baselines

To isolate the contribution of RL, we compare against two baselines. Reasoning-SFT generates reasoning traces from the base model, filters to correct ones, and fine-tunes on them—removing the on-policy adaptation of RLVR but keeping model-generated reasoning chains:

\min_\theta \; \mathbb{E}_{\substack{x \\ \hat{c}, \hat{y} \sim p_{\theta_0}(\cdot \mid x)}} \left[ -\log p_\theta(\hat{y}, \hat{c} \mid x) \right]

Standard SFT removes reasoning traces entirely and just trains on question→answer pairs, as in Roberts et al. (2020):

\min_\theta \; \mathbb{E}_{x, y} \left[ -\log p_\theta(y \mid x) \right]

Results

RL training shows improvements that transfer beyond the training distribution. After training only on TriviaQA, the model also improves on Natural Questions, HotpotQA, SimpleQA, and StrategyQA. Notably, HotpotQA is a multi-hop QA dataset; StrategyQA tests implicit reasoning strategies with only T/F answers.

Table 2. GPT-OSS-20B performance before and after RL training. EM = Exact Match; Ex-Recall = Extracted Recall. * = improvement over base is significant at 95% level by McNemar's test.

Model	Metric	TriviaQA	NQ	HotpotQA	SimpleQA	StrategyQA
Base	EM	36.5	6.0	7.5	1.8	71.5
Base	Ex-Recall	60.1	30.7	25.5	3.5	—
Reasoning-SFT	EM	38.8	6.1	7.3	1.9	72.5
Reasoning-SFT	Ex-Recall	63.8	31.5	26.0	4.0	—
RL-trained (ours)	EM	63.6*	18.2*	17.0*	3.3*	74.5*
RL-trained (ours)	Ex-Recall	70.0*	34.9*	27.6*	4.1	—

The RL-trained model substantially outperforms both baselines. Reasoning-SFT shows only modest gains, confirming that RL's on-policy adaptation—not just exposure to correct traces—drives the improvement.

Interestingly, even after RL training, the think step-by-step cue yields additional gains (e.g., +1.6% Ex-Recall on NQ), indicating room for further improvement. The trained model also improves on MATH in the no-cue setting, a transfer result we leave to future study.

So, what do the reasoning traces look like?

The RL-trained model generates consistently longer reasoning traces, but the improvements aren't always what we'd expect. In many cases, it simply outputs the answer earlier with minimal reasoning.

In the examples below, Thinking refers to content generated in the model's internal reasoning tokens (hidden from the user), and Answer refers to content in the model's visible response tokens.

Q: Who was Canada's first woman Prime Minister?

Base (GPT-OSS-20B)

Thinking This is false: Canada has no woman PM. So answer: none.

Answer Canada has never had a woman Prime Minister. ✗

RL-trained

Thinking The answer: Kim Campbell? Actually first female PM of Canada: Kim Campbell served in 1993. The question refers to Kim Campbell. Provide answer.

Answer Kim Campbell ✓

Q: The Manuel M. Torres Dam is the largest hydro plant in which country?

Base (GPT-OSS-20B)

Thinking Look up: Manuel M. Torres Dam is in South Africa, near the Vaal River. So largest hydro plant in Africa? Wait it's not largest in Africa. It is in South Africa. So answer: South Africa.

Answer South Africa ✗

RL-trained

Thinking We need answer: Manuel M. Torres Dam. It's in Mexico? Wait, Manuel M. Torres is dam in Yucatán? Actually the largest hydroelectric plant in Mexico is the Manuel Moreno Torres Dam on the Grijalva River? … It's located on the Grijalva River in Chiapas. So answer Mexico.

Answer Mexico ✓

Q: Which writer's first works were gathered into "Plain Tales From The Hills" in the 1880s?

Base (GPT-OSS-20B)

Thinking We recall Plain Tales from the Hills is by Rabindranath Tagore? Wait … Plain Tales from the Hills is by "Abdul Hamid"?? Actually it is by "Abdul Hamid". Yes.

Answer Abdul Hamid ✗

RL-trained

Thinking Answer: Rudyard Kipling.

Answer Rudyard Kipling ✓

The third example is particularly interesting: the RL-trained model simply produces the correct answer with almost no reasoning, while the base model confabulates an elaborate but wrong answer. We view improved answer calibration and improved reasoning as lying on a spectrum rather than being cleanly separable—even what may appear as model "confidently guessing" reflects the model correctly recognizing that extended reasoning is unnecessary. Nonetheless, eliciting qualitatively richer reasoning—such as spreading activation–style knowledge recall—remains an open problem.

Takeaways

Current reasoning language models do not perform their best reasoning on knowledge recall tasks. A simple prompt cue ("think step-by-step") helps on knowledge recall tasks but not math, revealing a gap left by RLVR training. RL with answer correctness as a verifiable reward teaches models to access their parametric knowledge better, with gains transferring to unseen QA benchmarks.

References

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension — Joshi et al., 2017
Natural Questions: A Benchmark for Question Answering Research — Kwiatkowski et al., 2019
Measuring Mathematical Problem Solving with the MATH Dataset — Hendrycks et al., 2021
How Much Knowledge Can You Pack Into the Parameters of a Language Model? — Roberts et al., 2020
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering — Yang et al., 2018
SimpleQA: Measuring Short-Form Factuality in Large Language Models — Wei et al., 2024
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies — Geva et al., 2021
A Spreading-Activation Theory of Semantic Processing — Collins & Loftus, 1975
Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages — McNemar, 1947