Skip to main content

RL Theory

Agentic RL and Tool Use

LLMs as multi-step policies: observations, tool calls, environment feedback, sparse rewards, credit assignment, and why agent training differs from single-turn RLHF.

AdvancedTier 2FrontierFrontier watch~60 min

Why This Matters

Standard RLHF trains a model to produce a good response to a prompt. Tool-using agents create a harder problem: the model chooses actions over many steps, observes external feedback, and decides when the task is complete.

A chat model is close to a single-turn function: promptresponse\text{prompt} \to \text{response}. A tool-using agent is better modeled as a multi-step policy: (historyt,observationt)actiont(\text{history}_t, \text{observation}_t) \to \text{action}_t, where actions can include code execution, web navigation, file edits, API calls, text responses, and termination.

The RL challenges are harder: horizons are longer, rewards are sparser, actions have real consequences (a wrong API call cannot be undone), and the state space includes the external world. Understanding the formal RL framework for agents explains why building reliable agents is much harder than building good chatbots.

Mental Model

Consider the difference between:

  1. Chat model: "Explain why this test failed." \to One response. Done.
  2. Agent: "Fix the failing test suite." \to Read the failure. Inspect files. Edit code. Run tests. Interpret the next failure. Repeat until the local evidence says the task is complete.

The agent must decide what to do next at every step, handle failures (a search returns no results, an API errors out), and manage a growing context of past actions and observations. This is a sequential decision problem: the setting RL was built to analyze.

The Agent as an MDP

Proposition

LLM Agent as a Markov Decision Process

Statement

An LLM agent can be formulated as a partially observable MDP (POMDP):

  • State sts_t: The full environment state (file system contents, web page state, conversation history, tool outputs). Typically not fully observable.
  • Observation oto_t: What the agent sees. The text representation of the current state (tool output, error message, retrieved content).
  • Action ata_t: The agent's next output. A tool call (code execution, web search, API request), a text response, or a decision to terminate.
  • Transition P(st+1st,at)P(s_{t+1} | s_t, a_t): The environment dynamics (code execution results, web page responses). Stochastic and partially known.
  • Reward R(st,at)R(s_t, a_t): Typically sparse. A final reward at task completion (did the agent solve the problem?) with zero intermediate reward.

The agent's policy πθ(ato1,,ot)\pi_\theta(a_t | o_1, \ldots, o_t) is the LLM itself: given the history of observations, it generates the next action as a text string.

The objective is to maximize the expected cumulative reward:

J(θ)=Eπθ[t=0TγtR(st,at)]J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{T} \gamma^t R(s_t, a_t)\right]

where TT is the (variable) episode length and γ\gamma is the discount factor.

Intuition

The LLM is the policy component. It processes observations (text from the environment), chooses actions (which tool to call, what arguments to use), and updates its next decision based on results. The context window is the agent's working memory: it contains the visible history of actions and observations. When the context window fills up, the agent must compress or summarize, which introduces information loss.

Why It Matters

This formulation connects LLM agents to the vast RL theory literature. Concepts like exploration-exploitation tradeoff, credit assignment, temporal abstraction, and reward shaping all apply directly. The formulation also reveals why agents are hard: long horizons, sparse rewards, and partial observability are exactly the settings where RL struggles most.

Failure Mode

The POMDP formulation assumes the agent's policy is Markov given the full observation history. In practice, the LLM has a finite context window, so it cannot condition on arbitrarily long histories. When episodes exceed the context length, the agent loses access to early observations and actions. This is not just a technical limitation; it means agents cannot be truly Markov for long tasks, introducing a systematic source of error.

Tool Use as Actions

Definition

Tool-Augmented LLM

A tool-augmented LLM has access to a set of tools T={t1,,tK}\mathcal{T} = \{t_1, \ldots, t_K\}, each with a typed interface (input schema, output schema). At each step, the agent either:

  1. Calls a tool: Generates a structured tool call (tk,args)(t_k, \text{args}), receives the tool output ot+1=tk(args)o_{t+1} = t_k(\text{args})
  2. Generates text: Produces a text response (reasoning, answer, etc.)
  3. Terminates: Signals task completion

The action space is A=T×ArgsText{stop}\mathcal{A} = \mathcal{T} \times \text{Args} \cup \text{Text} \cup \{\text{stop}\}.

Common tool categories:

  • Code execution: Python interpreter, shell commands
  • Web search: Search engine queries, page retrieval
  • API calls: Database queries, external service requests
  • File operations: Read, write, edit files
  • UI interaction: Browser navigation, click, type

The tool set defines the agent's action space. A model with code execution can test hypotheses that would otherwise remain verbal. A model with browser access can retrieve information outside its weights. A model with file editing can change an artifact and observe the consequences. The choice of tools is a product and safety decision, not just an API detail.

The ReAct Pattern

Definition

ReAct (Reasoning + Acting)

The ReAct pattern interleaves reasoning (chain-of-thought) with acting (tool calls) in an explicit loop:

Thought: The failing test names parseInterval, so I should inspect that code.
Action: read_file("src/time/parse-interval.ts")
Observation: The parser handles hours and minutes but not mixed units.
Thought: The failing case probably needs a mixed-unit branch plus a test.
Action: edit_file("src/time/parse-interval.ts", patch)
Observation: File updated.
Thought: Now I should run the narrow test before touching anything else.
Action: run_tests("parse-interval.test.ts")
Observation: The failing test now passes.
Thought: I can stop after checking the relevant suite.
Action: respond("Fixed parseInterval and verified the targeted tests.")

Each cycle consists of: Thought (reasoning about what to do), Action (tool call or response), Observation (result from the environment). The LLM generates the Thought and Action; the environment generates the Observation.

ReAct remains a common baseline pattern because it:

  • Makes reasoning explicit and inspectable
  • Allows the agent to plan before acting
  • Provides a natural structure for multi-step problem solving
  • Exposes the plan and action trace for debugging

The limitation: explicit reasoning traces consume context and can become unreliable evidence about the model's internal computation. For long tasks, the growing history of thoughts, actions, and observations must be compressed, summarized, or replaced by external memory.

Training Agentic Policies

Proposition

Policy Gradient for Tool-Augmented Agents

Statement

For an agent executing a trajectory τ=(o0,a0,o1,a1,,oT,aT)\tau = (o_0, a_0, o_1, a_1, \ldots, o_T, a_T) with episode reward R(τ)R(\tau), the policy gradient is:

θJ=Eτπθ[R(τ)t=0Tθlogπθ(atot)]\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\left[R(\tau) \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | o_{\leq t})\right]

Each action ata_t is a sequence of tokens (the tool call or text output), so:

θlogπθ(atot)=k=1atθlogπθ(at(k)ot,at(<k))\nabla_\theta \log \pi_\theta(a_t | o_{\leq t}) = \sum_{k=1}^{|a_t|} \nabla_\theta \log \pi_\theta(a_t^{(k)} | o_{\leq t}, a_t^{(<k)})

The gradient reinforces entire action sequences (tool calls with arguments) that led to successful episodes.

Intuition

The policy gradient pushes the agent to repeat actions that led to high reward and avoid actions that led to low reward. But the reward comes only at the end of a long episode. Which of the 20 actions was responsible for success? This is the credit assignment problem: the fundamental difficulty of RL with sparse rewards over long horizons.

Why It Matters

This is the mathematical framework for training agents with RL. It shows why agentic RL is harder than chat RLHF: the sum over TT timesteps introduces high variance, the sparse reward R(τ)R(\tau) provides weak signal per action, and the combinatorial action space (all possible tool calls with all possible arguments) is enormous.

Failure Mode

With sparse rewards and long horizons, the REINFORCE estimator has extremely high variance. A 20-step episode with binary reward gives each action a gradient proportional to the same episode-level reward, regardless of whether that specific action contributed to success. Variance reduction techniques (baselines, advantage estimation) help but do not fully solve the problem. This is why most agentic RL systems supplement sparse rewards with shaped intermediate rewards (e.g., partial credit for intermediate progress).

How Agentic RL Differs from Chat RLHF

PropertyChat RLHFAgentic RL
Horizon1 turn (single response)5-100+ turns
RewardDense (reward per response)Sparse (reward at task completion)
ActionsText generationTool calls + text
StateFixed promptEvolving environment
ConsequencesNone (just text)Real (code runs, files change)
Failure recoveryN/AMust handle errors and retry
Credit assignmentTrivial (one action)Hard (many actions)

The key distinction: single-turn chat RLHF is closer to a contextual bandit, while agentic RL is a sequential decision problem with exploration, credit assignment, partial observability, and long horizons.

Multimodal Agents

Agent evaluation increasingly extends beyond text-only prompts to multimodal and interactive environments:

  • UI agents: Navigate graphical interfaces by observing screenshots or DOM state and producing clicks, typing actions, and browser navigation
  • Embodied agents: Interact with physical or simulated environments
  • Multi-tool agents: Combine code execution, web browsing, file editing, and API calls in a single episode

Multimodal agents process visual observations (screenshots) alongside text, expanding the observation space and adding new action types (click at coordinates, scroll, type into a field).

Common Fake Understanding

"Agents" does not mean AGI. In this page, an LLM agent means a language model used as a policy in a multi-step decision loop with tools and feedback. That may be prompted, supervised, preference-trained, or RL-trained. It is not a claim about consciousness or human-like autonomy. Treating "agentic" as synonymous with "autonomous in the human sense" leads to confused safety analysis and inflated capability claims.

Training Infrastructure

Training agentic policies requires infrastructure beyond standard LLM training:

  • Environments: Sandboxed execution environments for code, browsers, APIs. Each training episode requires spinning up and tearing down an environment instance.
  • Trajectory collection: Episodes are collected by running the agent in the environment, which is much slower than sampling text (tool calls have latency, code execution takes time).
  • Reward functions: Task-specific reward functions that check whether the agent completed the objective. Often hand-crafted per task category.
  • Safety constraints: The agent must not perform irreversible harmful actions during training (delete important files, send unauthorized emails). Sandboxing is essential.

April 2026 Review: What Is Established

The reliable center of the literature is not that "agents" are one settled architecture. It is that several older ideas now meet in the same system: interactive environments, tool calling, language-conditioned policies, trajectory evaluation, and delayed rewards.

The strongest evidence comes from task suites where success can be checked: web navigation environments, software-issue repair, coding tests, question answering with references, and constrained interactive benchmarks. Those benchmarks are useful because they force the model to act, observe, and revise. They are not universal measures of autonomy. Each benchmark bakes in a tool set, a time budget, an evaluator, and a definition of success.

The practical standard for this topic should be: report the environment, tool budget, scaffold, evaluator, and failure recovery policy. Without those details, an "agent" result is hard to interpret.

Common Confusions

Watch Out

Tool prompting is not the same as agentic RL

Prompting a model with tool descriptions and examples is not RL. It is in-context learning. The model uses its pretrained knowledge to guess how to use tools. Agentic RL actually updates the model's weights based on success or failure in the environment. The distinction matters because an impressive tool transcript is not evidence that the policy was trained from environment rollouts.

Watch Out

Function calling is not the same as agentic reasoning

Function calling (structured tool invocation) is a single action. Agentic reasoning is the ability to plan a sequence of actions, observe results, adapt the plan, handle failures, and decide when to stop. A model that can call functions is not necessarily an agent. It may just be a better-formatted chatbot. The "agentic" property is about multi-step sequential decision-making, not single-step tool invocation.

Watch Out

Longer context does not solve the horizon problem

A longer context window helps the agent remember more of its history, but it does not solve the RL challenges of credit assignment and exploration. Even with infinite context, the agent still needs to figure out which of its many actions was responsible for success (credit assignment) and decide whether to try new strategies versus exploit known ones (exploration). These are fundamental RL problems, not context length problems.

Summary

  • LLM agents can be modeled as policies: observation in, action out, multi-step episodes
  • Agent MDP: state = environment, action = tool call or text, reward = task completion
  • Tool use defines the action space: code execution, web search, APIs, UI
  • ReAct pattern: interleave reasoning (Thought) with acting (Action) and observing (Observation)
  • Agentic RL is harder than chat RLHF: longer horizons, sparser rewards, real consequences
  • Credit assignment over long episodes is the core difficulty
  • Policy gradient for agents: REINFORCE over multi-step trajectories with high variance
  • "Agent" means a model in an action-feedback loop, not AGI or consciousness
  • Training requires sandboxed environments and task-specific reward functions

Exercises

ExerciseCore

Problem

An LLM agent solves a coding task in 10 steps: 8 actions are code edits and 2 are test executions. The final test passes (reward = 1). Under REINFORCE without a baseline, what gradient does each action receive? Why is this problematic?

ExerciseAdvanced

Problem

Compare the effective action space of a chat model (single-turn RLHF) versus an agentic model with 5 tools, each taking a string argument of up to 100 tokens. Assuming a vocabulary of 50,000 tokens, estimate the action space sizes and explain the implications for exploration.

ExerciseResearch

Problem

The credit assignment problem in agentic RL can be partially addressed by hindsight analysis: after a successful episode, identify which actions were critical by counterfactual reasoning. Formalize this: define a "criticality score" for action ata_t in a successful trajectory, and describe how you would estimate it using the model itself.

References

Pre-canonical:

  • Sutton, "Temporal Credit Assignment in Reinforcement Learning", PhD thesis, UMass Amherst (1984). The credit-assignment problem that tool-using agents inherit across long multi-step trajectories.
  • Harutyunyan et al., "Hindsight Credit Assignment" (NeurIPS 2019). Counterfactual credit reweighting relevant to sparse-reward tool-use trajectories.

Canonical:

Current:

Next Topics

The natural next steps from agentic RL:

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

4