Tool Use and Function Calling: Language Models Invoking External Functions

Category: agents-applications Updated: 2026-02-27

Toolformer (Schick et al., NeurIPS 2023): model self-supervises API call insertion; reduces perplexity on 5 tools vs baseline; ReAct (Yao et al., ICLR 2023): interleaved reasoning+actions raise HotpotQA EM from 29.0% to 35.1% and ALFWorld success from 25% to 71%.

Key Data Points
MeasureValueUnitNotes
ReAct HotpotQA Exact Match35.1%Exact MatchYao et al. (2022): ReAct (reason+act) vs 29.0% standard prompting; +6.1% absolute on multi-hop QA
ReAct ALFWorld success rate71%% successYao et al. (2022): ReAct 71% vs 25% standard prompting; +46 points on embodied task completion
Toolformer tools5 toolstool typesSchick et al. (2023): calculator, calendar, Wikipedia search, machine translation, QA system
Function call JSON format{"name": "tool_name", "arguments": {"arg": "val"}}Standard structured output; parsed by executor; result appended as observation in context

Tool use (function calling) enables language models to invoke external functions — calculators, search engines, code interpreters, databases, and APIs — extending beyond the limitations of parametric knowledge stored in weights. The model generates a structured description of the function call; an external executor runs the function and returns the result, which is appended to the context for the next generation step.

The Tool Use Loop

User query

LM generates: {"name": "calculator", "arguments": {"expr": "24 * 365"}}

Executor: runs calculator("24 * 365") → 8760

Context append: [TOOL_RESULT]: 8760

LM continues: "There are 8760 hours in a year."

Toolformer: Self-Supervised Tool Learning (Schick et al., 2023)

Toolformer trains a model to self-supervise when and how to insert tool calls:

  1. Sample positions in training text where a tool call might reduce prediction loss
  2. Generate candidate API calls via few-shot prompting
  3. Filter: keep only calls where executing the tool and inserting the result reduces loss on the following text
  4. Fine-tune on the filtered dataset with API calls embedded inline
ToolExample Use CasePerplexity Benefit
CalculatorArithmetic problems in training textReduces loss on following numbers
CalendarDate arithmetic and temporal reasoningCorrect date computations
Wikipedia searchFactual entity lookupsGrounded factual claims
Machine translationNon-English text processingCorrect multilingual handling
QA systemKnowledge retrievalFactual question answering

ReAct vs Direct Tool Calling (Yao et al., 2022)

ApproachHotpotQA EMALFWorld Success
Standard prompting (no tools)29.0%25%
Act-only (tool calls, no reasoning)28.7%45%
CoT-only (reasoning, no tools)28.7%
ReAct (reasoning + tool calls)35.1%71%

The act-only baseline (tool calls without reasoning) performs similarly to no-tools prompting on multi-hop QA, confirming that reasoning traces are essential for effective tool selection and sequencing.

Structured Output Formats

Function calls require well-formed JSON within the text generation stream:

FormatMechanismParsing
Inline (Toolformer)Special tokens wrap API call syntaxToken-level detection
Dedicated turnEntire output is a JSON objectMessage-level parsing
JSON schema constrainedConstrained decoding to valid JSONGrammar-based sampling

Trust and Safety Boundaries

Tool use introduces a trust boundary: the model-generated function call is untrusted input to the executor. Sandbox requirements depend on tool capabilities:

  • Read-only tools (search, calculator): low risk; broad access acceptable
  • Write tools (database, email): require explicit user authorization per call
  • Code execution: requires full sandboxing; never run directly on host

See chain-of-thought for reasoning traces that improve tool selection quality, rag for read-only retrieval-based augmentation, and context-window for how tool results consume the available token budget.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

How does function calling differ from RAG?

RAG retrieves documents from a static vector index and prepends them to context before generation — it is read-only retrieval over a pre-indexed corpus. Function calling executes arbitrary code or API endpoints and returns structured results: computations (calculator, code interpreter), real-time lookups (live data, current prices), stateful writes (database updates, email sending), or multi-step workflows. Function calling is more general but requires a trust boundary — the executor must validate and sandbox what the model is permitted to invoke.

What is the ReAct prompting framework?

ReAct (Yao et al., 2022) interleaves reasoning traces with action steps: Thought → Action → Observation → Thought → Action → .... The model generates a natural language reasoning step explaining what information it needs, then a structured tool call, then incorporates the result as an observation before reasoning again. This explicit reasoning-before-action reduces errors compared to direct tool-call generation, as the model plans which tool to use and why before committing to a call.

← All AI pages · Dashboard