Home

Tools

Why Tools Matter

Tools are what differentiate an agent from a bare LLM conversation. Without tools, a model can only generate text. It cannot read your files, run your tests, or check whether its suggestions actually work. Tools close the loop between reasoning and action.

Stencila agents operate in an agentic loop: the model reasons about a task, requests one or more tool calls, Stencila executes them and returns the results, and the model reasons again based on those results. This cycle repeats until the task is complete or a limit is reached. It is this loop, not the model alone, that makes agents capable of multi-step coding and research tasks like debugging across files, running experiments, or iterating on analysis code until tests pass.

Provider-Aligned Toolsets

Different model families are trained and optimized for different tool interfaces. OpenAI models produce better edits using their native apply_patch format (a v4a diff format). Anthropic models work best with edit_file's old_string/new_string exact-match replacement. Gemini models expect additional tools like read_many_files and list_dir that match their reference CLI.

Stencila handles this automatically. When an agent session starts, Stencila selects the tool definitions and system prompts that match the model's provider. Users do not need to configure this; the right tools are provided to the right models.

A core set of tools is shared across all providers. Provider-specific tools extend this base:

ToolProvidersDescription
ShellAllExecute shell commands
File (read_file, write_file, grep, glob)AllRead, write, and search files
Web (web_fetch)AllFetch web pages and save content locally
edit_fileAnthropic, GeminiApply exact-match old_string/new_string replacements
apply_patchOpenAIApply multi-file edits in v4a diff format
read_many_filesGeminiBatch-read multiple files in a single call
list_dirGeminiList directory contents with depth options

Tool Guards

Giving an agent access to tools means giving it the ability to modify files, execute arbitrary commands, and make network requests. A single hallucinated or misguided tool call can cause real damage — deleting files, force-pushing over git history, or leaking credentials into the model's context window.

Tool guards are a friction layer that sits between the agent's tool calls and their execution. Every tool call passes through a guard before it runs. The guard inspects the call parameters — the shell command, file path, or URL — and returns a verdict:

VerdictEffect
AllowThe tool call proceeds normally.
WarnThe tool call proceeds, but the verdict is logged and the agent is informed of the risk.
DenyThe tool call is blocked. The agent receives the reason and a suggestion for an alternative approach.

Each tool has its own guard with domain-specific rules:

  • Shell guard — pattern-based command evaluation using safe and destructive regex catalogs

  • File guard — path-based risk checks for read, write, edit, patch, and search operations

  • Web guard — URL and domain safety checks preventing SSRF, credential exposure, and protocol abuse

When a tool call is denied, the agent sees the reason and suggestion as tool output. This means the model can adjust its approach without user intervention (e.g using a safer command, or narrowing a file path).

Guards and Sandboxes

Tool guards are complementary to OS-level security boundaries such as sandboxes (bubblewrap, Landlock, containers). The two layers address different threat surfaces:

A sandbox constrains what the agent can do to the local environment; it cannot write outside a directory, cannot access the network, cannot read sensitive files. But many destructive operations are perfectly valid local commands that cause damage elsewhere: git push --force destroys history on the remote, uv publish pushes a package to PyPI, aws s3 rm --recursive deletes cloud storage. These commands can still execute successfully inside sandboxes that allow outbound network access, because the damage happens on a remote system.

Guards address this gap. They distinguish destructive from non-destructive use of tools like git, uv, docker, and cloud CLIs regardless of where the agent runs. A sandboxed agent still needs guards to prevent it from force-pushing a branch or publishing a package by mistake.

Trust Levels

Rather than requiring users to approve every individual tool call, guards use trust levels to control enforcement automatically. Each agent can be configured with a trust level that determines how strictly its tool calls are guarded:

LevelDescription
LowMost restrictive. Shell commands default to deny unless they match a known-safe pattern. File and web rules apply strictest verdicts. Appropriate for untested agents or sensitive environments.
Medium (default)Default-allow with destructive behavior blocking. Known-destructive commands are denied; everything else is allowed. A reasonable default for most use.
HighDefault-allow with relaxed blocking. Some rules that deny at medium trust downgrade to warnings. Appropriate for trusted agents on non-critical workspaces.

This lets agents work autonomously at a level of freedom appropriate to the context, without requiring a human in the loop for every shell command or file write.

Audit

All Warn and Deny verdicts are recorded to the workspace's Stencila database (<workspace>/.stencila/db.sqlite3) for post-hoc review. Allow verdicts are never recorded.

Each audit event captures:

FieldDescription
session_idUnique identifier for the agent session
agent_nameName of the agent that made the tool call
trust_levelTrust level of the session (low, medium, high)
tool_nameName of the tool that was called (shell, read_file, web_fetch, etc.)
inputThe tool input that was evaluated (command string, file path, URL)
matched_segmentThe specific segment that triggered the rule (normalized path, decisive command segment)
verdictThe verdict (Warn or Deny)
rule_idThe rule that fired (e.g., shell.core.filesystem.recursive_delete_root)
reasonHuman-readable explanation of why the rule fired
suggestionActionable suggestion for what to do instead

Auditing is best-effort and non-blocking: events are sent through a bounded channel to a background task. If the database cannot be opened or the channel is full, events are silently dropped — guard enforcement continues regardless.

© 2026 Stencila