Harbor Integration for Training LLMs with Environments

Harbor is a framework for running agentic tasks in sandboxes. It decouples a task (instruction + sandbox image + verifier), a harness/agent (the tool surface + loop), and a sandbox (docker, e2b, daytona, gke, …) so they can be mixed freely. This makes it a natural fit for RL: the same task suite can be trained with different tool surfaces, on whichever sandbox backend you prefer.

This guide covers how to integrate Harbor with TRL. For Harbor itself, see the Harbor docs.

The integration lives at trl.experimental.harbor and is gated behind the trl[harbor] extra (lazy-imported — non-users pay nothing).

When to use Harbor environments

GRPOTrainer supports environment-based training via the environment_factory slot — see OpenEnv for the general contract. Use Harbor when you want to train against a Harbor task suite: a directory tree of tasks, each a self-contained sandbox + verifier (for example, a data-analysis agent suite where the model explores files in a sandbox and writes an answer that a grader checks).

Installation

pip install trl[harbor]

Harbor drives generation through vLLM and uses environment_factory, which requires vllm>=0.22.0 and transformers>=5.2.0.
pip install 'vllm>=0.22.0'

This installs the harbor framework (Python >= 3.12). The integration imports harbor lazily and runs it in-process, so users who don’t touch trl.experimental.harbor aren’t affected.

A sandbox backend must also be installed and reachable at train time. Harbor keeps cloud backends behind its own extras, so install the one you intend to use and provide its credentials:

pip install "harbor[e2b]"      # E2B cloud sandbox  -> environment_type="e2b",  needs E2B_API_KEY
# docker backend (environment_type="docker", Harbor's default) just needs a reachable Docker daemon

Quick start

HarborSpec wires a single Harbor task suite into the three TRL trainer slots — train_dataset, environment_factory, reward_funcs — by exposing properties that map 1:1 to those kwarg names:

from trl import GRPOConfig, GRPOTrainer
from trl.experimental.harbor import HarborSpec

spec = HarborSpec("AdithyaSK/data_agent_rl_environment_train", agent="bash", num_tasks=64)

trainer = GRPOTrainer(
    model="Qwen/Qwen3-4B",
    args=GRPOConfig(
        num_generations=8,
        max_steps=50,
        max_tool_calling_iterations=25,
        log_completions=True,
    ),
    train_dataset=spec.train_dataset,
    environment_factory=spec.environment_factory,
    reward_funcs=spec.reward_funcs,
)
trainer.train()

Under the hood HarborSpec does three things, lazily on first access:

spec.train_dataset: resolves the task suite to local task directories (downloading the HF dataset if needed) and builds a datasets.Dataset with prompt (empty — the env’s instruction is appended at reset), task_dir, task_index, plus per-task task.toml metadata columns.
spec.environment_factory: returns a zero-arg callable producing a fresh per-rollout HarborEnv. On reset(task_dir) it starts the task’s Harbor sandbox and returns its instruction; tool methods exec into the sandbox; env.reward runs the verifier once after the rollout.
spec.reward_funcs: an outcome reward that reads the Harbor verifier’s scalar per rollout.

The dataset

dataset is either a Hugging Face dataset repo id holding a Harbor task tree, or a local path containing a tasks/ subtree. Each task is a directory:

tasks/<task_id>/
├── instruction.md          # the task prompt (returned by reset)
├── task.toml               # config + metadata (gold answer, difficulty, ...)
├── environment/            # Dockerfile (+ any pre-agent data hooks)
└── tests/                  # test.sh / grader → writes the reward

Select a subset with num_tasks or indices (mutually exclusive):

spec = HarborSpec("AdithyaSK/data_agent_rl_environment_train", num_tasks=10)        # first 10
spec = HarborSpec("AdithyaSK/data_agent_rl_environment_train", indices=[0, 5, 13])  # specific

Agents: external vs installed

Harbor supports two ways an agent drives a task, and the distinction determines what can be trained with RL:

External agents run outside the sandbox and drive the loop themselves, issuing commands into the container through Harbor’s environment interface (“typically by executing bash commands via the exec method”). The agent decides each action and interprets each result; the sandbox only executes.
Installed agents are installed into the container image and run there as a headless subprocess (extending BaseInstalledAgent). Harbor launches the agent inside the sandbox and parses its trajectory file afterward (populate_context_post_run); the agent runs autonomously with its own inference.

TRL’s integration is the external-agent pattern, and only that pattern is supported for now. RL training requires the trainer to drive the rollout turn by turn: the policy model being trained generates each turn, and TRL captures its tokens and log-probs and applies the environment mask — exactly what environment_factory provides over a black-box rollout_func. An installed agent is opaque to this: it runs inside the container with its own model and only emits a trajectory after the fact, so there are no policy tokens or log-probs for the trainer to optimize, and the model under training is never invoked. A HarborEnv is therefore an external agent — its tool methods exec into the sandbox, but the loop, and the model under training, stay in TRL.

Selecting the base agent (harness)

The base agent is the harness — which tool methods the env exposes and how it submits. Select it with agent=:

HarborSpec(dataset, agent="bash")                          # built-in single-bash-tool harness
HarborSpec(dataset, agent="my_pkg.harnesses:JupyterEnv")   # import path to your HarborEnv subclass
HarborSpec(dataset, agent="path/to/harness.py:JupyterEnv") # file path to your HarborEnv subclass
HarborSpec(dataset, agent=MyHarborEnv)                     # a HarborEnv subclass directly

The built-in "bash" harness (HarborBashEnv) exposes one bash tool and submits by writing /workdir/answer.txt. Two richer harnesses ship as examples — each in its own folder with a README listing its tools — under examples/scripts/harbor/harnesses/:

jupyter/ (JupyterEnv) — a stateful Python kernel (variables persist across cells) + a shell tool.
terminal_notes/ (TerminalNotesEnv) — 6 shell tools (incl. background processes) + a 4-tool persistent note toolkit.

HarborSpec(dataset, agent="examples/scripts/harbor/harnesses/jupyter/env.py:JupyterEnv")
HarborSpec(dataset, agent="examples/scripts/harbor/harnesses/terminal_notes/env.py:TerminalNotesEnv")

To write your own harness, subclass HarborEnv and add tool methods — every public method becomes a tool (TRL discovers them with inspect.getmembers), so give each a typed signature and a docstring (used to build the tool schema). Keep helpers underscore-prefixed. Use self._exec(cmd) to run shell commands in the sandbox, and set PROMPT_SUFFIX to append harness guidance to the task instruction:

from trl.experimental.harbor import HarborEnv

class GrepEnv(HarborEnv):
    PROMPT_SUFFIX = "\n\nUse `grep` and `read_file`. Submit by writing /workdir/answer.txt."

    def grep(self, pattern: str, path: str) -> str:
        """Search for `pattern` under `path`.

        Args:
            pattern: The regex to search for.
            path: The file or directory to search.
        """
        return self._exec(f"grep -rn {pattern!r} {path!r}")

The sandbox backend

environment_type is passed straight through to Harbor (not validated by TRL):

HarborSpec(dataset, environment_type="e2b")   # cloud sandbox (offloads provisioning), needs E2B_API_KEY
HarborSpec(dataset, environment_type="docker")  # default; needs a local Docker daemon

e2b is recommended for cluster training: only environment.exec crosses into the cloud sandbox, so the GPUs stay dedicated to the policy and you can run many rollouts concurrently.

Reward functions

spec.reward_funcs defaults to an outcome reward — per rollout it reads the Harbor verifier’s scalar (env.reward), computed once after the rollout by running the task’s tests/ verifier in the sandbox. For a custom reward, write a regular TRL reward function:

def my_reward(environments, **kwargs) -> list[float]:
    return [env.reward for env in environments]

API

class trl.experimental.harbor.HarborSpec

< source >

( dataset: str agent: str | type[HarborEnv] = 'bash' environment_type: str = 'docker' num_tasks: int | None = None indices: list[int] | None = None include_metadata: bool = True )

Parameters

dataset (str) — A Hugging Face dataset repo id holding a Harbor task tree (e.g. "AdithyaSK/data_agent_rl_environment_train"), or a local path to a directory containing a tasks/ subtree. Each task is a dir with instruction.md / task.toml / environment/ / tests/.
agent (str or type, optional, defaults to "bash") — The base agent / harness — i.e. the tool surface the env exposes. One of: a built-in name ("bash"), an import path "package.module:ClassName", a file path "path/to/file.py:ClassName", or a HarborEnv subclass directly.
environment_type (str, optional, defaults to "docker") — Harbor sandbox backend, passed through to Harbor (whatever it supports — "docker", "e2b", "daytona", "gke", "modal", "runloop", …). Not validated here; Harbor validates. "docker" is Harbor’s own default; pick "e2b" to offload sandboxing to the cloud.
num_tasks (int, optional) — Cap on the number of tasks pulled into the dataset. None uses every task in the tree.
indices (list[int], optional) — Specific task indices (into the sorted task list). Mutually exclusive with num_tasks.
include_metadata (bool, optional, defaults to True) — Fold per-task task.toml metadata (gold_answer, difficulty, …) into the dataset rows.

Single spec object that wires a Harbor task suite into a TRL trainer.

class trl.experimental.harbor.HarborEnv

< source >

( environment_type: str = 'docker' )

Parameters

environment_type (str, optional, defaults to "docker") — Harbor sandbox backend, passed through to Harbor ("docker", "e2b", "daytona", …).

Base TRL environment backed by a Harbor sandbox + verifier.

Subclasses define the tool methods (the harness). The lifecycle TRL drives per rollout: reset(task_dir) (start the task’s sandbox, return its instruction) -> tool methods (exec into the sandbox) -> reward (run the verifier once, lazily, after the rollout).

class trl.experimental.harbor.HarborBashEnv

< source >

( environment_type: str = 'docker' )

Single-bash-tool harness; submit by writing /workdir/answer.txt.

bash

< source >

( command: str )

Parameters

command — The shell command to run.

Run a shell command in the sandbox and return its combined stdout+stderr. The shell is non-stateful between calls. Use it to explore files (ls, head, cat), run Python (python3 -c "..."), and submit the answer (echo -n "<value>" > /workdir/answer.txt).

Limitations

The integration is in trl.experimental — APIs may change. Set TRL_EXPERIMENTAL_SILENCE=1 to silence the warning in CI logs.
Harbor’s async sandbox client is bound to one event loop, so each env drives start/exec/verify synchronously on its own loop; sandbox provisioning is therefore sequential across the generation batch (cloud backends like e2b mitigate the per-sandbox cost).
A single HarborSpec covers one task suite + one harness; multi-suite training is not supported yet.

Reference

Update on GitHub