Spaces:

Nomearod
/

agentbench

Sleeping

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

Commit

27c2e17

1 Parent(s): 213da36

docs(eval): Fix 2 SearchTool query expansion — attempted and reverted

Fix 2 (deterministic primary+secondary retrieval merge at the
SearchTool layer) attempted as an alternative to Fix 1's prompt-
layer counterfactual clause. Pre-committed four-metric gate +
strict pilot_005 flip criterion written BEFORE the post-edit runs.

Two K8s post-edit measurements produced:
- fix2 v1 (primary[:3] + secondary[:5] merge): failed P@5 by
-0.033 — metric-definition mismatch where
retrieval_precision_at_k's top-5 window included 2 expansion
chunks, one of which was from an adjacent source file.
- fix2 v2 (primary[:5] + secondary[:5] merge, Path A refinement):
every aggregate metric LITERALLY unchanged pre/post
(P@5 0.800, R@5 1.000, KHR 0.806, citation 1.000,
mean tool_calls 1.167). Design invariant holds perfectly —
the iteration budget, tool schema, and refusal gate are
untouched from the LLM's perspective.

pilot_005 strict flip STILL fails on v2 — not on a metric, but on
the refusal-phrasing criterion. Post-edit answer contains the
correct documented-negative facts and citation but opens with
"The Kubernetes documentation does not provide specific
instructions on configuring..." — the exact refusal-phrasing
opener the criterion was pre-committed to reject. Compare to
Fix 1's cleaner phrasing ("NetworkPolicy does not support
enforcing mTLS directly...") — the difference is not cosmetic:
Fix 2's answer hedges about the documentation while Fix 1's
asserts a fact about NetworkPolicy. Fix 2 successfully puts the
target chunk in the LLM's context; what it cannot provide is
explicit guidance on how to phrase the documented negative.

Diagnosis: Fix 2 is not an alternative to Fix 1 — it is a
prerequisite. A future session exploring Fix 2 + targeted prompt
guidance stacked is the natural next experiment. Fix 1's over-
firing problem may be smaller when the expansion already
happened deterministically inside the first tool call, so the
clause has less work to do. Speculative and not for this session.

Gate discipline honored: both Fix 1 and Fix 2 attempts have been
pre-committed, measured, and reverted under their pre-committed
criteria. No tolerance relaxation.

Kept from the attempt:
- tests/test_tools.py: MockChunk.id hygiene (mock now matches
the real Chunk API — removes a latent bug for future test
authors)
- tests/test_tools.py: TestSearchToolSpecSnapshot freezes the
SearchTool LLM-facing contract (name, description, parameters).
A future refactor that silently exposes internal state to the
LLM would break iteration-budget invariants; the snapshot test
catches this at test time.
- results/k8s_postedit_fix2.json (v1, P@5 failure)
- results/k8s_postedit_fix2_merge_v2.json (v2, clean metric pass
with strict-flip failure)

Reverted:
- agent_bench/tools/search.py back to HEAD (no Fix 2 code
remains)

Full test suite passes (441/441 — 438 pre-session + 3 new
snapshot tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (4) hide show

DECISIONS.md +279 -0
results/k8s_postedit_fix2.json +194 -0
results/k8s_postedit_fix2_merge_v2.json +203 -0
tests/test_tools.py +33 -1

DECISIONS.md CHANGED Viewed

@@ -878,3 +878,282 @@ in scope. The narrowed serving-migration deferral entry (tied to
 any external reference to the counterfactual-query fix) also stays
 deferred until Fix 2 lands, since the production/eval-harness
 prompt divergence is unchanged by this revert.

 any external reference to the counterfactual-query fix) also stays
 deferred until Fix 2 lands, since the production/eval-harness
 prompt divergence is unchanged by this revert.
+## Fix 2 pre-committed regression gate — SearchTool deterministic query expansion
+**Pre-committed BEFORE post-edit runs** (same discipline pattern
+that caught Fix 1's iteration inflation cleanly).
+**Mechanism under test.** `agent_bench/tools/search.py`
+`SearchTool.execute` gains a deterministic two-query retrieval
+path. When the primary retrieval passes the refusal gate, a
+secondary retrieval is issued against an expanded query
+(`original_query + " not supported limitations cannot"`), and the
+final context returned to the LLM is `primary_top_3 ++
+secondary_top_5` deduplicated by `chunk.id`. Both retrievals run
+inside a single `SearchTool.execute` call — from the LLM's
+perspective, the tool schema, name, parameters, and return shape
+are unchanged, and the iteration budget is untouched.
+**Why this is architecturally different from Fix 1.** Fix 1 placed
+a behavioral clause in the system prompt that told the agent to
+issue follow-up searches itself. The trigger was an LLM judgment
+("did the first search return content addressing the specific
+capability?") and the follow-up was a separate tool call, so it
+counted against `max_iterations`. Over-firing on compound questions
+inflated iteration counts and pushed q024/q025 to the cap. Fix 2
+replaces this with a deterministic trigger (primary passes gate),
+a fixed expansion suffix, and a merge that happens entirely inside
+one tool call. No LLM judgment; no iteration change; corpus-
+agnostic.
+**Suffix choice.** `" not supported limitations cannot"`. Keyword-
+dense, ungrammatical on purpose — the suffix exists to shift BM25
+and embedding mass toward "what you cannot do" / "limitations"
+sections, not to read well. The ungrammatical form is also a self-
+documenting signal in retrieval logs: anyone reading a query trace
+sees the suffix and immediately knows it is a synthetic expansion,
+not user input. A one-line comment in `search.py` preserves the
+rationale for future readers.
+**Merge choice.** `primary_top_3 + secondary_top_5` deduped by
+`chunk.id`, producing 5–8 unique chunks per call. Rationale: top-5
+primary would make the expansion redundant on high-overlap queries
+(defeating the mechanism), while primary-top-3 guarantees the
+expansion always contributes to the final context window. Probe
+data (`/tmp/probe_fix2_v2.py`, throwaway) confirms this merge
+strategy surfaces pilot_005's target chunk
+(`d0806d5da91d6026`, chunk_index 63, "Anything TLS related ... use
+a service mesh or ingress controller for this") at position 6–8 in
+the merged list.
+**Opt-in flag, defaulting ON.** `SearchTool` accepts
+`negative_framing_expansion: bool = True`. Default is the shipping
+configuration because the regression gate must measure the shipping
+behavior, not the no-op path. A `False` default would mean the gate
+validates an unused parameter, and a subsequent commit flipping the
+default would have no regression evidence. Kill switch is preserved
+via explicit `False` at construction if a future regression
+requires an A/B comparison.
+**Baseline reuse.** The Fix 1 session's pre-edit JSONs
+(`results/fastapi_preedit.json`, `results/k8s_preedit_pinned.json`,
+both committed at `bd2b913`) were measured under the currently-
+committed state of the repo: pinned `gpt-4o-mini-2024-07-18`, K8s
+threshold 0.015, FastAPI threshold 0.02, HEAD `prompts.py` with no
+clause, HEAD `search.py` with no expansion. The working tree
+verification confirms this state is unchanged. These JSONs are
+therefore reused as the Fix 2 pre-edit baseline and do not need to
+be re-measured. Only post-edit runs are required for the Fix 2
+regression (~$0.02 saved).
+**Pre-committed tolerances.**
+| Metric | Pass criterion |
+|---|---|
+| P@5 | post-edit ≥ pre-edit − 0.02 |
+| R@5 | post-edit ≥ pre-edit − 0.02 |
+| Citation accuracy | post-edit ≥ pre-edit (**hard gate** — any drop blocks commit) |
+| Mean `tool_calls_made` | post-edit ≤ pre-edit + **0.05** (design-correctness gate — see note) |
+| Individual cap-hit | no question that used fewer than `max_iterations=3` iterations pre-edit may hit the cap post-edit |
+**Note on the tool_calls gate.** ≤ +0.05 is a *design-correctness*
+gate, not a *performance* gate. Fix 2's invariant is that both
+retrievals happen inside one `SearchTool.execute` call, so the
+LLM's iteration count is unchanged by construction. Any non-trivial
+movement in `mean tool_calls_made` indicates the design invariant
+is broken — e.g., expansion accidentally exposed as a separate
+tool, or the LLM observing two-call behavior and adapting its
+strategy. The gate fires on design violation, not on performance
+regression. The 0.05 absolute threshold absorbs legitimate run-to-
+run variance from non-determinism in the LLM even at temperature
+0, without absorbing real iteration-count movement.
+**pilot_005 strict flip criterion (K8s-only, unchanged from Fix 1
+gate):**
+- `keyword_hit_rate ≥ 0.60` against golden keywords `["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]`
+- Answer cites `k8s_network_policies.md`
+- Answer contains "service mesh" OR "ingress controller"
+- Answer does NOT begin with refusal phrasing
+**Baseline reference for the gate.**
+| Corpus | Pre-edit source | P@5 | R@5 | Citation | Mean tool_calls |
+|---|---|---|---|---|---|
+| FastAPI (27) | `results/fastapi_preedit.json` @ `bd2b913` | 0.585 | 0.679 | 1.000 | 1.111 |
+| K8s (6 pilots) | `results/k8s_preedit_pinned.json` @ `bd2b913` | 0.800 | 1.000 | 1.000 | 1.167 |
+**Post-edit filenames (to be produced).**
+- `results/fastapi_postedit_fix2.json`
+- `results/k8s_postedit_fix2.json`
+**If the gate passes:** commit Fix 2 with `search.py` change, unit
+tests (including the tool-spec snapshot test), the two post-edit
+result JSONs, and this DECISIONS.md entry extended with the
+regression outcome.
+**If the gate fires:** revert, document the failure mode, surface
+the specific criterion that fired. No tolerance relaxation — same
+discipline pattern as Fix 1 revert.
+## Fix 2 outcome — mechanism works, response-style criterion fired, reverted
+**Regression runs produced.** Two post-edit runs on K8s (FastAPI not
+run — K8s findings gated the decision before API spend on the
+broader set):
+| Run | Merge rule | File | Purpose |
+|---|---|---|---|
+| Fix 2 v1 | `primary[:3] + secondary[:5]` | `results/k8s_postedit_fix2.json` | Initial implementation |
+| Fix 2 v2 | `primary[:5] + secondary[:5]` | `results/k8s_postedit_fix2_merge_v2.json` | Path A refinement after v1 failed P@5 on a metric-definition mismatch |
+**v1 findings.** Aggregate: P@5 0.800 → 0.767 (Δ −0.033, **FAILED**
+the P@5 ≥ −0.02 tolerance). The failure traced to a merge-rule /
+metric-semantics interaction: `retrieval_precision_at_k` computes
+precision on `retrieved_sources[:5]`, and with `primary[:3] +
+secondary[:5]` the first 5 entries were `primary_top_3 +
+secondary_top_2`. For pilot_005, `secondary[1]` was
+`k8s_pods.md` (chunk_index 40, surfaced because the reranker
+matched its "localhost communication" content against the expanded
+query). That single off-source chunk in position 5 dropped P@5
+from 1.00 to 0.80 for pilot_005 and similarly for pilot_006.
+Iteration invariant held (tool_calls 1.167 → 1.167). Citation
+accuracy held (1.000 → 1.000). Target chunk
+(`d0806d5da91d6026`, "Anything TLS related") reached the LLM
+context for pilot_005 at merged position 7.
+**Path A refinement (merge v2).** Change `primary[:3] +
+secondary[:5]` → `primary[:5] + secondary[:5]`. Rationale:
+primary_top_5 is preserved in positions 1–5 by construction, so
+P@5 computed on `ranked_sources[:5]` is unchanged from the
+no-expansion baseline. Expansion chunks land in positions 6–10.
+Target chunk still reaches LLM context (position 9 for pilot_005).
+This is an **implementation refinement, not a tolerance
+relaxation** — the pre-committed gate thresholds stand; only the
+merge rule was adjusted to respect the metric's window semantics.
+**v2 findings — perfect metric preservation, but strict-flip fails on response style.**
+Aggregate:
+| Metric | Pre-edit | Fix 2 v2 | Delta |
+|---|---|---|---|
+| P@5 | 0.800 | 0.800 | **0.000** |
+| R@5 | 1.000 | 1.000 | 0.000 |
+| KHR | 0.806 | 0.806 | 0.000 |
+| Citation accuracy | 1.000 | 1.000 | 0.000 |
+| Mean `tool_calls_made` | 1.167 | 1.167 | **0.000** |
+Every aggregate metric **literally unchanged**. Per-question
+deltas: zero on every metric, every question. The design
+invariant (iteration budget unchanged, tool schema unchanged,
+refusal gate behavior unchanged) holds perfectly.
+**But pilot_005 strict flip fails on the refusal-phrasing criterion.**
+Post-edit answer:
+> *"The Kubernetes documentation does not provide specific
+> instructions on configuring a NetworkPolicy to enforce mutual TLS
+> (mTLS) between Pods in the same namespace. For mTLS, it is
+> generally recommended to use a service mesh or other proxy
+> solutions, as NetworkPolicy alone does not handle TLS
+> configurations directly [source: k8s_network_policies.md]."*
+The answer substantively contains the documented negative with
+citation. But it opens with *"The Kubernetes documentation does
+not provide specific instructions..."* — the exact refusal-
+phrasing opener the strict-flip criterion was pre-committed to
+reject. The criterion exists because the brand is honest
+evaluation: an answer that opens apologizing that the
+documentation "does not provide specific instructions" reads, to
+a technical reviewer, like the system failed to find the answer
+and is papering over the gap, even though the facts and citation
+are present. The criterion fired as designed.
+**Compare to Fix 1 post-edit answer (from `bd2b913` evidence):**
+> *"Kubernetes NetworkPolicy does not support enforcing mutual TLS
+> (mTLS) directly. The documentation states that anything TLS
+> related should be handled using a service mesh or ingress
+> controller, rather than through NetworkPolicy [source: k8s_network_policies.md]."*
+Fix 1's answer asserts a fact about **NetworkPolicy** ("does not
+support"); Fix 2's answer asserts a fact about **the documentation**
+("does not provide instructions"). The first forecloses the
+capability; the second leaves open whether the capability exists
+somewhere the system didn't see. That distinction is load-bearing
+for any grounded-refusal narrative, and it separates a system that
+handles documented negatives crisply from one that hedges around
+them.
+**Diagnosis.** Fix 2's mechanism successfully gets the target chunk
+into the LLM's context window — the retrieval side of the problem
+is solved. What Fix 2 **cannot provide** is explicit guidance on
+how to phrase the documented negative once the chunk is present.
+Fix 1's prompt clause was doing that guidance work; removing the
+clause and relying on the LLM's unaided response style produces a
+hedging answer because the LLM, seeing both NetworkPolicy-spec
+content and a TLS limitation bullet, defaults to contextual
+hedging rather than crisp assertion.
+**Fix 2 is therefore not an alternative to Fix 1's prompt clause
+— it is a prerequisite.** Fix 2 guarantees the chunk reaches
+context; a future "Fix 2 + targeted prompt clause" stack could
+resolve both the retrieval gap and the response-style gap without
+Fix 1's over-firing problem, because the clause would no longer
+need to direct the agent to do a follow-up search (Fix 2 handled
+that). The over-firing on compound questions that broke Fix 1 was
+caused by the agent deciding to do extra search iterations under
+LLM judgment; if the expansion already happened deterministically
+inside the first tool call, the clause has less work to do and
+may not trigger the second-LLM-call pattern at all. **Speculative
+and not for this session.** Future work item.
+**Gate verdict: failed on pilot_005 strict flip criterion.**
+Reverting, same Fix-1 pattern.
+**What this commit contains.**
+- `agent_bench/tools/search.py` **reverted** to HEAD (no Fix 2
+  code changes)
+- `tests/test_tools.py` retains the `MockChunk.id` hygiene fix
+  (the real `Chunk` class has `id`; mock should match the real API
+  for future test authors)
+- `tests/test_tools.py` adds `TestSearchToolSpecSnapshot`: a
+  general-purpose guard that freezes `SearchTool`'s LLM-facing
+  contract (name, description, parameters). The lesson from Fix 2
+  is that any future refactor exposing internal SearchTool state
+  to the LLM would break iteration-budget invariants — the
+  snapshot test catches that at test time, independent of whether
+  Fix 2 lands.
+- Two regression evidence JSONs: `results/k8s_postedit_fix2.json`
+  (v1, the P@5 failure) and `results/k8s_postedit_fix2_merge_v2.json`
+  (v2, the strict-flip failure). Retained as the measurement
+  trail behind the revert decision.
+- This DECISIONS.md entry (pre-committed gate + outcome + revert
+  narrative).
+**What this commit does NOT contain.** No changes to
+`agent_bench/tools/search.py`, `agent_bench/core/prompts.py`, or
+`configs/default.yaml`. Both Fix 1 (prompt clause) and Fix 2
+(SearchTool expansion) have been attempted and reverted this
+session. Three commits of progress nonetheless: `b97f00f`
+(threshold calibration, empirical), `77017db` (prep bundle: model
+pin + fastapi wire + Fix 1 pre-committed tolerances), `bd2b913`
+(Fix 1 revert narrative). The threshold calibration and model pin
+are real, shipped, measurement-grounded infrastructure changes.
+The two fix attempts are documented learning that shapes the
+future direction.
+**Narrative summary.** Session hypothesis: pilot_005 is a
+counterfactual-query-expansion problem. Session evidence: the
+hypothesis is correct on retrieval — the target chunk is reachable
+via negative-framing queries and Fix 2 surfaces it deterministically
+with zero iteration-budget impact. Session evidence also shows the
+hypothesis is **incomplete** — retrieval-only fixes cannot close
+the response-style gap, because the LLM under unaided prompting
+hedges when a documented negative is surrounded by unrelated
+topical content. A future session exploring **Fix 2 + targeted
+prompt guidance stacked** is the natural next experiment; this
+session's pilot-first discipline has been preserved against two
+distinct pre-committed gates, both firing for the reasons they
+were designed to catch.

results/k8s_postedit_fix2.json ADDED Viewed

	@@ -0,0 +1,194 @@

+[
+  {
+    "question_id": "k8s_pilot_001",
+    "question": "In Kubernetes, does each Pod receive its own IP address, and how do containers inside the same Pod talk to each other?",
+    "category": "retrieval",
+    "difficulty": "easy",
+    "retrieval_precision": 0.6,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 1.0,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 2,
+    "latency_ms": 21174.48547598906,
+    "tokens_used": {
+      "input_tokens": 1872,
+      "output_tokens": 150,
+      "estimated_cost_usd": 0.00037079999999999996
+    },
+    "answer": "In Kubernetes, each Pod is assigned a unique IP address for each address family. Every container within a Pod shares the same network namespace, which includes the IP address and network ports. This allows containers in the same Pod to communicate with each other using `localhost` [source: k8s_pods.md]. \n\nAdditionally, containers within a Pod can also communicate using standard inter-process communications like SystemV semaphores or POSIX shared memory [source: k8s_pods.md].",
+    "retrieved_sources": [
+      "k8s_network_policies.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_secret.md",
+      "k8s_network_policies.md",
+      "k8s_pods.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_002",
+    "question": "When you update a Deployment's pod template, what mechanism does Kubernetes use to transition Pods from the old version to the new one, and what role does the ReplicaSet play?",
+    "category": "retrieval",
+    "difficulty": "hard",
+    "retrieval_precision": 1.0,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 0.8333333333333334,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 12374.25357697066,
+    "tokens_used": {
+      "input_tokens": 1122,
+      "output_tokens": 140,
+      "estimated_cost_usd": 0.0002523
+    },
+    "answer": "When you update a Deployment's pod template, Kubernetes creates a new ReplicaSet to manage the new version of the Pods. The existing ReplicaSet, which controls the old version of the Pods, is scaled down while the new ReplicaSet is scaled up. This process ensures that Pods are transitioned from the old version to the new one in a controlled manner. Specifically, the Deployment controller gradually scales the new ReplicaSet to the desired number of replicas while simultaneously scaling down the old ReplicaSet, ensuring that Pods are replaced at a controlled rate [source: k8s_deployment.md].",
+    "retrieved_sources": [
+      "k8s_deployment.md",
+      "k8s_replicaset.md",
+      "k8s_deployment.md",
+      "k8s_deployment.md",
+      "k8s_deployment.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_003",
+    "question": "What is the key difference between a ConfigMap and a Secret when deciding where to store sensitive application data like database passwords?",
+    "category": "retrieval",
+    "difficulty": "medium",
+    "retrieval_precision": 1.0,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 0.6666666666666666,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 9694.163175998256,
+    "tokens_used": {
+      "input_tokens": 1215,
+      "output_tokens": 102,
+      "estimated_cost_usd": 0.00024344999999999998
+    },
+    "answer": "The key difference between a ConfigMap and a Secret when deciding where to store sensitive application data, such as database passwords, is that a ConfigMap does not provide secrecy or encryption, making it unsuitable for confidential data. In contrast, a Secret is specifically intended to hold confidential data and includes additional protections to ensure the security of that information [source: k8s_configmap.md][source: k8s_secret.md].",
+    "retrieved_sources": [
+      "k8s_configmap.md",
+      "k8s_secret.md",
+      "k8s_secret.md",
+      "k8s_secret.md",
+      "k8s_configmap.md",
+      "k8s_secret.md",
+      "k8s_configmap.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_004",
+    "question": "If I set a custom value for one hard eviction threshold on the kubelet (e.g., memory.available) but leave the other thresholds unset, what happens to the defaults for the thresholds I didn't override?",
+    "category": "retrieval",
+    "difficulty": "hard",
+    "retrieval_precision": 1.0,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 0.5,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 10764.822138939053,
+    "tokens_used": {
+      "input_tokens": 1041,
+      "output_tokens": 135,
+      "estimated_cost_usd": 0.00023714999999999994
+    },
+    "answer": "If you set a custom value for one hard eviction threshold on the kubelet (e.g., `memory.available`), the default values for the thresholds you didn't override will not be inherited. Instead, the values of the other parameters will be set to zero. To provide custom values, you must specify all the thresholds respectively. Alternatively, you can set the kubelet config `MergeDefaultEvictionSettings` to true in the kubelet configuration file to retain the default values for the thresholds you did not change [source: k8s_node_pressure_eviction.md].",
+    "retrieved_sources": [
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_005",
+    "question": "How do I configure a Kubernetes NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace?",
+    "category": "retrieval",
+    "difficulty": "medium",
+    "retrieval_precision": 0.8,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 1.0,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 10257.949891034514,
+    "tokens_used": {
+      "input_tokens": 1225,
+      "output_tokens": 102,
+      "estimated_cost_usd": 0.00024495
+    },
+    "answer": "The Kubernetes documentation does not provide specific instructions on configuring a NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace. It mentions that NetworkPolicy is primarily used for controlling traffic between Pods but does not cover mTLS configurations directly. For mTLS, it is suggested to use a service mesh or ingress controller instead [source: k8s_network_policies.md].",
+    "retrieved_sources": [
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_pods.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_006",
+    "question": "As of the Kubernetes v1.31 snapshot, what is the feature state (alpha, beta, or stable) of the built-in Pod Security admission controller, and in which version did it reach that state?",
+    "category": "retrieval",
+    "difficulty": "easy",
+    "retrieval_precision": 0.2,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 1.0,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 8818.4428520035,
+    "tokens_used": {
+      "input_tokens": 1249,
+      "output_tokens": 62,
+      "estimated_cost_usd": 0.00022455
+    },
+    "answer": "The built-in Pod Security admission controller is in the **stable** feature state as of **Kubernetes v1.25** [source: k8s_pod_security_admission.md].",
+    "retrieved_sources": [
+      "k8s_pod_security_admission.md",
+      "k8s_replicaset.md",
+      "k8s_deployment.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_pods.md",
+      "k8s_pod_security_admission.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  }
+]

results/k8s_postedit_fix2_merge_v2.json ADDED Viewed

	@@ -0,0 +1,203 @@

+[
+  {
+    "question_id": "k8s_pilot_001",
+    "question": "In Kubernetes, does each Pod receive its own IP address, and how do containers inside the same Pod talk to each other?",
+    "category": "retrieval",
+    "difficulty": "easy",
+    "retrieval_precision": 0.4,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 1.0,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 2,
+    "latency_ms": 17866.33544403594,
+    "tokens_used": {
+      "input_tokens": 2038,
+      "output_tokens": 153,
+      "estimated_cost_usd": 0.00039749999999999996
+    },
+    "answer": "In Kubernetes, each Pod is assigned a unique IP address for each address family. However, every container within a Pod shares the same network namespace, which includes the IP address and network ports. This means that containers in the same Pod can communicate with each other using `localhost` [source: k8s_pods.md]. \n\nAdditionally, containers within a Pod can also communicate using standard inter-process communications like SystemV semaphores or POSIX shared memory [source: k8s_pods.md].",
+    "retrieved_sources": [
+      "k8s_network_policies.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_secret.md",
+      "k8s_network_policies.md",
+      "k8s_pods.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_002",
+    "question": "When you update a Deployment's pod template, what mechanism does Kubernetes use to transition Pods from the old version to the new one, and what role does the ReplicaSet play?",
+    "category": "retrieval",
+    "difficulty": "hard",
+    "retrieval_precision": 1.0,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 0.6666666666666666,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 12697.782160015777,
+    "tokens_used": {
+      "input_tokens": 1213,
+      "output_tokens": 198,
+      "estimated_cost_usd": 0.00030074999999999996
+    },
+    "answer": "When you update a Deployment's pod template, Kubernetes uses a rolling update mechanism to transition Pods from the old version to the new one. This process involves creating a new ReplicaSet for the updated Deployment. The Deployment controller gradually scales up the new ReplicaSet while simultaneously scaling down the old ReplicaSet, ensuring that Pods are replaced in a controlled manner. This approach allows for a smooth transition without downtime, as it manages the number of Pods running at any time during the update process [source: k8s_deployment.md]. \n\nThe ReplicaSet plays a crucial role in this mechanism by managing the Pods that match the Deployment's specifications. When a Deployment is updated, the existing ReplicaSet that controls the old Pods is scaled down, and the new ReplicaSet is scaled up to the desired number of replicas specified in the Deployment [source: k8s_deployment.md].",
+    "retrieved_sources": [
+      "k8s_deployment.md",
+      "k8s_replicaset.md",
+      "k8s_deployment.md",
+      "k8s_deployment.md",
+      "k8s_deployment.md",
+      "k8s_deployment.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_003",
+    "question": "What is the key difference between a ConfigMap and a Secret when deciding where to store sensitive application data like database passwords?",
+    "category": "retrieval",
+    "difficulty": "medium",
+    "retrieval_precision": 1.0,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 0.8333333333333334,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 10444.362040958367,
+    "tokens_used": {
+      "input_tokens": 1298,
+      "output_tokens": 123,
+      "estimated_cost_usd": 0.00026849999999999997
+    },
+    "answer": "The key difference between a ConfigMap and a Secret in Kubernetes is that a ConfigMap is used to store non-confidential data, while a Secret is specifically intended to hold confidential data, such as passwords, OAuth tokens, and SSH keys. ConfigMaps do not provide secrecy or encryption, making them unsuitable for sensitive information. Therefore, for storing sensitive application data like database passwords, it is recommended to use a Secret instead of a ConfigMap [source: k8s_configmap.md][source: k8s_secret.md].",
+    "retrieved_sources": [
+      "k8s_configmap.md",
+      "k8s_secret.md",
+      "k8s_secret.md",
+      "k8s_secret.md",
+      "k8s_configmap.md",
+      "k8s_secret.md",
+      "k8s_configmap.md",
+      "k8s_configmap.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_004",
+    "question": "If I set a custom value for one hard eviction threshold on the kubelet (e.g., memory.available) but leave the other thresholds unset, what happens to the defaults for the thresholds I didn't override?",
+    "category": "retrieval",
+    "difficulty": "hard",
+    "retrieval_precision": 1.0,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 0.5,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 9120.96216098871,
+    "tokens_used": {
+      "input_tokens": 1249,
+      "output_tokens": 135,
+      "estimated_cost_usd": 0.00026835
+    },
+    "answer": "If you set a custom value for one hard eviction threshold on the kubelet (e.g., `memory.available`), the default values for the thresholds you didn't override will not be inherited. Instead, the values of the other parameters will be set to zero. To provide custom values, you should specify all the thresholds respectively. Alternatively, you can set the kubelet config `MergeDefaultEvictionSettings` to true in the kubelet configuration file to retain the default values for the thresholds you did not change [source: k8s_node_pressure_eviction.md].",
+    "retrieved_sources": [
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_005",
+    "question": "How do I configure a Kubernetes NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace?",
+    "category": "retrieval",
+    "difficulty": "medium",
+    "retrieval_precision": 1.0,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 0.8333333333333334,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 8491.116772056557,
+    "tokens_used": {
+      "input_tokens": 1509,
+      "output_tokens": 92,
+      "estimated_cost_usd": 0.00028155
+    },
+    "answer": "The Kubernetes documentation does not provide specific instructions on configuring a NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace. For mTLS, it is generally recommended to use a service mesh or other proxy solutions, as NetworkPolicy alone does not handle TLS configurations directly [source: k8s_network_policies.md].",
+    "retrieved_sources": [
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_pods.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_006",
+    "question": "As of the Kubernetes v1.31 snapshot, what is the feature state (alpha, beta, or stable) of the built-in Pod Security admission controller, and in which version did it reach that state?",
+    "category": "retrieval",
+    "difficulty": "easy",
+    "retrieval_precision": 0.4,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 1.0,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 10854.806952062063,
+    "tokens_used": {
+      "input_tokens": 1320,
+      "output_tokens": 54,
+      "estimated_cost_usd": 0.00023040000000000002
+    },
+    "answer": "The built-in Pod Security admission controller reached a stable state in Kubernetes v1.25 [source: k8s_pod_security_admission.md].",
+    "retrieved_sources": [
+      "k8s_pod_security_admission.md",
+      "k8s_replicaset.md",
+      "k8s_deployment.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_pod_security_admission.md",
+      "k8s_pods.md",
+      "k8s_pod_security_admission.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  }
+]

tests/test_tools.py CHANGED Viewed

@@ -2,7 +2,8 @@
 from __future__ import annotations
-from dataclasses import dataclass
 import pytest
@@ -17,6 +18,7 @@ from agent_bench.tools.search import SearchTool
 class MockChunk:
     content: str
     source: str
 @dataclass
@@ -221,6 +223,36 @@ class TestSearchTool:
         assert "query" in defn.parameters["required"]
 # --- Refusal gate tests ---

 from __future__ import annotations
+import uuid
+from dataclasses import dataclass, field
 import pytest
 class MockChunk:
     content: str
     source: str
+    id: str = field(default_factory=lambda: uuid.uuid4().hex[:16])
 @dataclass
         assert "query" in defn.parameters["required"]
+class TestSearchToolSpecSnapshot:
+    """Frozen snapshot of SearchTool's LLM-facing contract.
+    Any silent change to the tool name, description, or parameter schema
+    that reaches the LLM invalidates invariants that callers rely on —
+    for example, an attempt to expose internal SearchTool state (such as
+    a Fix-2-style expansion flag) as an LLM-visible parameter would
+    break the "one LLM-facing tool call per execute() invocation"
+    iteration-budget guarantee. These assertions fail loudly if the
+    contract drifts.
+    """
+    def test_tool_name(self):
+        assert SearchTool.name == "search_documents"
+    def test_tool_description(self):
+        assert SearchTool.description == (
+            "Search the technical documentation corpus for relevant passages. "
+            "Returns the most relevant document chunks with source attribution."
+        )
+    def test_tool_parameters_schema(self):
+        params = SearchTool.parameters
+        assert params["type"] == "object"
+        assert set(params["properties"].keys()) == {"query", "top_k"}
+        assert params["properties"]["query"]["type"] == "string"
+        assert params["properties"]["top_k"]["type"] == "integer"
+        assert params["required"] == ["query"]
 # --- Refusal gate tests ---