Nomearod Claude Opus 4.6 (1M context) commited on
Commit
27c2e17
·
1 Parent(s): 213da36

docs(eval): Fix 2 SearchTool query expansion — attempted and reverted

Browse files

Fix 2 (deterministic primary+secondary retrieval merge at the
SearchTool layer) attempted as an alternative to Fix 1's prompt-
layer counterfactual clause. Pre-committed four-metric gate +
strict pilot_005 flip criterion written BEFORE the post-edit runs.

Two K8s post-edit measurements produced:
- fix2 v1 (primary[:3] + secondary[:5] merge): failed P@5 by
-0.033 — metric-definition mismatch where
retrieval_precision_at_k's top-5 window included 2 expansion
chunks, one of which was from an adjacent source file.
- fix2 v2 (primary[:5] + secondary[:5] merge, Path A refinement):
every aggregate metric LITERALLY unchanged pre/post
(P@5 0.800, R@5 1.000, KHR 0.806, citation 1.000,
mean tool_calls 1.167). Design invariant holds perfectly —
the iteration budget, tool schema, and refusal gate are
untouched from the LLM's perspective.

pilot_005 strict flip STILL fails on v2 — not on a metric, but on
the refusal-phrasing criterion. Post-edit answer contains the
correct documented-negative facts and citation but opens with
"The Kubernetes documentation does not provide specific
instructions on configuring..." — the exact refusal-phrasing
opener the criterion was pre-committed to reject. Compare to
Fix 1's cleaner phrasing ("NetworkPolicy does not support
enforcing mTLS directly...") — the difference is not cosmetic:
Fix 2's answer hedges about the documentation while Fix 1's
asserts a fact about NetworkPolicy. Fix 2 successfully puts the
target chunk in the LLM's context; what it cannot provide is
explicit guidance on how to phrase the documented negative.

Diagnosis: Fix 2 is not an alternative to Fix 1 — it is a
prerequisite. A future session exploring Fix 2 + targeted prompt
guidance stacked is the natural next experiment. Fix 1's over-
firing problem may be smaller when the expansion already
happened deterministically inside the first tool call, so the
clause has less work to do. Speculative and not for this session.

Gate discipline honored: both Fix 1 and Fix 2 attempts have been
pre-committed, measured, and reverted under their pre-committed
criteria. No tolerance relaxation.

Kept from the attempt:
- tests/test_tools.py: MockChunk.id hygiene (mock now matches
the real Chunk API — removes a latent bug for future test
authors)
- tests/test_tools.py: TestSearchToolSpecSnapshot freezes the
SearchTool LLM-facing contract (name, description, parameters).
A future refactor that silently exposes internal state to the
LLM would break iteration-budget invariants; the snapshot test
catches this at test time.
- results/k8s_postedit_fix2.json (v1, P@5 failure)
- results/k8s_postedit_fix2_merge_v2.json (v2, clean metric pass
with strict-flip failure)

Reverted:
- agent_bench/tools/search.py back to HEAD (no Fix 2 code
remains)

Full test suite passes (441/441 — 438 pre-session + 3 new
snapshot tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

DECISIONS.md CHANGED
@@ -878,3 +878,282 @@ in scope. The narrowed serving-migration deferral entry (tied to
878
  any external reference to the counterfactual-query fix) also stays
879
  deferred until Fix 2 lands, since the production/eval-harness
880
  prompt divergence is unchanged by this revert.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
878
  any external reference to the counterfactual-query fix) also stays
879
  deferred until Fix 2 lands, since the production/eval-harness
880
  prompt divergence is unchanged by this revert.
881
+
882
+ ## Fix 2 pre-committed regression gate — SearchTool deterministic query expansion
883
+
884
+ **Pre-committed BEFORE post-edit runs** (same discipline pattern
885
+ that caught Fix 1's iteration inflation cleanly).
886
+
887
+ **Mechanism under test.** `agent_bench/tools/search.py`
888
+ `SearchTool.execute` gains a deterministic two-query retrieval
889
+ path. When the primary retrieval passes the refusal gate, a
890
+ secondary retrieval is issued against an expanded query
891
+ (`original_query + " not supported limitations cannot"`), and the
892
+ final context returned to the LLM is `primary_top_3 ++
893
+ secondary_top_5` deduplicated by `chunk.id`. Both retrievals run
894
+ inside a single `SearchTool.execute` call — from the LLM's
895
+ perspective, the tool schema, name, parameters, and return shape
896
+ are unchanged, and the iteration budget is untouched.
897
+
898
+ **Why this is architecturally different from Fix 1.** Fix 1 placed
899
+ a behavioral clause in the system prompt that told the agent to
900
+ issue follow-up searches itself. The trigger was an LLM judgment
901
+ ("did the first search return content addressing the specific
902
+ capability?") and the follow-up was a separate tool call, so it
903
+ counted against `max_iterations`. Over-firing on compound questions
904
+ inflated iteration counts and pushed q024/q025 to the cap. Fix 2
905
+ replaces this with a deterministic trigger (primary passes gate),
906
+ a fixed expansion suffix, and a merge that happens entirely inside
907
+ one tool call. No LLM judgment; no iteration change; corpus-
908
+ agnostic.
909
+
910
+ **Suffix choice.** `" not supported limitations cannot"`. Keyword-
911
+ dense, ungrammatical on purpose — the suffix exists to shift BM25
912
+ and embedding mass toward "what you cannot do" / "limitations"
913
+ sections, not to read well. The ungrammatical form is also a self-
914
+ documenting signal in retrieval logs: anyone reading a query trace
915
+ sees the suffix and immediately knows it is a synthetic expansion,
916
+ not user input. A one-line comment in `search.py` preserves the
917
+ rationale for future readers.
918
+
919
+ **Merge choice.** `primary_top_3 + secondary_top_5` deduped by
920
+ `chunk.id`, producing 5–8 unique chunks per call. Rationale: top-5
921
+ primary would make the expansion redundant on high-overlap queries
922
+ (defeating the mechanism), while primary-top-3 guarantees the
923
+ expansion always contributes to the final context window. Probe
924
+ data (`/tmp/probe_fix2_v2.py`, throwaway) confirms this merge
925
+ strategy surfaces pilot_005's target chunk
926
+ (`d0806d5da91d6026`, chunk_index 63, "Anything TLS related ... use
927
+ a service mesh or ingress controller for this") at position 6–8 in
928
+ the merged list.
929
+
930
+ **Opt-in flag, defaulting ON.** `SearchTool` accepts
931
+ `negative_framing_expansion: bool = True`. Default is the shipping
932
+ configuration because the regression gate must measure the shipping
933
+ behavior, not the no-op path. A `False` default would mean the gate
934
+ validates an unused parameter, and a subsequent commit flipping the
935
+ default would have no regression evidence. Kill switch is preserved
936
+ via explicit `False` at construction if a future regression
937
+ requires an A/B comparison.
938
+
939
+ **Baseline reuse.** The Fix 1 session's pre-edit JSONs
940
+ (`results/fastapi_preedit.json`, `results/k8s_preedit_pinned.json`,
941
+ both committed at `bd2b913`) were measured under the currently-
942
+ committed state of the repo: pinned `gpt-4o-mini-2024-07-18`, K8s
943
+ threshold 0.015, FastAPI threshold 0.02, HEAD `prompts.py` with no
944
+ clause, HEAD `search.py` with no expansion. The working tree
945
+ verification confirms this state is unchanged. These JSONs are
946
+ therefore reused as the Fix 2 pre-edit baseline and do not need to
947
+ be re-measured. Only post-edit runs are required for the Fix 2
948
+ regression (~$0.02 saved).
949
+
950
+ **Pre-committed tolerances.**
951
+
952
+ | Metric | Pass criterion |
953
+ |---|---|
954
+ | P@5 | post-edit ≥ pre-edit − 0.02 |
955
+ | R@5 | post-edit ≥ pre-edit − 0.02 |
956
+ | Citation accuracy | post-edit ≥ pre-edit (**hard gate** — any drop blocks commit) |
957
+ | Mean `tool_calls_made` | post-edit ≤ pre-edit + **0.05** (design-correctness gate — see note) |
958
+ | Individual cap-hit | no question that used fewer than `max_iterations=3` iterations pre-edit may hit the cap post-edit |
959
+
960
+ **Note on the tool_calls gate.** ≤ +0.05 is a *design-correctness*
961
+ gate, not a *performance* gate. Fix 2's invariant is that both
962
+ retrievals happen inside one `SearchTool.execute` call, so the
963
+ LLM's iteration count is unchanged by construction. Any non-trivial
964
+ movement in `mean tool_calls_made` indicates the design invariant
965
+ is broken — e.g., expansion accidentally exposed as a separate
966
+ tool, or the LLM observing two-call behavior and adapting its
967
+ strategy. The gate fires on design violation, not on performance
968
+ regression. The 0.05 absolute threshold absorbs legitimate run-to-
969
+ run variance from non-determinism in the LLM even at temperature
970
+ 0, without absorbing real iteration-count movement.
971
+
972
+ **pilot_005 strict flip criterion (K8s-only, unchanged from Fix 1
973
+ gate):**
974
+ - `keyword_hit_rate ≥ 0.60` against golden keywords `["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]`
975
+ - Answer cites `k8s_network_policies.md`
976
+ - Answer contains "service mesh" OR "ingress controller"
977
+ - Answer does NOT begin with refusal phrasing
978
+
979
+ **Baseline reference for the gate.**
980
+
981
+ | Corpus | Pre-edit source | P@5 | R@5 | Citation | Mean tool_calls |
982
+ |---|---|---|---|---|---|
983
+ | FastAPI (27) | `results/fastapi_preedit.json` @ `bd2b913` | 0.585 | 0.679 | 1.000 | 1.111 |
984
+ | K8s (6 pilots) | `results/k8s_preedit_pinned.json` @ `bd2b913` | 0.800 | 1.000 | 1.000 | 1.167 |
985
+
986
+ **Post-edit filenames (to be produced).**
987
+ - `results/fastapi_postedit_fix2.json`
988
+ - `results/k8s_postedit_fix2.json`
989
+
990
+ **If the gate passes:** commit Fix 2 with `search.py` change, unit
991
+ tests (including the tool-spec snapshot test), the two post-edit
992
+ result JSONs, and this DECISIONS.md entry extended with the
993
+ regression outcome.
994
+
995
+ **If the gate fires:** revert, document the failure mode, surface
996
+ the specific criterion that fired. No tolerance relaxation — same
997
+ discipline pattern as Fix 1 revert.
998
+
999
+ ## Fix 2 outcome — mechanism works, response-style criterion fired, reverted
1000
+
1001
+ **Regression runs produced.** Two post-edit runs on K8s (FastAPI not
1002
+ run — K8s findings gated the decision before API spend on the
1003
+ broader set):
1004
+
1005
+ | Run | Merge rule | File | Purpose |
1006
+ |---|---|---|---|
1007
+ | Fix 2 v1 | `primary[:3] + secondary[:5]` | `results/k8s_postedit_fix2.json` | Initial implementation |
1008
+ | Fix 2 v2 | `primary[:5] + secondary[:5]` | `results/k8s_postedit_fix2_merge_v2.json` | Path A refinement after v1 failed P@5 on a metric-definition mismatch |
1009
+
1010
+ **v1 findings.** Aggregate: P@5 0.800 → 0.767 (Δ −0.033, **FAILED**
1011
+ the P@5 ≥ −0.02 tolerance). The failure traced to a merge-rule /
1012
+ metric-semantics interaction: `retrieval_precision_at_k` computes
1013
+ precision on `retrieved_sources[:5]`, and with `primary[:3] +
1014
+ secondary[:5]` the first 5 entries were `primary_top_3 +
1015
+ secondary_top_2`. For pilot_005, `secondary[1]` was
1016
+ `k8s_pods.md` (chunk_index 40, surfaced because the reranker
1017
+ matched its "localhost communication" content against the expanded
1018
+ query). That single off-source chunk in position 5 dropped P@5
1019
+ from 1.00 to 0.80 for pilot_005 and similarly for pilot_006.
1020
+ Iteration invariant held (tool_calls 1.167 → 1.167). Citation
1021
+ accuracy held (1.000 → 1.000). Target chunk
1022
+ (`d0806d5da91d6026`, "Anything TLS related") reached the LLM
1023
+ context for pilot_005 at merged position 7.
1024
+
1025
+ **Path A refinement (merge v2).** Change `primary[:3] +
1026
+ secondary[:5]` → `primary[:5] + secondary[:5]`. Rationale:
1027
+ primary_top_5 is preserved in positions 1–5 by construction, so
1028
+ P@5 computed on `ranked_sources[:5]` is unchanged from the
1029
+ no-expansion baseline. Expansion chunks land in positions 6–10.
1030
+ Target chunk still reaches LLM context (position 9 for pilot_005).
1031
+ This is an **implementation refinement, not a tolerance
1032
+ relaxation** — the pre-committed gate thresholds stand; only the
1033
+ merge rule was adjusted to respect the metric's window semantics.
1034
+
1035
+ **v2 findings — perfect metric preservation, but strict-flip fails on response style.**
1036
+
1037
+ Aggregate:
1038
+
1039
+ | Metric | Pre-edit | Fix 2 v2 | Delta |
1040
+ |---|---|---|---|
1041
+ | P@5 | 0.800 | 0.800 | **0.000** |
1042
+ | R@5 | 1.000 | 1.000 | 0.000 |
1043
+ | KHR | 0.806 | 0.806 | 0.000 |
1044
+ | Citation accuracy | 1.000 | 1.000 | 0.000 |
1045
+ | Mean `tool_calls_made` | 1.167 | 1.167 | **0.000** |
1046
+
1047
+ Every aggregate metric **literally unchanged**. Per-question
1048
+ deltas: zero on every metric, every question. The design
1049
+ invariant (iteration budget unchanged, tool schema unchanged,
1050
+ refusal gate behavior unchanged) holds perfectly.
1051
+
1052
+ **But pilot_005 strict flip fails on the refusal-phrasing criterion.**
1053
+ Post-edit answer:
1054
+
1055
+ > *"The Kubernetes documentation does not provide specific
1056
+ > instructions on configuring a NetworkPolicy to enforce mutual TLS
1057
+ > (mTLS) between Pods in the same namespace. For mTLS, it is
1058
+ > generally recommended to use a service mesh or other proxy
1059
+ > solutions, as NetworkPolicy alone does not handle TLS
1060
+ > configurations directly [source: k8s_network_policies.md]."*
1061
+
1062
+ The answer substantively contains the documented negative with
1063
+ citation. But it opens with *"The Kubernetes documentation does
1064
+ not provide specific instructions..."* — the exact refusal-
1065
+ phrasing opener the strict-flip criterion was pre-committed to
1066
+ reject. The criterion exists because the brand is honest
1067
+ evaluation: an answer that opens apologizing that the
1068
+ documentation "does not provide specific instructions" reads, to
1069
+ a technical reviewer, like the system failed to find the answer
1070
+ and is papering over the gap, even though the facts and citation
1071
+ are present. The criterion fired as designed.
1072
+
1073
+ **Compare to Fix 1 post-edit answer (from `bd2b913` evidence):**
1074
+
1075
+ > *"Kubernetes NetworkPolicy does not support enforcing mutual TLS
1076
+ > (mTLS) directly. The documentation states that anything TLS
1077
+ > related should be handled using a service mesh or ingress
1078
+ > controller, rather than through NetworkPolicy [source: k8s_network_policies.md]."*
1079
+
1080
+ Fix 1's answer asserts a fact about **NetworkPolicy** ("does not
1081
+ support"); Fix 2's answer asserts a fact about **the documentation**
1082
+ ("does not provide instructions"). The first forecloses the
1083
+ capability; the second leaves open whether the capability exists
1084
+ somewhere the system didn't see. That distinction is load-bearing
1085
+ for any grounded-refusal narrative, and it separates a system that
1086
+ handles documented negatives crisply from one that hedges around
1087
+ them.
1088
+
1089
+ **Diagnosis.** Fix 2's mechanism successfully gets the target chunk
1090
+ into the LLM's context window — the retrieval side of the problem
1091
+ is solved. What Fix 2 **cannot provide** is explicit guidance on
1092
+ how to phrase the documented negative once the chunk is present.
1093
+ Fix 1's prompt clause was doing that guidance work; removing the
1094
+ clause and relying on the LLM's unaided response style produces a
1095
+ hedging answer because the LLM, seeing both NetworkPolicy-spec
1096
+ content and a TLS limitation bullet, defaults to contextual
1097
+ hedging rather than crisp assertion.
1098
+
1099
+ **Fix 2 is therefore not an alternative to Fix 1's prompt clause
1100
+ — it is a prerequisite.** Fix 2 guarantees the chunk reaches
1101
+ context; a future "Fix 2 + targeted prompt clause" stack could
1102
+ resolve both the retrieval gap and the response-style gap without
1103
+ Fix 1's over-firing problem, because the clause would no longer
1104
+ need to direct the agent to do a follow-up search (Fix 2 handled
1105
+ that). The over-firing on compound questions that broke Fix 1 was
1106
+ caused by the agent deciding to do extra search iterations under
1107
+ LLM judgment; if the expansion already happened deterministically
1108
+ inside the first tool call, the clause has less work to do and
1109
+ may not trigger the second-LLM-call pattern at all. **Speculative
1110
+ and not for this session.** Future work item.
1111
+
1112
+ **Gate verdict: failed on pilot_005 strict flip criterion.**
1113
+ Reverting, same Fix-1 pattern.
1114
+
1115
+ **What this commit contains.**
1116
+ - `agent_bench/tools/search.py` **reverted** to HEAD (no Fix 2
1117
+ code changes)
1118
+ - `tests/test_tools.py` retains the `MockChunk.id` hygiene fix
1119
+ (the real `Chunk` class has `id`; mock should match the real API
1120
+ for future test authors)
1121
+ - `tests/test_tools.py` adds `TestSearchToolSpecSnapshot`: a
1122
+ general-purpose guard that freezes `SearchTool`'s LLM-facing
1123
+ contract (name, description, parameters). The lesson from Fix 2
1124
+ is that any future refactor exposing internal SearchTool state
1125
+ to the LLM would break iteration-budget invariants — the
1126
+ snapshot test catches that at test time, independent of whether
1127
+ Fix 2 lands.
1128
+ - Two regression evidence JSONs: `results/k8s_postedit_fix2.json`
1129
+ (v1, the P@5 failure) and `results/k8s_postedit_fix2_merge_v2.json`
1130
+ (v2, the strict-flip failure). Retained as the measurement
1131
+ trail behind the revert decision.
1132
+ - This DECISIONS.md entry (pre-committed gate + outcome + revert
1133
+ narrative).
1134
+
1135
+ **What this commit does NOT contain.** No changes to
1136
+ `agent_bench/tools/search.py`, `agent_bench/core/prompts.py`, or
1137
+ `configs/default.yaml`. Both Fix 1 (prompt clause) and Fix 2
1138
+ (SearchTool expansion) have been attempted and reverted this
1139
+ session. Three commits of progress nonetheless: `b97f00f`
1140
+ (threshold calibration, empirical), `77017db` (prep bundle: model
1141
+ pin + fastapi wire + Fix 1 pre-committed tolerances), `bd2b913`
1142
+ (Fix 1 revert narrative). The threshold calibration and model pin
1143
+ are real, shipped, measurement-grounded infrastructure changes.
1144
+ The two fix attempts are documented learning that shapes the
1145
+ future direction.
1146
+
1147
+ **Narrative summary.** Session hypothesis: pilot_005 is a
1148
+ counterfactual-query-expansion problem. Session evidence: the
1149
+ hypothesis is correct on retrieval — the target chunk is reachable
1150
+ via negative-framing queries and Fix 2 surfaces it deterministically
1151
+ with zero iteration-budget impact. Session evidence also shows the
1152
+ hypothesis is **incomplete** — retrieval-only fixes cannot close
1153
+ the response-style gap, because the LLM under unaided prompting
1154
+ hedges when a documented negative is surrounded by unrelated
1155
+ topical content. A future session exploring **Fix 2 + targeted
1156
+ prompt guidance stacked** is the natural next experiment; this
1157
+ session's pilot-first discipline has been preserved against two
1158
+ distinct pre-committed gates, both firing for the reasons they
1159
+ were designed to catch.
results/k8s_postedit_fix2.json ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "question_id": "k8s_pilot_001",
4
+ "question": "In Kubernetes, does each Pod receive its own IP address, and how do containers inside the same Pod talk to each other?",
5
+ "category": "retrieval",
6
+ "difficulty": "easy",
7
+ "retrieval_precision": 0.6,
8
+ "retrieval_recall": 1.0,
9
+ "keyword_hit_rate": 1.0,
10
+ "has_source_citation": true,
11
+ "grounded_refusal": true,
12
+ "citation_accuracy": 1.0,
13
+ "calculator_used_correctly": true,
14
+ "tool_calls_made": 2,
15
+ "latency_ms": 21174.48547598906,
16
+ "tokens_used": {
17
+ "input_tokens": 1872,
18
+ "output_tokens": 150,
19
+ "estimated_cost_usd": 0.00037079999999999996
20
+ },
21
+ "answer": "In Kubernetes, each Pod is assigned a unique IP address for each address family. Every container within a Pod shares the same network namespace, which includes the IP address and network ports. This allows containers in the same Pod to communicate with each other using `localhost` [source: k8s_pods.md]. \n\nAdditionally, containers within a Pod can also communicate using standard inter-process communications like SystemV semaphores or POSIX shared memory [source: k8s_pods.md].",
22
+ "retrieved_sources": [
23
+ "k8s_network_policies.md",
24
+ "k8s_pods.md",
25
+ "k8s_pods.md",
26
+ "k8s_pods.md",
27
+ "k8s_network_policies.md",
28
+ "k8s_network_policies.md",
29
+ "k8s_pods.md",
30
+ "k8s_pods.md",
31
+ "k8s_pods.md",
32
+ "k8s_secret.md",
33
+ "k8s_network_policies.md",
34
+ "k8s_pods.md"
35
+ ],
36
+ "faithfulness": null,
37
+ "correctness": null
38
+ },
39
+ {
40
+ "question_id": "k8s_pilot_002",
41
+ "question": "When you update a Deployment's pod template, what mechanism does Kubernetes use to transition Pods from the old version to the new one, and what role does the ReplicaSet play?",
42
+ "category": "retrieval",
43
+ "difficulty": "hard",
44
+ "retrieval_precision": 1.0,
45
+ "retrieval_recall": 1.0,
46
+ "keyword_hit_rate": 0.8333333333333334,
47
+ "has_source_citation": true,
48
+ "grounded_refusal": true,
49
+ "citation_accuracy": 1.0,
50
+ "calculator_used_correctly": true,
51
+ "tool_calls_made": 1,
52
+ "latency_ms": 12374.25357697066,
53
+ "tokens_used": {
54
+ "input_tokens": 1122,
55
+ "output_tokens": 140,
56
+ "estimated_cost_usd": 0.0002523
57
+ },
58
+ "answer": "When you update a Deployment's pod template, Kubernetes creates a new ReplicaSet to manage the new version of the Pods. The existing ReplicaSet, which controls the old version of the Pods, is scaled down while the new ReplicaSet is scaled up. This process ensures that Pods are transitioned from the old version to the new one in a controlled manner. Specifically, the Deployment controller gradually scales the new ReplicaSet to the desired number of replicas while simultaneously scaling down the old ReplicaSet, ensuring that Pods are replaced at a controlled rate [source: k8s_deployment.md].",
59
+ "retrieved_sources": [
60
+ "k8s_deployment.md",
61
+ "k8s_replicaset.md",
62
+ "k8s_deployment.md",
63
+ "k8s_deployment.md",
64
+ "k8s_deployment.md"
65
+ ],
66
+ "faithfulness": null,
67
+ "correctness": null
68
+ },
69
+ {
70
+ "question_id": "k8s_pilot_003",
71
+ "question": "What is the key difference between a ConfigMap and a Secret when deciding where to store sensitive application data like database passwords?",
72
+ "category": "retrieval",
73
+ "difficulty": "medium",
74
+ "retrieval_precision": 1.0,
75
+ "retrieval_recall": 1.0,
76
+ "keyword_hit_rate": 0.6666666666666666,
77
+ "has_source_citation": true,
78
+ "grounded_refusal": true,
79
+ "citation_accuracy": 1.0,
80
+ "calculator_used_correctly": true,
81
+ "tool_calls_made": 1,
82
+ "latency_ms": 9694.163175998256,
83
+ "tokens_used": {
84
+ "input_tokens": 1215,
85
+ "output_tokens": 102,
86
+ "estimated_cost_usd": 0.00024344999999999998
87
+ },
88
+ "answer": "The key difference between a ConfigMap and a Secret when deciding where to store sensitive application data, such as database passwords, is that a ConfigMap does not provide secrecy or encryption, making it unsuitable for confidential data. In contrast, a Secret is specifically intended to hold confidential data and includes additional protections to ensure the security of that information [source: k8s_configmap.md][source: k8s_secret.md].",
89
+ "retrieved_sources": [
90
+ "k8s_configmap.md",
91
+ "k8s_secret.md",
92
+ "k8s_secret.md",
93
+ "k8s_secret.md",
94
+ "k8s_configmap.md",
95
+ "k8s_secret.md",
96
+ "k8s_configmap.md"
97
+ ],
98
+ "faithfulness": null,
99
+ "correctness": null
100
+ },
101
+ {
102
+ "question_id": "k8s_pilot_004",
103
+ "question": "If I set a custom value for one hard eviction threshold on the kubelet (e.g., memory.available) but leave the other thresholds unset, what happens to the defaults for the thresholds I didn't override?",
104
+ "category": "retrieval",
105
+ "difficulty": "hard",
106
+ "retrieval_precision": 1.0,
107
+ "retrieval_recall": 1.0,
108
+ "keyword_hit_rate": 0.5,
109
+ "has_source_citation": true,
110
+ "grounded_refusal": true,
111
+ "citation_accuracy": 1.0,
112
+ "calculator_used_correctly": true,
113
+ "tool_calls_made": 1,
114
+ "latency_ms": 10764.822138939053,
115
+ "tokens_used": {
116
+ "input_tokens": 1041,
117
+ "output_tokens": 135,
118
+ "estimated_cost_usd": 0.00023714999999999994
119
+ },
120
+ "answer": "If you set a custom value for one hard eviction threshold on the kubelet (e.g., `memory.available`), the default values for the thresholds you didn't override will not be inherited. Instead, the values of the other parameters will be set to zero. To provide custom values, you must specify all the thresholds respectively. Alternatively, you can set the kubelet config `MergeDefaultEvictionSettings` to true in the kubelet configuration file to retain the default values for the thresholds you did not change [source: k8s_node_pressure_eviction.md].",
121
+ "retrieved_sources": [
122
+ "k8s_node_pressure_eviction.md",
123
+ "k8s_node_pressure_eviction.md",
124
+ "k8s_node_pressure_eviction.md",
125
+ "k8s_node_pressure_eviction.md",
126
+ "k8s_node_pressure_eviction.md"
127
+ ],
128
+ "faithfulness": null,
129
+ "correctness": null
130
+ },
131
+ {
132
+ "question_id": "k8s_pilot_005",
133
+ "question": "How do I configure a Kubernetes NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace?",
134
+ "category": "retrieval",
135
+ "difficulty": "medium",
136
+ "retrieval_precision": 0.8,
137
+ "retrieval_recall": 1.0,
138
+ "keyword_hit_rate": 1.0,
139
+ "has_source_citation": true,
140
+ "grounded_refusal": true,
141
+ "citation_accuracy": 1.0,
142
+ "calculator_used_correctly": true,
143
+ "tool_calls_made": 1,
144
+ "latency_ms": 10257.949891034514,
145
+ "tokens_used": {
146
+ "input_tokens": 1225,
147
+ "output_tokens": 102,
148
+ "estimated_cost_usd": 0.00024495
149
+ },
150
+ "answer": "The Kubernetes documentation does not provide specific instructions on configuring a NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace. It mentions that NetworkPolicy is primarily used for controlling traffic between Pods but does not cover mTLS configurations directly. For mTLS, it is suggested to use a service mesh or ingress controller instead [source: k8s_network_policies.md].",
151
+ "retrieved_sources": [
152
+ "k8s_network_policies.md",
153
+ "k8s_network_policies.md",
154
+ "k8s_network_policies.md",
155
+ "k8s_network_policies.md",
156
+ "k8s_pods.md",
157
+ "k8s_network_policies.md",
158
+ "k8s_network_policies.md"
159
+ ],
160
+ "faithfulness": null,
161
+ "correctness": null
162
+ },
163
+ {
164
+ "question_id": "k8s_pilot_006",
165
+ "question": "As of the Kubernetes v1.31 snapshot, what is the feature state (alpha, beta, or stable) of the built-in Pod Security admission controller, and in which version did it reach that state?",
166
+ "category": "retrieval",
167
+ "difficulty": "easy",
168
+ "retrieval_precision": 0.2,
169
+ "retrieval_recall": 1.0,
170
+ "keyword_hit_rate": 1.0,
171
+ "has_source_citation": true,
172
+ "grounded_refusal": true,
173
+ "citation_accuracy": 1.0,
174
+ "calculator_used_correctly": true,
175
+ "tool_calls_made": 1,
176
+ "latency_ms": 8818.4428520035,
177
+ "tokens_used": {
178
+ "input_tokens": 1249,
179
+ "output_tokens": 62,
180
+ "estimated_cost_usd": 0.00022455
181
+ },
182
+ "answer": "The built-in Pod Security admission controller is in the **stable** feature state as of **Kubernetes v1.25** [source: k8s_pod_security_admission.md].",
183
+ "retrieved_sources": [
184
+ "k8s_pod_security_admission.md",
185
+ "k8s_replicaset.md",
186
+ "k8s_deployment.md",
187
+ "k8s_node_pressure_eviction.md",
188
+ "k8s_pods.md",
189
+ "k8s_pod_security_admission.md"
190
+ ],
191
+ "faithfulness": null,
192
+ "correctness": null
193
+ }
194
+ ]
results/k8s_postedit_fix2_merge_v2.json ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "question_id": "k8s_pilot_001",
4
+ "question": "In Kubernetes, does each Pod receive its own IP address, and how do containers inside the same Pod talk to each other?",
5
+ "category": "retrieval",
6
+ "difficulty": "easy",
7
+ "retrieval_precision": 0.4,
8
+ "retrieval_recall": 1.0,
9
+ "keyword_hit_rate": 1.0,
10
+ "has_source_citation": true,
11
+ "grounded_refusal": true,
12
+ "citation_accuracy": 1.0,
13
+ "calculator_used_correctly": true,
14
+ "tool_calls_made": 2,
15
+ "latency_ms": 17866.33544403594,
16
+ "tokens_used": {
17
+ "input_tokens": 2038,
18
+ "output_tokens": 153,
19
+ "estimated_cost_usd": 0.00039749999999999996
20
+ },
21
+ "answer": "In Kubernetes, each Pod is assigned a unique IP address for each address family. However, every container within a Pod shares the same network namespace, which includes the IP address and network ports. This means that containers in the same Pod can communicate with each other using `localhost` [source: k8s_pods.md]. \n\nAdditionally, containers within a Pod can also communicate using standard inter-process communications like SystemV semaphores or POSIX shared memory [source: k8s_pods.md].",
22
+ "retrieved_sources": [
23
+ "k8s_network_policies.md",
24
+ "k8s_pods.md",
25
+ "k8s_pods.md",
26
+ "k8s_network_policies.md",
27
+ "k8s_network_policies.md",
28
+ "k8s_pods.md",
29
+ "k8s_pods.md",
30
+ "k8s_pods.md",
31
+ "k8s_pods.md",
32
+ "k8s_pods.md",
33
+ "k8s_pods.md",
34
+ "k8s_secret.md",
35
+ "k8s_network_policies.md",
36
+ "k8s_pods.md"
37
+ ],
38
+ "faithfulness": null,
39
+ "correctness": null
40
+ },
41
+ {
42
+ "question_id": "k8s_pilot_002",
43
+ "question": "When you update a Deployment's pod template, what mechanism does Kubernetes use to transition Pods from the old version to the new one, and what role does the ReplicaSet play?",
44
+ "category": "retrieval",
45
+ "difficulty": "hard",
46
+ "retrieval_precision": 1.0,
47
+ "retrieval_recall": 1.0,
48
+ "keyword_hit_rate": 0.6666666666666666,
49
+ "has_source_citation": true,
50
+ "grounded_refusal": true,
51
+ "citation_accuracy": 1.0,
52
+ "calculator_used_correctly": true,
53
+ "tool_calls_made": 1,
54
+ "latency_ms": 12697.782160015777,
55
+ "tokens_used": {
56
+ "input_tokens": 1213,
57
+ "output_tokens": 198,
58
+ "estimated_cost_usd": 0.00030074999999999996
59
+ },
60
+ "answer": "When you update a Deployment's pod template, Kubernetes uses a rolling update mechanism to transition Pods from the old version to the new one. This process involves creating a new ReplicaSet for the updated Deployment. The Deployment controller gradually scales up the new ReplicaSet while simultaneously scaling down the old ReplicaSet, ensuring that Pods are replaced in a controlled manner. This approach allows for a smooth transition without downtime, as it manages the number of Pods running at any time during the update process [source: k8s_deployment.md]. \n\nThe ReplicaSet plays a crucial role in this mechanism by managing the Pods that match the Deployment's specifications. When a Deployment is updated, the existing ReplicaSet that controls the old Pods is scaled down, and the new ReplicaSet is scaled up to the desired number of replicas specified in the Deployment [source: k8s_deployment.md].",
61
+ "retrieved_sources": [
62
+ "k8s_deployment.md",
63
+ "k8s_replicaset.md",
64
+ "k8s_deployment.md",
65
+ "k8s_deployment.md",
66
+ "k8s_deployment.md",
67
+ "k8s_deployment.md"
68
+ ],
69
+ "faithfulness": null,
70
+ "correctness": null
71
+ },
72
+ {
73
+ "question_id": "k8s_pilot_003",
74
+ "question": "What is the key difference between a ConfigMap and a Secret when deciding where to store sensitive application data like database passwords?",
75
+ "category": "retrieval",
76
+ "difficulty": "medium",
77
+ "retrieval_precision": 1.0,
78
+ "retrieval_recall": 1.0,
79
+ "keyword_hit_rate": 0.8333333333333334,
80
+ "has_source_citation": true,
81
+ "grounded_refusal": true,
82
+ "citation_accuracy": 1.0,
83
+ "calculator_used_correctly": true,
84
+ "tool_calls_made": 1,
85
+ "latency_ms": 10444.362040958367,
86
+ "tokens_used": {
87
+ "input_tokens": 1298,
88
+ "output_tokens": 123,
89
+ "estimated_cost_usd": 0.00026849999999999997
90
+ },
91
+ "answer": "The key difference between a ConfigMap and a Secret in Kubernetes is that a ConfigMap is used to store non-confidential data, while a Secret is specifically intended to hold confidential data, such as passwords, OAuth tokens, and SSH keys. ConfigMaps do not provide secrecy or encryption, making them unsuitable for sensitive information. Therefore, for storing sensitive application data like database passwords, it is recommended to use a Secret instead of a ConfigMap [source: k8s_configmap.md][source: k8s_secret.md].",
92
+ "retrieved_sources": [
93
+ "k8s_configmap.md",
94
+ "k8s_secret.md",
95
+ "k8s_secret.md",
96
+ "k8s_secret.md",
97
+ "k8s_configmap.md",
98
+ "k8s_secret.md",
99
+ "k8s_configmap.md",
100
+ "k8s_configmap.md"
101
+ ],
102
+ "faithfulness": null,
103
+ "correctness": null
104
+ },
105
+ {
106
+ "question_id": "k8s_pilot_004",
107
+ "question": "If I set a custom value for one hard eviction threshold on the kubelet (e.g., memory.available) but leave the other thresholds unset, what happens to the defaults for the thresholds I didn't override?",
108
+ "category": "retrieval",
109
+ "difficulty": "hard",
110
+ "retrieval_precision": 1.0,
111
+ "retrieval_recall": 1.0,
112
+ "keyword_hit_rate": 0.5,
113
+ "has_source_citation": true,
114
+ "grounded_refusal": true,
115
+ "citation_accuracy": 1.0,
116
+ "calculator_used_correctly": true,
117
+ "tool_calls_made": 1,
118
+ "latency_ms": 9120.96216098871,
119
+ "tokens_used": {
120
+ "input_tokens": 1249,
121
+ "output_tokens": 135,
122
+ "estimated_cost_usd": 0.00026835
123
+ },
124
+ "answer": "If you set a custom value for one hard eviction threshold on the kubelet (e.g., `memory.available`), the default values for the thresholds you didn't override will not be inherited. Instead, the values of the other parameters will be set to zero. To provide custom values, you should specify all the thresholds respectively. Alternatively, you can set the kubelet config `MergeDefaultEvictionSettings` to true in the kubelet configuration file to retain the default values for the thresholds you did not change [source: k8s_node_pressure_eviction.md].",
125
+ "retrieved_sources": [
126
+ "k8s_node_pressure_eviction.md",
127
+ "k8s_node_pressure_eviction.md",
128
+ "k8s_node_pressure_eviction.md",
129
+ "k8s_node_pressure_eviction.md",
130
+ "k8s_node_pressure_eviction.md",
131
+ "k8s_node_pressure_eviction.md",
132
+ "k8s_node_pressure_eviction.md"
133
+ ],
134
+ "faithfulness": null,
135
+ "correctness": null
136
+ },
137
+ {
138
+ "question_id": "k8s_pilot_005",
139
+ "question": "How do I configure a Kubernetes NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace?",
140
+ "category": "retrieval",
141
+ "difficulty": "medium",
142
+ "retrieval_precision": 1.0,
143
+ "retrieval_recall": 1.0,
144
+ "keyword_hit_rate": 0.8333333333333334,
145
+ "has_source_citation": true,
146
+ "grounded_refusal": true,
147
+ "citation_accuracy": 1.0,
148
+ "calculator_used_correctly": true,
149
+ "tool_calls_made": 1,
150
+ "latency_ms": 8491.116772056557,
151
+ "tokens_used": {
152
+ "input_tokens": 1509,
153
+ "output_tokens": 92,
154
+ "estimated_cost_usd": 0.00028155
155
+ },
156
+ "answer": "The Kubernetes documentation does not provide specific instructions on configuring a NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace. For mTLS, it is generally recommended to use a service mesh or other proxy solutions, as NetworkPolicy alone does not handle TLS configurations directly [source: k8s_network_policies.md].",
157
+ "retrieved_sources": [
158
+ "k8s_network_policies.md",
159
+ "k8s_network_policies.md",
160
+ "k8s_network_policies.md",
161
+ "k8s_network_policies.md",
162
+ "k8s_network_policies.md",
163
+ "k8s_network_policies.md",
164
+ "k8s_pods.md",
165
+ "k8s_network_policies.md",
166
+ "k8s_network_policies.md"
167
+ ],
168
+ "faithfulness": null,
169
+ "correctness": null
170
+ },
171
+ {
172
+ "question_id": "k8s_pilot_006",
173
+ "question": "As of the Kubernetes v1.31 snapshot, what is the feature state (alpha, beta, or stable) of the built-in Pod Security admission controller, and in which version did it reach that state?",
174
+ "category": "retrieval",
175
+ "difficulty": "easy",
176
+ "retrieval_precision": 0.4,
177
+ "retrieval_recall": 1.0,
178
+ "keyword_hit_rate": 1.0,
179
+ "has_source_citation": true,
180
+ "grounded_refusal": true,
181
+ "citation_accuracy": 1.0,
182
+ "calculator_used_correctly": true,
183
+ "tool_calls_made": 1,
184
+ "latency_ms": 10854.806952062063,
185
+ "tokens_used": {
186
+ "input_tokens": 1320,
187
+ "output_tokens": 54,
188
+ "estimated_cost_usd": 0.00023040000000000002
189
+ },
190
+ "answer": "The built-in Pod Security admission controller reached a stable state in Kubernetes v1.25 [source: k8s_pod_security_admission.md].",
191
+ "retrieved_sources": [
192
+ "k8s_pod_security_admission.md",
193
+ "k8s_replicaset.md",
194
+ "k8s_deployment.md",
195
+ "k8s_node_pressure_eviction.md",
196
+ "k8s_pod_security_admission.md",
197
+ "k8s_pods.md",
198
+ "k8s_pod_security_admission.md"
199
+ ],
200
+ "faithfulness": null,
201
+ "correctness": null
202
+ }
203
+ ]
tests/test_tools.py CHANGED
@@ -2,7 +2,8 @@
2
 
3
  from __future__ import annotations
4
 
5
- from dataclasses import dataclass
 
6
 
7
  import pytest
8
 
@@ -17,6 +18,7 @@ from agent_bench.tools.search import SearchTool
17
  class MockChunk:
18
  content: str
19
  source: str
 
20
 
21
 
22
  @dataclass
@@ -221,6 +223,36 @@ class TestSearchTool:
221
  assert "query" in defn.parameters["required"]
222
 
223
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
224
  # --- Refusal gate tests ---
225
 
226
 
 
2
 
3
  from __future__ import annotations
4
 
5
+ import uuid
6
+ from dataclasses import dataclass, field
7
 
8
  import pytest
9
 
 
18
  class MockChunk:
19
  content: str
20
  source: str
21
+ id: str = field(default_factory=lambda: uuid.uuid4().hex[:16])
22
 
23
 
24
  @dataclass
 
223
  assert "query" in defn.parameters["required"]
224
 
225
 
226
+ class TestSearchToolSpecSnapshot:
227
+ """Frozen snapshot of SearchTool's LLM-facing contract.
228
+
229
+ Any silent change to the tool name, description, or parameter schema
230
+ that reaches the LLM invalidates invariants that callers rely on —
231
+ for example, an attempt to expose internal SearchTool state (such as
232
+ a Fix-2-style expansion flag) as an LLM-visible parameter would
233
+ break the "one LLM-facing tool call per execute() invocation"
234
+ iteration-budget guarantee. These assertions fail loudly if the
235
+ contract drifts.
236
+ """
237
+
238
+ def test_tool_name(self):
239
+ assert SearchTool.name == "search_documents"
240
+
241
+ def test_tool_description(self):
242
+ assert SearchTool.description == (
243
+ "Search the technical documentation corpus for relevant passages. "
244
+ "Returns the most relevant document chunks with source attribution."
245
+ )
246
+
247
+ def test_tool_parameters_schema(self):
248
+ params = SearchTool.parameters
249
+ assert params["type"] == "object"
250
+ assert set(params["properties"].keys()) == {"query", "top_k"}
251
+ assert params["properties"]["query"]["type"] == "string"
252
+ assert params["properties"]["top_k"]["type"] == "integer"
253
+ assert params["required"] == ["query"]
254
+
255
+
256
  # --- Refusal gate tests ---
257
 
258