Nomearod Claude Opus 4.7 (1M context) commited on
Commit
6409a40
Β·
1 Parent(s): 1001208

docs(decisions): promote cold-start falsified-assumption and audit-path incident entries, add three-regimes latency refinement

Browse files

Two entries previously on hf-deploy only:
- "Cold-start gate fired β€” assumption falsified, fix deferred to v1.1"
documents the N=3 cold-start measurement (median 113s), the falsified
corpus-loading assumption, and the re-pre-committed v1.1 contingency
- "Audit-path bug β€” streams masked a request-crashing failure (2026-04-15)"
documents the streaming-endpoint failure-masking finding and the fix

Also folds in a post-hoc refinement (2026-04-22): three latency regimes
observed (cold-start 89-129s, wake-from-idle ~6.7s, steady-state warm ~5s),
tightening the README's "~2 min cold / ~5s warm" phrasing and narrowing
the v1.1 first-request-penalty projection to +6-13s over current behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. DECISIONS.md +262 -0
DECISIONS.md CHANGED
@@ -620,6 +620,147 @@ numbers needed to make the decision; the decision is documented here
620
  so future-me remembers the threshold and doesn't optimize prematurely
621
  on a hunch.
622
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
623
  ## False-premise questions come in two flavors
624
 
625
  When authoring golden-dataset questions whose premise is wrong, the
@@ -1854,3 +1995,124 @@ threshold, or (c) mark the file explicitly in a
1854
  option (a) as the follow-up, because it preserves the production
1855
  detector regex without weakening and keeps the test's fidelity to
1856
  the actual attack surface.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
620
  so future-me remembers the threshold and doesn't optimize prematurely
621
  on a hunch.
622
 
623
+ ## Cold-start gate fired β€” assumption falsified, fix deferred to v1.1 at the right cause
624
+
625
+ The preceding "Cold-start contingency" entry pre-committed a lazy-load
626
+ fix (FastAPI eager, K8s lazy on first request) if the measured cold
627
+ start exceeded 60 seconds. Measurement falsified the entry's core
628
+ assumption: **corpus loading is not the dominant cold-start cost**.
629
+ The committed fix addresses ~1 % of the observed overshoot. Executing
630
+ it verbatim would honor the gate's letter but not its intent β€” theater
631
+ dressed as discipline. This entry documents the measurement, the
632
+ falsified assumption, and the new contingency pre-committed at the
633
+ actual cause.
634
+
635
+ **Measurement (N=3, 2026-04-15, HF Spaces target deployment):**
636
+
637
+ | Sample | Cold start | Silent Python init | Visible phase |
638
+ |---|---|---|---|
639
+ | N=1 | 113 s | ~101 s | ~12 s |
640
+ | N=2 | 89 s | ~70 s | ~19 s |
641
+ | N=3 | 129 s | ~115 s | ~14 s |
642
+
643
+ - Median 113 s, mean ~110 s, range 89–129 s (spread ~40 s)
644
+ - **Gate fire is unambiguous at both tails.** Even the fastest sample
645
+ (89 s) is ~48 % over the 60 s threshold; the slowest (129 s) is
646
+ ~115 % over. No boundary ambiguity.
647
+ - **Sample-size justification.** N=3 is acknowledged as a small sample.
648
+ It is adequate here because (a) the gate-fire conclusion is stable
649
+ across both tails, (b) the "silent Python init dominates variance"
650
+ finding is stable across all three samples (silent phase varies
651
+ 70 β†’ 115 s across runs; visible phase varies only 12 β†’ 19 s), and
652
+ (c) the cost of additional samples (manual HF Space restart + ~2 min
653
+ wait + log extraction per sample) exceeds the marginal information
654
+ gain once both tails fire the gate and the variance pattern is stable.
655
+ N=4 would tighten the confidence interval on the median but does not
656
+ change either the gate-fire conclusion or the falsified-assumption
657
+ finding.
658
+ - **Variance source named.** HF Spaces shared-infrastructure CPU / IO
659
+ contention during Python module imports. The silent-init phase
660
+ varies 45 s across samples (70 β†’ 115 s); the visible phase is stable
661
+ (12–19 s). That is the signature of host-level contention on a
662
+ shared physical node, not code-level variability. An
663
+ exclusively-owned container would plausibly show a tighter bound.
664
+ - **Raw log captures** (preserved so this entry can be cross-checked
665
+ against the underlying evidence without re-running the measurement):
666
+ `measurements/2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log`.
667
+
668
+ **Where the cost lives.** At the median (113 s):
669
+
670
+ - **Silent Python init phase β€” ~90 s (β‰ˆ 80 % of total):** interpreter
671
+ start, module imports (`torch`, `transformers`, `langchain`, `faiss`,
672
+ `fastapi`, `httpx`, the full dependency closure), and initial model
673
+ weight loading (`all-MiniLM-L6-v2` embedder, cross-encoder
674
+ reranker). Not logged β€” no observability inside the import chain.
675
+ - **Visible startup phase β€” ~15 s (β‰ˆ 15 % of total):** injection
676
+ classifier init (~10 s, includes the "classifier skipped" warning),
677
+ FastAPI corpus load (< 1 s, +0.9 MB RSS), K8s corpus load (< 1 s,
678
+ +25.8 MB RSS), reranker warmup (~2 s).
679
+
680
+ **The K8s corpus load β€” which the pre-committed fix was designed to
681
+ defer β€” contributes under 1 second of the 113-second median.**
682
+ Deferring it saves roughly 1 % of the overshoot. FastAPI corpus load
683
+ is the same order of magnitude. Corpus loading is simply not where the
684
+ cost lives on this deployment.
685
+
686
+ **Why we are not executing the pre-committed fix.** The preceding
687
+ contingency was written under an empirical assumption about cost
688
+ attribution (corpus loading is the dominant cost). Measurement
689
+ falsified the assumption. Implementing the fix anyway would be a
690
+ mechanical execution of a recipe whose premise has been disproven β€”
691
+ it checks the gate-honoring box while failing to address the cause.
692
+ That is structurally identical to relaxing-by-redefinition ("60 s was
693
+ too tight"), just in the opposite direction: **relaxing by execution**.
694
+ The pre-commitment rule's purpose is to prevent motivated reasoning
695
+ about the gate, not to mandate mechanical compliance with a recipe
696
+ whose empirical foundation has collapsed.
697
+
698
+ The honest action is (1) accept the measurement as the v1 baseline,
699
+ (2) document the falsified assumption explicitly (this entry),
700
+ (3) re-pre-commit a new contingency at the actual dominant cost with
701
+ an explicit trigger condition so the decision is not relitigated at
702
+ review time, and (4) update the user-facing README surface to reflect
703
+ the measured cold-wake number rather than the optimistic pre-deploy
704
+ estimate.
705
+
706
+ **v1.1 contingency β€” pre-committed:**
707
+
708
+ > **If HF Spaces traffic produces more than N cold wakes per day**
709
+ > (N to be determined from observed usage patterns after launch, **not
710
+ > estimated in advance**), defer eager loading of (a) the cross-encoder
711
+ > reranker, (b) the sentence-transformers embedder, and (c) the
712
+ > injection classifier tier to first-relevant-request.
713
+ >
714
+ > **Estimated work:** 4–6 hours (lazy-init wrappers + first-request
715
+ > caching + integration tests for the warm/cold transition).
716
+ >
717
+ > **Expected tradeoff:** cold wake ~113 s β†’ ~50–60 s (approaches the
718
+ > original 60 s target); **first request after any cold wake incurs
719
+ > +8–15 s** additional latency (model weights load synchronously in
720
+ > the request path), after which subsequent warm requests return to
721
+ > normal ~5 s latency.
722
+ >
723
+ > **Trigger is usage-justified, not estimate-justified.** Until real
724
+ > traffic data justifies the work, there is nothing to optimize β€” a
725
+ > recruiter demo that gets one cold wake per day does not pay for
726
+ > 4–6 hours of engineering plus the new first-request-latency failure
727
+ > mode. The trigger threshold N is left unnamed deliberately: naming a
728
+ > number in advance would invite the same falsification pattern this
729
+ > entry is documenting.
730
+
731
+ **Methodology lesson.** When a pre-committed contingency is written
732
+ under an empirical assumption, the contingency only holds if the
733
+ assumption survives measurement. If measurement falsifies the
734
+ assumption, the correct action is to document the falsification,
735
+ accept the observed baseline, and re-pre-commit at the actual cause.
736
+ The wrong action is to execute the original recipe anyway, which
737
+ trades one form of motivated reasoning (threshold relaxation) for
738
+ another (recipe compliance). The underlying discipline β€” "pre-commit
739
+ your gates and honor them" β€” does not mean "mechanically run the
740
+ pre-committed fix regardless of what it addresses." It means "honor
741
+ the gate's *intent*, which is to prevent motivated reasoning about
742
+ pass/fail."
743
+
744
+ **Post-hoc refinement (2026-04-22) β€” three latency regimes observed.**
745
+ Follow-up warm-latency measurements after the initial cold-start
746
+ characterization distinguish a middle regime that the README's
747
+ "~2 min cold / ~5 s warm" phrasing collapses:
748
+
749
+ | Regime | Latency | Cause |
750
+ |---|---|---|
751
+ | Cold-start (container spin-up) | 89–129 s | Python imports, model load (per the table above) |
752
+ | Wake-from-idle (first `/ask` after container is up) | ~6.7 s (β‰ˆ +2 s over warm) | Residual lazy init on first request |
753
+ | Steady-state warm | ~5 s (mean 4.74 s, n=5) | LLM API round-trip + retrieval |
754
+
755
+ Source: n=1 first-hit-after-cold-wake @ 6.7 s, then n=5 steady-state
756
+ warm @ 4.07, 4.69, 4.78, 5.06, 5.11 s (2026-04-22). The README claim
757
+ captures regimes 1 and 3; the ~2 s first-hit penalty between them is
758
+ a refinement of the warm-cold boundary, not a contradiction. The
759
+ v1.1 projection of "+8–15 s first request after any cold wake"
760
+ measures the delta from this ~2 s current baseline, not from a
761
+ zero-penalty starting point β€” the true v1.1 UX cost over current
762
+ behavior is closer to +6–13 s.
763
+
764
  ## False-premise questions come in two flavors
765
 
766
  When authoring golden-dataset questions whose premise is wrong, the
 
1995
  option (a) as the follow-up, because it preserves the production
1996
  detector regex without weakening and keeps the test's fidelity to
1997
  the actual attack surface.
1998
+
1999
+ ## Audit-path bug β€” streams masked a request-crashing failure (2026-04-15)
2000
+
2001
+ During v1 deploy smoke testing, every non-stream `POST /ask` request
2002
+ and every injection-blocked request on the HF Space returned
2003
+ `{"detail": "Internal server error"}` (HTTP 500) instead of the
2004
+ intended 200 / 403. Normal queries via `POST /ask/stream` appeared to
2005
+ work correctly from the dashboard, so the bug was invisible until a
2006
+ direct curl hit a non-stream endpoint. Discovered and fixed same-day.
2007
+
2008
+ **Root cause.** `agent_bench/security/audit_logger.py:60` called
2009
+ `self.path.parent.mkdir(parents=True, exist_ok=True)` the first time
2010
+ a request wrote an audit record. The default path `logs/audit.jsonl`
2011
+ resolved to `/home/user/app/logs/audit.jsonl` at runtime. The
2012
+ Dockerfile's `WORKDIR /home/user/app` creates that directory as
2013
+ **root:root mode 0755** because Docker's WORKDIR directive does not
2014
+ honor `--chown`. Subsequent `COPY --chown=user` lines only change
2015
+ ownership of the copied files, not the directory itself. At runtime
2016
+ under `USER user` (uid 1000), the process had `r-x` on WORKDIR β€” it
2017
+ could read and execute, but not create new subdirectories. First
2018
+ audit write β†’ `PermissionError: [Errno 13] Permission denied: 'logs'`.
2019
+
2020
+ **Why `.cache/` worked and `logs/` didn't.** `.cache/` is created at
2021
+ build time by the two `RUN python scripts/ingest.py ...` steps
2022
+ (running as root) and then explicitly chowned by
2023
+ `RUN chown -R user:user .cache/`. `logs/` was created lazily at
2024
+ runtime by the audit logger, as a non-root user, in a directory owned
2025
+ by root. Directory-creation permission wall.
2026
+
2027
+ **Why the streaming endpoint masked it.** `/ask/stream` calls
2028
+ `_write_audit` at the *end* of the event generator
2029
+ (`routes.py:438`), after all stage events and the final answer have
2030
+ already been yielded over SSE. A failure there is already too late
2031
+ to affect the client's view of the response β€” the client sees the
2032
+ answer, the stream ends, and the audit entry is silently missing.
2033
+ Non-stream `/ask` calls `_write_audit` synchronously before `return`
2034
+ (line 263), so the failure propagates up through the middleware
2035
+ exception handler and becomes a visible 500. Injection-blocked
2036
+ requests on both endpoints also audit synchronously before returning
2037
+ (lines 193, 302) and produce the same visible 500. Normal dashboard
2038
+ use hits `/ask/stream` β†’ symptom invisible; smoke testing
2039
+ non-stream `/ask` β†’ symptom obvious.
2040
+
2041
+ **Traceback confirmed from the HF Space runtime log:**
2042
+
2043
+ ```
2044
+ File "/home/user/app/agent_bench/serving/routes.py", line 302, in ask_stream
2045
+ _write_audit(
2046
+ File "/home/user/app/agent_bench/serving/routes.py", line 558, in _write_audit
2047
+ audit_logger.log(record)
2048
+ File "/home/user/app/agent_bench/security/audit_logger.py", line 60, in log
2049
+ self.path.parent.mkdir(parents=True, exist_ok=True)
2050
+ PermissionError: [Errno 13] Permission denied: 'logs'
2051
+ ```
2052
+
2053
+ **Two-fix patch.** Both applied in the same deploy cycle:
2054
+
2055
+ 1. **Dockerfile (proximate).** Commit `ca34ccb`:
2056
+ `RUN mkdir -p logs && chown -R user:user .cache/ logs/` at build
2057
+ time, before `USER user`. Ensures the runtime user owns the
2058
+ audit-log directory from container start.
2059
+ 2. **`AuditLogger.log()` (architectural, defense-in-depth).** Commit
2060
+ `25e0f1b`: wrap the entire write body in `try/except Exception`,
2061
+ log the failure via structlog as `audit_write_failed` with the
2062
+ exception type and path, return normally.
2063
+
2064
+ **Why both, not either alone.**
2065
+
2066
+ - **Only fix 1:** unblocks the Space but leaves the architectural
2067
+ bug β€” an audit logger that can crash the app is misdesigned
2068
+ regardless of filesystem permissions. The SECURITY.md OWASP LLM10
2069
+ writeup would be structurally weak.
2070
+ - **Only fix 2:** stops the crashes, but audit writes would still
2071
+ silently fail on HF Spaces because `logs/` remains unwritable. The
2072
+ OWASP LLM10 claim would be *live-untrue* on the demo surface.
2073
+ - **Both:** demo is honest (audit log actually written on HF Space),
2074
+ architecture is sound (audit failures degrade gracefully), and the
2075
+ OWASP LLM10 writeup has live evidence backing both claims.
2076
+
2077
+ **Verification on the live Space after the patch.** `GET /health`
2078
+ flipped from `"degraded"` to `"healthy"` with `provider_available:
2079
+ true`; `POST /ask` normal returned 200 with grounded answers citing
2080
+ real sources from both corpora (was 500); `POST /ask` with the
2081
+ `ignore_previous` injection pattern returned 403
2082
+ `"Request blocked: potential prompt injection detected"` (was 500);
2083
+ the runtime log shows `injection_detected pattern=ignore_previous
2084
+ tier=heuristic` followed by a 403 status and no `audit_write_failed`
2085
+ entries, confirming the audit write succeeded end-to-end.
2086
+
2087
+ **Methodology lesson β€” streaming endpoints can hide backend failures.**
2088
+ Any request-path side effect (audit logging, metrics, cache write,
2089
+ persistence) that runs at the end of a streaming generator will fail
2090
+ silently from the client's perspective, because content is already
2091
+ committed to the wire before the failure fires. Testing such side
2092
+ effects against their failure modes *through* a streaming happy path
2093
+ is not sufficient β€” the happy path masks the failure. Every
2094
+ request-path side effect needs at least one test that verifies it
2095
+ fails loudly (or, where the design says graceful, degrades loudly)
2096
+ when its underlying dependency is broken, **exercised synchronously,
2097
+ not via the streaming code path.**
2098
+
2099
+ For this project, `AuditLogger.log()` should gain a unit test that
2100
+ asserts it returns normally (not raises) when its target directory is
2101
+ unwritable β€” i.e., a regression test for the `try/except` fix above.
2102
+ And the dependency injection of the audit logger in the route
2103
+ handlers should get an integration test that exercises both the
2104
+ streaming and non-streaming paths against the same failure injection,
2105
+ to assert the streaming path does not hide what the non-streaming
2106
+ path surfaces. Tracked as a v1.1 test-coverage follow-up.
2107
+
2108
+ **Why this wasn't caught locally.** Existing
2109
+ `tests/test_audit_logger.py` tests use pytest's `tmp_path` fixture,
2110
+ which is always writable by the test process. The failure mode only
2111
+ manifests when the runtime user cannot create the log directory, a
2112
+ deployment-environment condition not reproducible in a unit test
2113
+ against a local filesystem owned by the developer. An integration
2114
+ test that runs the Docker image end-to-end and hits `/ask` against
2115
+ the actual container filesystem would have caught it pre-deploy.
2116
+ Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
2117
+ build infrastructure) but is the right long-term mitigation for this
2118
+ class of bug.