Spaces:

Nomearod
/

agentbench

Running

Jane Yeung Claude Opus 4.6 (1M context) commited on Mar 31

Commit

a9d4375

1 Parent(s): 8875eea

feat: infrastructure sprint — vLLM/Modal, Helm, Terraform (#8)

* feat: add SelfHostedProvider for OpenAI-compatible endpoints (vLLM, TGI, Ollama)

- SelfHostedProvider targets any /v1/chat/completions endpoint via httpx
- Config schema extended: provider.selfhosted.{base_url, model_name, api_key, timeout_seconds}
- Env var fallback: MODAL_VLLM_URL, SELFHOSTED_MODEL, MODAL_AUTH_TOKEN
- Lazy tool-calling detection via startup probe; prompt-based fallback for
models that don't support function calling
- True streaming via httpx.AsyncClient.stream() + aiter_lines()
- 25 tests covering factory, complete, tool detection, prompt fallback,
retry/timeout, env vars, streaming, format_tools
- YAML configs for local (docker-compose) and Modal (serverless) deployments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings on SelfHostedProvider

- selfhosted_local.yaml: vLLM port changed from 8000 to 8001 to avoid
collision with FastAPI app (which also serves on 8000)
- docker-compose.vllm.yml: host port updated to 8001 accordingly
- _detect_tool_calling(): distinguish transient failures (timeout, 5xx)
returning None from definitive unsupported (400) returning False.
Transient results are NOT cached, so detection retries on next call.
- stream_complete(): apply same tool-calling detection/fallback logic
as complete() — prevents 400 on unsupported endpoints in streaming mode
- Added tests: transient failure returns None, 5xx returns None,
transient retries on next call, YAML-from-disk loading (3 files),
port collision regression test. 31 tests total.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Modal vLLM deployment and benchmark runner

- modal/common.py: shared constants (model, GPU type, cost tracking)
- modal/serve_vllm.py: Modal app deploying vLLM as OpenAI-compatible
endpoint on A10G GPU with /v1/chat/completions and /health
- modal/run_benchmark.py: runs 27-question eval against all provider
configs and generates docs/provider_comparison.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address round-2 review findings

Provider:
- _parse_tool_calls_from_text: validate arguments is dict, not arbitrary
JSON (e.g. "oops") — prevents Pydantic ValidationError
- selfhosted_local.yaml: remove hardcoded base_url to avoid port collision
with app (8000) and Docker service-name mismatch; falls back to
MODAL_VLLM_URL env var, then default http://localhost:8001/v1
- Default fallback URL changed from :8000 to :8001

Docker Compose:
- AGENT_BENCH_ENV=selfhosted_local now works: config has empty base_url,
so MODAL_VLLM_URL=http://vllm:8000/v1 takes effect in-container

Modal:
- serve_vllm.py: rewritten to use vLLM CLI subprocess + ASGI proxy,
avoiding unstable Python API imports
- run_benchmark.py: fixed list-vs-dict crash — evaluate.py returns
list[EvalResult]; added aggregate() to compute P@5, R@5, citation
accuracy, latency p50, cost from per-question results; fixed field
names to match EvalResult schema; fixed docstring (removed invalid
`modal run` claim)

Tests:
- Added test_fallback_handles_non_dict_arguments (malformed args → {})
- Updated YAML-from-disk tests for empty base_url
- 32 selfhosted tests, 201 total

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: Modal proxy signature, status forwarding, benchmark --base-url

- serve_vllm.py: annotate request as fastapi.Request (was parsed as
query param, returning 422 on every call); forward upstream status
code and headers via JSONResponse instead of bare resp.json()
- run_benchmark.py: --base-url is now optional; only required when
selfhosted_modal is in the provider set. --only openai works without it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: Modal proxy readiness, streaming errors, GPU cost reporting

serve_vllm.py:
- Add readiness loop: poll vLLM /health before accepting proxied
requests (180s timeout, 2s interval). Prevents cold-start failures.
- Streaming branch: check resp.status_code before streaming; return
upstream 4xx/5xx as non-streaming Response with real status code.
- Import ordering fix for ruff.

run_benchmark.py:
- Self-hosted cost: derive from GPU-seconds (latency_ms * A10G rate)
instead of token pricing (which is $0.00 for self-hosted models).
Uses MODAL_A10G_COST_PER_SEC from common.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Helm chart for K8s deployment with dev/prod values

- Chart with Deployment, Service, HPA, ConfigMap, Secret templates
- values-dev.yaml: 1 replica, no HPA, reduced resources
- values-prod.yaml: 3 replicas, HPA 2-8 pods at 70% CPU
- Container port 7860 (matching Dockerfile), Service maps to 8000
- Probes target /health endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Terraform GKE modules for API cluster (CPU-only, GCP)

- Root module wires networking + GKE modules
- Networking: VPC, subnet with pod/service CIDRs, firewall rules
- GKE: cluster with managed node pool (e2-standard-4, 2 nodes)
- No GPU nodes — inference runs on Modal (external)
- terraform.tfvars gitignored; example provided

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add infra documentation, Makefile targets, and architecture updates

- Makefile: modal-deploy, modal-stop, vllm-up, benchmark-all, k8s-dev,
k8s-prod, tf-plan, tf-validate targets
- DECISIONS.md: 7 new entries (vLLM, Modal, split topology, Helm,
CPU HPA, env var fallback, lazy tool detection)
- README.md: self-hosted/K8s/Terraform sections, provider tree in
architecture diagram, updated test count (201) and skills list
- docs/k8s-local-setup.md: minikube walkthrough

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: streaming proxy lifetime, Helm provider routing, README flow

serve_vllm.py:
- Fix stream lifetime: use client.send(req, stream=True) instead of
async with client.stream() to avoid closing the upstream connection
before FastAPI iterates the generator. Generator owns cleanup via
try/finally with upstream.aclose().

Helm:
- configmap.yaml: AGENT_BENCH_ENV now conditional on provider.type
(selfhosted → selfhosted_modal, openai → default, anthropic → anthropic)
instead of hardcoded selfhosted_modal.
- Makefile k8s-dev/k8s-prod: require MODAL_VLLM_URL and pass it via
--set provider.selfhosted.modalEndpoint to prevent deploying pods
with empty endpoint URLs.

README:
- Benchmark flow: document required OPENAI_API_KEY and ANTHROPIC_API_KEY
for make benchmark-all; show --only selfhosted_modal alternative.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: proxy streaming detection, provider comparison docs consistency

serve_vllm.py:
- Detect streaming by checking request body for "stream": true, not
Accept header. httpx sends Accept: */* by default, so the header
check missed real streaming requests and fell through to resp.json()
on SSE bodies (JSONDecodeError).

docs/provider_comparison.md:
- Updated to 3-provider format (openai, anthropic, selfhosted_modal)
matching the benchmark runner output. Self-hosted row marked TBD
with instructions to run make benchmark-all.

README.md:
- Provider comparison table updated to 3 columns with TBD self-hosted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: proxy non-JSON responses, docs table consistency

serve_vllm.py:
- Non-streaming branch: try resp.json(), fall back to raw Response
for endpoints that return empty/non-JSON bodies (e.g. vLLM /health
returns empty 200). Prevents JSONDecodeError → 500.

docs/provider_comparison.md:
- Table shape now matches generator output: 6 columns, no Model column.

README.md:
- Description updated from "2 providers" to "3 providers".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update stale test counts in README (169 → 201)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: Modal vLLM deployment and self-hosted benchmark

- Pin vllm==0.6.6.post1, transformers==4.47.0, huggingface_hub<1.0 to
fix tqdm DisabledTqdm, tokenizer, and head_dim incompatibilities
- Add HF secret for authenticated model downloads
- Increase context window to 8192 and ready timeout to 600s
- Add proxy error handling with traceback surfacing
- Sanitize messages for non-tool-calling models: merge multiple system
messages, convert tool-role to user, merge consecutive same-role
messages (required by Mistral chat template)
- Abort benchmark on first provider failure instead of continuing
- Tune selfhosted config: max_iterations=1, top_k=3 for 7B context

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update README with self-hosted benchmark results

Replace TBD placeholders with actual Mistral-7B benchmark data from
Modal vLLM deployment. Add citation accuracy and latency columns to
provider comparison table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: sync common.py context window and complete provider comparison

- Sync VLLM_MAX_MODEL_LEN to 8192 in common.py (was 4096, diverged
from serve_vllm.py during debugging)
- Add OpenAI and Anthropic data to provider_comparison.md alongside
self-hosted results for complete 3-provider comparison with analysis

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add provider health_check and Prometheus metrics endpoint

- Add health_check() to LLMProvider interface with default True
- OpenAI: probe via models.retrieve()
- Anthropic: probe via minimal messages.create()
- SelfHosted: probe via GET /models (vLLM endpoint)
- /health now calls provider.health_check() instead of just checking
the provider object exists — pods report degraded when inference is
actually unreachable
- Add /metrics/prometheus endpoint with text exposition format
(counter/gauge types) for Prometheus adapter + K8s HPA custom metrics
- Existing JSON /metrics endpoint unchanged

Co-Authored-By: Claude Opus 4.6 (1M context) <nore

Files changed (37) hide show

.gitignore +4 -0
DECISIONS.md +49 -0
Makefile +33 -1
README.md +59 -12
agent_bench/core/config.py +8 -0
agent_bench/core/provider.py +413 -0
agent_bench/serving/routes.py +32 -4
configs/selfhosted_local.yaml +58 -0
configs/selfhosted_modal.yaml +56 -0
docker/docker-compose.vllm.yml +50 -0
docs/k8s-local-setup.md +40 -0
docs/provider_comparison.md +60 -39
k8s/helm/agent-bench/Chart.yaml +6 -0
k8s/helm/agent-bench/templates/_helpers.tpl +35 -0
k8s/helm/agent-bench/templates/configmap.yaml +15 -0
k8s/helm/agent-bench/templates/deployment.yaml +45 -0
k8s/helm/agent-bench/templates/hpa.yaml +22 -0
k8s/helm/agent-bench/templates/secret.yaml +12 -0
k8s/helm/agent-bench/templates/service.yaml +15 -0
k8s/helm/agent-bench/values-dev.yaml +12 -0
k8s/helm/agent-bench/values-prod.yaml +15 -0
k8s/helm/agent-bench/values.yaml +43 -0
modal/common.py +11 -0
modal/run_benchmark.py +182 -0
modal/serve_vllm.py +187 -0
pyproject.toml +3 -0
terraform/main.tf +32 -0
terraform/modules/gke/main.tf +38 -0
terraform/modules/gke/outputs.tf +8 -0
terraform/modules/gke/variables.tf +29 -0
terraform/modules/networking/main.tf +67 -0
terraform/modules/networking/variables.tf +11 -0
terraform/outputs.tf +15 -0
terraform/terraform.tfvars.example +6 -0
terraform/variables.tf +16 -0
tests/test_selfhosted_provider.py +689 -0
tests/test_serving.py +95 -0

.gitignore CHANGED Viewed

@@ -17,3 +17,7 @@ venv/
 .worktrees/
 *.db
 docs/DESIGN.md

 .worktrees/
 *.db
 docs/DESIGN.md
+terraform.tfvars
+.terraform/
+*.tfstate
+*.tfstate.backup

DECISIONS.md CHANGED Viewed

@@ -232,3 +232,52 @@ The deduplicated `sources` list in `AgentResponse` is for the API
 response. The `ranked_sources` list preserves rank order with
 duplicates for evaluation metrics. P@5 and R@5 need the raw
 retrieval ranking, not the post-processed answer metadata.

 response. The `ranked_sources` list preserves rank order with
 duplicates for evaluation metrics. P@5 and R@5 need the raw
 retrieval ranking, not the post-processed answer metadata.
+## Why vLLM over TGI / llama.cpp
+vLLM has the widest model support, best throughput via PagedAttention, and a native
+OpenAI-compatible server (`/v1/chat/completions`). TGI is a valid alternative; llama.cpp
+targets different use cases (edge/CPU inference). This is a deliberate choice, not
+ignorance of alternatives.
+## Why Modal for GPU inference
+Serverless GPU eliminates idle cost and GPU node management. A10G at ~$1.30/hr costs
+~$0.50 per full 27-question benchmark run. The Docker Compose path (`docker-compose.vllm.yml`)
+is retained for users who have local GPUs or prefer persistent serving.
+## Why split topology (K8s API + Modal GPU)
+The API layer (retrieval, orchestration, tool routing) is CPU-bound and benefits from
+horizontal scaling via K8s HPA. The LLM inference layer is GPU-bound and benefits from
+serverless elasticity — Modal scales to zero when idle, scales up on demand with no node
+provisioning. Co-locating both in K8s would require GPU node pools with idle cost,
+node autoscaler latency, and NVIDIA device plugin management. This mirrors a common
+production pattern.
+## Why Helm only, not Kustomize + Helm
+Showing two K8s deployment methods for the same app adds complexity without demonstrating
+distinct skills. Helm with `values-dev.yaml` / `values-prod.yaml` covers
+environment-specific configuration cleanly.
+## Why CPU-based HPA, not custom metrics
+CPU utilization works without a Prometheus adapter or custom metrics server. A production
+improvement would use the Prometheus adapter to scale on p95 latency from the `/metrics`
+endpoint — this requires bridging the JSON metrics to Prometheus exposition format.
+Documented as a follow-up.
+## Why env var fallback in SelfHostedProvider
+Follows the same pattern as OpenAIProvider reading `OPENAI_API_KEY`. The YAML config
+provides defaults; env vars override at runtime. No config loader changes needed.
+## Why lazy tool-call detection, not metadata check
+Checking `/v1/models` metadata for tool-calling support is unreliable — model metadata
+doesn't consistently report this capability. Instead, the provider sends one tool-calling
+request on first `complete()` call with tools and checks if the response contains
+`tool_calls`. The result is cached as `self._supports_tool_calling`. Transient failures
+(timeout, 5xx) return `None` and retry on the next call rather than permanently
+downgrading to prompt-based fallback.

Makefile CHANGED Viewed

@@ -1,6 +1,6 @@
 PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11
-.PHONY: install test lint serve ingest evaluate-fast evaluate-full benchmark evaluate-langchain docker
 install:
 	$(PYTHON) -m pip install -e ".[dev]"
@@ -33,3 +33,35 @@ evaluate-langchain:
 docker:
 	docker-compose -f docker/docker-compose.yaml up --build

 PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11
+.PHONY: install test lint serve ingest evaluate-fast evaluate-full benchmark evaluate-langchain docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate
 install:
 	$(PYTHON) -m pip install -e ".[dev]"
 docker:
 	docker-compose -f docker/docker-compose.yaml up --build
+## --- Infrastructure ---
+modal-deploy:  ## Deploy vLLM on Modal (prints endpoint URL)
+	@command -v modal >/dev/null 2>&1 || { echo "Error: modal CLI not found. Run: pip install -e '.[modal]' && modal setup"; exit 1; }
+	modal deploy modal/serve_vllm.py
+modal-stop:  ## Stop Modal deployment
+	@command -v modal >/dev/null 2>&1 || { echo "Error: modal CLI not found. Run: pip install -e '.[modal]' && modal setup"; exit 1; }
+	modal app stop agent-bench-vllm
+vllm-up:  ## Start local vLLM via Docker Compose (requires NVIDIA GPU)
+	docker compose -f docker/docker-compose.vllm.yml up --build
+benchmark-all:  ## Run provider comparison (requires Modal deployment + API keys)
+	$(PYTHON) modal/run_benchmark.py --base-url $(MODAL_VLLM_URL)
+k8s-dev:  ## Deploy to minikube (dev values, set MODAL_VLLM_URL first)
+	@test -n "$(MODAL_VLLM_URL)" || (echo "Error: MODAL_VLLM_URL is not set" && exit 1)
+	helm install agent-bench k8s/helm/agent-bench/ -f k8s/helm/agent-bench/values-dev.yaml \
+		--set provider.selfhosted.modalEndpoint=$(MODAL_VLLM_URL)
+k8s-prod:  ## Deploy via Helm (prod values, set MODAL_VLLM_URL first)
+	@test -n "$(MODAL_VLLM_URL)" || (echo "Error: MODAL_VLLM_URL is not set" && exit 1)
+	helm install agent-bench k8s/helm/agent-bench/ -f k8s/helm/agent-bench/values-prod.yaml \
+		--set provider.selfhosted.modalEndpoint=$(MODAL_VLLM_URL)
+tf-plan:  ## Run terraform plan (no apply)
+	cd terraform && terraform plan
+tf-validate:  ## Validate terraform syntax
+	cd terraform && terraform validate

README.md CHANGED Viewed

@@ -2,9 +2,9 @@
 ![CI](https://github.com/tyy0811/agent-bench/actions/workflows/ci.yaml/badge.svg)
-Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on the same 27-question golden dataset across 2 providers. Zero hallucinated citations in all four configurations.
-`169 tests` | `2 providers` | `LangChain comparison` | `Docker` | `CI`
 ## Benchmark Results
@@ -30,12 +30,16 @@ Full analysis: [comparison report](results/comparison_custom_vs_langchain.md)
 ### Provider Comparison (Custom Pipeline)
-| Metric | OpenAI gpt-4o-mini | Anthropic claude-haiku |
-|--------|-------------------|----------------------|
-| Retrieval P@5 | 0.70 | **0.74** |
-| Retrieval R@5 | 0.83 | **0.84** |
-| Keyword Hit Rate | 0.89 | **0.92** |
-| Cost per query | **$0.0004** | $0.0007 |
 [Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md)
@@ -78,6 +82,40 @@ curl -X POST http://localhost:8000/ask \
 OPENAI_API_KEY=sk-... docker-compose -f docker/docker-compose.yaml up --build
 ```
 ## Architecture
 ```mermaid
@@ -90,13 +128,21 @@ flowchart LR
     Reg --> Calc[calculator]
     Search --> Store[Hybrid Store<br/>FAISS + BM25 + RRF]
     LLM -->|no tool_calls| Resp[AskResponse<br/>answer + sources + metadata]
 ```
 ## Skills Demonstrated
 - **Agent design & evaluation**: Built two independent orchestration approaches (custom tool-calling loop + LangChain AgentExecutor) and evaluated both on identical metrics to quantify framework tradeoffs
 - **Retrieval engineering**: Hybrid FAISS + BM25 with Reciprocal Rank Fusion, cross-encoder reranking, evaluated across 27 questions with P@5, R@5, citation accuracy
-- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 169 deterministic tests with mock providers
 <details><summary>API Reference</summary>
@@ -107,7 +153,8 @@ flowchart LR
 | `/ask` | POST | Ask a question, get answer with sources |
 | `/ask/stream` | POST | SSE streaming (sources → chunks → done) |
 | `/health` | GET | Store stats, provider status, uptime |
-| `/metrics` | GET | Request count, latency p50/p95, cost |
 ### POST /ask
@@ -156,7 +203,7 @@ The golden dataset contains 27 hand-crafted questions:
 ## Testing
 ```bash
-make test    # 169 deterministic tests, no API keys needed
 make lint    # ruff + mypy
 ```
@@ -179,7 +226,7 @@ See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF
 | Conversation memory | Stateless | SQLite sessions | State management |
 | Cloud deployment | None | HF Spaces (Docker) | Docker → production |
 | CI/CD | None | GitHub Actions | Automated quality gates |
-| Tests | 97 | 169 | Comprehensive coverage |
 See [DECISIONS.md](DECISIONS.md) for the reasoning behind each design choice.

 ![CI](https://github.com/tyy0811/agent-bench/actions/workflows/ci.yaml/badge.svg)
+Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on the same 27-question golden dataset across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal). Zero hallucinated citations in all API configurations.
+`205 tests` · `3 providers` · `LangChain comparison` · `K8s + Terraform` · `CI`
 ## Benchmark Results
 ### Provider Comparison (Custom Pipeline)
+| Metric | OpenAI gpt-4o-mini | Anthropic claude-haiku | Self-hosted Mistral-7B |
+|--------|-------------------|----------------------|----------------------|
+| Retrieval P@5 | 0.70 | **0.74** | 0.05 |
+| Retrieval R@5 | 0.83 | **0.84** | 0.05 |
+| Keyword Hit Rate | 0.89 | **0.92** | 0.61 |
+| Citation Acc | **1.00** | **1.00** | 0.14 |
+| Latency p50 | 4,690 ms | 5,120 ms | 6,709 ms |
+| Cost per query | **$0.0004** | $0.0007 | $0.0031 |
+API providers are directly comparable (same config). The self-hosted row uses `max_iterations=1` and `top_k=3` (vs 3/5 for API) to fit Mistral-7B's 8K context window — not an apples-to-apples comparison, but reflects realistic 7B operating constraints. See [provider comparison](docs/provider_comparison.md) for full analysis.
 [Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md)
 OPENAI_API_KEY=sk-... docker-compose -f docker/docker-compose.yaml up --build
 ```
+### Self-Hosted LLM via Modal (no local GPU needed)
+```bash
+pip install -e ".[modal]"                                # Install Modal SDK
+modal setup                                              # Authenticate with Modal
+modal secret create huggingface-secret HF_TOKEN=hf_...   # HF token for model download
+make modal-deploy                                        # Deploy vLLM on Modal A10G
+export MODAL_VLLM_URL=https://your--agent-bench-vllm-serve.modal.run/v1
+AGENT_BENCH_ENV=selfhosted_modal make serve              # Serve with self-hosted provider
+# Run provider comparison (requires all provider API keys)
+export OPENAI_API_KEY=sk-...
+export ANTHROPIC_API_KEY=sk-ant-...
+make benchmark-all
+# Or run only the self-hosted provider
+python modal/run_benchmark.py --base-url $MODAL_VLLM_URL --only selfhosted_modal
+```
+### Self-Hosted LLM via Docker Compose (requires local NVIDIA GPU)
+```bash
+docker compose -f docker/docker-compose.vllm.yml up --build
+```
+### Kubernetes (Helm)
+```bash
+make k8s-dev     # Dev: 1 replica, no HPA
+make k8s-prod    # Prod: 3 replicas, HPA 2-8 pods
+```
+See [docs/k8s-local-setup.md](docs/k8s-local-setup.md) for minikube walkthrough.
 ## Architecture
 ```mermaid
     Reg --> Calc[calculator]
     Search --> Store[Hybrid Store<br/>FAISS + BM25 + RRF]
     LLM -->|no tool_calls| Resp[AskResponse<br/>answer + sources + metadata]
+    subgraph Providers
+        LLM --- OpenAI[OpenAI<br/>gpt-4o-mini]
+        LLM --- Anthropic[Anthropic<br/>claude-haiku]
+        LLM --- SelfHosted[SelfHosted<br/>vLLM / TGI / Ollama]
+    end
 ```
 ## Skills Demonstrated
 - **Agent design & evaluation**: Built two independent orchestration approaches (custom tool-calling loop + LangChain AgentExecutor) and evaluated both on identical metrics to quantify framework tradeoffs
 - **Retrieval engineering**: Hybrid FAISS + BM25 with Reciprocal Rank Fusion, cross-encoder reranking, evaluated across 27 questions with P@5, R@5, citation accuracy
+- **Infrastructure:** Kubernetes (Helm), Terraform (GCP/GKE), self-hosted LLM serving (vLLM on Modal + Docker Compose)
+- **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
+- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 205 deterministic tests with mock providers
 <details><summary>API Reference</summary>
 | `/ask` | POST | Ask a question, get answer with sources |
 | `/ask/stream` | POST | SSE streaming (sources → chunks → done) |
 | `/health` | GET | Store stats, provider status, uptime |
+| `/metrics` | GET | Request count, latency p50/p95, cost (JSON) |
+| `/metrics/prometheus` | GET | Prometheus text exposition format |
 ### POST /ask
 ## Testing
 ```bash
+make test    # 205 deterministic tests, no API keys needed
 make lint    # ruff + mypy
 ```
 | Conversation memory | Stateless | SQLite sessions | State management |
 | Cloud deployment | None | HF Spaces (Docker) | Docker → production |
 | CI/CD | None | GitHub Actions | Automated quality gates |
+| Tests | 97 | 205 | Comprehensive coverage |
 See [DECISIONS.md](DECISIONS.md) for the reasoning behind each design choice.

agent_bench/core/config.py CHANGED Viewed

@@ -21,9 +21,17 @@ class ModelPricing(BaseModel):
     output_cost_per_mtok: float
 class ProviderConfig(BaseModel):
     default: str = "openai"
     models: dict[str, ModelPricing] = {}
 class ChunkingConfig(BaseModel):

     output_cost_per_mtok: float
+class SelfHostedConfig(BaseModel):
+    base_url: str = ""
+    model_name: str = "mistralai/Mistral-7B-Instruct-v0.3"
+    api_key: str = ""
+    timeout_seconds: float = 120.0
 class ProviderConfig(BaseModel):
     default: str = "openai"
     models: dict[str, ModelPricing] = {}
+    selfhosted: SelfHostedConfig = SelfHostedConfig()
 class ChunkingConfig(BaseModel):

agent_bench/core/provider.py CHANGED Viewed

@@ -102,6 +102,15 @@ class LLMProvider(ABC):
     @abstractmethod
     def format_tools(self, tools: list[ToolDefinition]) -> list[dict]: ...
 # --- Implementations ---
@@ -327,6 +336,13 @@ class OpenAIProvider(LLMProvider):
             if chunk.choices and chunk.choices[0].delta.content:
                 yield chunk.choices[0].delta.content
     def format_tools(self, tools: list[ToolDefinition]) -> list[dict]:
         return format_tools_openai(tools)
@@ -560,10 +576,405 @@ class AnthropicProvider(LLMProvider):
                     f"Anthropic timed out: {e}"
                 ) from e
     def format_tools(self, tools: list[ToolDefinition]) -> list[dict]:
         return format_tools_anthropic(tools)
 def create_provider(config: AppConfig | None = None) -> LLMProvider:
     """Factory: create provider based on config."""
     if config is None:
@@ -573,6 +984,8 @@ def create_provider(config: AppConfig | None = None) -> LLMProvider:
         return OpenAIProvider(config)
     elif name == "anthropic":
         return AnthropicProvider(config)
     elif name == "mock":
         return MockProvider()
     else:

     @abstractmethod
     def format_tools(self, tools: list[ToolDefinition]) -> list[dict]: ...
+    async def health_check(self) -> bool:
+        """Check if the upstream provider is reachable.
+        Returns True if the provider can serve requests, False otherwise.
+        Default implementation returns True (assume healthy). Providers
+        should override this to perform a real connectivity check.
+        """
+        return True
 # --- Implementations ---
             if chunk.choices and chunk.choices[0].delta.content:
                 yield chunk.choices[0].delta.content
+    async def health_check(self) -> bool:
+        try:
+            await self.client.models.retrieve(self.model)
+            return True
+        except Exception:
+            return False
     def format_tools(self, tools: list[ToolDefinition]) -> list[dict]:
         return format_tools_openai(tools)
                     f"Anthropic timed out: {e}"
                 ) from e
+    async def health_check(self) -> bool:
+        try:
+            await self.client.models.retrieve(model_id=self.model)
+            return True
+        except Exception:
+            return False
     def format_tools(self, tools: list[ToolDefinition]) -> list[dict]:
         return format_tools_anthropic(tools)
+class SelfHostedProvider(LLMProvider):
+    """Provider targeting any OpenAI-compatible endpoint (vLLM, TGI, Ollama).
+    Reads settings from config (provider.selfhosted.*) with env var fallback:
+        MODAL_VLLM_URL   -> base_url
+        SELFHOSTED_MODEL -> model_name
+        MODAL_AUTH_TOKEN  -> api_key
+    Tool-calling support is detected lazily on the first complete() call
+    with tools. If the endpoint returns a 400 or the model ignores tools,
+    subsequent calls fall back to prompt-based tool selection.
+    """
+    def __init__(self, config: AppConfig | None = None) -> None:
+        import os
+        import httpx as _httpx
+        self.config = config or load_config()
+        sh = self.config.provider.selfhosted
+        self.base_url = (
+            sh.base_url
+            or os.environ.get("MODAL_VLLM_URL", "http://localhost:8001/v1")
+        )
+        self.model = (
+            sh.model_name
+            if sh.model_name != "mistralai/Mistral-7B-Instruct-v0.3"
+            else os.environ.get("SELFHOSTED_MODEL", sh.model_name)
+        )
+        api_key = sh.api_key or os.environ.get("MODAL_AUTH_TOKEN", "")
+        self._supports_tool_calling: bool | None = None  # detected lazily
+        model_pricing = self.config.provider.models.get(self.model)
+        self._input_cost = model_pricing.input_cost_per_mtok if model_pricing else 0.0
+        self._output_cost = model_pricing.output_cost_per_mtok if model_pricing else 0.0
+        self.client = _httpx.AsyncClient(
+            base_url=self.base_url,
+            timeout=sh.timeout_seconds,
+            follow_redirects=True,
+            headers={"Authorization": f"Bearer {api_key}"} if api_key else {},
+        )
+    async def _detect_tool_calling(self) -> bool | None:
+        """Probe the endpoint for OpenAI-format tool-calling support.
+        Returns:
+            True  — model responded with tool_calls (definitive: cache it)
+            False — endpoint returned 400 (definitive: cache it)
+            None  — transient failure (timeout, 5xx, connection error); do NOT cache
+        """
+        test_tool = {
+            "type": "function",
+            "function": {
+                "name": "test_probe",
+                "description": "Probe for tool support",
+                "parameters": {
+                    "type": "object",
+                    "properties": {"x": {"type": "string"}},
+                },
+            },
+        }
+        try:
+            resp = await self.client.post(
+                "/chat/completions",
+                json={
+                    "model": self.model,
+                    "messages": [
+                        {"role": "user", "content": "Call the test_probe tool with x='hello'"}
+                    ],
+                    "tools": [test_tool],
+                    "tool_choice": "auto",
+                    "max_tokens": 50,
+                },
+            )
+            if resp.status_code == 400:
+                log.info("selfhosted_tool_detect", result="unsupported (400)")
+                return False
+            if resp.status_code >= 500:
+                log.warning("selfhosted_tool_detect", result="transient (5xx)")
+                return None
+            resp.raise_for_status()
+            data = resp.json()
+            has_tools = bool(
+                data["choices"][0]["message"].get("tool_calls")
+            )
+            log.info("selfhosted_tool_detect", result="supported" if has_tools else "unsupported")
+            return has_tools
+        except Exception:
+            log.warning("selfhosted_tool_detect", result="transient (error)")
+            return None
+    @staticmethod
+    def _sanitize_messages(messages: list[dict]) -> list[dict]:
+        """Convert tool-role messages and merge consecutive same-role messages.
+        Many models (e.g. Mistral) require strictly alternating user/assistant
+        messages. Tool results are converted to user messages and consecutive
+        same-role messages are merged.
+        """
+        sanitized: list[dict] = []
+        for m in messages:
+            if m["role"] == "tool":
+                role = "user"
+                content = f"[Tool result]: {m['content']}"
+            elif m["role"] == "assistant" and "tool_calls" in m:
+                role = "assistant"
+                content = m.get("content") or ""
+            else:
+                role = m["role"]
+                content = m.get("content") or ""
+            # Merge consecutive same-role messages
+            if sanitized and sanitized[-1]["role"] == role and role != "system":
+                sanitized[-1]["content"] += "\n\n" + content
+            else:
+                sanitized.append({"role": role, "content": content})
+        # Merge consecutive same-role messages that resulted from dropping empty ones
+        merged: list[dict] = []
+        for m in sanitized:
+            if not m["content"].strip() and m["role"] != "system":
+                continue  # drop empty messages
+            if merged and merged[-1]["role"] == m["role"] and m["role"] != "system":
+                merged[-1]["content"] += "\n\n" + m["content"]
+            else:
+                merged.append(m)
+        return merged
+    @staticmethod
+    def _tools_as_prompt(tools: list[ToolDefinition]) -> str:
+        """Format tools as system prompt text for prompt-based fallback."""
+        lines = ["You have access to the following tools:", ""]
+        for t in tools:
+            lines.append(f"- {t.name}: {t.description}")
+            lines.append(f"  Parameters: {json.dumps(t.parameters)}")
+        lines.extend([
+            "",
+            "To use a tool, respond with ONLY this JSON (no other text):",
+            '{"tool_calls": [{"name": "tool_name", "arguments": {"param": "value"}}]}',
+            "",
+            "If you don't need a tool, respond normally with text.",
+        ])
+        return "\n".join(lines)
+    @staticmethod
+    def _parse_tool_calls_from_text(text: str) -> list[ToolCall]:
+        """Parse tool calls from model text output (prompt-based fallback)."""
+        import uuid
+        try:
+            data = json.loads(text.strip())
+            if isinstance(data, dict) and "tool_calls" in data:
+                calls = []
+                for tc in data["tool_calls"]:
+                    raw_args = tc.get("arguments", {})
+                    if not isinstance(raw_args, dict):
+                        raw_args = {}
+                    calls.append(
+                        ToolCall(
+                            id=f"call_{uuid.uuid4().hex[:8]}",
+                            name=tc["name"],
+                            arguments=raw_args,
+                        )
+                    )
+                return calls
+        except (json.JSONDecodeError, KeyError, TypeError):
+            pass
+        return []
+    async def complete(
+        self,
+        messages: list[Message],
+        tools: list[ToolDefinition] | None = None,
+        temperature: float = 0.0,
+        max_tokens: int = 1024,
+    ) -> CompletionResponse:
+        import httpx as _httpx
+        # Lazy tool-calling detection on first call with tools
+        if tools and self._supports_tool_calling is None:
+            result = await self._detect_tool_calling()
+            if result is not None:
+                self._supports_tool_calling = result
+            # If None (transient), leave as None so next call retries
+        formatted_messages = format_messages_openai(messages)
+        # Use native tools only when detection confirmed support.
+        # When detection is None (transient failure), fall back to prompt-based
+        # rather than risk a 400 with native tools on an unsupported endpoint.
+        use_native_tools = tools and self._supports_tool_calling is True
+        if tools and not use_native_tools:
+            tool_prompt = self._tools_as_prompt(tools)
+            # Merge tool instructions into existing system message (some models
+            # like Mistral reject multiple system messages in their chat template)
+            if formatted_messages and formatted_messages[0]["role"] == "system":
+                formatted_messages[0]["content"] = (
+                    tool_prompt + "\n\n" + formatted_messages[0]["content"]
+                )
+            else:
+                formatted_messages = [
+                    {"role": "system", "content": tool_prompt},
+                    *formatted_messages,
+                ]
+        # Always sanitize for self-hosted: messages may contain tool/tool_calls
+        # from earlier iterations even when current call has tools=None
+        formatted_messages = self._sanitize_messages(formatted_messages)
+        payload: dict = {
+            "model": self.model,
+            "messages": formatted_messages,
+            "temperature": temperature,
+            "max_tokens": max_tokens,
+        }
+        if use_native_tools and tools:
+            payload["tools"] = self.format_tools(tools)
+            payload["tool_choice"] = "auto"
+        retry_cfg = self.config.retry
+        start = time.perf_counter()
+        for attempt in range(retry_cfg.max_retries + 1):
+            try:
+                resp = await self.client.post("/chat/completions", json=payload)
+                if resp.status_code == 429:
+                    if attempt == retry_cfg.max_retries:
+                        raise ProviderRateLimitError(
+                            f"Rate limited after {retry_cfg.max_retries} retries"
+                        )
+                    wait = min(
+                        retry_cfg.base_delay * (2 ** attempt), retry_cfg.max_delay
+                    )
+                    log.warning(
+                        "selfhosted_retry",
+                        attempt=attempt + 1,
+                        wait_seconds=wait,
+                    )
+                    await asyncio.sleep(wait)
+                    continue
+                if resp.status_code >= 400:
+                    log.error("selfhosted_error", status=resp.status_code, body=resp.text[:500])
+                resp.raise_for_status()
+                break
+            except _httpx.TimeoutException as e:
+                raise ProviderTimeoutError(f"Self-hosted timed out: {e}") from e
+        latency_ms = (time.perf_counter() - start) * 1000
+        data = resp.json()
+        choice = data["choices"][0]
+        content = choice["message"].get("content") or ""
+        tool_calls: list[ToolCall] = []
+        if choice["message"].get("tool_calls"):
+            # Native tool calling response
+            for tc in choice["message"]["tool_calls"]:
+                try:
+                    args = json.loads(tc["function"]["arguments"])
+                except (json.JSONDecodeError, KeyError):
+                    args = {}
+                tool_calls.append(
+                    ToolCall(
+                        id=tc["id"],
+                        name=tc["function"]["name"],
+                        arguments=args,
+                    )
+                )
+        elif tools and not self._supports_tool_calling and content:
+            # Prompt-based fallback: parse tool calls from text
+            tool_calls = self._parse_tool_calls_from_text(content)
+            if tool_calls:
+                content = ""  # tool call replaces text content
+        usage_data = data.get("usage", {})
+        input_tokens = usage_data.get("prompt_tokens", 0)
+        output_tokens = usage_data.get("completion_tokens", 0)
+        cost = (
+            input_tokens * self._input_cost + output_tokens * self._output_cost
+        ) / 1_000_000
+        return CompletionResponse(
+            content=content,
+            tool_calls=tool_calls,
+            usage=TokenUsage(
+                input_tokens=input_tokens,
+                output_tokens=output_tokens,
+                estimated_cost_usd=cost,
+            ),
+            provider="selfhosted",
+            model=self.model,
+            latency_ms=latency_ms,
+        )
+    async def stream_complete(
+        self,
+        messages: list[Message],
+        tools: list[ToolDefinition] | None = None,
+        temperature: float = 0.0,
+        max_tokens: int = 1024,
+    ) -> AsyncIterator[str]:
+        import httpx as _httpx
+        # Same tool-calling detection/fallback as complete()
+        if tools and self._supports_tool_calling is None:
+            result = await self._detect_tool_calling()
+            if result is not None:
+                self._supports_tool_calling = result
+        formatted_messages = format_messages_openai(messages)
+        use_native_tools = tools and self._supports_tool_calling is True
+        if tools and not use_native_tools:
+            tool_prompt = self._tools_as_prompt(tools)
+            if formatted_messages and formatted_messages[0]["role"] == "system":
+                formatted_messages[0]["content"] = (
+                    tool_prompt + "\n\n" + formatted_messages[0]["content"]
+                )
+            else:
+                formatted_messages = [
+                    {"role": "system", "content": tool_prompt},
+                    *formatted_messages,
+                ]
+        formatted_messages = self._sanitize_messages(formatted_messages)
+        payload: dict = {
+            "model": self.model,
+            "messages": formatted_messages,
+            "temperature": temperature,
+            "max_tokens": max_tokens,
+            "stream": True,
+        }
+        if use_native_tools and tools:
+            payload["tools"] = self.format_tools(tools)
+            payload["tool_choice"] = "auto"
+        retry_cfg = self.config.retry
+        for attempt in range(retry_cfg.max_retries + 1):
+            try:
+                async with self.client.stream(
+                    "POST", "/chat/completions", json=payload
+                ) as resp:
+                    if resp.status_code == 429:
+                        if attempt == retry_cfg.max_retries:
+                            raise ProviderRateLimitError(
+                                f"Rate limited after {retry_cfg.max_retries} retries"
+                            )
+                        wait = min(
+                            retry_cfg.base_delay * (2 ** attempt),
+                            retry_cfg.max_delay,
+                        )
+                        log.warning(
+                            "selfhosted_stream_retry",
+                            attempt=attempt + 1,
+                            wait_seconds=wait,
+                        )
+                        await asyncio.sleep(wait)
+                        continue
+                    resp.raise_for_status()
+                    async for line in resp.aiter_lines():
+                        line = line.strip()
+                        if not line or not line.startswith("data: "):
+                            continue
+                        data_str = line[len("data: "):]
+                        if data_str == "[DONE]":
+                            return
+                        try:
+                            chunk_data = json.loads(data_str)
+                            delta = chunk_data["choices"][0].get("delta", {})
+                            if delta.get("content"):
+                                yield delta["content"]
+                        except (json.JSONDecodeError, KeyError, IndexError):
+                            continue
+                return  # success — exit retry loop
+            except _httpx.TimeoutException as e:
+                raise ProviderTimeoutError(f"Self-hosted timed out: {e}") from e
+    async def health_check(self) -> bool:
+        try:
+            resp = await self.client.get("/models", timeout=5.0)
+            return resp.status_code == 200
+        except Exception:
+            return False
+    def format_tools(self, tools: list[ToolDefinition]) -> list[dict]:
+        return format_tools_openai(tools)
 def create_provider(config: AppConfig | None = None) -> LLMProvider:
     """Factory: create provider based on config."""
     if config is None:
         return OpenAIProvider(config)
     elif name == "anthropic":
         return AnthropicProvider(config)
+    elif name == "selfhosted":
+        return SelfHostedProvider(config)
     elif name == "mock":
         return MockProvider()
     else:

agent_bench/serving/routes.py CHANGED Viewed

@@ -1,4 +1,4 @@
-"""API routes: /ask, /ask/stream, /health, /metrics."""
 from __future__ import annotations
@@ -178,10 +178,10 @@ async def health(request: Request) -> HealthResponse:
     store = request.app.state.store
     start_time: float = request.app.state.start_time
-    provider_available = True
     try:
-        # Just check the provider is constructed — don't make an API call
-        _ = request.app.state.orchestrator.provider
     except Exception:
         provider_available = False
@@ -205,3 +205,31 @@ async def metrics(request: Request) -> MetricsResponse:
         errors_total=m.errors_total,
         avg_cost_per_query_usd=m.avg_cost,
     )

+"""API routes: /ask, /ask/stream, /health, /metrics, /metrics/prometheus."""
 from __future__ import annotations
     store = request.app.state.store
     start_time: float = request.app.state.start_time
+    provider_available = False
     try:
+        provider = request.app.state.orchestrator.provider
+        provider_available = await provider.health_check()
     except Exception:
         provider_available = False
         errors_total=m.errors_total,
         avg_cost_per_query_usd=m.avg_cost,
     )
+@router.get("/metrics/prometheus")
+async def metrics_prometheus(request: Request) -> Response:
+    """Prometheus text exposition format for K8s HPA custom metrics."""
+    m: MetricsCollector = request.app.state.metrics
+    lines = [
+        "# HELP agent_bench_requests_total Total requests served.",
+        "# TYPE agent_bench_requests_total counter",
+        f"agent_bench_requests_total {m.requests_total}",
+        "# HELP agent_bench_errors_total Total error responses.",
+        "# TYPE agent_bench_errors_total counter",
+        f"agent_bench_errors_total {m.errors_total}",
+        "# HELP agent_bench_latency_p50_ms 50th percentile latency in ms.",
+        "# TYPE agent_bench_latency_p50_ms gauge",
+        f"agent_bench_latency_p50_ms {m.percentile(50):.1f}",
+        "# HELP agent_bench_latency_p95_ms 95th percentile latency in ms.",
+        "# TYPE agent_bench_latency_p95_ms gauge",
+        f"agent_bench_latency_p95_ms {m.percentile(95):.1f}",
+        "# HELP agent_bench_avg_cost_usd Average cost per query in USD.",
+        "# TYPE agent_bench_avg_cost_usd gauge",
+        f"agent_bench_avg_cost_usd {m.avg_cost:.6f}",
+        "",
+    ]
+    return Response(
+        content="\n".join(lines),
+        media_type="text/plain; version=0.0.4; charset=utf-8",
+    )

configs/selfhosted_local.yaml ADDED Viewed

	@@ -0,0 +1,58 @@

+agent:
+  max_iterations: 3
+  temperature: 0.0
+provider:
+  default: selfhosted
+  selfhosted:
+    # base_url left empty: falls back to MODAL_VLLM_URL env var,
+    # then to http://localhost:8001/v1 (default for local vLLM via docker-compose.vllm.yml).
+    # In Docker Compose, MODAL_VLLM_URL is set to http://vllm:8000/v1.
+    model_name: mistralai/Mistral-7B-Instruct-v0.3
+    timeout_seconds: 120
+  models:
+    mistralai/Mistral-7B-Instruct-v0.3:
+      input_cost_per_mtok: 0.0
+      output_cost_per_mtok: 0.0
+    gpt-4o-mini:
+      input_cost_per_mtok: 0.15
+      output_cost_per_mtok: 0.60
+rag:
+  chunking:
+    strategy: recursive
+    chunk_size: 512
+    chunk_overlap: 64
+  retrieval:
+    strategy: hybrid
+    rrf_k: 60
+    candidates_per_system: 10
+    top_k: 5
+  reranker:
+    enabled: true
+    model_name: cross-encoder/ms-marco-MiniLM-L-6-v2
+    top_k: 5
+  refusal_threshold: 0.02
+  store_path: .cache/store
+embedding:
+  model: all-MiniLM-L6-v2
+  cache_dir: .cache/embeddings
+retry:
+  max_retries: 3
+  base_delay: 1.0
+  max_delay: 8.0
+memory:
+  enabled: false
+serving:
+  host: 0.0.0.0
+  port: 8000
+  request_timeout_seconds: 120
+  rate_limit_rpm: 10
+evaluation:
+  judge_provider: openai
+  golden_dataset: agent_bench/evaluation/datasets/tech_docs_golden.json

configs/selfhosted_modal.yaml ADDED Viewed

	@@ -0,0 +1,56 @@

+agent:
+  max_iterations: 1
+  temperature: 0.0
+provider:
+  default: selfhosted
+  selfhosted:
+    # base_url and api_key read from MODAL_VLLM_URL / MODAL_AUTH_TOKEN env vars
+    model_name: mistralai/Mistral-7B-Instruct-v0.3
+    timeout_seconds: 300
+  models:
+    mistralai/Mistral-7B-Instruct-v0.3:
+      input_cost_per_mtok: 0.0
+      output_cost_per_mtok: 0.0
+    gpt-4o-mini:
+      input_cost_per_mtok: 0.15
+      output_cost_per_mtok: 0.60
+rag:
+  chunking:
+    strategy: recursive
+    chunk_size: 512
+    chunk_overlap: 64
+  retrieval:
+    strategy: hybrid
+    rrf_k: 60
+    candidates_per_system: 10
+    top_k: 3
+  reranker:
+    enabled: true
+    model_name: cross-encoder/ms-marco-MiniLM-L-6-v2
+    top_k: 3
+  refusal_threshold: 0.02
+  store_path: .cache/store
+embedding:
+  model: all-MiniLM-L6-v2
+  cache_dir: .cache/embeddings
+retry:
+  max_retries: 3
+  base_delay: 1.0
+  max_delay: 8.0
+memory:
+  enabled: false
+serving:
+  host: 0.0.0.0
+  port: 8000
+  request_timeout_seconds: 120
+  rate_limit_rpm: 10
+evaluation:
+  judge_provider: openai
+  golden_dataset: agent_bench/evaluation/datasets/tech_docs_golden.json

docker/docker-compose.vllm.yml ADDED Viewed

	@@ -0,0 +1,50 @@

+# Local GPU serving via vLLM + agent-bench API.
+# Requires: nvidia-container-toolkit
+# See modal/serve_vllm.py for serverless alternative.
+#
+# Usage:
+#   docker compose -f docker/docker-compose.vllm.yml up --build
+services:
+  vllm:
+    image: vllm/vllm-openai:latest
+    command:
+      - --model=mistralai/Mistral-7B-Instruct-v0.3
+      - --max-model-len=4096
+      - --dtype=half
+      - --gpu-memory-utilization=0.85
+      - --host=0.0.0.0
+      - --port=8000
+    ports:
+      - "8001:8000"
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    volumes:
+      - vllm-cache:/root/.cache/huggingface
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 5
+      start_period: 120s
+  app:
+    build:
+      context: ..
+      dockerfile: docker/Dockerfile
+    environment:
+      - MODAL_VLLM_URL=http://vllm:8000/v1
+      - AGENT_BENCH_ENV=selfhosted_local
+    depends_on:
+      vllm:
+        condition: service_healthy
+    ports:
+      - "8080:7860"
+volumes:
+  vllm-cache:

docs/k8s-local-setup.md ADDED Viewed

	@@ -0,0 +1,40 @@

+# Kubernetes Local Setup (minikube)
+## Prerequisites
+- [minikube](https://minikube.sigs.k8s.io/docs/start/)
+- [Helm](https://helm.sh/docs/intro/install/)
+- Docker
+## Deploy
+```bash
+# Start minikube
+minikube start --cpus=4 --memory=8192
+# Build image inside minikube's Docker daemon
+eval $(minikube docker-env)
+docker build -t agent-bench:latest -f docker/Dockerfile .
+# Deploy with dev values
+helm install agent-bench k8s/helm/agent-bench/ \
+  -f k8s/helm/agent-bench/values-dev.yaml \
+  --set provider.selfhosted.modalEndpoint=$MODAL_VLLM_URL
+# Verify
+kubectl get pods
+kubectl port-forward svc/agent-bench 8080:8000
+# Test
+curl http://localhost:8080/health
+curl -X POST http://localhost:8080/ask \
+  -H "Content-Type: application/json" \
+  -d '{"question": "How do I define a path parameter in FastAPI?"}'
+```
+## Teardown
+```bash
+helm uninstall agent-bench
+minikube stop
+```

docs/provider_comparison.md CHANGED Viewed

@@ -1,64 +1,85 @@
-# Provider Comparison — OpenAI vs Anthropic
 Evaluated on the same 27-question golden dataset over 16 FastAPI documentation files.
-Both providers use the same RAG pipeline: hybrid retrieval (FAISS + BM25 + RRF),
-cross-encoder reranking, grounded refusal threshold, and identical system prompt.
-**The only difference is the LLM provider.** Everything else is controlled.
-## Models
-| Provider | Model | Context | Pricing (input/output per 1M tokens) |
-|----------|-------|---------|--------------------------------------|
-| OpenAI | gpt-4o-mini | 128K | $0.15 / $0.60 |
-| Anthropic | claude-haiku-4-5 | 200K | $0.80 / $4.00 |
-## Retrieval Metrics
-| Metric | OpenAI gpt-4o-mini | Anthropic claude-haiku | Delta |
-|--------|-------------------|----------------------|-------|
-| Retrieval P@5 | 0.70 | **0.74** | +0.04 |
-| Retrieval R@5 | 0.83 | **0.84** | +0.01 |
-| Keyword Hit Rate | 0.89 | **0.92** | +0.03 |
-Haiku outperforms gpt-4o-mini on all retrieval metrics. The improvement
-in P@5 (0.70 → 0.74) suggests Haiku generates more precise search queries,
-which the cross-encoder reranker then amplifies.
-## Cost
-| Metric | OpenAI gpt-4o-mini | Anthropic claude-haiku |
-|--------|-------------------|----------------------|
-| Cost per query | **$0.0004** | $0.0007 |
-| Full eval (27 questions) | **~$0.01** | ~$0.02 |
-OpenAI is ~1.75x cheaper per query. Both are negligible for a demo.
-## Qualitative Observations
-- **Tool use**: Both providers correctly use the `search_documents` tool on retrieval
-  questions and the `calculator` tool on calculation questions.
-- **Refusal**: Both providers follow the system prompt instruction to refuse when the
-  search tool returns "No relevant documents found." The refusal threshold gate fires
-  identically since it operates on retrieval scores before the LLM is invoked.
-- **Citation format**: Both providers follow the `[source: filename.md]` citation format
-  specified in the system prompt.
-- **Answer quality**: Haiku tends to produce more structured answers (numbered lists,
-  code examples) while gpt-4o-mini is more concise. Both are accurate.
 ## How to Reproduce
 ```bash
-# OpenAI evaluation (default config)
 OPENAI_API_KEY=sk-... python scripts/evaluate.py --mode deterministic
 # Anthropic evaluation
 ANTHROPIC_API_KEY=sk-ant-... python scripts/evaluate.py --config configs/anthropic.yaml --mode deterministic
 ```
 ## Takeaway
-The provider abstraction works as designed — switching from OpenAI to Anthropic is a
-single config change (`provider.default: anthropic`). The orchestrator, tools, evaluation
-harness, and serving layer are completely unchanged. Both providers produce competitive
-results on the same benchmark.

+# Provider Comparison: API vs Self-Hosted
 Evaluated on the same 27-question golden dataset over 16 FastAPI documentation files.
+All providers use hybrid retrieval (FAISS + BM25 + RRF), cross-encoder reranking,
+grounded refusal threshold, and identical system prompt.
+**Note:** The self-hosted config differs from API configs in two ways to accommodate
+the 7B model's smaller context window (8192 tokens) and weaker instruction following:
+`max_iterations=1` (vs 3) and `top_k=3` (vs 5). This means the self-hosted row is
+**not a controlled comparison** — it reflects realistic operating constraints for a
+7B model, not an apples-to-apples provider swap. The API providers are directly
+comparable to each other.
+## Results
+| Provider | Model | Iterations | top_k | P@5 | R@5 | Citation Acc | Latency p50 (ms) | Cost/query |
+|----------|-------|-----------|-------|-----|-----|--------------|-------------------|------------|
+| OpenAI (API) | gpt-4o-mini | 3 | 5 | 0.70 | 0.83 | 1.00 | 4,690 | $0.0004 |
+| Anthropic (API) | claude-haiku-4-5 | 3 | 5 | 0.74 | 0.84 | 1.00 | 5,120 | $0.0007 |
+| Self-hosted (Modal) | Mistral-7B-Instruct-v0.3 | 1 | 3 | 0.05 | 0.05 | 0.14 | 6,709 | $0.0031 |
+## Analysis
+**Retrieval quality:** API models (gpt-4o-mini, claude-haiku) generate substantially better
+search queries than Mistral-7B, reflected in P@5 (0.70-0.74 vs 0.05). The 7B model struggles
+with prompt-based tool calling — it often produces malformed JSON or calls tools with
+poor queries, degrading retrieval quality.
+**Citation accuracy:** Both API providers achieve 1.00 citation accuracy (zero hallucinated
+citations). Mistral-7B manages 0.14, frequently omitting or fabricating source references.
+This is a known limitation of smaller models on instruction-following tasks.
+**Latency:** Self-hosted latency (6,709ms p50) is higher than API providers due to the
+proxy overhead and smaller model generating more tokens before reaching a final answer.
+Cold start adds ~90s on first request (model download + GPU load).
+**Cost:** Self-hosted cost ($0.0031/query) is computed from GPU-seconds
+(latency x Modal A10G rate of $0.000361/sec). This is higher per-query than API providers
+at low volume, but the cost model is fundamentally different — GPU cost scales with
+compute time, not token count.
+**Tool calling:** Mistral-7B does not support native OpenAI-format tool calling in vLLM
+0.6.6. The provider falls back to prompt-based tool selection (injecting tool descriptions
+into the system prompt and parsing JSON from the model's text output). This works but is
+unreliable — a legitimate benchmark finding, not a failure.
+## Infrastructure
+| Config | Cold start | Warm latency p50 | GPU | Infra |
+|--------|-----------|-------------------|-----|-------|
+| OpenAI | N/A | 4,690 ms | N/A | Managed API |
+| Anthropic | N/A | 5,120 ms | N/A | Managed API |
+| Self-hosted (Modal) | ~90s | 6,709 ms | A10G (24GB) | Serverless GPU |
 ## How to Reproduce
 ```bash
+# OpenAI evaluation
 OPENAI_API_KEY=sk-... python scripts/evaluate.py --mode deterministic
 # Anthropic evaluation
 ANTHROPIC_API_KEY=sk-ant-... python scripts/evaluate.py --config configs/anthropic.yaml --mode deterministic
+# Self-hosted evaluation (requires Modal deployment + HF secret)
+pip install -e ".[modal]"
+modal secret create huggingface-secret HF_TOKEN=hf_...
+modal deploy modal/serve_vllm.py
+export MODAL_VLLM_URL=https://your--agent-bench-vllm-serve.modal.run/v1
+python scripts/evaluate.py --config configs/selfhosted_modal.yaml --mode deterministic
+# All providers at once
+make benchmark-all
 ```
 ## Takeaway
+The provider abstraction works as designed — switching providers is a single config change.
+API models dominate on quality metrics, but the self-hosted path demonstrates end-to-end
+inference serving: vLLM on Modal (serverless A10G), OpenAI-compatible endpoint, identical
+evaluation harness. The quality gap is expected for a 7B model on RAG tasks and would
+narrow with larger self-hosted models (e.g., Mixtral-8x7B, Llama-3-70B).
+---
+Generated by `modal/run_benchmark.py`

k8s/helm/agent-bench/Chart.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+apiVersion: v2
+name: agent-bench
+description: Agentic RAG system with self-hosted LLM support
+type: application
+version: 0.1.0
+appVersion: "0.1.0"

k8s/helm/agent-bench/templates/_helpers.tpl ADDED Viewed

	@@ -0,0 +1,35 @@

+{{/*
+Expand the name of the chart.
+*/}}
+{{- define "agent-bench.name" -}}
+{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
+{{- end }}
+{{/*
+Create a default fully qualified app name.
+*/}}
+{{- define "agent-bench.fullname" -}}
+{{- $name := default .Chart.Name .Values.nameOverride }}
+{{- if .Values.fullnameOverride }}
+{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
+{{- else }}
+{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
+{{- end }}
+{{- end }}
+{{/*
+Common labels
+*/}}
+{{- define "agent-bench.labels" -}}
+helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version }}
+{{ include "agent-bench.selectorLabels" . }}
+app.kubernetes.io/managed-by: {{ .Release.Service }}
+{{- end }}
+{{/*
+Selector labels
+*/}}
+{{- define "agent-bench.selectorLabels" -}}
+app.kubernetes.io/name: {{ include "agent-bench.name" . }}
+app.kubernetes.io/instance: {{ .Release.Name }}
+{{- end }}

k8s/helm/agent-bench/templates/configmap.yaml ADDED Viewed

	@@ -0,0 +1,15 @@

+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: {{ include "agent-bench.fullname" . }}-config
+  labels:
+    {{- include "agent-bench.labels" . | nindent 4 }}
+data:
+  {{- if eq .Values.provider.type "selfhosted" }}
+  AGENT_BENCH_ENV: "selfhosted_modal"
+  SELFHOSTED_MODEL: {{ .Values.provider.selfhosted.model | quote }}
+  {{- else if eq .Values.provider.type "openai" }}
+  AGENT_BENCH_ENV: "default"
+  {{- else if eq .Values.provider.type "anthropic" }}
+  AGENT_BENCH_ENV: "anthropic"
+  {{- end }}

k8s/helm/agent-bench/templates/deployment.yaml ADDED Viewed

	@@ -0,0 +1,45 @@

+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: {{ include "agent-bench.fullname" . }}
+  labels:
+    {{- include "agent-bench.labels" . | nindent 4 }}
+spec:
+  {{- if not .Values.autoscaling.enabled }}
+  replicas: {{ .Values.replicaCount }}
+  {{- end }}
+  selector:
+    matchLabels:
+      {{- include "agent-bench.selectorLabels" . | nindent 6 }}
+  template:
+    metadata:
+      labels:
+        {{- include "agent-bench.selectorLabels" . | nindent 8 }}
+    spec:
+      containers:
+        - name: api
+          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
+          imagePullPolicy: {{ .Values.image.pullPolicy }}
+          ports:
+            - name: http
+              containerPort: 7860
+              protocol: TCP
+          envFrom:
+            - configMapRef:
+                name: {{ include "agent-bench.fullname" . }}-config
+            - secretRef:
+                name: {{ include "agent-bench.fullname" . }}-secrets
+          livenessProbe:
+            httpGet:
+              path: {{ .Values.probes.liveness.path }}
+              port: 7860
+            initialDelaySeconds: {{ .Values.probes.liveness.initialDelaySeconds }}
+            periodSeconds: {{ .Values.probes.liveness.periodSeconds }}
+          readinessProbe:
+            httpGet:
+              path: {{ .Values.probes.readiness.path }}
+              port: 7860
+            initialDelaySeconds: {{ .Values.probes.readiness.initialDelaySeconds }}
+            periodSeconds: {{ .Values.probes.readiness.periodSeconds }}
+          resources:
+            {{- toYaml .Values.resources | nindent 12 }}

k8s/helm/agent-bench/templates/hpa.yaml ADDED Viewed

	@@ -0,0 +1,22 @@

+{{- if .Values.autoscaling.enabled }}
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: {{ include "agent-bench.fullname" . }}
+  labels:
+    {{- include "agent-bench.labels" . | nindent 4 }}
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: {{ include "agent-bench.fullname" . }}
+  minReplicas: {{ .Values.autoscaling.minReplicas }}
+  maxReplicas: {{ .Values.autoscaling.maxReplicas }}
+  metrics:
+    - type: Resource
+      resource:
+        name: cpu
+        target:
+          type: Utilization
+          averageUtilization: {{ .Values.autoscaling.targetCPUUtilization }}
+{{- end }}

k8s/helm/agent-bench/templates/secret.yaml ADDED Viewed

	@@ -0,0 +1,12 @@

+apiVersion: v1
+kind: Secret
+metadata:
+  name: {{ include "agent-bench.fullname" . }}-secrets
+  labels:
+    {{- include "agent-bench.labels" . | nindent 4 }}
+type: Opaque
+stringData:
+  MODAL_VLLM_URL: {{ .Values.provider.selfhosted.modalEndpoint | quote }}
+  MODAL_AUTH_TOKEN: {{ .Values.provider.selfhosted.modalAuthToken | quote }}
+  OPENAI_API_KEY: {{ .Values.provider.openaiApiKey | quote }}
+  ANTHROPIC_API_KEY: {{ .Values.provider.anthropicApiKey | quote }}

k8s/helm/agent-bench/templates/service.yaml ADDED Viewed

	@@ -0,0 +1,15 @@

+apiVersion: v1
+kind: Service
+metadata:
+  name: {{ include "agent-bench.fullname" . }}
+  labels:
+    {{- include "agent-bench.labels" . | nindent 4 }}
+spec:
+  type: {{ .Values.service.type }}
+  ports:
+    - port: {{ .Values.service.port }}
+      targetPort: 7860
+      protocol: TCP
+      name: http
+  selector:
+    {{- include "agent-bench.selectorLabels" . | nindent 4 }}

k8s/helm/agent-bench/values-dev.yaml ADDED Viewed

	@@ -0,0 +1,12 @@

+replicaCount: 1
+autoscaling:
+  enabled: false
+resources:
+  requests:
+    cpu: 250m
+    memory: 512Mi
+  limits:
+    cpu: 1000m
+    memory: 2Gi

k8s/helm/agent-bench/values-prod.yaml ADDED Viewed

	@@ -0,0 +1,15 @@

+replicaCount: 3
+autoscaling:
+  enabled: true
+  minReplicas: 2
+  maxReplicas: 8
+  targetCPUUtilization: 70
+resources:
+  requests:
+    cpu: 500m
+    memory: 1Gi
+  limits:
+    cpu: 2000m
+    memory: 4Gi

k8s/helm/agent-bench/values.yaml ADDED Viewed

	@@ -0,0 +1,43 @@

+replicaCount: 2
+image:
+  repository: agent-bench
+  tag: latest
+  pullPolicy: IfNotPresent
+service:
+  type: ClusterIP
+  port: 8000
+provider:
+  type: selfhosted
+  selfhosted:
+    model: mistralai/Mistral-7B-Instruct-v0.3
+    modalEndpoint: ""
+    modalAuthToken: ""
+  openaiApiKey: ""
+  anthropicApiKey: ""
+autoscaling:
+  enabled: true
+  minReplicas: 2
+  maxReplicas: 8
+  targetCPUUtilization: 70
+resources:
+  requests:
+    cpu: 500m
+    memory: 1Gi
+  limits:
+    cpu: 2000m
+    memory: 4Gi
+probes:
+  liveness:
+    path: /health
+    initialDelaySeconds: 10
+    periodSeconds: 30
+  readiness:
+    path: /health
+    initialDelaySeconds: 5
+    periodSeconds: 10

modal/common.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""Shared constants for Modal deployments."""
+MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.3"
+GPU_TYPE = "a10g"
+VLLM_MAX_MODEL_LEN = 8192
+VLLM_DTYPE = "half"
+VLLM_GPU_MEMORY_UTILIZATION = 0.85
+# Cost tracking (for provider comparison report)
+# Modal A10G: ~$0.000361/sec (~$1.30/hr)
+MODAL_A10G_COST_PER_SEC = 0.000361

modal/run_benchmark.py ADDED Viewed

	@@ -0,0 +1,182 @@

+"""Run the 27-question benchmark against all provider configurations.
+Usage:
+    # Run against a deployed Modal endpoint
+    python modal/run_benchmark.py --base-url https://...modal.run/v1
+    # Optionally restrict to specific providers
+    python modal/run_benchmark.py --base-url https://...modal.run/v1 --only selfhosted_modal
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import statistics
+import subprocess
+import sys
+from pathlib import Path
+PROJECT_ROOT = Path(__file__).resolve().parent.parent
+def run_eval(config_path: str, env: dict[str, str]) -> list[dict] | None:
+    """Run scripts/evaluate.py and return the list of EvalResult dicts."""
+    output_path = f".cache/eval_{Path(config_path).stem}.json"
+    result = subprocess.run(
+        [
+            sys.executable,
+            "scripts/evaluate.py",
+            "--config",
+            config_path,
+            "--mode",
+            "deterministic",
+            "--output",
+            output_path,
+        ],
+        capture_output=True,
+        text=True,
+        env=env,
+        cwd=str(PROJECT_ROOT),
+    )
+    if result.returncode != 0:
+        print(f"FAILED: {config_path}\n{result.stderr}", file=sys.stderr)
+        return None
+    output_file = PROJECT_ROOT / output_path
+    if not output_file.exists():
+        print(f"FAILED: output not created: {output_path}", file=sys.stderr)
+        return None
+    with open(output_file) as f:
+        data = json.load(f)
+    if not isinstance(data, list):
+        print(f"FAILED: expected list, got {type(data).__name__}", file=sys.stderr)
+        return None
+    return data
+def aggregate(results: list[dict], provider_name: str = "") -> dict:
+    """Compute aggregate metrics from a list of EvalResult dicts.
+    For selfhosted providers, cost is computed from GPU-seconds (latency *
+    MODAL_A10G_COST_PER_SEC) rather than token pricing, which is zero.
+    """
+    from common import MODAL_A10G_COST_PER_SEC
+    positive = [r for r in results if r.get("category") != "out_of_scope"]
+    if not positive:
+        return {}
+    # For self-hosted, derive cost from GPU time; for API providers, use token cost
+    is_selfhosted = "selfhosted" in provider_name
+    if is_selfhosted:
+        avg_cost = statistics.mean(
+            (r["latency_ms"] / 1000.0) * MODAL_A10G_COST_PER_SEC
+            for r in positive
+        )
+    else:
+        avg_cost = statistics.mean(
+            r.get("tokens_used", {}).get("estimated_cost_usd", 0.0)
+            for r in positive
+        )
+    return {
+        "retrieval_precision": statistics.mean(
+            r["retrieval_precision"] for r in positive
+        ),
+        "retrieval_recall": statistics.mean(
+            r["retrieval_recall"] for r in positive
+        ),
+        "citation_accuracy": statistics.mean(
+            r["citation_accuracy"] for r in positive
+        ),
+        "latency_p50_ms": statistics.median(
+            r["latency_ms"] for r in positive
+        ),
+        "avg_cost_usd": avg_cost,
+    }
+def generate_report(
+    all_results: dict[str, list[dict] | None], output_path: str
+) -> None:
+    """Generate docs/provider_comparison.md from benchmark results."""
+    lines = [
+        "# Provider Comparison: API vs Self-Hosted",
+        "",
+        "Benchmark: 27-question golden dataset "
+        "(19 retrieval, 3 calculation, 5 out-of-scope).",
+        "",
+        "| Provider | P@5 | R@5 | Citation Acc | Latency p50 (ms) | Cost/query |",
+        "|----------|-----|-----|--------------|-------------------|------------|",
+    ]
+    for name, results in all_results.items():
+        if results is None:
+            lines.append(f"| {name} | ERROR | - | - | - | - |")
+            continue
+        agg = aggregate(results, provider_name=name)
+        if not agg:
+            lines.append(f"| {name} | NO DATA | - | - | - | - |")
+            continue
+        lines.append(
+            f"| {name} "
+            f"| {agg['retrieval_precision']:.2f} "
+            f"| {agg['retrieval_recall']:.2f} "
+            f"| {agg['citation_accuracy']:.2f} "
+            f"| {agg['latency_p50_ms']:.0f} "
+            f"| ${agg['avg_cost_usd']:.4f} |"
+        )
+    lines.extend(["", "---", "", "Generated by `modal/run_benchmark.py`"])
+    out = PROJECT_ROOT / output_path
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text("\n".join(lines))
+    print(f"Report written to {output_path}")
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Run provider comparison benchmark")
+    parser.add_argument(
+        "--base-url",
+        help="Modal vLLM endpoint URL (required when running selfhosted_modal)",
+    )
+    parser.add_argument(
+        "--only",
+        help="Run only this provider (e.g., selfhosted_modal, openai, anthropic)",
+    )
+    args = parser.parse_args()
+    configs = [
+        ("openai", "configs/default.yaml"),
+        ("anthropic", "configs/anthropic.yaml"),
+        ("selfhosted_modal", "configs/selfhosted_modal.yaml"),
+    ]
+    if args.only:
+        configs = [(n, p) for n, p in configs if n == args.only]
+        if not configs:
+            parser.error(f"Unknown provider: {args.only}")
+    needs_base_url = any(n == "selfhosted_modal" for n, _ in configs)
+    if needs_base_url and not args.base_url:
+        parser.error("--base-url is required when running selfhosted_modal")
+    all_results: dict[str, list[dict] | None] = {}
+    for name, config_path in configs:
+        print(f"\n--- Running: {name} ({config_path}) ---")
+        env = os.environ.copy()
+        if name == "selfhosted_modal" and args.base_url:
+            env["MODAL_VLLM_URL"] = args.base_url
+        results = run_eval(config_path, env)
+        if results is None:
+            print(f"\nABORTING: {name} failed, stopping benchmark run.",
+                  file=sys.stderr)
+            sys.exit(1)
+        all_results[name] = results
+    generate_report(all_results, "docs/provider_comparison.md")
+if __name__ == "__main__":
+    main()

modal/serve_vllm.py ADDED Viewed

	@@ -0,0 +1,187 @@

+"""Deploy vLLM on Modal as an OpenAI-compatible endpoint.
+Usage:
+    modal deploy modal/serve_vllm.py     # Deploy (stays running, prints URL)
+    modal serve modal/serve_vllm.py      # Dev mode (auto-redeploys on change)
+The printed URL is the MODAL_VLLM_URL for SelfHostedProvider:
+    export MODAL_VLLM_URL=https://<your-workspace>--agent-bench-vllm-serve.modal.run/v1
+Note: The vLLM server integration pattern changes between vLLM releases.
+      If deployment fails, check Modal's vLLM example for the current API:
+      https://modal.com/docs/examples/vllm_inference
+"""
+import modal
+# Inlined from common.py — Modal containers don't auto-include sibling modules
+MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.3"
+VLLM_MAX_MODEL_LEN = 8192
+VLLM_DTYPE = "half"
+VLLM_GPU_MEMORY_UTILIZATION = 0.85
+MODELS_DIR = "/models"
+VLLM_PORT = 8000
+VLLM_READY_TIMEOUT = 600  # seconds to wait for vLLM to become ready (download + load)
+vllm_image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .pip_install(
+        "vllm==0.6.6.post1",
+        "transformers==4.47.0",
+        "huggingface_hub[hf_transfer]<1.0",
+        "httpx",
+    )
+    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
+)
+app = modal.App("agent-bench-vllm")
+model_volume = modal.Volume.from_name("vllm-model-cache", create_if_missing=True)
+@app.function(
+    image=vllm_image,
+    gpu="a10g",
+    scaledown_window=600,
+    timeout=900,
+    volumes={MODELS_DIR: model_volume},
+    secrets=[modal.Secret.from_name("huggingface-secret")],
+)
+@modal.asgi_app()
+def serve():
+    """Serve vLLM with OpenAI-compatible API.
+    Exposes /v1/chat/completions and /health.
+    Waits for the vLLM subprocess to be ready before accepting requests.
+    """
+    import subprocess
+    import time
+    import httpx
+    from fastapi import FastAPI, Request
+    from fastapi.responses import JSONResponse, Response, StreamingResponse
+    vllm_process = subprocess.Popen(
+        [
+            "python", "-m", "vllm.entrypoints.openai.api_server",
+            "--model", MODEL_NAME,
+            "--download-dir", MODELS_DIR,
+            "--dtype", VLLM_DTYPE,
+            "--max-model-len", str(VLLM_MAX_MODEL_LEN),
+            "--gpu-memory-utilization", str(VLLM_GPU_MEMORY_UTILIZATION),
+            "--host", "0.0.0.0",
+            "--port", str(VLLM_PORT),
+        ],
+    )
+    # Wait for vLLM to be ready before accepting proxied requests
+    base = f"http://localhost:{VLLM_PORT}"
+    deadline = time.monotonic() + VLLM_READY_TIMEOUT
+    while time.monotonic() < deadline:
+        try:
+            r = httpx.get(f"{base}/health", timeout=2.0)
+            if r.status_code == 200:
+                break
+        except httpx.HTTPError:
+            pass
+        if vllm_process.poll() is not None:
+            raise RuntimeError(
+                f"vLLM process exited with code {vllm_process.returncode}"
+            )
+        time.sleep(2)
+    else:
+        vllm_process.terminate()
+        raise TimeoutError(
+            f"vLLM did not become ready within {VLLM_READY_TIMEOUT}s"
+        )
+    proxy_app = FastAPI()
+    client = httpx.AsyncClient(base_url=base, timeout=120.0)
+    @proxy_app.api_route("/{path:path}", methods=["GET", "POST"])
+    async def proxy(path: str, request: Request):
+        """Proxy all requests to the vLLM subprocess."""
+        import traceback as _tb
+        try:
+            return await _proxy_inner(path, request)
+        except Exception as exc:
+            _tb.print_exc()
+            return JSONResponse(
+                content={"error": str(exc), "type": type(exc).__name__},
+                status_code=502,
+            )
+    async def _proxy_inner(path: str, request: Request):
+        url = f"/{path}"
+        body = await request.body()
+        headers = {
+            k: v for k, v in request.headers.items()
+            if k.lower() not in ("host", "content-length")
+        }
+        # Detect streaming: check body for "stream": true (httpx sends
+        # Accept: */*, not text/event-stream, so header check is unreliable)
+        is_streaming = False
+        if body:
+            try:
+                import json as _json
+                is_streaming = _json.loads(body).get("stream", False)
+            except (ValueError, AttributeError):
+                pass
+        if not is_streaming:
+            is_streaming = request.headers.get("accept") == "text/event-stream"
+        if is_streaming:
+            req = client.build_request(
+                request.method, url, content=body, headers=headers
+            )
+            upstream = await client.send(req, stream=True)
+            if upstream.status_code != 200:
+                error_body = await upstream.aread()
+                await upstream.aclose()
+                return Response(
+                    content=error_body,
+                    status_code=upstream.status_code,
+                    media_type="application/json",
+                )
+            async def stream():
+                try:
+                    async for chunk in upstream.aiter_bytes():
+                        yield chunk
+                finally:
+                    await upstream.aclose()
+            return StreamingResponse(
+                stream(),
+                status_code=upstream.status_code,
+                media_type="text/event-stream",
+            )
+        resp = await client.request(
+            request.method, url, content=body, headers=headers
+        )
+        # Not all endpoints return JSON (e.g. /health returns empty 200)
+        try:
+            content = resp.json()
+        except Exception:
+            return Response(
+                content=resp.content,
+                status_code=resp.status_code,
+                media_type=resp.headers.get("content-type", "text/plain"),
+            )
+        return JSONResponse(
+            content=content,
+            status_code=resp.status_code,
+            headers={
+                k: v for k, v in resp.headers.items()
+                if k.lower() not in ("content-length", "transfer-encoding")
+            },
+        )
+    @proxy_app.on_event("shutdown")
+    def shutdown():
+        vllm_process.terminate()
+    return proxy_app

pyproject.toml CHANGED Viewed

@@ -32,6 +32,9 @@ dev = [
     "respx>=0.21.0",
     "types-PyYAML",
 ]
 [tool.setuptools.packages.find]
 include = ["agent_bench*"]

     "respx>=0.21.0",
     "types-PyYAML",
 ]
+modal = [
+    "modal>=0.66.0",
+]
 [tool.setuptools.packages.find]
 include = ["agent_bench*"]

terraform/main.tf ADDED Viewed

	@@ -0,0 +1,32 @@

+terraform {
+  required_version = ">= 1.5"
+  required_providers {
+    google = {
+      source  = "hashicorp/google"
+      version = "~> 5.0"
+    }
+  }
+}
+provider "google" {
+  project = var.project_id
+  region  = var.region
+}
+module "networking" {
+  source       = "./modules/networking"
+  project_id   = var.project_id
+  region       = var.region
+  cluster_name = var.cluster_name
+}
+module "gke" {
+  source           = "./modules/gke"
+  project_id       = var.project_id
+  region           = var.region
+  cluster_name     = var.cluster_name
+  network          = module.networking.network_name
+  subnetwork       = module.networking.subnetwork_name
+  cpu_node_count   = 2
+  cpu_machine_type = "e2-standard-4"
+}

terraform/modules/gke/main.tf ADDED Viewed

	@@ -0,0 +1,38 @@

+resource "google_container_cluster" "primary" {
+  name     = var.cluster_name
+  location = var.region
+  project  = var.project_id
+  network    = var.network
+  subnetwork = var.subnetwork
+  # Autopilot disabled — we manage node pools explicitly
+  enable_autopilot = false
+  # Remove default node pool (we create our own)
+  remove_default_node_pool = true
+  initial_node_count       = 1
+  ip_allocation_policy {
+    cluster_secondary_range_name  = "pods"
+    services_secondary_range_name = "services"
+  }
+}
+resource "google_container_node_pool" "cpu_pool" {
+  name       = "${var.cluster_name}-cpu-pool"
+  location   = var.region
+  cluster    = google_container_cluster.primary.name
+  node_count = var.cpu_node_count
+  project    = var.project_id
+  node_config {
+    machine_type = var.cpu_machine_type
+    disk_size_gb = 50
+    disk_type    = "pd-standard"
+    oauth_scopes = [
+      "https://www.googleapis.com/auth/cloud-platform",
+    ]
+  }
+}

terraform/modules/gke/outputs.tf ADDED Viewed

	@@ -0,0 +1,8 @@

+output "cluster_name" {
+  value = google_container_cluster.primary.name
+}
+output "cluster_endpoint" {
+  value     = google_container_cluster.primary.endpoint
+  sensitive = true
+}

terraform/modules/gke/variables.tf ADDED Viewed

	@@ -0,0 +1,29 @@

+variable "project_id" {
+  type = string
+}
+variable "region" {
+  type = string
+}
+variable "cluster_name" {
+  type = string
+}
+variable "network" {
+  type = string
+}
+variable "subnetwork" {
+  type = string
+}
+variable "cpu_node_count" {
+  type    = number
+  default = 2
+}
+variable "cpu_machine_type" {
+  type    = string
+  default = "e2-standard-4"
+}

terraform/modules/networking/main.tf ADDED Viewed

	@@ -0,0 +1,67 @@

+resource "google_compute_network" "vpc" {
+  name                    = "${var.cluster_name}-vpc"
+  auto_create_subnetworks = false
+  project                 = var.project_id
+}
+resource "google_compute_subnetwork" "subnet" {
+  name          = "${var.cluster_name}-subnet"
+  ip_cidr_range = "10.0.0.0/24"
+  region        = var.region
+  network       = google_compute_network.vpc.id
+  project       = var.project_id
+  secondary_ip_range {
+    range_name    = "pods"
+    ip_cidr_range = "10.1.0.0/16"
+  }
+  secondary_ip_range {
+    range_name    = "services"
+    ip_cidr_range = "10.2.0.0/20"
+  }
+}
+resource "google_compute_firewall" "allow_internal" {
+  name    = "${var.cluster_name}-allow-internal"
+  network = google_compute_network.vpc.name
+  project = var.project_id
+  allow {
+    protocol = "tcp"
+    ports    = ["0-65535"]
+  }
+  allow {
+    protocol = "udp"
+    ports    = ["0-65535"]
+  }
+  allow {
+    protocol = "icmp"
+  }
+  source_ranges = ["10.0.0.0/8"]
+}
+resource "google_compute_firewall" "allow_health_checks" {
+  name    = "${var.cluster_name}-allow-health-checks"
+  network = google_compute_network.vpc.name
+  project = var.project_id
+  allow {
+    protocol = "tcp"
+    ports    = ["80", "443", "8000", "7860"]
+  }
+  # GCP health check IP ranges
+  source_ranges = ["35.191.0.0/16", "130.211.0.0/22"]
+}
+output "network_name" {
+  value = google_compute_network.vpc.name
+}
+output "subnetwork_name" {
+  value = google_compute_subnetwork.subnet.name
+}

terraform/modules/networking/variables.tf ADDED Viewed

	@@ -0,0 +1,11 @@

+variable "project_id" {
+  type = string
+}
+variable "region" {
+  type = string
+}
+variable "cluster_name" {
+  type = string
+}

terraform/outputs.tf ADDED Viewed

	@@ -0,0 +1,15 @@

+output "cluster_name" {
+  description = "GKE cluster name"
+  value       = module.gke.cluster_name
+}
+output "cluster_endpoint" {
+  description = "GKE cluster endpoint"
+  value       = module.gke.cluster_endpoint
+  sensitive   = true
+}
+output "kubeconfig_command" {
+  description = "Command to configure kubectl"
+  value       = "gcloud container clusters get-credentials ${var.cluster_name} --region ${var.region} --project ${var.project_id}"
+}

terraform/terraform.tfvars.example ADDED Viewed

	@@ -0,0 +1,6 @@

+# Copy to terraform.tfvars and fill in values.
+# terraform.tfvars is gitignored.
+project_id   = "your-gcp-project-id"
+region       = "europe-west1"
+cluster_name = "agent-bench-cluster"

terraform/variables.tf ADDED Viewed

	@@ -0,0 +1,16 @@

+variable "project_id" {
+  description = "GCP project ID"
+  type        = string
+}
+variable "region" {
+  description = "GCP region for the cluster"
+  type        = string
+  default     = "europe-west1"
+}
+variable "cluster_name" {
+  description = "GKE cluster name"
+  type        = string
+  default     = "agent-bench-cluster"
+}

tests/test_selfhosted_provider.py ADDED Viewed

	@@ -0,0 +1,689 @@

+"""Tests for the SelfHostedProvider (OpenAI-compatible endpoint)."""
+import json
+import httpx
+import pytest
+import respx
+from agent_bench.core.config import (
+    AppConfig,
+    ProviderConfig,
+    RetryConfig,
+    SelfHostedConfig,
+)
+from agent_bench.core.provider import (
+    ProviderRateLimitError,
+    ProviderTimeoutError,
+    SelfHostedProvider,
+    create_provider,
+)
+from agent_bench.core.types import Message, Role, ToolDefinition
+# --- Helpers ---
+FAKE_URL = "http://fake-vllm:8000/v1"
+SEARCH_TOOL = ToolDefinition(
+    name="search_documents",
+    description="Search docs",
+    parameters={"type": "object", "properties": {"query": {"type": "string"}}},
+)
+def _ok_response(content="ok", tool_calls=None, prompt_tokens=10, completion_tokens=5):
+    """Build a minimal OpenAI-format chat completion response."""
+    message: dict = {"role": "assistant", "content": content}
+    if tool_calls:
+        message["tool_calls"] = tool_calls
+        message["content"] = None
+    return {
+        "id": "chatcmpl-test",
+        "object": "chat.completion",
+        "model": "mistralai/Mistral-7B-Instruct-v0.3",
+        "choices": [{"index": 0, "message": message, "finish_reason": "stop"}],
+        "usage": {
+            "prompt_tokens": prompt_tokens,
+            "completion_tokens": completion_tokens,
+            "total_tokens": prompt_tokens + completion_tokens,
+        },
+    }
+def _probe_response_with_tool_calls():
+    """Response to the tool-calling detection probe — model uses tools."""
+    return _ok_response(
+        tool_calls=[
+            {
+                "id": "call_probe",
+                "type": "function",
+                "function": {
+                    "name": "test_probe",
+                    "arguments": json.dumps({"x": "hello"}),
+                },
+            }
+        ],
+    )
+def _probe_response_without_tool_calls():
+    """Response to the tool-calling detection probe — model ignores tools."""
+    return _ok_response(content="I cannot use tools.")
+# --- Factory ---
+class TestSelfHostedFactory:
+    def test_factory_creates_selfhosted_provider(self, monkeypatch):
+        """Factory returns SelfHostedProvider for 'selfhosted' config."""
+        monkeypatch.setenv("MODAL_VLLM_URL", FAKE_URL)
+        config = AppConfig(provider=ProviderConfig(default="selfhosted"))
+        provider = create_provider(config)
+        assert isinstance(provider, SelfHostedProvider)
+    def test_factory_raises_for_unknown_provider(self):
+        config = AppConfig(provider=ProviderConfig(default="nonexistent"))
+        with pytest.raises(ValueError, match="Unknown provider"):
+            create_provider(config)
+# --- Config-based settings ---
+class TestSelfHostedConfig:
+    def test_reads_base_url_from_config(self, monkeypatch):
+        """Config selfhosted.base_url takes precedence over env var."""
+        monkeypatch.setenv("MODAL_VLLM_URL", "http://env-url:8000/v1")
+        config = AppConfig(
+            provider=ProviderConfig(
+                default="selfhosted",
+                selfhosted=SelfHostedConfig(base_url="http://config-url:8000/v1"),
+            )
+        )
+        provider = SelfHostedProvider(config)
+        assert provider.base_url == "http://config-url:8000/v1"
+    def test_falls_back_to_env_when_config_empty(self, monkeypatch):
+        """Empty config falls back to MODAL_VLLM_URL env var."""
+        monkeypatch.setenv("MODAL_VLLM_URL", "http://env-url:8000/v1")
+        config = AppConfig(provider=ProviderConfig(default="selfhosted"))
+        provider = SelfHostedProvider(config)
+        assert provider.base_url == "http://env-url:8000/v1"
+    def test_reads_api_key_from_config(self, monkeypatch):
+        monkeypatch.delenv("MODAL_AUTH_TOKEN", raising=False)
+        config = AppConfig(
+            provider=ProviderConfig(
+                default="selfhosted",
+                selfhosted=SelfHostedConfig(
+                    base_url=FAKE_URL, api_key="config-key-123"
+                ),
+            )
+        )
+        provider = SelfHostedProvider(config)
+        assert provider.client.headers.get("authorization") == "Bearer config-key-123"
+    def test_timeout_from_config(self, monkeypatch):
+        monkeypatch.setenv("MODAL_VLLM_URL", FAKE_URL)
+        config = AppConfig(
+            provider=ProviderConfig(
+                default="selfhosted",
+                selfhosted=SelfHostedConfig(timeout_seconds=42.0),
+            )
+        )
+        provider = SelfHostedProvider(config)
+        assert provider.client.timeout.read == 42.0
+    def test_config_yaml_selfhosted_block_not_dropped(self):
+        """Pydantic accepts provider.selfhosted fields (regression for issue #3)."""
+        raw = {
+            "provider": {
+                "default": "selfhosted",
+                "selfhosted": {
+                    "base_url": "http://yaml-url:8000/v1",
+                    "model_name": "meta-llama/Llama-3-8B",
+                    "api_key": "yaml-key",
+                    "timeout_seconds": 60.0,
+                },
+            }
+        }
+        config = AppConfig.model_validate(raw)
+        assert config.provider.selfhosted.base_url == "http://yaml-url:8000/v1"
+        assert config.provider.selfhosted.model_name == "meta-llama/Llama-3-8B"
+        assert config.provider.selfhosted.api_key == "yaml-key"
+        assert config.provider.selfhosted.timeout_seconds == 60.0
+    def test_loads_selfhosted_local_yaml_from_disk(self):
+        """selfhosted_local.yaml loads from disk with correct selfhosted settings."""
+        from pathlib import Path
+        from agent_bench.core.config import load_config
+        yaml_path = Path(__file__).resolve().parent.parent / "configs" / "selfhosted_local.yaml"
+        config = load_config(yaml_path)
+        assert config.provider.default == "selfhosted"
+        assert config.provider.selfhosted.base_url == ""  # env var fallback
+        assert config.provider.selfhosted.model_name == "mistralai/Mistral-7B-Instruct-v0.3"
+    def test_loads_selfhosted_modal_yaml_from_disk(self):
+        """selfhosted_modal.yaml loads from disk; base_url empty (env var fallback)."""
+        from pathlib import Path
+        from agent_bench.core.config import load_config
+        yaml_path = Path(__file__).resolve().parent.parent / "configs" / "selfhosted_modal.yaml"
+        config = load_config(yaml_path)
+        assert config.provider.default == "selfhosted"
+        assert config.provider.selfhosted.base_url == ""  # falls back to MODAL_VLLM_URL
+    def test_default_fallback_port_does_not_collide_with_app(self, monkeypatch):
+        """Default vLLM fallback URL must NOT use port 8000 (app's serving port)."""
+        monkeypatch.delenv("MODAL_VLLM_URL", raising=False)
+        config = AppConfig(provider=ProviderConfig(default="selfhosted"))
+        provider = SelfHostedProvider(config)
+        assert ":8000" not in provider.base_url
+# --- complete() ---
+class TestSelfHostedComplete:
+    @pytest.fixture
+    def provider(self, monkeypatch):
+        monkeypatch.setenv("MODAL_VLLM_URL", FAKE_URL)
+        config = AppConfig(provider=ProviderConfig(default="selfhosted"))
+        return SelfHostedProvider(config)
+    @pytest.mark.asyncio
+    async def test_complete_parses_response(self, provider):
+        """SelfHostedProvider.complete() parses OpenAI-format response."""
+        mock_response = _ok_response(
+            content="Path params use curly braces. [source: fastapi.md]",
+            prompt_tokens=80,
+            completion_tokens=20,
+        )
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(200, json=mock_response)
+            )
+            response = await provider.complete(
+                [Message(role=Role.USER, content="How do path params work?")]
+            )
+        assert response.content == "Path params use curly braces. [source: fastapi.md]"
+        assert response.tool_calls == []
+        assert response.provider == "selfhosted"
+        assert response.model == "mistralai/Mistral-7B-Instruct-v0.3"
+        assert response.usage.input_tokens == 80
+        assert response.usage.output_tokens == 20
+        assert response.latency_ms > 0
+    @pytest.mark.asyncio
+    async def test_complete_parses_tool_calls(self, provider):
+        """SelfHostedProvider.complete() parses native tool_calls."""
+        # Pre-set tool support to skip detection probe
+        provider._supports_tool_calling = True
+        tool_response = _ok_response(
+            tool_calls=[
+                {
+                    "id": "call_abc",
+                    "type": "function",
+                    "function": {
+                        "name": "search_documents",
+                        "arguments": json.dumps({"query": "path params"}),
+                    },
+                }
+            ],
+            prompt_tokens=60,
+            completion_tokens=15,
+        )
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(200, json=tool_response)
+            )
+            response = await provider.complete(
+                [Message(role=Role.USER, content="search for path params")],
+                tools=[SEARCH_TOOL],
+            )
+        assert len(response.tool_calls) == 1
+        assert response.tool_calls[0].id == "call_abc"
+        assert response.tool_calls[0].name == "search_documents"
+        assert response.tool_calls[0].arguments == {"query": "path params"}
+    @pytest.mark.asyncio
+    async def test_complete_handles_malformed_tool_args(self, provider):
+        """Malformed JSON in tool arguments falls back to empty dict."""
+        provider._supports_tool_calling = True
+        mock_response = _ok_response(
+            tool_calls=[
+                {
+                    "id": "call_bad",
+                    "type": "function",
+                    "function": {
+                        "name": "search_documents",
+                        "arguments": "not valid json{{{",
+                    },
+                }
+            ],
+        )
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(200, json=mock_response)
+            )
+            response = await provider.complete(
+                [Message(role=Role.USER, content="test")]
+            )
+        assert len(response.tool_calls) == 1
+        assert response.tool_calls[0].arguments == {}
+# --- Tool-calling detection ---
+class TestSelfHostedToolDetection:
+    @pytest.fixture
+    def provider(self, monkeypatch):
+        monkeypatch.setenv("MODAL_VLLM_URL", FAKE_URL)
+        config = AppConfig(provider=ProviderConfig(default="selfhosted"))
+        return SelfHostedProvider(config)
+    @pytest.mark.asyncio
+    async def test_detect_tool_calling_supported(self, provider):
+        """Detection probe returns True when model responds with tool_calls."""
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(
+                    200, json=_probe_response_with_tool_calls()
+                )
+            )
+            result = await provider._detect_tool_calling()
+        assert result is True
+    @pytest.mark.asyncio
+    async def test_detect_tool_calling_unsupported_400(self, provider):
+        """Detection probe returns False on 400 (endpoint rejects tools)."""
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(
+                    400, json={"error": "tools not supported"}
+                )
+            )
+            result = await provider._detect_tool_calling()
+        assert result is False
+    @pytest.mark.asyncio
+    async def test_detect_tool_calling_unsupported_no_tool_calls(self, provider):
+        """Detection probe returns False when model ignores tools."""
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(
+                    200, json=_probe_response_without_tool_calls()
+                )
+            )
+            result = await provider._detect_tool_calling()
+        assert result is False
+    @pytest.mark.asyncio
+    async def test_detect_transient_failure_returns_none(self, provider):
+        """Transient failure (timeout, 5xx) returns None, not False."""
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                side_effect=httpx.ReadTimeout("cold start")
+            )
+            result = await provider._detect_tool_calling()
+        assert result is None
+    @pytest.mark.asyncio
+    async def test_detect_5xx_returns_none(self, provider):
+        """Server error returns None (transient), not False (definitive)."""
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(503, json={"error": "unavailable"})
+            )
+            result = await provider._detect_tool_calling()
+        assert result is None
+    @pytest.mark.asyncio
+    async def test_detection_runs_once_then_cached(self, provider):
+        """Detection probe fires on first call with tools, cached thereafter."""
+        call_count = 0
+        def side_effect(request):
+            nonlocal call_count
+            call_count += 1
+            body = json.loads(request.content)
+            # Detection probe has test_probe tool
+            if any(
+                t.get("function", {}).get("name") == "test_probe"
+                for t in body.get("tools", [])
+            ):
+                return httpx.Response(
+                    200, json=_probe_response_with_tool_calls()
+                )
+            # Real request
+            return httpx.Response(200, json=_ok_response(
+                tool_calls=[{
+                    "id": "call_real",
+                    "type": "function",
+                    "function": {
+                        "name": "search_documents",
+                        "arguments": json.dumps({"query": "test"}),
+                    },
+                }],
+            ))
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                side_effect=side_effect
+            )
+            # First call: probe + real = 2 requests
+            await provider.complete(
+                [Message(role=Role.USER, content="test")],
+                tools=[SEARCH_TOOL],
+            )
+            # Second call: no probe = 1 request
+            await provider.complete(
+                [Message(role=Role.USER, content="test2")],
+                tools=[SEARCH_TOOL],
+            )
+        assert call_count == 3  # 1 probe + 2 real
+        assert provider._supports_tool_calling is True
+    @pytest.mark.asyncio
+    async def test_transient_failure_retries_on_next_call(self, provider):
+        """Transient detection failure leaves _supports_tool_calling as None, retries."""
+        call_count = 0
+        def side_effect(request):
+            nonlocal call_count
+            call_count += 1
+            body = json.loads(request.content)
+            is_probe = any(
+                t.get("function", {}).get("name") == "test_probe"
+                for t in body.get("tools", [])
+            )
+            if is_probe:
+                if call_count == 1:
+                    # First probe: transient failure
+                    return httpx.Response(503, json={"error": "cold start"})
+                # Second probe: success
+                return httpx.Response(
+                    200, json=_probe_response_with_tool_calls()
+                )
+            # Real request (fallback or native)
+            return httpx.Response(200, json=_ok_response())
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                side_effect=side_effect
+            )
+            # First call: probe fails (transient) + real (fallback) = 2
+            await provider.complete(
+                [Message(role=Role.USER, content="test")],
+                tools=[SEARCH_TOOL],
+            )
+            assert provider._supports_tool_calling is None  # NOT cached
+            # Second call: probe succeeds + real (native) = 2
+            await provider.complete(
+                [Message(role=Role.USER, content="test2")],
+                tools=[SEARCH_TOOL],
+            )
+            assert provider._supports_tool_calling is True  # NOW cached
+        assert call_count == 4  # 2 probes + 2 real
+# --- Prompt-based fallback ---
+class TestSelfHostedPromptFallback:
+    @pytest.fixture
+    def provider(self, monkeypatch):
+        monkeypatch.setenv("MODAL_VLLM_URL", FAKE_URL)
+        config = AppConfig(provider=ProviderConfig(default="selfhosted"))
+        p = SelfHostedProvider(config)
+        p._supports_tool_calling = False  # Force fallback mode
+        return p
+    @pytest.mark.asyncio
+    async def test_fallback_parses_tool_call_from_text(self, provider):
+        """When tool calling is unsupported, parse tool calls from model text."""
+        tool_json = json.dumps(
+            {"tool_calls": [{"name": "search_documents", "arguments": {"query": "path params"}}]}
+        )
+        mock_response = _ok_response(content=tool_json)
+        with respx.mock:
+            route = respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(200, json=mock_response)
+            )
+            response = await provider.complete(
+                [Message(role=Role.USER, content="search for path params")],
+                tools=[SEARCH_TOOL],
+            )
+            # Verify tools NOT in payload (prompt-based, not native)
+            sent_body = json.loads(route.calls[0].request.content)
+            assert "tools" not in sent_body
+        assert len(response.tool_calls) == 1
+        assert response.tool_calls[0].name == "search_documents"
+        assert response.tool_calls[0].arguments == {"query": "path params"}
+        assert response.content == ""  # tool call replaces content
+    @pytest.mark.asyncio
+    async def test_fallback_injects_tool_prompt(self, provider):
+        """When tool calling is unsupported, tool descriptions injected as system prompt."""
+        mock_response = _ok_response(content="Just a text answer.")
+        with respx.mock:
+            route = respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(200, json=mock_response)
+            )
+            await provider.complete(
+                [Message(role=Role.USER, content="hello")],
+                tools=[SEARCH_TOOL],
+            )
+            sent_body = json.loads(route.calls[0].request.content)
+        # System message should contain tool descriptions
+        system_msg = sent_body["messages"][0]
+        assert system_msg["role"] == "system"
+        assert "search_documents" in system_msg["content"]
+        assert "tool_calls" in system_msg["content"]
+    @pytest.mark.asyncio
+    async def test_fallback_handles_non_dict_arguments(self, provider):
+        """Non-dict arguments in prompt-based JSON degrades to empty dict, not crash."""
+        tool_json = json.dumps(
+            {"tool_calls": [{"name": "search_documents", "arguments": "oops"}]}
+        )
+        mock_response = _ok_response(content=tool_json)
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(200, json=mock_response)
+            )
+            response = await provider.complete(
+                [Message(role=Role.USER, content="test")],
+                tools=[SEARCH_TOOL],
+            )
+        assert len(response.tool_calls) == 1
+        assert response.tool_calls[0].name == "search_documents"
+        assert response.tool_calls[0].arguments == {}
+    @pytest.mark.asyncio
+    async def test_fallback_returns_text_when_no_tool_json(self, provider):
+        """When model responds with plain text (not JSON), return as content."""
+        mock_response = _ok_response(content="I don't know how to use tools.")
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(200, json=mock_response)
+            )
+            response = await provider.complete(
+                [Message(role=Role.USER, content="test")],
+                tools=[SEARCH_TOOL],
+            )
+        assert response.tool_calls == []
+        assert response.content == "I don't know how to use tools."
+# --- Retry and timeout ---
+class TestSelfHostedRetryAndTimeout:
+    @pytest.fixture
+    def provider(self, monkeypatch):
+        monkeypatch.setenv("MODAL_VLLM_URL", FAKE_URL)
+        config = AppConfig(
+            provider=ProviderConfig(default="selfhosted"),
+            retry=RetryConfig(max_retries=2, base_delay=0.01, max_delay=0.05),
+        )
+        return SelfHostedProvider(config)
+    @pytest.mark.asyncio
+    async def test_retries_on_429_then_succeeds(self, provider):
+        """Provider retries on 429 and succeeds on next attempt."""
+        call_count = 0
+        def side_effect(request):
+            nonlocal call_count
+            call_count += 1
+            if call_count == 1:
+                return httpx.Response(429, json={"error": "rate limited"})
+            return httpx.Response(200, json=_ok_response())
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                side_effect=side_effect
+            )
+            response = await provider.complete(
+                [Message(role=Role.USER, content="test")]
+            )
+        assert response.content == "ok"
+        assert call_count == 2
+    @pytest.mark.asyncio
+    async def test_raises_rate_limit_after_exhausting_retries(self, provider):
+        """Provider raises ProviderRateLimitError after all retries exhausted."""
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(429, json={"error": "rate limited"})
+            )
+            with pytest.raises(ProviderRateLimitError, match="Rate limited"):
+                await provider.complete(
+                    [Message(role=Role.USER, content="test")]
+                )
+    @pytest.mark.asyncio
+    async def test_raises_timeout_error(self, provider):
+        """Provider raises ProviderTimeoutError on httpx timeout."""
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                side_effect=httpx.ReadTimeout("timed out")
+            )
+            with pytest.raises(ProviderTimeoutError, match="timed out"):
+                await provider.complete(
+                    [Message(role=Role.USER, content="test")]
+                )
+# --- Env var fallback ---
+class TestSelfHostedEnvVars:
+    def test_reads_base_url_from_env(self, monkeypatch):
+        monkeypatch.setenv("MODAL_VLLM_URL", "http://my-modal-url:8000/v1")
+        config = AppConfig(provider=ProviderConfig(default="selfhosted"))
+        provider = SelfHostedProvider(config)
+        assert provider.base_url == "http://my-modal-url:8000/v1"
+    def test_reads_auth_token_from_env(self, monkeypatch):
+        monkeypatch.setenv("MODAL_VLLM_URL", FAKE_URL)
+        monkeypatch.setenv("MODAL_AUTH_TOKEN", "secret-token-123")
+        config = AppConfig(provider=ProviderConfig(default="selfhosted"))
+        provider = SelfHostedProvider(config)
+        assert provider.client.headers.get("authorization") == "Bearer secret-token-123"
+    def test_no_auth_header_when_no_token(self, monkeypatch):
+        monkeypatch.setenv("MODAL_VLLM_URL", FAKE_URL)
+        monkeypatch.delenv("MODAL_AUTH_TOKEN", raising=False)
+        config = AppConfig(provider=ProviderConfig(default="selfhosted"))
+        provider = SelfHostedProvider(config)
+        assert "authorization" not in {
+            k.lower() for k in provider.client.headers.keys()
+        }
+# --- Streaming ---
+class TestSelfHostedStream:
+    @pytest.fixture
+    def provider(self, monkeypatch):
+        monkeypatch.setenv("MODAL_VLLM_URL", FAKE_URL)
+        config = AppConfig(provider=ProviderConfig(default="selfhosted"))
+        return SelfHostedProvider(config)
+    @pytest.mark.asyncio
+    async def test_stream_yields_content_chunks(self, provider):
+        """stream_complete() yields text chunks from SSE stream."""
+        sse_body = (
+            'data: {"choices":[{"delta":{"content":"Hello "}}]}\n\n'
+            'data: {"choices":[{"delta":{"content":"world"}}]}\n\n'
+            "data: [DONE]\n\n"
+        )
+        with respx.mock:
+            respx.post(f"{FAKE_URL}/chat/completions").mock(
+                return_value=httpx.Response(
+                    200,
+                    stream=httpx.ByteStream(sse_body.encode()),
+                    headers={"content-type": "text/event-stream"},
+                )
+            )
+            chunks = []
+            async for chunk in provider.stream_complete(
+                [Message(role=Role.USER, content="Hi")]
+            ):
+                chunks.append(chunk)
+        assert chunks == ["Hello ", "world"]
+# --- format_tools ---
+class TestSelfHostedFormatTools:
+    def test_format_tools_uses_openai_schema(self, monkeypatch):
+        monkeypatch.setenv("MODAL_VLLM_URL", FAKE_URL)
+        config = AppConfig(provider=ProviderConfig(default="selfhosted"))
+        provider = SelfHostedProvider(config)
+        tools = [
+            ToolDefinition(
+                name="search_documents",
+                description="Search docs",
+                parameters={
+                    "type": "object",
+                    "properties": {"query": {"type": "string"}},
+                    "required": ["query"],
+                },
+            )
+        ]
+        formatted = provider.format_tools(tools)
+        assert formatted[0]["type"] == "function"
+        assert formatted[0]["function"]["name"] == "search_documents"
+        assert formatted[0]["function"]["parameters"]["required"] == ["query"]

tests/test_serving.py CHANGED Viewed

@@ -151,6 +151,101 @@ class TestMetricsEndpoint:
         assert "errors_total" in data
         assert "avg_cost_per_query_usd" in data
 class TestMiddleware:
     @pytest.mark.asyncio

         assert "errors_total" in data
         assert "avg_cost_per_query_usd" in data
+    @pytest.mark.asyncio
+    async def test_prometheus_endpoint_returns_text_exposition(self, test_app):
+        async with AsyncClient(
+            transport=ASGITransport(app=test_app), base_url="http://test"
+        ) as client:
+            response = await client.get("/metrics/prometheus")
+        assert response.status_code == 200
+        assert "text/plain" in response.headers["content-type"]
+        body = response.text
+        assert "# TYPE agent_bench_requests_total counter" in body
+        assert "agent_bench_requests_total " in body
+        assert "# TYPE agent_bench_latency_p95_ms gauge" in body
+        assert "agent_bench_latency_p95_ms " in body
+        assert "# TYPE agent_bench_errors_total counter" in body
+class TestHealthCheckProbesProvider:
+    @pytest.mark.asyncio
+    async def test_healthy_when_provider_health_check_passes(self, test_app):
+        """MockProvider.health_check() returns True (default), so status=healthy."""
+        async with AsyncClient(
+            transport=ASGITransport(app=test_app), base_url="http://test"
+        ) as client:
+            response = await client.get("/health")
+        assert response.status_code == 200
+        data = response.json()
+        assert data["status"] == "healthy"
+        assert data["provider_available"] is True
+    @pytest.mark.asyncio
+    async def test_degraded_when_provider_health_check_fails(self):
+        """Provider whose health_check() returns False -> status=degraded."""
+        from fastapi import FastAPI
+        class UnhealthyProvider(MockProvider):
+            async def health_check(self) -> bool:
+                return False
+        app = FastAPI()
+        registry = ToolRegistry()
+        registry.register(FakeSearchTool())
+        provider = UnhealthyProvider()
+        orchestrator = Orchestrator(provider=provider, registry=registry, max_iterations=1)
+        app.state.orchestrator = orchestrator
+        app.state.store = HybridStore(dimension=384)
+        app.state.config = AppConfig(provider=ProviderConfig(default="mock"))
+        app.state.system_prompt = "test"
+        app.state.start_time = time.time()
+        app.state.metrics = MetricsCollector()
+        app.add_middleware(RequestMiddleware)
+        from agent_bench.serving.routes import router
+        app.include_router(router)
+        async with AsyncClient(
+            transport=ASGITransport(app=app), base_url="http://test"
+        ) as client:
+            response = await client.get("/health")
+        assert response.status_code == 200
+        data = response.json()
+        assert data["status"] == "degraded"
+        assert data["provider_available"] is False
+    @pytest.mark.asyncio
+    async def test_degraded_when_provider_health_check_raises(self):
+        """Provider whose health_check() raises -> status=degraded."""
+        from fastapi import FastAPI
+        class CrashingProvider(MockProvider):
+            async def health_check(self) -> bool:
+                raise ConnectionError("upstream unreachable")
+        app = FastAPI()
+        registry = ToolRegistry()
+        registry.register(FakeSearchTool())
+        provider = CrashingProvider()
+        orchestrator = Orchestrator(provider=provider, registry=registry, max_iterations=1)
+        app.state.orchestrator = orchestrator
+        app.state.store = HybridStore(dimension=384)
+        app.state.config = AppConfig(provider=ProviderConfig(default="mock"))
+        app.state.system_prompt = "test"
+        app.state.start_time = time.time()
+        app.state.metrics = MetricsCollector()
+        app.add_middleware(RequestMiddleware)
+        from agent_bench.serving.routes import router
+        app.include_router(router)
+        async with AsyncClient(
+            transport=ASGITransport(app=app), base_url="http://test"
+        ) as client:
+            response = await client.get("/health")
+        assert response.status_code == 200
+        data = response.json()
+        assert data["status"] == "degraded"
+        assert data["provider_available"] is False
 class TestMiddleware:
     @pytest.mark.asyncio