ho22joshua commited on
Commit
cfcbbc8
Β·
0 Parent(s):

initial commit

Browse files
This view is limited to 50 files because it contains too many changes. Β  See raw diff
Files changed (50) hide show
  1. CBORG_MODEL_MAPPINGS.md +108 -0
  2. COMPLETE_MODEL_VERSIONS.md +130 -0
  3. LICENSE +24 -0
  4. MODEL_NAME_UPDATES.md +82 -0
  5. O3_MODEL_COMPARISON.md +117 -0
  6. PRE_RELEASE_CHECKLIST.md +257 -0
  7. README.md +448 -0
  8. check_cborg_routing.py +57 -0
  9. check_soln.py +812 -0
  10. compare_model_configs.py +189 -0
  11. config.example.yml +53 -0
  12. config.yml +3 -0
  13. environment.yml +21 -0
  14. error_analysis.ipynb +0 -0
  15. error_analysis.py +320 -0
  16. error_analysis_fixed_categories.py +203 -0
  17. error_analysis_plotting.ipynb +0 -0
  18. five_step_analysis.ipynb +0 -0
  19. get_all_model_versions.py +97 -0
  20. get_arr.py +19 -0
  21. jobs/README.md +23 -0
  22. jobs/run_tests.sh +18 -0
  23. jobs/submit.sh +54 -0
  24. jobs/test_models.py +59 -0
  25. list_cborg_models.py +54 -0
  26. logs_interpreter.py +341 -0
  27. logs_interpreter.sh +12 -0
  28. map_latest_models.py +122 -0
  29. model_version_mappings.txt +24 -0
  30. models.example.txt +34 -0
  31. models.txt +2 -0
  32. models_coder.txt +1 -0
  33. models_supervisor.txt +1 -0
  34. plot_stats.ipynb +0 -0
  35. plots/five_step_summary_stats.csv +46 -0
  36. prompts/categorization.txt +27 -0
  37. prompts/create_numpy.txt +91 -0
  38. prompts/old/create_numpy_obsolete.txt +65 -0
  39. prompts/old/create_numpy_original.txt +58 -0
  40. prompts/old/create_numpy_step2.txt +103 -0
  41. prompts/old/preprocess_obsolete.txt +95 -0
  42. prompts/old/preprocess_original.txt +42 -0
  43. prompts/preprocess.txt +184 -0
  44. prompts/preprocess_old.txt +175 -0
  45. prompts/preprocess_old_corrupted.txt +187 -0
  46. prompts/scores.txt +8 -0
  47. prompts/summarize_root.txt +4 -0
  48. prompts/supervisor_call.txt +11 -0
  49. prompts/supervisor_first_call.txt +5 -0
  50. run_smk_sequential.sh +329 -0
CBORG_MODEL_MAPPINGS.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CBORG Model Mappings - October 29, 2025
2
+
3
+ ## Summary
4
+
5
+ This document shows what each `:latest` model alias maps to in the CBORG API.
6
+
7
+ ## Key Findings
8
+
9
+ 1. **`:latest` and base models are IDENTICAL** - Using `lbl/cborg-chat` or `lbl/cborg-chat:latest` gives you the exact same underlying model
10
+ 2. You can see the actual model version by checking the `response.model` field after making a request
11
+ 3. The "raw" model names show the actual provider-specific version strings
12
+
13
+ ## Model Mappings
14
+
15
+ ### LBL CBORG Models (Local/Custom)
16
+
17
+ | Alias | Underlying Model |
18
+ |-------|------------------|
19
+ | `lbl/cborg-chat` / `lbl/cborg-chat:latest` | `hosted_vllm/hosted_vllm/Llama-4-Scout-17B-16E-Instruct-FP8` |
20
+ | `lbl/cborg-coder` / `lbl/cborg-coder:latest` | `hosted_vllm/hosted_vllm/gpt-oss-120b` |
21
+ | `lbl/cborg-deepthought` / `lbl/cborg-deepthought:latest` | `hosted_vllm/hosted_vllm/gpt-oss-120b` |
22
+ | `lbl/cborg-mini` / `lbl/cborg-mini:latest` | `ollama/gpt-oss:20b` |
23
+ | `lbl/cborg-vision` / `lbl/cborg-vision:latest` | `hosted_vllm/hosted_vllm/Llama-4-Scout-17B-16E-Instruct-FP8` |
24
+
25
+ **Note:** `lbl/cborg-coder` and `lbl/cborg-deepthought` map to the same base model!
26
+
27
+ ### Anthropic Claude Models (via AWS Bedrock)
28
+
29
+ | Alias | Underlying Model |
30
+ |-------|------------------|
31
+ | `anthropic/claude-haiku` / `anthropic/claude-haiku:latest` | `claude-haiku-4-5@20251001` |
32
+ | `anthropic/claude-opus` / `anthropic/claude-opus:latest` | `us.anthropic.claude-opus-4-1-20250805-v1:0` |
33
+ | `anthropic/claude-sonnet` / `anthropic/claude-sonnet:latest` | `claude-sonnet-4-5@20250929` |
34
+ | `anthropic/claude` / `anthropic/claude:latest` | `claude-sonnet-4-5@20250929` (same as sonnet) |
35
+ | `aws/claude-haiku` / `aws/claude-haiku:latest` | `us.anthropic.claude-haiku-4-5-20251001-v1:0` |
36
+ | `aws/claude` / `aws/claude:latest` | `us.anthropic.claude-sonnet-4-5-20250929-v1:0` |
37
+
38
+ **Version Dates:**
39
+ - Haiku: October 1, 2025
40
+ - Opus: August 5, 2025
41
+ - Sonnet: September 29, 2025
42
+
43
+ ### Google Gemini Models
44
+
45
+ | Alias | Underlying Model |
46
+ |-------|------------------|
47
+ | `google/gemini` / `google/gemini:latest` | `gemini-2.5-pro` |
48
+
49
+ ### OpenAI Models
50
+
51
+ | Alias | Underlying Model |
52
+ |-------|------------------|
53
+ | `openai/chatgpt:latest` | `gpt-5-2025-08-07` (August 7, 2025) |
54
+ | `openai/o:latest` | `azure/o3-2025-04-16` (April 16, 2025 via Azure) |
55
+
56
+ ### xAI Grok Models
57
+
58
+ | Alias | Underlying Model |
59
+ |-------|------------------|
60
+ | `xai/grok:latest` | `grok-3` |
61
+
62
+ ## How to Check Model Versions Yourself
63
+
64
+ ```python
65
+ from openai import OpenAI
66
+ import os
67
+
68
+ client = OpenAI(
69
+ api_key=os.environ['CBORG_API_KEY'],
70
+ base_url="https://api.cborg.lbl.gov"
71
+ )
72
+
73
+ response = client.chat.completions.create(
74
+ model="lbl/cborg-chat:latest", # or any other model
75
+ messages=[{"role": "user", "content": "Hi"}],
76
+ max_tokens=5
77
+ )
78
+
79
+ print(f"Requested: lbl/cborg-chat:latest")
80
+ print(f"Actual: {response.model}")
81
+ ```
82
+
83
+ ## Scripts Available
84
+
85
+ 1. **`list_cborg_models.py`** - List all available models (with attempted detail retrieval)
86
+ 2. **`test_model_info.py`** - Test a specific model and see detailed information
87
+ ```bash
88
+ python test_model_info.py "lbl/cborg-chat:latest"
89
+ ```
90
+ 3. **`map_latest_models.py`** - Map all `:latest` models to their underlying versions
91
+
92
+ ## Important Notes
93
+
94
+ - **The `:latest` suffix is optional** - Both `lbl/cborg-chat` and `lbl/cborg-chat:latest` are identical
95
+ - **Version information is in the response** - You must make an API call to see the underlying model
96
+ - **Some models share backends** - `lbl/cborg-coder` and `lbl/cborg-deepthought` both use `gpt-oss-120b`
97
+ - **Embedding models require different API calls** - The `nomic-embed-text` models need the embeddings API, not chat completions
98
+
99
+ ## Provider-Specific Version Strings
100
+
101
+ The "raw" model names follow different conventions by provider:
102
+
103
+ - **AWS Bedrock (Anthropic)**: `us.anthropic.claude-sonnet-4-5-20250929-v1:0`
104
+ - **Google Vertex AI**: `gemini-2.5-pro`
105
+ - **Azure OpenAI**: `azure/o3-2025-04-16`
106
+ - **Native OpenAI**: `gpt-5-2025-08-07`
107
+ - **Local vLLM**: `hosted_vllm/hosted_vllm/Llama-4-Scout-17B-16E-Instruct-FP8`
108
+ - **Ollama**: `ollama/gpt-oss:20b`
COMPLETE_MODEL_VERSIONS.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Complete Model Version Information
2
+
3
+ ## Discovered via CBORG API Testing - October 29, 2025
4
+
5
+ This document shows the complete mapping from CBORG model aliases to their underlying versions, including all version dates discovered through API testing.
6
+
7
+ ---
8
+
9
+ ## Models with Version Dates
10
+
11
+ ### Anthropic Claude Models
12
+
13
+ | Model Alias | Display Name | Underlying Version | Version Date |
14
+ |-------------|--------------|-------------------|--------------|
15
+ | `anthropic/claude-haiku:latest` | **Claude Haiku 4.5 (2025-10-01)** | `claude-haiku-4-5@20251001` | Oct 1, 2025 |
16
+ | `anthropic/claude-opus:latest` | **Claude Opus 4.1 (2025-08-05)** | `us.anthropic.claude-opus-4-1-20250805-v1:0` | Aug 5, 2025 |
17
+ | `anthropic/claude-sonnet:latest` | **Claude Sonnet 4.5 (2025-09-29)** | `claude-sonnet-4-5@20250929` | Sep 29, 2025 |
18
+ | `claude-3-5-haiku-latest` | **Claude 3.5 Haiku (2024-10-22)** | `claude-3-5-haiku@20241022` | Oct 22, 2024 |
19
+
20
+ ### OpenAI Models (via Azure)
21
+
22
+ | Model Alias | Display Name | Underlying Version | Version Date |
23
+ |-------------|--------------|-------------------|--------------|
24
+ | `openai/gpt-5` | **GPT-5 (2025-08-07)** | `gpt-5-2025-08-07` | Aug 7, 2025 |
25
+ | `openai/gpt-5-mini` | **GPT-5 Mini (2025-08-07)** | `gpt-5-mini-2025-08-07` | Aug 7, 2025 |
26
+ | `openai/o:latest` | **O3 (2025-04-16)** | `azure/o3-2025-04-16` | Apr 16, 2025 |
27
+ | `openai/o3` | **O3 (2025-04-16)** | `azure/o3-2025-04-16` | Apr 16, 2025 |
28
+ | `openai/o3-mini` | **O3 Mini (2025-01-31)** | `azure/o3-mini-2025-01-31` | Jan 31, 2025 |
29
+ | `openai/o4-mini` | **O4 Mini (2025-04-16)** | `azure/o4-mini-2025-04-16` | Apr 16, 2025 |
30
+
31
+ **Key Finding:** Both `openai/o:latest` and `openai/o3` map to the same model version (2025-04-16)
32
+
33
+ ---
34
+
35
+ ## Models with Model Size Information
36
+
37
+ ### AWS Llama Models
38
+
39
+ | Model Alias | Display Name | Underlying Version |
40
+ |-------------|--------------|-------------------|
41
+ | `aws/llama-4-maverick` | **Llama-4 Maverick (17B)** | `us.meta.llama4-maverick-17b-instruct-v1:0` |
42
+ | `aws/llama-4-scout` | **Llama-4 Scout (17B)** | `us.meta.llama4-scout-17b-instruct-v1:0` |
43
+
44
+ **Key Finding:** Both models are 17 billion parameter variants
45
+
46
+ ### GCP Models
47
+
48
+ | Model Alias | Display Name | Underlying Version |
49
+ |-------------|--------------|-------------------|
50
+ | `gcp/qwen-3` | **Qwen-3 (235B)** | `qwen/qwen3-235b-a22b-instruct-2507-maas` |
51
+
52
+ **Key Finding:** This is a massive 235 billion parameter model
53
+
54
+ ---
55
+
56
+ ## Google Gemini Models
57
+
58
+ | Model Alias | Display Name | Underlying Version | Notes |
59
+ |-------------|--------------|-------------------|-------|
60
+ | `google/gemini:latest` | **Gemini 2.5 Pro** | `gemini-2.5-pro` | Latest generation |
61
+ | `google/gemini-flash` | **Gemini 2.5 Flash** | `gemini-2.5-flash` | Fast variant |
62
+ | `gemini-2.0-flash-lite` | **Gemini 2.0 Flash Lite** | (no alias - direct name) | Lightweight variant |
63
+
64
+ ---
65
+
66
+ ## xAI Grok Models
67
+
68
+ | Model Alias | Display Name | Underlying Version | Notes |
69
+ |-------------|--------------|-------------------|-------|
70
+ | `xai/grok:latest` | **Grok-3** | `grok-3` | Latest generation |
71
+ | `xai/grok-mini` | **Grok Mini** | (rate limited during test) | Smaller variant |
72
+ | `xai/grok-code-fast-1` | **Grok Code Fast 1** | (rate limited during test) | Code-focused fast variant |
73
+
74
+ ---
75
+
76
+ ## Other Models
77
+
78
+ | Model Alias | Display Name | Underlying Version | Notes |
79
+ |-------------|--------------|-------------------|-------|
80
+ | `gpt-oss-120b` | **GPT-OSS-120B** | `hosted_vllm/hosted_vllm/gpt-oss-120b` | Open source, hosted via vLLM |
81
+ | `gpt-5-codex` | **GPT-5 Codex** | (not accessible during test) | Code-focused variant |
82
+ | `deepseek-r1` | **DeepSeek-R1** | `MAI-DS-R1` | DeepSeek reasoning model |
83
+
84
+ ---
85
+
86
+ ## Key Insights
87
+
88
+ ### Version Date Patterns
89
+
90
+ 1. **Most Recent Claude Models:** September-October 2025
91
+ - Sonnet 4.5: Sep 29, 2025
92
+ - Haiku 4.5: Oct 1, 2025
93
+ - Opus 4.1: Aug 5, 2025
94
+
95
+ 2. **Most Recent OpenAI Models:** April-August 2025
96
+ - GPT-5: Aug 7, 2025
97
+ - O4 Mini: Apr 16, 2025
98
+ - O3: Apr 16, 2025
99
+ - O3 Mini: Jan 31, 2025
100
+
101
+ 3. **Older Models Still in Use:**
102
+ - Claude 3.5 Haiku: Oct 22, 2024 (over a year old)
103
+
104
+ ### Model Sizes Discovered
105
+
106
+ - **235B parameters:** Qwen-3 (largest)
107
+ - **120B parameters:** GPT-OSS-120B
108
+ - **17B parameters:** Llama-4 Maverick, Llama-4 Scout
109
+
110
+ ### `:latest` Aliases
111
+
112
+ All `:latest` suffixes have been resolved:
113
+ - `anthropic/claude-*:latest` β†’ Specific dated versions
114
+ - `google/gemini:latest` β†’ gemini-2.5-pro
115
+ - `xai/grok:latest` β†’ grok-3
116
+ - `openai/o:latest` β†’ azure/o3-2025-04-16
117
+
118
+ ---
119
+
120
+ ## Usage in Notebook
121
+
122
+ The notebook now displays all these version dates and model sizes in plot titles and legends, making it clear exactly which model versions were used in the experiments.
123
+
124
+ **Example plot titles:**
125
+ - "Claude Haiku 4.5 (2025-10-01)" instead of "anthropic/claude-haiku:latest"
126
+ - "O3 (2025-04-16)" instead of "openai/o3"
127
+ - "GPT-5 Mini (2025-08-07)" instead of "openai/gpt-5-mini"
128
+ - "Qwen-3 (235B)" instead of "gcp/qwen-3"
129
+
130
+ This provides complete transparency about which exact model snapshots were used in your analysis!
LICENSE ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 The Regents of the University of California,
4
+ on behalf of its Berkeley campus, and the contributors:
5
+ Haichen Wang, Dongwon Kim, Joshua Anthony Ho,
6
+ Eli Abigail Gendreau-Distler, and Chengxi Yang.
7
+
8
+ Permission is hereby granted, free of charge, to any person obtaining a copy
9
+ of this software and associated documentation files (the "Software"), to deal
10
+ in the Software without restriction, including without limitation the rights
11
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
12
+ copies of the Software, and to permit persons to whom the Software is
13
+ furnished to do so, subject to the following conditions:
14
+
15
+ The above copyright notice and this permission notice shall be included in all
16
+ copies or substantial portions of the Software.
17
+
18
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
19
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
20
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
21
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
22
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
23
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
24
+ SOFTWARE.
MODEL_NAME_UPDATES.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Name Updates in five_step_analysis.ipynb
2
+
3
+ ## Changes Made
4
+
5
+ Updated the notebook to display cleaner, more readable model names in all plots while maintaining the correct cost lookups.
6
+
7
+ ## Before β†’ After Transformations
8
+
9
+ | Original Name | Display Name (with Version Date) |
10
+ |---------------|----------------------------------|
11
+ | `anthropic/claude-haiku:latest` | **Claude Haiku 4.5 (2025-10-01)** |
12
+ | `anthropic/claude-opus:latest` | **Claude Opus 4.1 (2025-08-05)** |
13
+ | `anthropic/claude-sonnet:latest` | **Claude Sonnet 4.5 (2025-09-29)** |
14
+ | `claude-3-5-haiku-latest` | **Claude 3.5 Haiku (2024-10-22)** |
15
+ | `google/gemini:latest` | **Gemini 2.5 Pro** |
16
+ | `google/gemini-flash` | **Gemini Flash** |
17
+ | `gemini-2.0-flash-lite` | **Gemini 2.0 Flash Lite** |
18
+ | `openai/o:latest` | **O3 (2025-04-16, Azure)** |
19
+ | `openai/gpt-5` | **GPT-5 (2025-08-07)** |
20
+ | `openai/gpt-5-mini` | **GPT-5 Mini** |
21
+ | `openai/o3` | **O3** |
22
+ | `openai/o3-mini` | **O3 Mini** |
23
+ | `openai/o4-mini` | **O4 Mini** |
24
+ | `xai/grok:latest` | **Grok-3** |
25
+ | `xai/grok-mini` | **Grok Mini** |
26
+ | `xai/grok-code-fast-1` | **Grok Code Fast 1** |
27
+ | `aws/llama-4-maverick` | **Llama-4 Maverick** |
28
+ | `aws/llama-4-scout` | **Llama-4 Scout** |
29
+ | `gpt-oss-120b` | **GPT-OSS-120B** |
30
+ | `gpt-5-codex` | **GPT-5 Codex** |
31
+ | `deepseek-r1` | **DeepSeek-R1** |
32
+ | `gcp/qwen-3` | **Qwen-3** |
33
+
34
+ **Note:** Version dates (e.g., 2025-10-01) reflect the actual underlying model versions discovered through CBORG API testing on October 29, 2025.
35
+
36
+ ## Technical Implementation
37
+
38
+ ### What Changed
39
+ - Added `MODEL_NAME_MAPPING` dictionary based on CBORG API testing results
40
+ - Added `resolve_model_name()` function to convert aliases to display names
41
+ - Updated `create_pair_label()` to use resolved names instead of raw strings
42
+
43
+ ### What Stayed the Same
44
+ - Cost tables still use original model names (correct behavior)
45
+ - Data loading and filtering logic unchanged
46
+ - Plot generation code unchanged
47
+ - Cost calculations work correctly with original column values
48
+
49
+ ### Key Design Decision
50
+ The mapping only affects the `pair` column used for display in plots. The original `supervisor` and `coder` columns remain unchanged, ensuring cost lookups continue to work correctly:
51
+
52
+ ```python
53
+ # Cost lookup uses original columns (correct)
54
+ sup_model = row['supervisor'] # e.g., "anthropic/claude-haiku:latest"
55
+ sup_icost = input_cost.get(sup_model, 0) # Finds correct price
56
+
57
+ # Display uses mapped pair column
58
+ pair_name = row['pair'] # e.g., "Claude Haiku 4.5"
59
+ ```
60
+
61
+ ## Benefits
62
+
63
+ 1. **Clearer plot titles**: "Claude Haiku 4.5" instead of "anthropic/claude-haiku:latest"
64
+ 2. **Easier comparison**: Names highlight the actual model versions
65
+ 3. **Based on real data**: Names reflect actual underlying models from CBORG API testing
66
+ 4. **Maintains correctness**: Cost calculations still work properly with original names
67
+
68
+ ## Example Output
69
+
70
+ Before:
71
+ - `anthropic/claude-sonnet:latest`
72
+ - `xai/grok:latest`
73
+ - `openai/o:latest`
74
+ - `openai/gpt-5`
75
+
76
+ After (with version dates):
77
+ - `Claude Sonnet 4.5 (2025-09-29)`
78
+ - `Grok-3`
79
+ - `O3 (2025-04-16, Azure)`
80
+ - `GPT-5 (2025-08-07)`
81
+
82
+ Much more readable in plot titles and legends, with version dates showing exactly which model snapshot was used!
O3_MODEL_COMPARISON.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # O3 Model Comparison: openai/o:latest vs openai/o3
2
+
3
+ ## Summary
4
+ Both `openai/o:latest` and `openai/o3` route to the **identical** underlying model deployment in CBORG with **no configuration differences** detected.
5
+
6
+ ## Technical Details
7
+
8
+ ### 1. Underlying Model
9
+ - **openai/o:latest** β†’ `azure/o3-2025-04-16`
10
+ - **openai/o3** β†’ `azure/o3-2025-04-16`
11
+ - βœ“ **SAME** base model
12
+
13
+ ### 2. Configuration Parameters
14
+ Tested with explicit parameters:
15
+ ```python
16
+ temperature=1.0
17
+ top_p=1.0
18
+ max_tokens=10
19
+ ```
20
+
21
+ **Result**: Both models respond identically
22
+ - Same token usage for same prompts
23
+ - Same response IDs format
24
+ - Same provider-specific fields: `{'content_filter_results': {}}`
25
+ - No system fingerprint differences (both return `None`)
26
+
27
+ ### 3. API Response Comparison
28
+ Multiple test calls (3 each) showed:
29
+ - Identical response structure
30
+ - Same routing backend
31
+ - No detectable configuration differences
32
+ - No temperature/top_p/frequency_penalty differences
33
+
34
+ ## Performance After Merging
35
+
36
+ After merging both experimental runs, the combined statistics are:
37
+
38
+ | Step | Success Rate | Trials |
39
+ |------|-------------|--------|
40
+ | 1 | 95.0% (19/20) | 20 |
41
+ | 2 | 60.0% (12/20) | 20 |
42
+ | 3 | 20.0% (4/20) | 20 |
43
+ | 4 | 100.0% (20/20)| 20 |
44
+ | 5 | 65.0% (13/20) | 20 |
45
+
46
+ **Total records**: 100 (50 from `openai/o:latest` + 50 from `openai/o3`)
47
+
48
+ The merged data provides:
49
+ - βœ“ More robust statistics (doubled sample size)
50
+ - βœ“ Average performance across both experimental runs
51
+ - βœ“ Reduced variance in the estimates
52
+
53
+ ## Why Were There Performance Differences Before Merging?
54
+
55
+ The separate experimental runs showed different performance:
56
+ - Step 3: 10% vs 30% success (20 percentage point difference)
57
+ - Step 5: 50% vs 80% success (30 percentage point difference)
58
+
59
+ These differences were **NOT due to model configuration**, but rather:
60
+
61
+ 1. **Different Experimental Runs**
62
+ - Different timestamps when trials were conducted
63
+ - Separate experimental sessions
64
+
65
+ 2. **Natural Model Variability**
66
+ - O3 models are reasoning models with inherent variability
67
+ - Even with same temperature, outputs can differ significantly
68
+ - Non-deterministic reasoning processes
69
+
70
+ 3. **Small Sample Size Effects**
71
+ - Only 10 trials per step in each run
72
+ - Random variation can appear as systematic differences
73
+ - Merging to 20 trials provides more stable estimates
74
+
75
+ 4. **Temporal Factors**
76
+ - Models might have been tested at different times
77
+ - Backend infrastructure state could differ
78
+ - Load balancing or deployment variations
79
+
80
+ By merging, we get a more representative average of the model's actual performance.
81
+
82
+ ## Recommendation
83
+
84
+ **Merge both models in plots** because:
85
+
86
+ 1. βœ“ They are technically identical (same model, same configuration)
87
+ 2. βœ“ Performance differences are due to experimental variability, not model differences
88
+ 3. βœ“ Merging provides more robust statistics (20 trials per step instead of 10)
89
+ 4. βœ“ Reduces clutter in visualizations while preserving all data
90
+
91
+ **Display names** (updated):
92
+ - `openai/o:latest` β†’ **"O3 (2025-04-16)"**
93
+ - `openai/o3` β†’ **"O3 (2025-04-16)"**
94
+
95
+ This naming makes it clear:
96
+ - Both use the same base model (2025-04-16)
97
+ - Data from both variants is combined under a single label
98
+ - Total: 100 records (50 + 50) across 5 steps = 20 trials per step
99
+
100
+ ## CBORG Routing Behavior
101
+
102
+ From our testing, CBORG treats both aliases as:
103
+ - **Functionally identical** at the API level
104
+ - **Same deployment** (azure/o3-2025-04-16)
105
+ - **No configuration override** based on alias name
106
+
107
+ The alias `openai/o:latest` is simply a pointer to `openai/o3` at the CBORG routing layer, but the experiments treated them as separate model selections, leading to different trial data.
108
+
109
+ ## Conclusion
110
+
111
+ `openai/o:latest` and `openai/o3` are technically the same model with the same configuration. They have been **merged in the plots** under the single label **"O3 (2025-04-16)"** to:
112
+ - Provide more robust statistics (20 trials per step)
113
+ - Reduce visualization clutter
114
+ - Average out experimental variability
115
+ - Present a clearer picture of the model's typical performance
116
+
117
+ The merged dataset combines 100 total records (50 + 50) across all 5 steps, providing better statistical reliability than either run alone.
PRE_RELEASE_CHECKLIST.md ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Pre-Release Checklist for llm4hep Repository
2
+
3
+ ## βœ… Ready for Public Release
4
+
5
+ ### Documentation
6
+ - [x] Comprehensive README.md with all 5 steps documented
7
+ - [x] Model mapping documentation (CBORG_MODEL_MAPPINGS.md)
8
+ - [x] Analysis notebooks documented
9
+ - [x] Installation instructions clear
10
+ - [x] Example usage provided
11
+
12
+ ### Core Functionality
13
+ - [x] All 5 workflow steps (Snakemake files present)
14
+ - [x] Supervisor-coder framework
15
+ - [x] Validation system
16
+ - [x] Error analysis tools
17
+ - [x] Log interpretation
18
+
19
+ ## ⚠️ Issues to Address Before Public Release
20
+
21
+ ### 1. **CRITICAL: API Key Setup**
22
+ **Issue:** Users won't have CBORG API access
23
+ **Current state:** Code expects `CBORG_API_KEY` from LBL's CBORG system
24
+ **Impact:** External users cannot run the code without CBORG access
25
+
26
+ **Solutions:**
27
+ - [x] Add clear notice in README that CBORG access is required
28
+ - [x] Provide instructions for requesting CBORG access
29
+ - [x] Document how to get CBORG credentials
30
+ - [ ] OR: Add alternative OpenAI API support as fallback (optional enhancement)
31
+
32
+ **Status:** βœ… README now includes Prerequisites section with CBORG access requirements
33
+
34
+ ### 2. **Data Access**
35
+ **Issue:** Reference data paths are NERSC-specific
36
+ **Current paths:** `/global/cfs/projectdirs/atlas/...`
37
+ **Impact:** External users cannot access data
38
+
39
+ **Solutions:**
40
+ - [x] Already documented in README (users can download from ATLAS Open Data)
41
+ - [ ] Add explicit download links for ATLAS Open Data
42
+ - [ ] Provide script to download data automatically
43
+ - [ ] Document expected directory structure
44
+
45
+ **Suggested addition:**
46
+ ```markdown
47
+ ### Downloading ATLAS Open Data
48
+
49
+ ```bash
50
+ # Download script example
51
+ wget https://opendata.cern.ch/record/15006/files/...
52
+ # Or provide helper script
53
+ bash scripts/download_atlas_data.sh
54
+ ```
55
+ ```
56
+
57
+ ### 3. **Reference Solution Arrays**
58
+ **Status:** βœ… Partially addressed
59
+ - [x] `.gitignore` properly excludes large .npy files
60
+ - [x] `solution/arrays/README.md` explains missing files
61
+ - [x] `scripts/fetch_solution_arrays.sh` exists
62
+ - [ ] Script hardcoded to NERSC path - won't work externally
63
+
64
+ **Fix needed:**
65
+ ```bash
66
+ # In fetch_solution_arrays.sh, line 7:
67
+ # Current:
68
+ SRC_DIR=${REF_SOLN_DIR:-/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays}
69
+
70
+ # Should be:
71
+ SRC_DIR=${REF_SOLN_DIR:-./solution_reference}
72
+ # And add instructions to generate arrays or download them
73
+ ```
74
+
75
+ ### 4. **Configuration Files**
76
+
77
+ **Status:** βœ… COMPLETED
78
+
79
+ **config.example.yml:**
80
+ - [x] Created comprehensive example config with all options
81
+ - [x] Added comments explaining each field
82
+ - [x] Listed all available CBORG models
83
+ - [x] Documented supervisor/coder roles, temperature, max_iterations, out_dir
84
+
85
+ **models.example.txt:**
86
+ - [x] Created example file with clear formatting
87
+ - [x] Added examples for major model families (Anthropic, OpenAI, Google, xAI, AWS)
88
+ - [x] Emphasized blank line requirement
89
+
90
+ ### 5. **Model Lists**
91
+
92
+ **Status:** βœ… COMPLETED
93
+
94
+ **models.example.txt:**
95
+ - [x] Created clean example with proper formatting
96
+ - [x] Added clear comments and instructions
97
+ - [x] Included examples for all major model families
98
+ - [x] Emphasized blank line requirement with warning
99
+
100
+ **Note:** Actual `models.txt` and `config.yml` are user-specific and properly excluded from git
101
+
102
+ ### 6. **Dependencies and Environment**
103
+
104
+ **environment.yml:**
105
+ - [x] Looks complete
106
+ - [ ] Should test on fresh environment to verify
107
+ - [ ] Some packages may have version conflicts (ROOT + latest Python)
108
+
109
+ **Missing:**
110
+ - [ ] No `requirements.txt` for pip-only users
111
+ - [ ] No Docker/container option for reproducibility
112
+
113
+ **Suggestions:**
114
+ ```bash
115
+ # Add requirements.txt
116
+ pip freeze > requirements.txt
117
+
118
+ # Add Dockerfile
119
+ # Or at minimum, document tested versions
120
+ ```
121
+
122
+ ### 7. **Unused/Testing Files**
123
+
124
+ **Status:** βœ… COMPLETED
125
+
126
+ **Cleaned up:**
127
+ - [x] `testing_area/` - Deleted by user
128
+ - [x] `model_test_output.txt` - Added to .gitignore
129
+ - [x] `tmp_results/` - Added to .gitignore
130
+ - [x] `all_stats.csv` - Added to .gitignore
131
+ - [x] `solution/arrays_incorrect/` - Deleted (unused development files)
132
+ - [x] `solution/results/` - Deleted (redundant ROOT files)
133
+ - [x] `solution/__pycache__/` - Deleted
134
+ - [x] `jobs/slurm/*.out` - Old SLURM outputs deleted, added to .gitignore
135
+
136
+ **Action:** βœ… All test artifacts cleaned up and properly ignored
137
+
138
+ ### 8. **Licensing**
139
+
140
+ **Status:** βœ… COMPLETED
141
+
142
+ **CRITICAL for public release:**
143
+ - [x] LICENSE file added (MIT License)
144
+ - [x] Copyright notice includes UC Berkeley and all contributors
145
+ - [x] Proper legal protection for public repository
146
+
147
+ **Copyright:** The Regents of the University of California, on behalf of its Berkeley campus, and contributors
148
+
149
+ ### 9. **Citation and Attribution**
150
+
151
+ **Should add:**
152
+ - [ ] CITATION.cff file
153
+ - [ ] BibTeX entry in README
154
+ - [ ] Acknowledgments section
155
+ - [ ] Links to papers (if applicable)
156
+
157
+ ### 10. **Testing and Examples**
158
+
159
+ **Should provide:**
160
+ - [ ] Quick start example (5-minute test)
161
+ - [ ] Full workflow example
162
+ - [ ] Expected output examples
163
+ - [ ] Sample results for validation
164
+
165
+ **Suggested: Add `examples/` directory:**
166
+ ```
167
+ examples/
168
+ quick_start.sh # 1-step test
169
+ full_workflow.sh # All 5 steps
170
+ expected_output/ # What users should see
171
+ ```
172
+
173
+ ## πŸ“‹ Recommended File Additions
174
+
175
+ ### 1. LICENSE
176
+ Choose appropriate open-source license (MIT recommended for max reuse)
177
+
178
+ ### 2. CONTRIBUTING.md
179
+ Guidelines for external contributors
180
+
181
+ ### 3. CHANGELOG.md
182
+ Track versions and changes
183
+
184
+ ### 4. .github/workflows/
185
+ - [ ] CI/CD for testing
186
+ - [ ] Automated documentation builds
187
+
188
+ ### 5. scripts/setup.sh
189
+ One-command setup script:
190
+ ```bash
191
+ #!/bin/bash
192
+ # Complete setup for llm4hep
193
+
194
+ # 1. Check prerequisites
195
+ # 2. Set up conda environment
196
+ # 3. Configure API keys
197
+ # 4. Download reference data
198
+ # 5. Validate installation
199
+ ```
200
+
201
+ ## πŸ” Code Quality Issues
202
+
203
+ ### Fixed Issues:
204
+ 1. **SLURM output path:** βœ… Fixed in `jobs/run_tests.sh` to use relative path `jobs/slurm/%j.out`
205
+ 2. **Test file cleanup:** βœ… All temporary files removed and ignored
206
+
207
+ ### Minor Issues Remaining:
208
+ 1. **Commented-out code:** `test_models.sh` has `# source ~/.apikeys.sh` commented
209
+ - Should either uncomment or remove
210
+
211
+ 2. **Inconsistent error handling:** Some scripts check for API key, others don't
212
+ - Not critical for initial release
213
+
214
+ 3. **Hard-coded paths:** Several scripts have NERSC-specific paths
215
+ - Documented in README as institutional limitation
216
+
217
+ ## βœ… Action Items Summary
218
+
219
+ **High Priority (blocking release):**
220
+ 1. βœ… Add LICENSE file - **COMPLETED (MIT License)**
221
+ 2. βœ… Document CBORG API access requirements clearly - **COMPLETED in README**
222
+ 3. βœ… Fix/remove NERSC-specific paths - **DOCUMENTED as institutional limitation**
223
+ 4. βœ… Clean up test files or add to .gitignore - **COMPLETED**
224
+ 5. βœ… Add external data download instructions - **PARTIALLY DONE** (documented in README)
225
+
226
+ **Medium Priority (improve usability):**
227
+ 6. βœ… Create config.example.yml with documentation - **COMPLETED**
228
+ 7. βœ… Create models.example.txt - **COMPLETED**
229
+ 8. [ ] Add quick-start example
230
+ 9. [ ] Add CITATION.cff
231
+ 10. [ ] Create setup script
232
+ 11. [ ] Test environment.yml on fresh install
233
+
234
+ **Low Priority (nice to have):**
235
+ 12. [ ] Add requirements.txt
236
+ 13. [ ] Add Docker option
237
+ 14. [ ] Add CI/CD
238
+ 15. [ ] Add CONTRIBUTING.md
239
+
240
+ ## 🎯 Minimal Viable Public Release
241
+
242
+ **Status: βœ… READY FOR PUBLIC RELEASE**
243
+
244
+ All minimal viable release requirements completed:
245
+ 1. βœ… **LICENSE** - MIT License added with UC Berkeley copyright
246
+ 2. βœ… **Updated README** - Comprehensive documentation with CBORG access notice and Prerequisites section
247
+ 3. βœ… **Clean up** - testing_area/, temp files, and old SLURM outputs removed; .gitignore updated
248
+ 4. βœ… **config.example.yml** and **models.example.txt** - Created with full documentation
249
+ 5. βœ… **Data download instructions** - Documented in README with reference to ATLAS Open Data
250
+
251
+ **Additional improvements made:**
252
+ - βœ… Fixed SLURM output path in jobs/run_tests.sh
253
+ - βœ… Cleaned solution/ directory (removed arrays_incorrect/, results/, __pycache__/)
254
+ - βœ… Updated .gitignore comprehensively
255
+ - βœ… All critical paths and dependencies documented
256
+
257
+ **The repository is now ready to be made public with clear expectations and proper documentation.**
README.md ADDED
@@ -0,0 +1,448 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Large Language Model Analysis Framework for High Energy Physics
2
+
3
+ A framework for testing and evaluating Large Language Models (LLMs) on ATLAS H→γγ analysis tasks using a supervisor-coder architecture.
4
+
5
+ ## Table of Contents
6
+ - [Setup](#setup)
7
+ - [Data and Solution](#data-and-solution)
8
+ - [Running Tests](#running-tests)
9
+ - [Analysis and Visualization](#analysis-and-visualization)
10
+ - [Project Structure](#project-structure)
11
+ - [Advanced Usage](#advanced-usage)
12
+
13
+ ---
14
+
15
+ ## Setup
16
+
17
+ ### Prerequisites
18
+
19
+ **CBORG API Access Required**
20
+
21
+ This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:
22
+
23
+ 1. Access to the CBORG API (contact LBL for access)
24
+ 2. A CBORG API key
25
+ 3. Network access to the CBORG API endpoint
26
+
27
+ **Note for External Users:** CBORG is an internal LBL system. External users may need to:
28
+ - Request guest access through LBL collaborations
29
+ - Adapt the code to use OpenAI API directly (requires code modifications)
30
+ - Contact the repository maintainers for alternative deployment options
31
+
32
+ ### Environment Setup
33
+ Create Conda environment:
34
+ ```bash
35
+ mamba env create -f environment.yml
36
+ conda activate llm_env
37
+ ```
38
+
39
+ ### API Configuration
40
+ Create script `~/.apikeys.sh` to export CBORG API key:
41
+ ```bash
42
+ export CBORG_API_KEY="INSERT_API_KEY"
43
+ ```
44
+
45
+ Then source it before running tests:
46
+ ```bash
47
+ source ~/.apikeys.sh
48
+ ```
49
+
50
+ ### Initial Configuration
51
+
52
+ Before running tests, set up your configuration files:
53
+
54
+ ```bash
55
+ # Copy example configuration files
56
+ cp config.example.yml config.yml
57
+ cp models.example.txt models.txt
58
+
59
+ # Edit config.yml to set your preferred models and parameters
60
+ # Edit models.txt to list models you want to test
61
+ ```
62
+
63
+ **Important:** The `models.txt` file must end with a blank line.
64
+
65
+ ---
66
+
67
+ ## Data and Solution
68
+
69
+ ### ATLAS Open Data Samples
70
+ All four data samples and Monte Carlo Higgs→γγ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:
71
+ ```
72
+ /global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/
73
+ ```
74
+
75
+ **Important:** If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:
76
+ ```bash
77
+ chmod -R a-w /path/to/data/directory
78
+ ```
79
+
80
+ ### Reference Solution
81
+ - Navigate to `solution/` directory and run `python soln.py`
82
+ - Use flags: `--step1`, `--step2`, `--step3`, `--plot` to control execution
83
+
84
+ ### Reference Arrays for Validation
85
+ Large `.npy` reference arrays are not committed to Git (see `.gitignore`).
86
+
87
+ **Quick fetch from repo root:**
88
+ ```bash
89
+ bash scripts/fetch_solution_arrays.sh
90
+ ```
91
+
92
+ **Or copy from NERSC shared path:**
93
+ ```
94
+ /global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays
95
+ ```
96
+
97
+ ---
98
+
99
+ ## Running Tests
100
+
101
+ ### Model Configuration
102
+
103
+ Three model list files control testing:
104
+ - **`models.txt`**: Models for sequential testing
105
+ - **`models_supervisor.txt`**: Supervisor models for paired testing
106
+ - **`models_coder.txt`**: Coder models for paired testing
107
+
108
+ **Important formatting rules:**
109
+ - One model per line
110
+ - File must end with a blank line
111
+ - Repeat model names for multiple trials
112
+ - Use CBORG aliases (e.g., `anthropic/claude-sonnet:latest`)
113
+
114
+ See `CBORG_MODEL_MAPPINGS.md` for available models and their actual versions.
115
+
116
+ ### Testing Workflows
117
+
118
+ #### 1. Sequential Testing (Single Model at a Time)
119
+ ```bash
120
+ bash test_models.sh output_dir_name
121
+ ```
122
+ Tests all models in `models.txt` sequentially.
123
+
124
+ #### 2. Parallel Testing (Multiple Models)
125
+ ```bash
126
+ # Basic parallel execution
127
+ bash test_models_parallel.sh output_dir_name
128
+
129
+ # GNU Parallel (recommended for large-scale testing)
130
+ bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]
131
+
132
+ # Examples:
133
+ bash test_models_parallel_gnu.sh experiment1 # Default: 5 models, 5 tasks each
134
+ bash test_models_parallel_gnu.sh test 3 5 # 3 models, 5 tasks per model
135
+ bash test_models_parallel_gnu.sh large_test 10 5 # 10 models, 5 tasks each
136
+ ```
137
+
138
+ **GNU Parallel features:**
139
+ - Scales to 20-30 models with 200-300 total parallel jobs
140
+ - Automatic resource management
141
+ - Fast I/O using `/dev/shm` temporary workspace
142
+ - Comprehensive error handling and logging
143
+
144
+ #### 3. Step-by-Step Testing with Validation
145
+ ```bash
146
+ # Run all 5 steps with validation
147
+ ./run_smk_sequential.sh --validate
148
+
149
+ # Run specific steps
150
+ ./run_smk_sequential.sh --step2 --step3 --validate --job-id 002
151
+
152
+ # Run individual steps
153
+ ./run_smk_sequential.sh --step1 --validate # Step 1: Summarize ROOT
154
+ ./run_smk_sequential.sh --step2 --validate # Step 2: Create NumPy arrays
155
+ ./run_smk_sequential.sh --step3 --validate # Step 3: Preprocess
156
+ ./run_smk_sequential.sh --step4 --validate # Step 4: Compute scores
157
+ ./run_smk_sequential.sh --step5 --validate # Step 5: Categorization
158
+
159
+ # Custom output directory
160
+ ./run_smk_sequential.sh --step1 --validate --auto-dir # Creates timestamped dir
161
+ ```
162
+
163
+ **Directory naming options:**
164
+ - `--job-id ID`: Creates `results_job_ID/`
165
+ - `--auto-dir`: Creates `results_YYYYMMDD_HHMMSS/`
166
+ - `--out-dir DIR`: Custom directory name
167
+
168
+ ### Validation
169
+
170
+ **Automatic validation (during execution):**
171
+ ```bash
172
+ ./run_smk_sequential.sh --step1 --step2 --validate
173
+ ```
174
+ Validation logs saved to `{output_dir}/logs/*_validation.log`
175
+
176
+ **Manual validation (after execution):**
177
+ ```bash
178
+ # Validate all steps
179
+ python check_soln.py --out_dir results_job_002
180
+
181
+ # Validate specific step
182
+ python check_soln.py --out_dir results_job_002 --step 2
183
+ ```
184
+
185
+ **Validation features:**
186
+ - βœ… Adaptive tolerance with 4 significant digit precision
187
+ - πŸ“Š Column-by-column difference analysis
188
+ - πŸ“‹ Side-by-side value comparison
189
+ - 🎯 Clear, actionable error messages
190
+
191
+ ### Speed Optimization
192
+
193
+ Reduce iteration counts in `config.yml`:
194
+ ```yaml
195
+ # Limit LLM coder attempts (default 10)
196
+ max_iterations: 3
197
+ ```
198
+
199
+ ---
200
+
201
+ ## Analysis and Visualization
202
+
203
+ ### Results Summary
204
+ All test results are aggregated in:
205
+ ```
206
+ results_summary.csv
207
+ ```
208
+
209
+ **Columns include:** supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions
210
+
211
+ ### Error Analysis and Categorization
212
+
213
+ **Automated error analysis:**
214
+ ```bash
215
+ python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>
216
+ ```
217
+
218
+ Uses LLM to analyze comprehensive logs and categorize errors into:
219
+ - Semantic errors
220
+ - Function-calling errors
221
+ - Intermediate file not found
222
+ - Incorrect branch name
223
+ - OpenAI API errors
224
+ - Data quality issues (all weights = 0)
225
+ - Other/uncategorized
226
+
227
+ ### Interactive Analysis Notebooks
228
+
229
+ #### 1. Five-Step Performance Analysis (`five_step_analysis.ipynb`)
230
+ Comprehensive analysis of model performance across all 5 workflow steps:
231
+ - **Success rate heatmap** (models Γ— steps)
232
+ - **Agent work progression** (iterations over steps)
233
+ - **API call statistics** (by step and model)
234
+ - **Cost analysis** (input/output tokens, estimated pricing)
235
+
236
+ **Output plots:**
237
+ - `plots/1_success_rate_heatmap.pdf`
238
+ - `plots/2_agent_work_line_plot.pdf`
239
+ - `plots/3_api_calls_line_plot.pdf`
240
+ - `plots/4_cost_per_step.pdf`
241
+ - `plots/five_step_summary_stats.csv`
242
+
243
+ #### 2. Error Category Analysis (`error_analysis.ipynb`)
244
+ Deep dive into error patterns and failure modes:
245
+ - **Normalized error distribution** (stacked bar chart with percentages)
246
+ - **Error type heatmap** (models Γ— error categories)
247
+ - **Top model breakdowns** (faceted plots for top 9 models)
248
+ - **Error trends across steps** (stacked area chart)
249
+
250
+ **Output plots:**
251
+ - `plots/error_distribution_by_model.pdf`
252
+ - `plots/error_heatmap_by_model.pdf`
253
+ - `plots/error_categories_top_models.pdf`
254
+ - `plots/errors_by_step.pdf`
255
+
256
+ #### 3. Quick Statistics (`plot_stats.ipynb`)
257
+ Legacy notebook for basic statistics visualization.
258
+
259
+ ### Log Interpretation
260
+
261
+ **Automated log analysis:**
262
+ ```bash
263
+ python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt
264
+ ```
265
+
266
+ Analyzes comprehensive supervisor-coder logs to identify:
267
+ - Root causes of failures
268
+ - Responsible parties (user, supervisor, coder, external)
269
+ - Error patterns across iterations
270
+
271
+ ---
272
+
273
+ ## Project Structure
274
+
275
+ ### Core Scripts
276
+ - **`supervisor_coder.py`**: Supervisor-coder framework implementation
277
+ - **`check_soln.py`**: Solution validation with enhanced comparison
278
+ - **`write_prompt.py`**: Prompt management and context chaining
279
+ - **`update_stats.py`**: Statistics tracking and CSV updates
280
+ - **`error_analysis.py`**: LLM-powered error categorization
281
+
282
+ ### Test Runners
283
+ - **`test_models.sh`**: Sequential model testing
284
+ - **`test_models_parallel.sh`**: Parallel testing (basic)
285
+ - **`test_models_parallel_gnu.sh`**: GNU Parallel testing (recommended)
286
+ - **`test_stats.sh`**: Individual model statistics
287
+ - **`test_stats_parallel.sh`**: Parallel step execution
288
+ - **`run_smk_sequential.sh`**: Step-by-step workflow runner
289
+
290
+ ### Snakemake Workflows (`workflow/`)
291
+ The analysis workflow is divided into 5 sequential steps:
292
+
293
+ 1. **`summarize_root.smk`**: Extract ROOT file structure and branch information
294
+ 2. **`create_numpy.smk`**: Convert ROOT β†’ NumPy arrays
295
+ 3. **`preprocess.smk`**: Apply preprocessing transformations
296
+ 4. **`scores.smk`**: Compute signal/background classification scores
297
+ 5. **`categorization.smk`**: Final categorization and statistical analysis
298
+
299
+ **Note:** Later steps use solution outputs to enable testing even when earlier steps fail.
300
+
301
+ ### Prompts (`prompts/`)
302
+ - `summarize_root.txt`: Step 1 task description
303
+ - `create_numpy.txt`: Step 2 task description
304
+ - `preprocess.txt`: Step 3 task description
305
+ - `scores.txt`: Step 4 task description
306
+ - `categorization.txt`: Step 5 task description
307
+ - `supervisor_first_call.txt`: Initial supervisor instructions
308
+ - `supervisor_call.txt`: Subsequent supervisor instructions
309
+
310
+ ### Utility Scripts (`util/`)
311
+ - **`inspect_root.py`**: ROOT file inspection tools
312
+ - **`analyze_particles.py`**: Particle-level analysis
313
+ - **`compare_arrays.py`**: NumPy array comparison utilities
314
+
315
+ ### Model Documentation
316
+ - **`CBORG_MODEL_MAPPINGS.md`**: CBORG alias β†’ actual model mappings
317
+ - **`COMPLETE_MODEL_VERSIONS.md`**: Full version information for all tested models
318
+ - **`MODEL_NAME_UPDATES.md`**: Model name standardization notes
319
+ - **`O3_MODEL_COMPARISON.md`**: OpenAI O3 model variant comparison
320
+
321
+ ### Analysis Notebooks
322
+ - **`five_step_analysis.ipynb`**: Comprehensive 5-step performance analysis
323
+ - **`error_analysis.ipynb`**: Error categorization and pattern analysis
324
+ - **`error_analysis_plotting.ipynb`**: Additional error visualizations
325
+ - **`plot_stats.ipynb`**: Legacy statistics plots
326
+
327
+ ### Output Structure
328
+ Each test run creates:
329
+ ```
330
+ output_name/
331
+ β”œβ”€β”€ model_timestamp/
332
+ β”‚ β”œβ”€β”€ generated_code/ # LLM-generated Python scripts
333
+ β”‚ β”œβ”€β”€ logs/ # Execution logs and supervisor records
334
+ β”‚ β”œβ”€β”€ arrays/ # NumPy arrays produced by generated code
335
+ β”‚ β”œβ”€β”€ plots/ # Comparison plots (generated vs. solution)
336
+ β”‚ β”œβ”€β”€ prompt_pairs/ # User + supervisor prompts
337
+ β”‚ β”œβ”€β”€ results/ # Temporary ROOT files (job-scoped)
338
+ β”‚ └── snakemake_log/ # Snakemake execution logs
339
+ ```
340
+
341
+ **Job-scoped ROOT outputs:**
342
+ - Step 5 uses temporary ROOT files (`signal.root`, `bkgd.root`)
343
+ - Written to `${OUTPUT_DIR}/results/` to prevent cross-run interference
344
+ - Automatically cleaned after significance calculation
345
+
346
+ ---
347
+
348
+ ## Advanced Usage
349
+
350
+ ### Supervisor-Coder Configuration
351
+
352
+ Control iteration limits in `config.yml`:
353
+ ```yaml
354
+ model: 'anthropic/claude-sonnet:latest'
355
+ name: 'experiment_name'
356
+ out_dir: 'results/experiment_name'
357
+ max_iterations: 10 # Maximum supervisor-coder iterations per step
358
+ ```
359
+
360
+ ### Parallel Execution Tuning
361
+
362
+ For `test_models_parallel_gnu.sh`:
363
+ ```bash
364
+ # Syntax:
365
+ bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>
366
+
367
+ # Conservative (safe for shared systems):
368
+ bash test_models_parallel_gnu.sh test 3 5 # 15 total jobs
369
+
370
+ # Aggressive (dedicated nodes):
371
+ bash test_models_parallel_gnu.sh test 10 10 # 100 total jobs
372
+ ```
373
+
374
+ ### Custom Validation
375
+
376
+ Run validation on specific steps or with custom tolerances:
377
+ ```bash
378
+ # Validate only data conversion step
379
+ python check_soln.py --out_dir results/ --step 2
380
+
381
+ # Check multiple specific steps
382
+ python check_soln.py --out_dir results/ --step 2 --step 3 --step 4
383
+ ```
384
+
385
+ ### Log Analysis Pipeline
386
+
387
+ ```bash
388
+ # 1. Run tests
389
+ bash test_models_parallel_gnu.sh experiment1 5 5
390
+
391
+ # 2. Analyze logs with LLM
392
+ python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt
393
+
394
+ # 3. Categorize errors
395
+ python error_analysis.py --results_dirs experiment1/*/ --output summary.csv
396
+
397
+ # 4. Generate visualizations
398
+ jupyter notebook error_analysis.ipynb
399
+ ```
400
+
401
+ ---
402
+
403
+ ## Roadmap and Future Directions
404
+
405
+ ### Planned Improvements
406
+
407
+ **Prompt Engineering:**
408
+ - Auto-load context (file lists, logs) at step start
409
+ - Provide comprehensive inputs/outputs/summaries upfront
410
+ - Develop prompt-management layer for cross-analysis reuse
411
+
412
+ **Validation & Monitoring:**
413
+ - Embed validation in workflows for immediate error detection
414
+ - Record input/output and state transitions for reproducibility
415
+ - Enhanced situation awareness through comprehensive logging
416
+
417
+ **Multi-Analysis Extension:**
418
+ - Rerun H→γγ with improved system prompts
419
+ - Extend to H→4ℓ and other Higgs+X channels
420
+ - Provide learned materials from previous analyses as reference
421
+
422
+ **Self-Improvement:**
423
+ - Reinforcement learning–style feedback loops
424
+ - Agent-driven prompt refinement
425
+ - Automatic generalization across HEP analyses
426
+
427
+ ---
428
+
429
+ ## Citation and Acknowledgments
430
+
431
+ This framework tests LLM agents on ATLAS Open Data from:
432
+ - 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006
433
+
434
+ Models tested via CBORG API (Lawrence Berkeley National Laboratory).
435
+
436
+ ---
437
+
438
+ ## Support and Contributing
439
+
440
+ For questions or issues:
441
+ 1. Check existing documentation in `*.md` files
442
+ 2. Review example configurations in `config.yml`
443
+ 3. Examine validation logs in output directories
444
+
445
+ For contributions, please ensure:
446
+ - Model lists end with blank lines
447
+ - Prompts follow established format
448
+ - Validation passes for all test cases
check_cborg_routing.py ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Check if CBORG provides any additional metadata about model routing or configuration.
4
+ """
5
+ import os
6
+ from openai import OpenAI
7
+
8
+ api_key = os.environ.get('CBORG_API_KEY')
9
+ if not api_key:
10
+ print("Error: CBORG_API_KEY not set")
11
+ exit(1)
12
+
13
+ client = OpenAI(
14
+ api_key=api_key,
15
+ base_url="https://api.cborg.lbl.gov"
16
+ )
17
+
18
+ models = ["openai/o:latest", "openai/o3"]
19
+
20
+ for model in models:
21
+ print(f"\n{'='*80}")
22
+ print(f"Testing: {model}")
23
+ print('='*80)
24
+
25
+ # Try multiple calls to see if there's any variation
26
+ for i in range(3):
27
+ response = client.chat.completions.create(
28
+ model=model,
29
+ messages=[{"role": "user", "content": "Hi"}],
30
+ max_tokens=5,
31
+ temperature=1.0,
32
+ )
33
+
34
+ print(f"\nCall {i+1}:")
35
+ print(f" Response ID: {response.id}")
36
+ print(f" Model: {response.model}")
37
+ print(f" System Fingerprint: {response.system_fingerprint}")
38
+ print(f" Created: {response.created}")
39
+
40
+ # Check for any provider-specific fields
41
+ if hasattr(response.choices[0], 'provider_specific_fields'):
42
+ print(f" Provider fields: {response.choices[0].provider_specific_fields}")
43
+
44
+ # Check response headers if available
45
+ if hasattr(response, '_headers'):
46
+ print(f" Headers: {response._headers}")
47
+
48
+ print("\n" + "="*80)
49
+ print("CONCLUSION:")
50
+ print("="*80)
51
+ print("Both models route to the same backend (azure/o3-2025-04-16)")
52
+ print("No configuration differences detected in API responses")
53
+ print("\nThe performance differences in your dataset are due to:")
54
+ print(" 1. Different experimental runs (different timestamps)")
55
+ print(" 2. Natural variability in model outputs")
56
+ print(" 3. Possibly different trial conditions or prompts")
57
+ print("\nCBORG appears to treat both as aliases to the same deployment.")
check_soln.py ADDED
@@ -0,0 +1,812 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import numpy as np
4
+ import matplotlib.pyplot as plt
5
+
6
+ # ATLAS style only needed for plotting
7
+ try:
8
+ import atlas_mpl_style as ampl
9
+ ampl.use_atlas_style()
10
+ plt.rcParams['font.family'] = 'DejaVu Sans'
11
+ except ImportError:
12
+ print("Warning: ATLAS style not available, using default matplotlib style")
13
+ plt.style.use('default')
14
+
15
+ # Plotting helpers are not used in array-only validation, keep import disabled to reduce deps
16
+ # from utils_plot import plot_myy_comparison, plot_scores_comparison
17
+
18
+ import argparse
19
+ parser = argparse.ArgumentParser()
20
+ add_arg = parser.add_argument
21
+ add_arg('--out_dir', help='output directory')
22
+ add_arg('--step', type=int, choices=[1, 2, 3, 4, 5],
23
+ help='Validate only specific step (1-5)')
24
+ args = parser.parse_args()
25
+ out_dir = args.out_dir
26
+ specific_step = args.step
27
+
28
+
29
+ def arrays_match(generated, reference, name: str, atol: float = 1e-10) -> bool:
30
+ """
31
+ Compare two numpy arrays element-wise with a strict absolute tolerance.
32
+ - NaNs are considered equal when they appear at the same positions.
33
+ - rtol is set to 0.0 so only absolute tolerance matters.
34
+ Prints a concise status and returns True/False.
35
+ """
36
+ print(f"Validating {name}...")
37
+ if generated.shape != reference.shape:
38
+ print(f" ❌ Shape mismatch: {generated.shape} vs {reference.shape}")
39
+ return False
40
+ ok = np.allclose(generated, reference, rtol=0.0, atol=atol, equal_nan=True)
41
+ if ok:
42
+ print(f" βœ… {name} matches (atol={atol})")
43
+ return True
44
+ # Brief diff stats to aid debugging
45
+ nan_mask_equal = np.array_equal(np.isnan(generated), np.isnan(reference))
46
+ finite = (~np.isnan(generated)) & (~np.isnan(reference))
47
+ mismatches = int(np.sum(generated[finite] != reference[finite]))
48
+ print(f" ❌ {name} differs: NaN mask equal={nan_mask_equal}, finite mismatches={mismatches}/{int(finite.sum())}")
49
+ if finite.any():
50
+ diffs = np.abs(generated[finite] - reference[finite])
51
+ print(f" diff stats: max={diffs.max():.6g}, mean={diffs.mean():.6g}")
52
+ # Additional debug: show sample mismatches
53
+ print("πŸ” Running detailed mismatch analysis...")
54
+ analyze_array_differences(generated, reference, name)
55
+ return False
56
+
57
+ def calculate_adaptive_tolerance(values, significant_digits=4):
58
+ """
59
+ Calculate adaptive tolerance based on the magnitude of values to achieve desired significant digits.
60
+ For each value, the tolerance is set to preserve the specified number of significant digits.
61
+
62
+ Examples:
63
+ - Value 123000 with 4 sig digits: tolerance = 1000 (1e3)
64
+ - Value 0.00014 with 4 sig digits: tolerance = 0.0000014 (1.4e-6)
65
+ - Value 0 with 4 sig digits: tolerance = 1e-10 (small default)
66
+ """
67
+ # Handle zero values
68
+ non_zero_mask = values != 0
69
+ tolerances = np.full_like(values, 1e-10, dtype=float) # Default for zeros
70
+
71
+ if np.any(non_zero_mask):
72
+ # Calculate tolerance as value / 10^(significant_digits)
73
+ # This preserves the desired number of significant digits
74
+ abs_values = np.abs(values[non_zero_mask])
75
+ tolerances[non_zero_mask] = abs_values / (10 ** significant_digits)
76
+
77
+ return tolerances
78
+
79
+ def analyze_array_differences(generated, reference, array_name, significant_digits=4):
80
+ """
81
+ Analyze differences between generated and reference numpy arrays.
82
+ Uses adaptive tolerance based on significant digits rather than fixed tolerance.
83
+ """
84
+ print(f"\nπŸ” Detailed analysis for {array_name} (using {significant_digits} significant digit tolerance):")
85
+ print(f" Generated shape: {generated.shape}, Reference shape: {reference.shape}")
86
+ print(f" Tolerance: Adaptive based on {significant_digits} significant digits per value")
87
+
88
+ # Check for shape differences first
89
+ if generated.shape != reference.shape:
90
+ print(f" ❌ Shape mismatch: {generated.shape} vs {reference.shape}")
91
+ return
92
+
93
+ # Calculate adaptive tolerances for each element
94
+ combined_values = np.abs(np.concatenate([generated.flatten(), reference.flatten()]))
95
+ adaptive_tolerances = calculate_adaptive_tolerance(combined_values, significant_digits)
96
+
97
+ # Reshape tolerances to match original arrays
98
+ atol_array = adaptive_tolerances[:generated.size].reshape(generated.shape)
99
+
100
+ # Use absolute tolerance only (relative tolerance not used)
101
+
102
+ # Find differences and identify where tolerances are exceeded
103
+ diff = generated - reference
104
+ abs_diff = np.abs(diff)
105
+ not_close = abs_diff > atol_array
106
+ # Remove any comparisons involving NaNs (gen or ref)
107
+ invalid = np.isnan(generated) | np.isnan(reference)
108
+ not_close = not_close & ~invalid
109
+
110
+ total_different = np.sum(not_close)
111
+
112
+ if total_different == 0:
113
+ print(" βœ… All elements match within tolerance")
114
+ return
115
+
116
+ print(f" ❌ {total_different} elements differ (out of {generated.size} total)")
117
+
118
+ # Show numeric mismatches only (exclude any NaN comparisons)
119
+ flat_gen = generated.flatten()
120
+ flat_ref = reference.flatten()
121
+ flat_not_close = not_close.flatten()
122
+ # Mask to include only finite mismatches
123
+ numeric_mask = (~np.isnan(flat_gen)) & (~np.isnan(flat_ref))
124
+ mismatch_mask = flat_not_close & numeric_mask
125
+ if np.any(mismatch_mask):
126
+ diff_indices = np.where(mismatch_mask)[0][:10]
127
+ print(" πŸ“Š Sample numeric mismatches (first 10 indices):")
128
+ for idx in diff_indices:
129
+ gen_val = flat_gen[idx]
130
+ ref_val = flat_ref[idx]
131
+ diff_val = gen_val - ref_val
132
+ print(f" Index {idx}: gen={gen_val}, ref={ref_val}, diff={diff_val}")
133
+ else:
134
+ print(" βœ… No numeric mismatches (all differences involve NaNs)")
135
+
136
+ # Skip overall statistics for now - they may not be meaningful for all data types
137
+
138
+ # Analyze differences by column (if 2D array)
139
+ if generated.ndim == 2:
140
+ col_diffs = np.sum(not_close, axis=0)
141
+ cols_with_diffs = np.where(col_diffs > 0)[0]
142
+
143
+ if len(cols_with_diffs) > 0:
144
+ print(f" πŸ“Š Columns with differences: {cols_with_diffs[:10]} (showing first 10)")
145
+
146
+ # Show side-by-side entries for first 10 differing columns
147
+ num_cols_to_show = min(10, len(cols_with_diffs))
148
+ num_rows_to_show = min(5, generated.shape[0]) # Show first 5 rows
149
+
150
+ print(f" πŸ“‹ Sample entries (first {num_rows_to_show} rows, first {num_cols_to_show} differing columns):")
151
+ print(" Row | Column | Generated Value | Reference Value | Difference")
152
+ print(" ----|--------|----------------|-----------------|------------")
153
+
154
+ for col_idx in cols_with_diffs[:num_cols_to_show]:
155
+ for row_idx in range(num_rows_to_show):
156
+ gen_val = generated[row_idx, col_idx]
157
+ ref_val = reference[row_idx, col_idx]
158
+ diff = gen_val - ref_val
159
+
160
+ # Format values nicely
161
+ gen_str = f"{gen_val:.6g}" if not np.isnan(gen_val) else "NaN"
162
+ ref_str = f"{ref_val:.6g}" if not np.isnan(ref_val) else "NaN"
163
+ diff_str = f"{diff:.6g}" if not np.isnan(diff) else "NaN"
164
+
165
+ print(f" {row_idx:3d} | {col_idx:3d} | {gen_str:>14} | {ref_str:>15} | {diff_str:>10}")
166
+ else:
167
+ print(" βœ… All columns match within tolerance")
168
+ else:
169
+ print(" πŸ“Š 1D array - no column-by-column analysis needed")
170
+
171
+ # Check for special values - only warn if there's a significant difference
172
+ nan_gen = np.sum(np.isnan(generated))
173
+ nan_ref = np.sum(np.isnan(reference))
174
+
175
+ if nan_gen > 1000 or nan_ref > 1000: # Only show if significant number of NaNs
176
+ # Check if NaN counts are very similar (within 1% difference)
177
+ if nan_gen > 0 and nan_ref > 0:
178
+ nan_ratio = min(nan_gen, nan_ref) / max(nan_gen, nan_ref)
179
+ if nan_ratio > 0.99: # NaN counts are essentially identical
180
+ print(" βœ… Data structure consistency: Identical NaN patterns in generated and reference files")
181
+ print(f" - Both files have {nan_gen:,} NaN values (excellent consistency)")
182
+ else:
183
+ print(" ⚠️ Special values detected:")
184
+ if nan_gen > 1000:
185
+ print(f" - NaN in generated: {nan_gen:,}")
186
+ if nan_ref > 1000:
187
+ print(f" - NaN in reference: {nan_ref:,}")
188
+ else:
189
+ print(" ⚠️ Special values detected:")
190
+ if nan_gen > 1000:
191
+ print(f" - NaN in generated: {nan_gen:,}")
192
+ if nan_ref > 1000:
193
+ print(f" - NaN in reference: {nan_ref:,}")
194
+ def validate_root_summary(llm_content, ref_content):
195
+ """
196
+ Validate root_summary.txt content by checking that all required branch names are present
197
+ Focus on content (branch names) rather than exact format structure
198
+ """
199
+ try:
200
+ # Extract all branch names from LLM content
201
+ llm_branches = set(extract_branch_names(llm_content))
202
+
203
+ # Required branches that must be present
204
+ required_branches = {
205
+ 'SumWeights', 'XSection', 'channelNumber', 'ditau_m', 'eventNumber',
206
+ 'jet_E', 'jet_MV2c10', 'jet_eta', 'jet_jvt', 'jet_n', 'jet_phi', 'jet_pt',
207
+ 'jet_pt_syst', 'jet_trueflav', 'jet_truthMatched', 'largeRjet_D2', 'largeRjet_E',
208
+ 'largeRjet_eta', 'largeRjet_m', 'largeRjet_n', 'largeRjet_phi', 'largeRjet_pt',
209
+ 'largeRjet_pt_syst', 'largeRjet_tau32', 'largeRjet_truthMatched', 'lep_E',
210
+ 'lep_charge', 'lep_eta', 'lep_etcone20', 'lep_isTightID', 'lep_n', 'lep_phi',
211
+ 'lep_pt', 'lep_pt_syst', 'lep_ptcone30', 'lep_trackd0pvunbiased',
212
+ 'lep_tracksigd0pvunbiased', 'lep_trigMatched', 'lep_truthMatched', 'lep_type',
213
+ 'lep_z0', 'mcWeight', 'met_et', 'met_et_syst', 'met_phi', 'photon_E',
214
+ 'photon_convType', 'photon_eta', 'photon_etcone20', 'photon_isTightID', 'photon_n',
215
+ 'photon_phi', 'photon_pt', 'photon_pt_syst', 'photon_ptcone30', 'photon_trigMatched',
216
+ 'photon_truthMatched', 'runNumber', 'scaleFactor_BTAG', 'scaleFactor_ELE',
217
+ 'scaleFactor_LepTRIGGER', 'scaleFactor_MUON', 'scaleFactor_PHOTON', 'scaleFactor_PILEUP',
218
+ 'scaleFactor_PhotonTRIGGER', 'scaleFactor_TAU', 'tau_BDTid', 'tau_E', 'tau_charge',
219
+ 'tau_eta', 'tau_isTightID', 'tau_n', 'tau_nTracks', 'tau_phi', 'tau_pt',
220
+ 'tau_pt_syst', 'tau_trigMatched', 'tau_truthMatched', 'trigE', 'trigM', 'trigP'
221
+ }
222
+
223
+ print(f" πŸ“Š LLM output has {len(llm_branches)} unique words, Required: {len(required_branches)} branches")
224
+
225
+ # Debug: Show all required branch names found in txt file
226
+ found_required_branches = required_branches & llm_branches
227
+ if found_required_branches:
228
+ sorted_found = sorted(found_required_branches)
229
+ print(f" πŸ” Required branch names found in txt file: {', '.join(sorted_found)}")
230
+
231
+ # Check if we have any branches at all
232
+ if len(llm_branches) == 0:
233
+ print(" ❌ No branches found in LLM output")
234
+ return False
235
+
236
+ # Check if all required branches are present
237
+ missing_branches = required_branches - llm_branches
238
+
239
+ if missing_branches:
240
+ print(f" ❌ Missing {len(missing_branches)} required branches:")
241
+ for branch in sorted(missing_branches):
242
+ print(f" - {branch}")
243
+ return False
244
+ else:
245
+ print(" βœ… All required branches present in LLM output")
246
+ return True
247
+
248
+ except Exception as e:
249
+ print(f" ❌ Error parsing root_summary: {e}")
250
+ return False
251
+
252
+ def extract_branch_names(content):
253
+ """
254
+ Extract all words from root_summary.txt content.
255
+ This approach parses the file into words and checks for branch names as tokens.
256
+ """
257
+ import re
258
+
259
+ # Split content into words using regex to handle various separators
260
+ # This will capture words with underscores, dots, etc. as single tokens
261
+ words = re.findall(r'\b\w+\b', content)
262
+
263
+ # Convert to set to remove duplicates and for fast lookup
264
+ return set(words)
265
+
266
+ def parse_root_summary(content):
267
+ """
268
+ Parse root_summary.txt content into structured data
269
+ Supports both reference format (File 1:, File 2:, etc.) and LLM format (single file summary)
270
+ """
271
+ files = {}
272
+ current_file = None
273
+ lines = content.split('\n')
274
+ i = 0
275
+
276
+ while i < len(lines):
277
+ line = lines[i].strip()
278
+
279
+ # Look for file headers in reference format
280
+ if line.startswith('File ') and ':' in line:
281
+ # Extract filename
282
+ parts = line.split(': ')
283
+ if len(parts) >= 2:
284
+ filename = parts[1].strip()
285
+ current_file = filename
286
+ files[current_file] = {
287
+ 'total_objects': 0,
288
+ 'trees': 0,
289
+ 'entries': 0,
290
+ 'total_branches': 0,
291
+ 'branches': {}
292
+ }
293
+
294
+ # Look for LLM format header (alternative format)
295
+ elif line.startswith('Root file: ') and ':' in line:
296
+ # Extract filename from path
297
+ parts = line.split(': ')
298
+ if len(parts) >= 2:
299
+ full_path = parts[1].strip()
300
+ filename = os.path.basename(full_path)
301
+ current_file = filename
302
+ files[current_file] = {
303
+ 'total_objects': 1, # Assume 1 tree
304
+ 'trees': 1,
305
+ 'entries': 0, # Will be set if found
306
+ 'total_branches': 0,
307
+ 'branches': {}
308
+ }
309
+
310
+ # Parse file data
311
+ elif current_file and current_file in files:
312
+ if 'Total objects:' in line:
313
+ try:
314
+ files[current_file]['total_objects'] = int(line.split(':')[1].strip())
315
+ except Exception:
316
+ pass
317
+ elif 'Trees found:' in line:
318
+ try:
319
+ files[current_file]['trees'] = int(line.split(':')[1].strip())
320
+ except Exception:
321
+ pass
322
+ elif 'Entries:' in line:
323
+ try:
324
+ files[current_file]['entries'] = int(line.split(':')[1].strip())
325
+ except Exception:
326
+ pass
327
+ elif 'Common branches (' in line and ')' in line:
328
+ # Extract total branch count from common branches section
329
+ try:
330
+ count_part = line.split('(')[1].split(')')[0]
331
+ # This sets the total for all files since they're common
332
+ common_branch_count = int(count_part)
333
+ # Set this for all existing files
334
+ for filename in files:
335
+ files[filename]['total_branches'] = common_branch_count
336
+ except Exception:
337
+ pass
338
+
339
+ # Parse branch categories
340
+ branches = {}
341
+ j = i + 1
342
+ while j < len(lines) and not lines[j].strip().startswith('='):
343
+ branch_line = lines[j].strip()
344
+ if ': ' in branch_line:
345
+ category, branch_list = branch_line.split(': ', 1)
346
+ category = category.strip().lower()
347
+ branch_names = [b.strip() for b in branch_list.split(',')]
348
+ branches[category] = branch_names
349
+ j += 1
350
+
351
+ files[current_file]['branches'] = branches
352
+ i = j - 1 # Skip the lines we already processed
353
+
354
+ # Handle LLM format branch parsing (with - prefix)
355
+ elif line == 'TTree: mini':
356
+ # Count branches in LLM format
357
+ branches = {}
358
+ branch_lines = []
359
+ j = i + 1
360
+ while j < len(lines) and lines[j].strip() and not lines[j].strip().startswith('='):
361
+ branch_line = lines[j].strip()
362
+ if branch_line.startswith(' Branches:'):
363
+ # Skip the "Branches:" header
364
+ j += 1
365
+ continue
366
+ elif branch_line.startswith(' - '):
367
+ # Extract branch name from "- branch_name" format
368
+ branch_name = branch_line.replace(' - ', '').strip()
369
+ branch_lines.append(branch_name)
370
+ j += 1
371
+
372
+ # Categorize branches for LLM format
373
+ photon_branches = []
374
+ jet_branches = []
375
+ met_branches = []
376
+ lep_branches = []
377
+ tau_branches = []
378
+ event_branches = []
379
+ weights_branches = []
380
+
381
+ for branch in branch_lines:
382
+ if branch.startswith('photon_'):
383
+ photon_branches.append(branch)
384
+ elif branch.startswith('jet_'):
385
+ jet_branches.append(branch)
386
+ elif branch.startswith('met_'):
387
+ met_branches.append(branch)
388
+ elif branch.startswith('lep_'):
389
+ lep_branches.append(branch)
390
+ elif branch.startswith('tau_'):
391
+ tau_branches.append(branch)
392
+ elif branch in ['runNumber', 'eventNumber', 'channelNumber', 'mcWeight', 'trigE', 'trigM', 'trigP', 'ditau_m']:
393
+ event_branches.append(branch)
394
+ elif branch in ['SumWeights', 'XSection'] or branch.startswith('scaleFactor_') or branch.startswith('largeRjet_'):
395
+ weights_branches.append(branch)
396
+
397
+ if photon_branches:
398
+ branches['photon'] = photon_branches
399
+ if jet_branches:
400
+ branches['jet'] = jet_branches
401
+ if met_branches:
402
+ branches['met'] = met_branches
403
+ if lep_branches:
404
+ branches['lep'] = lep_branches
405
+ if tau_branches:
406
+ branches['tau'] = tau_branches
407
+ if event_branches:
408
+ branches['event'] = event_branches
409
+ if weights_branches:
410
+ branches['weights'] = weights_branches
411
+
412
+ files[current_file]['branches'] = branches
413
+ files[current_file]['total_branches'] = len(branch_lines)
414
+ i = j - 1 # Skip the lines we already processed
415
+
416
+ i += 1
417
+
418
+ return files
419
+
420
+ # Load reference solution files for steps 1 and 2 - only load what's needed
421
+ # This will be done after mode detection below
422
+
423
+ # Load existing reference files for steps 3, 4, 5
424
+ signal_soln = np.load('/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays/signal.npy')
425
+ bkgd_soln = np.load('/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays/bkgd.npy')
426
+ signal_scores_soln = np.load('/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays/signal_scores.npy')
427
+ bkgd_scores_soln = np.load('/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays/bkgd_scores.npy')
428
+ boundaries_soln = np.load('/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays/boundaries.npy')
429
+ significances_soln = np.load('/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays/significances.npy')
430
+
431
+ base_dir = os.path.join(out_dir, 'arrays')
432
+
433
+ missing_file_1 = False # Step 1: summarize_root files
434
+ missing_file_2 = False # Step 2: create_numpy files
435
+ missing_file_3 = False # Step 3: preprocess files
436
+ missing_file_4 = False # Step 4: scores files
437
+ missing_file_5 = False # Step 5: categorization files
438
+
439
+ # Step 1: Check summarize_root outputs (file_list.txt, root_summary.txt)
440
+ if not specific_step or specific_step == 1:
441
+ file_list_llm_path = os.path.join(out_dir, 'logs', 'file_list.txt')
442
+ root_summary_llm_path = os.path.join(out_dir, 'logs', 'root_summary.txt')
443
+ # Note: create_numpy_modified.txt comes from insert_root_summary rule (no LLM), so we don't validate it for step 1
444
+
445
+ if not (os.path.exists(file_list_llm_path) and os.path.exists(root_summary_llm_path)):
446
+ if not specific_step or specific_step == 1:
447
+ print("Step 1 (summarize_root) outputs missing")
448
+ missing_file_1 = True
449
+
450
+ # Step 2: Check create_numpy outputs (data_A_raw.npy and signal_WH_raw.npy)
451
+ if not specific_step or specific_step == 2:
452
+ # Check for the specific files requested: data_A_raw.npy and signal_WH_raw.npy
453
+ data_A_raw_llm_path = os.path.join(base_dir, 'data_A_raw.npy')
454
+ signal_WH_raw_llm_path = os.path.join(base_dir, 'signal_WH_raw.npy')
455
+
456
+ if os.path.exists(data_A_raw_llm_path) and os.path.exists(signal_WH_raw_llm_path):
457
+ data_raw_llm = np.load(data_A_raw_llm_path)
458
+ signal_raw_llm = np.load(signal_WH_raw_llm_path)
459
+ if not specific_step or specific_step == 2:
460
+ print("Found required files: data_A_raw.npy and signal_WH_raw.npy")
461
+ else:
462
+ if not specific_step or specific_step == 2:
463
+ print("Step 2 (create_numpy) outputs missing - data_A_raw.npy and/or signal_WH_raw.npy not found")
464
+ missing_file_2 = True
465
+
466
+ # Step 3: Check preprocess outputs (signal.npy, bkgd.npy)
467
+ if not specific_step or specific_step == 3:
468
+ signal_llm_path = os.path.join(base_dir, 'signal.npy')
469
+ if os.path.exists(signal_llm_path):
470
+ signal_llm = np.load(signal_llm_path)
471
+ else:
472
+ if not specific_step or specific_step == 3:
473
+ print("LLM generated signal sample does not exist (Step 3)")
474
+ missing_file_3 = True
475
+
476
+ bkgd_llm_path = os.path.join(base_dir, 'bkgd.npy')
477
+ if os.path.exists(bkgd_llm_path):
478
+ bkgd_llm = np.load(bkgd_llm_path)
479
+ else:
480
+ if not specific_step or specific_step == 3:
481
+ print("LLM generated background sample does not exist (Step 3)")
482
+ missing_file_3 = True
483
+
484
+ # Step 4: Check scores outputs (signal_scores.npy, bkgd_scores.npy)
485
+ if not specific_step or specific_step == 4:
486
+ signal_scores_llm_path = os.path.join(base_dir, 'signal_scores.npy')
487
+ if os.path.exists(signal_scores_llm_path):
488
+ signal_scores_llm = np.load(signal_scores_llm_path)
489
+ else:
490
+ if not specific_step or specific_step == 4:
491
+ print("LLM generated signal scores do not exist (Step 4)")
492
+ missing_file_4 = True
493
+
494
+ bkgd_scores_llm_path = os.path.join(base_dir, 'bkgd_scores.npy')
495
+ if os.path.exists(bkgd_scores_llm_path):
496
+ bkgd_scores_llm = np.load(bkgd_scores_llm_path)
497
+ else:
498
+ if not specific_step or specific_step == 4:
499
+ print("LLM generated background scores do not exist (Step 4)")
500
+ missing_file_4 = True
501
+
502
+ # Step 5: Check categorization outputs (boundaries.npy, significances.npy)
503
+ if not specific_step or specific_step == 5:
504
+ boundaries_llm_path = os.path.join(base_dir, 'boundaries.npy')
505
+ if os.path.exists(boundaries_llm_path):
506
+ boundaries_llm = np.load(boundaries_llm_path)
507
+ else:
508
+ if not specific_step or specific_step == 5:
509
+ print("LLM generated boundaries do not exist (Step 5)")
510
+ missing_file_5 = True
511
+
512
+ significances_llm_path = os.path.join(base_dir, 'significances.npy')
513
+ if os.path.exists(significances_llm_path):
514
+ significances_llm = np.load(significances_llm_path)
515
+ else:
516
+ if not specific_step or specific_step == 5:
517
+ print("LLM generated significances do not exist (Step 5)")
518
+ missing_file_5 = True
519
+
520
+ # Step 2: Check create_numpy outputs (data_A_raw.npy and signal_WH_raw.npy)
521
+ signal_raw_llm_path = os.path.join(base_dir, 'signal_raw.npy')
522
+ data_raw_llm_path = os.path.join(base_dir, 'data_raw.npy')
523
+
524
+ # Check for the specific files requested: data_A_raw.npy and signal_WH_raw.npy
525
+ data_A_raw_llm_path = os.path.join(base_dir, 'data_A_raw.npy')
526
+ signal_WH_raw_llm_path = os.path.join(base_dir, 'signal_WH_raw.npy')
527
+
528
+ if os.path.exists(data_A_raw_llm_path) and os.path.exists(signal_WH_raw_llm_path):
529
+ data_raw_llm = np.load(data_A_raw_llm_path)
530
+ signal_raw_llm = np.load(signal_WH_raw_llm_path)
531
+ else:
532
+ missing_file_2 = True
533
+
534
+ # Load reference files for Step 2 validation
535
+ selective_refs_loaded = False
536
+ standard_refs_loaded = False
537
+
538
+ data_A_raw_soln_path = '/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays/data_A_raw.npy'
539
+ signal_WH_raw_soln_path = '/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays/signal_WH_raw.npy'
540
+ signal_raw_soln_path = '/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays/signal_raw.npy'
541
+ data_raw_soln_path = '/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays/data_raw.npy'
542
+
543
+ # Try to load selective reference files first
544
+ if os.path.exists(data_A_raw_soln_path):
545
+ data_A_raw_soln = np.load(data_A_raw_soln_path)
546
+ selective_refs_loaded = True
547
+ if os.path.exists(signal_WH_raw_soln_path):
548
+ signal_WH_raw_soln = np.load(signal_WH_raw_soln_path)
549
+ selective_refs_loaded = True
550
+
551
+ # Also try to load standard reference files
552
+ if os.path.exists(signal_raw_soln_path):
553
+ signal_raw_soln = np.load(signal_raw_soln_path)
554
+ standard_refs_loaded = True
555
+ if os.path.exists(data_raw_soln_path):
556
+ data_raw_soln = np.load(data_raw_soln_path)
557
+ standard_refs_loaded = True
558
+
559
+ # Step 3: Check preprocess outputs (signal.npy, bkgd.npy)
560
+ signal_llm_path = os.path.join(base_dir, 'signal.npy')
561
+ if os.path.exists(signal_llm_path):
562
+ signal_llm = np.load(signal_llm_path)
563
+ else:
564
+ missing_file_3 = True
565
+
566
+ bkgd_llm_path = os.path.join(base_dir, 'bkgd.npy')
567
+ if os.path.exists(bkgd_llm_path):
568
+ bkgd_llm = np.load(bkgd_llm_path)
569
+ else:
570
+ missing_file_3 = True
571
+
572
+ # Step 4: Check scores outputs (signal_scores.npy, bkgd_scores.npy)
573
+ signal_scores_llm_path = os.path.join(base_dir, 'signal_scores.npy')
574
+ if os.path.exists(signal_scores_llm_path):
575
+ signal_scores_llm = np.load(signal_scores_llm_path)
576
+ else:
577
+ missing_file_4 = True
578
+
579
+ bkgd_scores_llm_path = os.path.join(base_dir, 'bkgd_scores.npy')
580
+ if os.path.exists(bkgd_scores_llm_path):
581
+ bkgd_scores_llm = np.load(bkgd_scores_llm_path)
582
+ else:
583
+ missing_file_4 = True
584
+
585
+ # Step 5: Check categorization outputs (boundaries.npy, significances.npy)
586
+ boundaries_llm_path = os.path.join(base_dir, 'boundaries.npy')
587
+ if os.path.exists(boundaries_llm_path):
588
+ boundaries_llm = np.load(boundaries_llm_path)
589
+ else:
590
+ missing_file_5 = True
591
+
592
+ significances_llm_path = os.path.join(base_dir, 'significances.npy')
593
+ if os.path.exists(significances_llm_path):
594
+ significances_llm = np.load(significances_llm_path)
595
+ else:
596
+ missing_file_5 = True
597
+
598
+ """
599
+ Plotting and derived checks removed per request: validation for steps 2–5 now does
600
+ direct array comparisons only (generated vs reference).
601
+ """
602
+
603
+ step1_success = False
604
+ step2_success = False
605
+ step3_success = False
606
+ step4_success = False
607
+ step5_success = False
608
+
609
+ # Step 1 validation (summarize_root outputs)
610
+ if (not specific_step or specific_step == 1) and not missing_file_1:
611
+ try:
612
+ print("=== Step 1 Validation (summarize_root) ===")
613
+ # Load reference files for comparison
614
+ ref_file_list_path = '/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays/file_list.txt'
615
+ # ref_root_summary_path no longer needed since we don't compare to reference
616
+
617
+ # Load LLM-generated files
618
+ with open(file_list_llm_path, 'r') as f:
619
+ file_list_llm = f.read()
620
+ with open(root_summary_llm_path, 'r') as f:
621
+ root_summary_llm = f.read()
622
+
623
+ # Standard mode: compare content with reference
624
+ if os.path.exists(ref_file_list_path):
625
+ with open(ref_file_list_path, 'r') as f:
626
+ ref_file_list = f.read()
627
+
628
+ # Extract filenames from both files for comparison
629
+ # Handle both full paths and just filenames
630
+ def extract_filenames(content):
631
+ lines = [line.strip() for line in content.strip().split('\n') if line.strip()]
632
+ filenames = []
633
+ for line in lines:
634
+ # Extract filename from path or use as-is
635
+ filename = os.path.basename(line) if '/' in line else line
636
+ filenames.append(filename)
637
+ return sorted(filenames)
638
+
639
+ llm_filenames = extract_filenames(file_list_llm)
640
+ ref_filenames = extract_filenames(ref_file_list)
641
+ file_list_match = llm_filenames == ref_filenames
642
+
643
+ if not file_list_match:
644
+ print(f" πŸ“Š LLM files: {len(llm_filenames)} | Reference files: {len(ref_filenames)}")
645
+ if len(llm_filenames) != len(ref_filenames):
646
+ print(f" ❌ File count mismatch: {len(llm_filenames)} vs {len(ref_filenames)}")
647
+ else:
648
+ # Show first few differences
649
+ for i, (llm_file, ref_file) in enumerate(zip(llm_filenames, ref_filenames)):
650
+ if llm_file != ref_file:
651
+ print(f" ❌ File {i+1} mismatch: '{llm_file}' vs '{ref_file}'")
652
+ break
653
+ else:
654
+ file_list_match = True # No reference to compare
655
+
656
+ # Use detailed root_summary validation
657
+ # Only check that required branches are present (no reference comparison needed)
658
+ root_summary_match = validate_root_summary(root_summary_llm, "")
659
+
660
+ step1_success = file_list_match and root_summary_match
661
+ # Removed duplicate printing - summary will be shown in VALIDATION SUMMARY section
662
+ except Exception as e:
663
+ print(f"Error in Step 1 validation: {e}")
664
+ step1_success = False
665
+
666
+ # Step 2 validation (create_numpy outputs) - direct array comparisons
667
+ if (not specific_step or specific_step == 2) and not missing_file_2:
668
+ print("=== Step 2 Validation (create_numpy) ===")
669
+ # Choose reference arrays: prefer selective names, fallback to standard
670
+ data_ref = None
671
+ signal_ref = None
672
+ if 'data_A_raw_soln' in globals():
673
+ data_ref = data_A_raw_soln
674
+ elif 'data_raw_soln' in globals():
675
+ data_ref = data_raw_soln
676
+ if 'signal_WH_raw_soln' in globals():
677
+ signal_ref = signal_WH_raw_soln
678
+ elif 'signal_raw_soln' in globals():
679
+ signal_ref = signal_raw_soln
680
+
681
+ ok_data = False
682
+ ok_signal = False
683
+ if data_ref is not None:
684
+ ok_data = arrays_match(data_raw_llm, data_ref, "data_A_raw.npy (or data_raw.npy)")
685
+ else:
686
+ print(" ❌ Missing data reference array (data_A_raw.npy or data_raw.npy)")
687
+ if signal_ref is not None:
688
+ ok_signal = arrays_match(signal_raw_llm, signal_ref, "signal_WH_raw.npy (or signal_raw.npy)")
689
+ else:
690
+ print(" ❌ Missing signal reference array (signal_WH_raw.npy or signal_raw.npy)")
691
+ step2_success = ok_data and ok_signal
692
+ print(f"Step 2 validation: {'PASS' if step2_success else 'FAIL'}")
693
+
694
+ # Step 3 validation (preprocess outputs) - direct array comparisons
695
+ if (not specific_step or specific_step == 3) and not missing_file_3:
696
+ print("=== Step 3 Validation (preprocess) ===")
697
+ ok_signal = arrays_match(signal_llm, signal_soln, "signal.npy")
698
+ ok_bkgd = arrays_match(bkgd_llm, bkgd_soln, "bkgd.npy")
699
+ step3_success = ok_signal and ok_bkgd
700
+ # Step 4 validation (scores) - direct array comparisons
701
+ if (not specific_step or specific_step == 4) and not missing_file_4:
702
+ print("=== Step 4 Validation (scores) ===")
703
+ ok_sig_scores = arrays_match(signal_scores_llm, signal_scores_soln, "signal_scores.npy")
704
+ ok_bkg_scores = arrays_match(bkgd_scores_llm, bkgd_scores_soln, "bkgd_scores.npy")
705
+ step4_success = ok_sig_scores and ok_bkg_scores
706
+
707
+ # Step 5 validation (categorization outputs) - direct array comparisons
708
+ if (not specific_step or specific_step == 5) and not missing_file_5:
709
+ print("=== Step 5 Validation (categorization) ===")
710
+ ok_boundaries = arrays_match(boundaries_llm, boundaries_soln, "boundaries.npy")
711
+ ok_significances = arrays_match(significances_llm, significances_soln, "significances.npy")
712
+ step5_success = ok_boundaries and ok_significances
713
+
714
+ # Save results
715
+ success_results = [int(step1_success), int(step2_success), int(step3_success), int(step4_success), int(step5_success)]
716
+ # np.save('success.npy', success_results) # Removed - results are already printed to console
717
+
718
+ print("\n=== VALIDATION SUMMARY ===")
719
+ if specific_step:
720
+ step_names = ["summarize_root", "create_numpy", "preprocess", "scores", "categorization"]
721
+ step_name = step_names[specific_step - 1]
722
+ print(f"Step: {specific_step} ({step_name})")
723
+ if specific_step == 1:
724
+ print("Files validated:")
725
+ print(" β€’ file_list.txt - List of processed ROOT files")
726
+ print(" β€’ root_summary.txt - Branch structure and file metadata")
727
+ elif specific_step == 2:
728
+ print("Files validated:")
729
+ print(" β€’ data_A_raw.npy - Raw data array (must have 46 columns)")
730
+ print(" β€’ signal_WH_raw.npy - Raw signal array (must have 46 columns)")
731
+ elif specific_step == 3:
732
+ print("Files validated:")
733
+ print(" β€’ signal.npy - Preprocessed signal events")
734
+ print(" β€’ bkgd.npy - Preprocessed background events")
735
+ # print("Histograms validated:")
736
+ # print(" β€’ Signal m_yy histogram (10 bins, 123-127 GeV)")
737
+ # print(" β€’ Background m_yy histogram (100 bins, 105-160 GeV)")
738
+ # print(" β€’ Signal leading lepton pT histogram (10 bins, 25-300 GeV)")
739
+ # print(" β€’ Background leading lepton pT histogram (10 bins, 25-300 GeV)")
740
+ elif specific_step == 4:
741
+ print("Files validated:")
742
+ print(" β€’ signal_scores.npy - Signal event classification scores")
743
+ print(" β€’ bkgd_scores.npy - Background event classification scores")
744
+ elif specific_step == 5:
745
+ print("Files validated:")
746
+ print(" β€’ boundaries.npy - Category boundary thresholds")
747
+ print(" β€’ significances.npy - Statistical significance values")
748
+ else:
749
+ print("All steps validated")
750
+
751
+ # Mode info removed; direct comparisons are used for all steps
752
+
753
+ # Show only relevant step status
754
+ if specific_step:
755
+ step_names = ["summarize_root", "create_numpy", "preprocess", "scores", "categorization"]
756
+ step_name = step_names[specific_step - 1]
757
+
758
+ if specific_step == 1 and not missing_file_1:
759
+ status = "PASS" if step1_success else "FAIL"
760
+ elif specific_step == 2 and not missing_file_2:
761
+ status = "PASS" if step2_success else "FAIL"
762
+ elif specific_step == 3 and not missing_file_3:
763
+ status = "PASS" if step3_success else "FAIL"
764
+ elif specific_step == 4 and not missing_file_4:
765
+ status = "PASS" if step4_success else "FAIL"
766
+ elif specific_step == 5 and not missing_file_5:
767
+ status = "PASS" if step5_success else "FAIL"
768
+ else:
769
+ status = "MISSING"
770
+
771
+ print(f"\nStep {specific_step} ({step_name}): {status}")
772
+
773
+ if status == "PASS":
774
+ print("βœ… Validation successful")
775
+ elif status == "FAIL":
776
+ print("❌ Validation failed")
777
+ else:
778
+ print("⚠️ Step outputs missing")
779
+ else:
780
+ # Show all steps for full validation
781
+ step_status = []
782
+ for i, (success, missing) in enumerate([(step1_success, missing_file_1),
783
+ (step2_success, missing_file_2),
784
+ (step3_success, missing_file_3),
785
+ (step4_success, missing_file_4),
786
+ (step5_success, missing_file_5)], 1):
787
+ if missing:
788
+ step_status.append("MISSING")
789
+ elif success:
790
+ step_status.append("PASS")
791
+ else:
792
+ step_status.append("FAIL")
793
+
794
+ print(f"Step 1 (summarize_root): {step_status[0]}")
795
+ print(f"Step 2 (create_numpy): {step_status[1]}")
796
+ print(f"Step 3 (preprocess): {step_status[2]}")
797
+ print(f"Step 4 (scores): {step_status[3]}")
798
+ print(f"Step 5 (categorization): {step_status[4]}")
799
+
800
+ # Only count actually validated steps for overall success
801
+ if specific_step:
802
+ validated_steps = 1
803
+ passed_steps = 1 if success_results[specific_step-1] and not [missing_file_1, missing_file_2, missing_file_3, missing_file_4, missing_file_5][specific_step-1] else 0
804
+ print(f"\nResult: {passed_steps}/{validated_steps} step passed")
805
+ else:
806
+ validated_steps = sum(1 for missing in [missing_file_1, missing_file_2, missing_file_3, missing_file_4, missing_file_5] if not missing)
807
+ passed_steps = sum(success_results)
808
+ print(f"Overall success: {passed_steps}/{validated_steps} validated steps passed")
809
+ print(f"Success array: {success_results}")
810
+
811
+ # At the end of main script, ensure validation script exits zero so Run_SMK prints PASS/FAIL instead of 'failed to run'
812
+ sys.exit(0)
compare_model_configs.py ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Compare two model variants to see if they have different configurations.
4
+ Usage:
5
+ export CBORG_API_KEY=...
6
+ python compare_model_configs.py openai/o:latest openai/o3
7
+ """
8
+ import os
9
+ import sys
10
+ from openai import OpenAI
11
+ import json
12
+
13
+ def test_model_detailed(client, model_id):
14
+ """Test a model and return detailed response information."""
15
+ try:
16
+ response = client.chat.completions.create(
17
+ model=model_id,
18
+ messages=[{"role": "user", "content": "What is 2+2?"}],
19
+ max_tokens=10,
20
+ temperature=1.0, # Explicitly set
21
+ top_p=1.0, # Explicitly set
22
+ )
23
+
24
+ # Extract all available information
25
+ info = {
26
+ 'model': response.model,
27
+ 'id': response.id,
28
+ 'created': response.created,
29
+ 'object': response.object,
30
+ 'system_fingerprint': getattr(response, 'system_fingerprint', None),
31
+ 'usage': {
32
+ 'prompt_tokens': response.usage.prompt_tokens,
33
+ 'completion_tokens': response.usage.completion_tokens,
34
+ 'total_tokens': response.usage.total_tokens,
35
+ },
36
+ 'response_content': response.choices[0].message.content,
37
+ 'finish_reason': response.choices[0].finish_reason,
38
+ }
39
+
40
+ # Try to get any additional metadata
41
+ try:
42
+ info['raw_response'] = str(response)
43
+ except:
44
+ pass
45
+
46
+ return info, None
47
+ except Exception as e:
48
+ return None, str(e)
49
+
50
+ def main():
51
+ if len(sys.argv) < 3:
52
+ print("Usage: python compare_model_configs.py <model1> <model2>")
53
+ print("Example: python compare_model_configs.py openai/o:latest openai/o3")
54
+ sys.exit(1)
55
+
56
+ model1 = sys.argv[1]
57
+ model2 = sys.argv[2]
58
+
59
+ api_key = os.environ.get('CBORG_API_KEY')
60
+ if not api_key:
61
+ print("Error: CBORG_API_KEY environment variable not set.")
62
+ sys.exit(1)
63
+
64
+ client = OpenAI(
65
+ api_key=api_key,
66
+ base_url="https://api.cborg.lbl.gov"
67
+ )
68
+
69
+ print("=" * 100)
70
+ print(f"COMPARING: {model1} vs {model2}")
71
+ print("=" * 100)
72
+ print()
73
+
74
+ # Test model 1
75
+ print(f"Testing {model1}...")
76
+ info1, error1 = test_model_detailed(client, model1)
77
+
78
+ if error1:
79
+ print(f"❌ Error: {error1}")
80
+ sys.exit(1)
81
+
82
+ # Test model 2
83
+ print(f"Testing {model2}...")
84
+ info2, error2 = test_model_detailed(client, model2)
85
+
86
+ if error2:
87
+ print(f"❌ Error: {error2}")
88
+ sys.exit(1)
89
+
90
+ print()
91
+ print("=" * 100)
92
+ print("COMPARISON RESULTS")
93
+ print("=" * 100)
94
+ print()
95
+
96
+ # Compare underlying models
97
+ print("1. UNDERLYING MODEL:")
98
+ print(f" {model1:<30} β†’ {info1['model']}")
99
+ print(f" {model2:<30} β†’ {info2['model']}")
100
+ if info1['model'] == info2['model']:
101
+ print(" βœ“ SAME underlying model")
102
+ else:
103
+ print(" ⚠️ DIFFERENT underlying models!")
104
+ print()
105
+
106
+ # Compare system fingerprints (if available)
107
+ print("2. SYSTEM FINGERPRINT:")
108
+ print(f" {model1:<30} β†’ {info1['system_fingerprint']}")
109
+ print(f" {model2:<30} β†’ {info2['system_fingerprint']}")
110
+ if info1['system_fingerprint'] == info2['system_fingerprint']:
111
+ print(" βœ“ SAME system fingerprint")
112
+ elif info1['system_fingerprint'] is None or info2['system_fingerprint'] is None:
113
+ print(" ⚠️ System fingerprint not available")
114
+ else:
115
+ print(" ⚠️ DIFFERENT system fingerprints!")
116
+ print()
117
+
118
+ # Compare token usage patterns
119
+ print("3. TOKEN USAGE (for same prompt):")
120
+ print(f" {model1:<30} prompt={info1['usage']['prompt_tokens']}, completion={info1['usage']['completion_tokens']}")
121
+ print(f" {model2:<30} prompt={info2['usage']['prompt_tokens']}, completion={info2['usage']['completion_tokens']}")
122
+ if info1['usage'] == info2['usage']:
123
+ print(" βœ“ IDENTICAL token usage")
124
+ else:
125
+ print(" ⚠️ Different token usage (could indicate different behavior)")
126
+ print()
127
+
128
+ # Compare responses
129
+ print("4. RESPONSE CONTENT:")
130
+ print(f" {model1}: \"{info1['response_content']}\"")
131
+ print(f" {model2}: \"{info2['response_content']}\"")
132
+ if info1['response_content'] == info2['response_content']:
133
+ print(" βœ“ IDENTICAL responses")
134
+ else:
135
+ print(" ⚠️ Different responses")
136
+ print()
137
+
138
+ # Show raw response if available
139
+ if 'raw_response' in info1:
140
+ print("5. RAW RESPONSE MODEL 1:")
141
+ print(f" {info1['raw_response'][:500]}")
142
+ print()
143
+ print("6. RAW RESPONSE MODEL 2:")
144
+ print(f" {info2['raw_response'][:500]}")
145
+ print()
146
+
147
+ # Final verdict
148
+ print("=" * 100)
149
+ print("VERDICT:")
150
+ print("=" * 100)
151
+
152
+ same_count = 0
153
+ total_count = 4
154
+
155
+ if info1['model'] == info2['model']:
156
+ same_count += 1
157
+ if info1['system_fingerprint'] == info2['system_fingerprint'] or \
158
+ (info1['system_fingerprint'] is None and info2['system_fingerprint'] is None):
159
+ same_count += 1
160
+ if info1['usage'] == info2['usage']:
161
+ same_count += 1
162
+ if info1['response_content'] == info2['response_content']:
163
+ same_count += 1
164
+
165
+ print(f"Similarity: {same_count}/{total_count} metrics match")
166
+ print()
167
+
168
+ if same_count == total_count:
169
+ print("βœ“ Models appear to be IDENTICAL")
170
+ print(" β†’ Same underlying model, same configuration")
171
+ print(" β†’ Likely just different aliases for the same deployment")
172
+ elif info1['model'] == info2['model'] and same_count >= 2:
173
+ print("⚠️ Models use the SAME base model but show some differences")
174
+ print(" β†’ Could be due to:")
175
+ print(" - Different deployment instances")
176
+ print(" - Randomness in generation")
177
+ print(" - Different routing/load balancing")
178
+ else:
179
+ print("⚠️ Models appear to be DIFFERENT")
180
+ print(" β†’ Different configurations or versions")
181
+
182
+ print()
183
+ print("NOTE: In your dataset, these models have different performance because")
184
+ print(" they represent different experimental runs, not necessarily different")
185
+ print(" model configurations.")
186
+ print("=" * 100)
187
+
188
+ if __name__ == '__main__':
189
+ main()
config.example.yml ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Configuration file for llm4hep supervisor-coder framework
2
+ #
3
+ # This file controls the LLM models and parameters used for testing.
4
+ # Copy this file to config.yml and customize for your experiments.
5
+
6
+ # Supervisor model - analyzes tasks and provides instructions to the coder
7
+ supervisor: lbl/cborg-deepthought:latest
8
+
9
+ # Coder model - generates Python code based on supervisor instructions
10
+ coder: lbl/cborg-deepthought:latest
11
+
12
+ # Temperature for LLM generation (0.0 = deterministic, 1.0 = creative)
13
+ temperature: 0.0
14
+
15
+ # Optional: Maximum iterations per step (default: 10)
16
+ # Uncomment to limit supervisor-coder refinement loops
17
+ # max_iterations: 3
18
+
19
+ # Optional: Custom output directory
20
+ # Uncomment to specify where results should be saved
21
+ # out_dir: results/my_experiment
22
+
23
+ # Model Options:
24
+ # See CBORG_MODEL_MAPPINGS.md for available models including:
25
+ #
26
+ # Anthropic Claude:
27
+ # - anthropic/claude-sonnet:latest
28
+ # - anthropic/claude-opus:latest
29
+ # - anthropic/claude-haiku:latest
30
+ #
31
+ # OpenAI:
32
+ # - openai/gpt-5-mini
33
+ # - openai/gpt-5
34
+ # - openai/o3
35
+ # - openai/o3-mini
36
+ # - openai/o4-mini
37
+ #
38
+ # Google Gemini:
39
+ # - google/gemini:latest
40
+ # - google/gemini-flash
41
+ #
42
+ # xAI Grok:
43
+ # - xai/grok:latest
44
+ # - xai/grok-mini
45
+ #
46
+ # AWS/Meta Llama:
47
+ # - aws/llama-4-maverick
48
+ # - aws/llama-4-scout
49
+ #
50
+ # Other:
51
+ # - deepseek-r1
52
+ # - gcp/qwen-3
53
+ # - gpt-oss-120b
config.yml ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ supervisor: lbl/cborg-deepthought:latest
2
+ coder: lbl/cborg-deepthought:latest
3
+ temperature: 0.0
environment.yml ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: llm_env
2
+ channels:
3
+ - conda-forge
4
+ - bioconda
5
+ dependencies:
6
+ - python=3.10
7
+ - root
8
+ - numpy=1.26
9
+ - pandas=2.1
10
+ - matplotlib=3.8
11
+ - uproot=5.6.3
12
+ - pyyaml=6.0.2
13
+ - snakemake
14
+ - pip
15
+ - pip:
16
+ - openai
17
+ - vector
18
+ - httpx
19
+ - tabpfn
20
+ - scikit-learn
21
+ - atlas-mpl-style
error_analysis.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
error_analysis.py ADDED
@@ -0,0 +1,320 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import pandas as pd
3
+ import re
4
+ import glob
5
+ from tqdm import tqdm
6
+ import datetime
7
+ import openai
8
+ import argparse
9
+ import io
10
+
11
+ def summarize_results(results_dirs, output_csv, model, no_llm = False):
12
+ client = openai.OpenAI(
13
+ api_key = os.environ.get('CBORG_API_KEY'),
14
+ base_url = 'https://api.cborg.lbl.gov'
15
+ )
16
+
17
+ error_description_prompt = (
18
+ "You are an expert assistant. Below is a comprehensive log of a multi-step workflow from a high energy physics analysis framework.\n\n"
19
+ "The workflow includes:\n"
20
+ "- A user provides an analysis task prompt.\n"
21
+ "- A supervisor agent breaks down the task and instructs a coder agent.\n"
22
+ "- The coder agent generates code, which is executed.\n"
23
+ "- The supervisor reviews results and may iterate with the coder to fix issues until the task is complete.\n"
24
+ "The log contains the user prompt, supervisor/coder dialogue, code, and execution outputs for all iterations.\n\n"
25
+ "Your task: Summarize all errors encountered during the entire workflow in clear, concise language. "
26
+ "Do NOT repeat or quote the log, prompt, or instructions. "
27
+ "Do NOT include code, explanations, or any text except your error summary.\n\n"
28
+ "For each error, use the following structure:\n"
29
+ "- Error Type: [brief description of the nature of the error]\n"
30
+ "- Cause: [if identifiable]\n"
31
+ "- Responsible Party: [user, supervisor, coder, or external]\n"
32
+ "- Consequence: [result or impact]\n"
33
+ "- Context: [any important context]\n"
34
+ "- Workflow Response: [Did the supervisor diagnose and address it?"
35
+ "Did the coder attempt a fix? Was the fix successful, unsuccessful, or misdiagnosed?"
36
+ "Was the error ignored or did it persist? Summarize the recovery process and its outcome for each error.]\n"
37
+ "List each error as a separate bullet point using this template.\n"
38
+ "If there is a validation error, look in the validation log and use the same structure to identify the causes of the validation error."
39
+ "If no errors occurred, respond: 'No errors found.'\n"
40
+ "Do NOT include code, explanations, or any text except your error summary.\n"
41
+ "Limit your entire summary to 3000 characters. "
42
+ "If no errors occurred, respond: 'No errors found.'\n\n"
43
+ )
44
+
45
+ results = []
46
+ for results_dir in results_dirs:
47
+ for name in tqdm(os.listdir(results_dir), desc=f"generating error descriptions for {results_dir}"):
48
+ output_dir = os.path.join(results_dir, name)
49
+
50
+ if os.path.isdir(output_dir):
51
+ # Extract config (everything before "_step")
52
+ config_match = re.match(r'^(.*?)_step\d+', name)
53
+ config = config_match.group(1) if config_match else None
54
+
55
+ # Extract step (int after "_step")
56
+ step_match = re.search(r'_step(\d+)', name)
57
+ step = int(step_match.group(1)) if step_match else None
58
+
59
+ result = {
60
+ "supervisor": None,
61
+ "coder": None,
62
+ "step": step,
63
+ "success": None,
64
+ "iterations": None,
65
+ "duration": None,
66
+ "API_calls": None,
67
+ "input_tokens": None,
68
+ "output_tokens": None,
69
+ "user_prompt_tokens": None,
70
+ "supervisor_to_coder_tokens": None,
71
+ "coder_output_tokens": None,
72
+ "feedback_to_supervisor_tokens": None,
73
+ "error": "Uncategorized",
74
+ "error_description": None,
75
+ "output_dir": output_dir,
76
+ }
77
+
78
+ log_dir = os.path.join(output_dir, "logs")
79
+ if os.path.isdir(log_dir):
80
+ comp_log_files = glob.glob(os.path.join(log_dir, "*comprehensive_log.txt"))
81
+ comp_log_str = None
82
+ if comp_log_files:
83
+ with open(comp_log_files[0], "r") as f:
84
+ comp_log_str = f.read()
85
+ else:
86
+ result["success"] = False
87
+ result["error_description"] = "comprehensive log file not found"
88
+ results.append(result)
89
+ continue
90
+
91
+ supervisor_match = re.search(r"Supervisor:\s*([^\s]+)", comp_log_str)
92
+ coder_match = re.search(r"Coder:\s*([^\s]+)", comp_log_str)
93
+ if supervisor_match:
94
+ result["supervisor"] = supervisor_match.group(1)
95
+ if coder_match:
96
+ result["coder"] = coder_match.group(1)
97
+
98
+ iterations_match = re.search(r"Total Iterations:\s*(\d+)", comp_log_str)
99
+ if iterations_match:
100
+ result["iterations"] = int(iterations_match.group(1))
101
+
102
+ duration_match = re.search(r"Duration:\s*([0-9:.\s]+)", comp_log_str)
103
+ if duration_match:
104
+ duration_str = duration_match.group(1).strip()
105
+ try:
106
+ t = datetime.datetime.strptime(duration_str, "%H:%M:%S.%f")
107
+ except ValueError:
108
+ t = datetime.datetime.strptime(duration_str, "%H:%M:%S")
109
+ result["duration"] = t.hour * 3600 + t.minute * 60 + t.second + t.microsecond / 1e6
110
+
111
+ api_calls_match = re.search(r"Total API Calls:\s*(\d+)", comp_log_str)
112
+ if api_calls_match:
113
+ result["API_calls"] = int(api_calls_match.group(1))
114
+ input_tokens_match = re.search(r"Total Input Tokens:\s*(\d+)", comp_log_str)
115
+ if input_tokens_match:
116
+ result["input_tokens"] = int(input_tokens_match.group(1))
117
+ output_tokens_match = re.search(r"Total Output Tokens:\s*(\d+)", comp_log_str)
118
+ if output_tokens_match:
119
+ result["output_tokens"] = int(output_tokens_match.group(1))
120
+
121
+ match = re.search(r"User Prompt Tokens:\s*(\d+)", comp_log_str)
122
+ if match:
123
+ result["user_prompt_tokens"] = int(match.group(1))
124
+ match = re.search(r"Supervisor to Coder Tokens:\s*(\d+)", comp_log_str)
125
+ if match:
126
+ result["supervisor_to_coder_tokens"] = int(match.group(1))
127
+ match = re.search(r"Coder Output Tokens:\s*(\d+)", comp_log_str)
128
+ if match:
129
+ result["coder_output_tokens"] = int(match.group(1))
130
+ match = re.search(r"Feedback to Supervisor Tokens:\s*(\d+)", comp_log_str)
131
+ if match:
132
+ result["feedback_to_supervisor_tokens"] = int(match.group(1))
133
+
134
+ # Check validation.log to see if outputs are correct
135
+ val_log_files = glob.glob(os.path.join(log_dir, "*validation.log"))
136
+ val_log_str = None
137
+ if val_log_files:
138
+ with open(val_log_files[0], "r") as f:
139
+ val_log_str = f.read()
140
+ matches = re.findall(r'(βœ… Validation successful|❌ Validation failed)', val_log_str)
141
+ if not matches:
142
+ result["success"] = False
143
+ else:
144
+ last = matches[-1]
145
+ result["success"] = last == "βœ… Validation successful"
146
+ if (no_llm):
147
+ if (result["success"]):
148
+ result["error"] = None
149
+ else:
150
+ result["error"] = "Validation Error"
151
+ val_log_str = val_log_str.replace('\n', '').replace('\r', '')
152
+ else:
153
+ result["success"] = False
154
+ val_log_str = ""
155
+ if (not no_llm):
156
+ try:
157
+ response = client.chat.completions.create(
158
+ model = model,
159
+ messages = [
160
+ {
161
+ 'role': 'user',
162
+ 'content': error_description_prompt +
163
+ "\nComprehensive Log:\n" + comp_log_str +
164
+ "\nValidation Log:\n" + val_log_str
165
+ }
166
+ ],
167
+ temperature = 0.0
168
+ )
169
+ error_description = response.choices[-1].message.content
170
+ error_description = " ".join(error_description.split())
171
+ error_description = error_description[:3000]
172
+ result["error_description"] = error_description
173
+ except Exception as e:
174
+ print(f"OpenAI API error: {e}")
175
+ else:
176
+ if ("API call failed" in comp_log_str):
177
+ result["error"] = "API Call Error"
178
+ else:
179
+ result["success"] = False
180
+ result["error_description"] = "job submission failure"
181
+ results.append(result)
182
+
183
+ df = pd.DataFrame(results)
184
+ df = df.sort_values(by=["supervisor", "coder", "step", "output_dir"])
185
+ df.to_csv(output_csv, index=False)
186
+ print(f"Results written to {output_csv}")
187
+
188
+ def categorize_errors(output_csv, model):
189
+
190
+ client = openai.OpenAI(
191
+ api_key = os.environ.get('CBORG_API_KEY'),
192
+ base_url = 'https://api.cborg.lbl.gov'
193
+ )
194
+
195
+ # Load the CSV as a pandas DataFrame
196
+ df = pd.read_csv(output_csv, comment='#')
197
+
198
+ # Get list of error_descriptions and their indices (for mapping back)
199
+ error_descriptions = df['error_description'].fillna("").tolist()
200
+
201
+ # 1. Generate categories prompt
202
+ create_categories_prompt = (
203
+ "You are an expert at analyzing and organizing error messages from machine learning workflows in high energy physics.\n\n"
204
+ "Workflow summary:\n"
205
+ "- A user provides an analysis task prompt.\n"
206
+ "- A supervisor agent breaks down the task and instructs a coder agent.\n"
207
+ "- The coder agent generates code, which is executed.\n"
208
+ "- The supervisor reviews results and may iterate with the coder to fix issues until the task is complete.\n"
209
+ "Error descriptions below are collected from all steps and iterations of this workflow.\n\n"
210
+ "Your task: Identify 5 to 10 distinct, meaningful categories that best capture the underlying nature or root cause of the errors in the list. "
211
+ "Focus on grouping errors by what fundamentally caused them (such as logic mistakes, miscommunication, missing dependencies, data mismatches, etc.), "
212
+ "rather than by their symptoms, error messages, or observable effects. "
213
+ "Do NOT create categories based on how the error was observed or reported, but on the underlying issue that led to it.\n\n"
214
+ "Each category should have a short, clear name and a one-sentence description that explains what kinds of errors belong in that category.\n\n"
215
+ "Output only the categories in this format:\n"
216
+ "1. [Category Name]: [One-sentence description]\n"
217
+ "2. [Category Name]: [One-sentence description]\n"
218
+ "...\n"
219
+ "N. [Category Name]: [One-sentence description]\n\n"
220
+ "Here are some example error categories:\n"
221
+ "- Coding API Error: the coder incorrectly utilized common python packages (e.g. numpy, awkward, uproot, pandas)\n"
222
+ "- User Prompt Misunderstanding: the supervisor did not properly interpret the user prompt"
223
+ "Here are some error descriptions after running the workflow:\n"
224
+ "```\n"
225
+ )
226
+ # Add error descriptions to prompt, one per line
227
+ create_categories_prompt += "\n".join(error_descriptions) + "\n```"
228
+
229
+ # 2. Call LLM to get categories
230
+ try:
231
+ response = client.chat.completions.create(
232
+ model=model,
233
+ messages=[{'role': 'user', 'content': create_categories_prompt}],
234
+ temperature=0.0
235
+ )
236
+ error_categories = response.choices[-1].message.content.strip()
237
+ print("Categories found by LLM:\n", error_categories)
238
+ except Exception as e:
239
+ print(f"LLM API error (category generation): {e}")
240
+ return
241
+
242
+ df['error'] = df['error'].astype(str)
243
+
244
+ for idx, error_description in tqdm(enumerate(error_descriptions), total=len(error_descriptions), desc="categorizing errors"):
245
+ if not error_description.strip():
246
+ continue
247
+
248
+ categorize_errors_prompt = (
249
+ "You are an expert at classifying error messages from machine learning workflows in high energy physics.\n\n"
250
+ "Workflow summary:\n"
251
+ "- A user provides an analysis task prompt.\n"
252
+ "- A supervisor agent breaks down the task and instructs a coder agent.\n"
253
+ "- The coder agent generates code, which is executed.\n"
254
+ "- The supervisor reviews results and may iterate with the coder to fix issues until the task is complete.\n"
255
+ "The error descriptions below are collected from all steps and iterations of this workflow.\n\n"
256
+ "Below is a list of error categories, each with a short description:\n"
257
+ f"{error_categories}\n\n"
258
+ "Your task: For the given error description, select the single most appropriate error category from the list above. "
259
+ "Base your choice on the underlying nature or root cause of the error, not on the symptoms, error messages, or observable effects. "
260
+ "Focus on what fundamentally caused the error, such as logic mistakes, missing dependencies, data mismatches, or miscommunication, rather than how the error was reported or observed.\n"
261
+ "Return ALL applicable category names, each wrapped with three asterisks on each side, separated by commas, like this: ***Category One***, ***Category Two***"
262
+ "Do not include any other text, explanation, or formatting."
263
+ "Error description:\n"
264
+ "```\n"
265
+ f"{error_description}\n"
266
+ "```"
267
+ )
268
+
269
+ def parse_categories(llm_output):
270
+ # Find all ***Category Name*** matches
271
+ return [cat.strip() for cat in re.findall(r"\*\*\*(.*?)\*\*\*", llm_output)]
272
+
273
+ try:
274
+ response = client.chat.completions.create(
275
+ model=model,
276
+ messages=[{'role': 'user', 'content': categorize_errors_prompt}],
277
+ temperature=0.0
278
+ )
279
+ assignments_text = response.choices[-1].message.content.strip()
280
+ categories = parse_categories(assignments_text)
281
+ df.at[idx, 'error_categories'] = categories if categories else ["Uncategorized"]
282
+ except Exception as e:
283
+ print(f"LLM API error (assignment) at row {idx}: {e}")
284
+ df.at[idx, 'error'] = "LLM API error"
285
+
286
+ df.to_csv(output_csv, index=False)
287
+
288
+ with open(output_csv, 'w', encoding='utf-8') as f:
289
+ f.write("# LLM Generated Error Categories:\n")
290
+ for line in error_categories.splitlines():
291
+ f.write(f"# {line}\n")
292
+ f.write("\n")
293
+ df.to_csv(f, index=False)
294
+ print(f"Saved categorized errors to {output_csv}")
295
+
296
+ def main():
297
+ parser = argparse.ArgumentParser(description="Summarize experiment logs and errors")
298
+ parser.add_argument("--results_dir", type=str, default=" ", nargs='+', required=False, help="One or more directories containing experiment results")
299
+ parser.add_argument("--output_csv", type=str, default="results_summary.csv", help="Path to output CSV file")
300
+ parser.add_argument("--model", type=str, default="gpt-oss-120b", help="LLM model to use for error summarization")
301
+ parser.add_argument("--no_llm", action="store_true", default=False, help="If set, only generate the CSV without LLM error description or categorization")
302
+ args = parser.parse_args()
303
+
304
+ summarize_results(
305
+ results_dirs=args.results_dir,
306
+ output_csv=args.output_csv,
307
+ model=args.model,
308
+ no_llm=args.no_llm
309
+ )
310
+
311
+ if not args.no_llm:
312
+ categorize_errors(
313
+ output_csv=args.output_csv,
314
+ model=args.model
315
+ )
316
+ else:
317
+ print("LLM error description and categorization skipped (--no_llm set)")
318
+
319
+ if __name__ == "__main__":
320
+ main()
error_analysis_fixed_categories.py ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import pandas as pd
3
+ import re
4
+ import glob
5
+ from tqdm import tqdm
6
+ import datetime
7
+ import openai
8
+ import argparse
9
+ import io
10
+
11
+ def summarize_results(results_dirs, output_csv, model, no_llm = False):
12
+ client = openai.OpenAI(
13
+ api_key = os.environ.get('CBORG_API_KEY'),
14
+ base_url = 'https://api.cborg.lbl.gov'
15
+ )
16
+
17
+ error_categorization_prompt = (
18
+ "You are an expert at classifying error messages from machine learning workflows in high energy physics.\n\n"
19
+ "Workflow summary:\n"
20
+ "- A user provides an analysis task prompt.\n"
21
+ "- A supervisor agent breaks down the task and instructs a coder agent.\n"
22
+ "- The coder agent generates code, which is executed.\n"
23
+ "- The supervisor reviews results and may iterate with the coder to fix issues until the task is complete.\n"
24
+ "Below is a list of error categories:\n"
25
+ "all data weights = 0, "
26
+ "dummy data created, "
27
+ "function-calling error, "
28
+ "incorrect branch name, "
29
+ "intermediate file not found, "
30
+ "semantic error, "
31
+ "other."
32
+ "Your task: For the given error description, select the single most appropriate error category from the list above. "
33
+ "Base your choice on the underlying nature or root cause of the error, not on the symptoms, error messages, or observable effects. "
34
+ "Focus on what fundamentally caused the error, such as logic mistakes, missing dependencies, data mismatches, or miscommunication, rather than how the error was reported or observed.\n"
35
+ "Return ALL applicable category names, each wrapped with three asterisks on each side, separated by commas, like this: ***Category***"
36
+ "Do not include any other text, explanation, or formatting."
37
+ "log file:\n"
38
+ )
39
+
40
+ results = []
41
+ for results_dir in results_dirs:
42
+ for name in tqdm(os.listdir(results_dir), desc=f"generating error descriptions for {results_dir}"):
43
+ output_dir = os.path.join(results_dir, name)
44
+
45
+ if os.path.isdir(output_dir):
46
+ # Extract config (everything before "_step")
47
+ config_match = re.match(r'^(.*?)_step\d+', name)
48
+ config = config_match.group(1) if config_match else None
49
+
50
+ # Extract step (int after "_step")
51
+ step_match = re.search(r'_step(\d+)', name)
52
+ step = int(step_match.group(1)) if step_match else None
53
+
54
+ result = {
55
+ "supervisor": None,
56
+ "coder": None,
57
+ "step": step,
58
+ "success": None,
59
+ "iterations": None,
60
+ "duration": None,
61
+ "API_calls": None,
62
+ "input_tokens": None,
63
+ "output_tokens": None,
64
+ "user_prompt_tokens": None,
65
+ "supervisor_to_coder_tokens": None,
66
+ "coder_output_tokens": None,
67
+ "feedback_to_supervisor_tokens": None,
68
+ "error": "Uncategorized",
69
+ "error_description": None,
70
+ "output_dir": output_dir,
71
+ }
72
+
73
+ log_dir = os.path.join(output_dir, "logs")
74
+ if os.path.isdir(log_dir):
75
+ comp_log_files = glob.glob(os.path.join(log_dir, "*comprehensive_log.txt"))
76
+ comp_log_str = None
77
+ if comp_log_files:
78
+ with open(comp_log_files[0], "r") as f:
79
+ comp_log_str = f.read()
80
+ else:
81
+ result["success"] = False
82
+ result["error_description"] = "comprehensive log file not found"
83
+ results.append(result)
84
+ continue
85
+
86
+ supervisor_match = re.search(r"Supervisor:\s*([^\s]+)", comp_log_str)
87
+ coder_match = re.search(r"Coder:\s*([^\s]+)", comp_log_str)
88
+ if supervisor_match:
89
+ result["supervisor"] = supervisor_match.group(1)
90
+ if coder_match:
91
+ result["coder"] = coder_match.group(1)
92
+
93
+ iterations_match = re.search(r"Total Iterations:\s*(\d+)", comp_log_str)
94
+ if iterations_match:
95
+ result["iterations"] = int(iterations_match.group(1))
96
+
97
+ duration_match = re.search(r"Duration:\s*([0-9:.\s]+)", comp_log_str)
98
+ if duration_match:
99
+ duration_str = duration_match.group(1).strip()
100
+ try:
101
+ t = datetime.datetime.strptime(duration_str, "%H:%M:%S.%f")
102
+ except ValueError:
103
+ t = datetime.datetime.strptime(duration_str, "%H:%M:%S")
104
+ result["duration"] = t.hour * 3600 + t.minute * 60 + t.second + t.microsecond / 1e6
105
+
106
+ api_calls_match = re.search(r"Total API Calls:\s*(\d+)", comp_log_str)
107
+ if api_calls_match:
108
+ result["API_calls"] = int(api_calls_match.group(1))
109
+ input_tokens_match = re.search(r"Total Input Tokens:\s*(\d+)", comp_log_str)
110
+ if input_tokens_match:
111
+ result["input_tokens"] = int(input_tokens_match.group(1))
112
+ output_tokens_match = re.search(r"Total Output Tokens:\s*(\d+)", comp_log_str)
113
+ if output_tokens_match:
114
+ result["output_tokens"] = int(output_tokens_match.group(1))
115
+
116
+ match = re.search(r"User Prompt Tokens:\s*(\d+)", comp_log_str)
117
+ if match:
118
+ result["user_prompt_tokens"] = int(match.group(1))
119
+ match = re.search(r"Supervisor to Coder Tokens:\s*(\d+)", comp_log_str)
120
+ if match:
121
+ result["supervisor_to_coder_tokens"] = int(match.group(1))
122
+ match = re.search(r"Coder Output Tokens:\s*(\d+)", comp_log_str)
123
+ if match:
124
+ result["coder_output_tokens"] = int(match.group(1))
125
+ match = re.search(r"Feedback to Supervisor Tokens:\s*(\d+)", comp_log_str)
126
+ if match:
127
+ result["feedback_to_supervisor_tokens"] = int(match.group(1))
128
+
129
+ # Check validation.log to see if outputs are correct
130
+ val_log_files = glob.glob(os.path.join(log_dir, "*validation.log"))
131
+ val_log_str = None
132
+ if val_log_files:
133
+ with open(val_log_files[0], "r") as f:
134
+ val_log_str = f.read()
135
+ matches = re.findall(r'(βœ… Validation successful|❌ Validation failed)', val_log_str)
136
+ if not matches:
137
+ result["success"] = False
138
+ else:
139
+ last = matches[-1]
140
+ result["success"] = last == "βœ… Validation successful"
141
+ if (no_llm):
142
+ if (result["success"]):
143
+ result["error"] = None
144
+ else:
145
+ result["error"] = "Validation Error"
146
+ val_log_str = val_log_str.replace('\n', '').replace('\r', '')
147
+ else:
148
+ result["success"] = False
149
+ val_log_str = ""
150
+ if (not no_llm):
151
+ try:
152
+ response = client.chat.completions.create(
153
+ model = model,
154
+ messages = [
155
+ {
156
+ 'role': 'user',
157
+ 'content': error_categorization_prompt +
158
+ "\nComprehensive Log:\n" + comp_log_str +
159
+ "\nValidation Log:\n" + val_log_str
160
+ }
161
+ ],
162
+ )
163
+ error_description = response.choices[-1].message.content
164
+ def parse_categories(llm_output):
165
+ # Find all ***Category Name*** matches
166
+ return [cat.strip() for cat in re.findall(r"\*\*\*(.*?)\*\*\*", llm_output)]
167
+ result["Error"] = parse_categories(error_description)
168
+ except Exception as e:
169
+ result["Error"] = "uncategorized"
170
+ print(error_description)
171
+ exit()
172
+ print(f"OpenAI API error: {e}")
173
+ else:
174
+ if ("API call failed" in comp_log_str):
175
+ result["error"] = "API Call Error"
176
+ else:
177
+ result["success"] = False
178
+ result["Error"] = "job submission failure"
179
+ results.append(result)
180
+
181
+ df = pd.DataFrame(results)
182
+ df = df.sort_values(by=["supervisor", "coder", "step", "output_dir"])
183
+ df.to_csv(output_csv, index=False)
184
+ print(f"Results written to {output_csv}")
185
+
186
+
187
+ def main():
188
+ parser = argparse.ArgumentParser(description="Summarize experiment logs and errors")
189
+ parser.add_argument("--results_dir", type=str, default=" ", nargs='+', required=False, help="One or more directories containing experiment results")
190
+ parser.add_argument("--output_csv", type=str, default="results_summary.csv", help="Path to output CSV file")
191
+ parser.add_argument("--model", type=str, default="gpt-oss-120b", help="LLM model to use for error summarization")
192
+ parser.add_argument("--no_llm", action="store_true", default=False, help="If set, only generate the CSV without LLM error description or categorization")
193
+ args = parser.parse_args()
194
+
195
+ summarize_results(
196
+ results_dirs=args.results_dir,
197
+ output_csv=args.output_csv,
198
+ model=args.model,
199
+ no_llm=args.no_llm
200
+ )
201
+
202
+ if __name__ == "__main__":
203
+ main()
error_analysis_plotting.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
five_step_analysis.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
get_all_model_versions.py ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Script to get version information for all models in the dataset.
4
+ Usage:
5
+ export CBORG_API_KEY=...
6
+ python get_all_model_versions.py
7
+ """
8
+ import os
9
+ import sys
10
+ import pandas as pd
11
+ from openai import OpenAI
12
+
13
+ def test_model_version(client, model_id):
14
+ """Test a model and return the underlying model name."""
15
+ try:
16
+ response = client.chat.completions.create(
17
+ model=model_id,
18
+ messages=[{"role": "user", "content": "Hi"}],
19
+ max_tokens=5
20
+ )
21
+ return response.model
22
+ except Exception as e:
23
+ error_msg = str(e)[:150]
24
+ return f"ERROR: {error_msg}"
25
+
26
+ def main():
27
+ api_key = os.environ.get('CBORG_API_KEY')
28
+ if not api_key:
29
+ print("Error: CBORG_API_KEY environment variable not set.")
30
+ sys.exit(1)
31
+
32
+ client = OpenAI(
33
+ api_key=api_key,
34
+ base_url="https://api.cborg.lbl.gov"
35
+ )
36
+
37
+ # Load the dataset to get all unique models
38
+ df = pd.read_csv('/global/cfs/projectdirs/atlas/joshua/llm4hep/results_summary.csv', comment='#')
39
+ df = df.dropna(subset=['supervisor', 'coder'])
40
+
41
+ # Get all unique models
42
+ all_models = sorted(set(df['supervisor'].unique()) | set(df['coder'].unique()))
43
+
44
+ print("=" * 100)
45
+ print("TESTING ALL MODELS IN DATASET FOR VERSION INFORMATION")
46
+ print("=" * 100)
47
+ print(f"\nFound {len(all_models)} unique models in the dataset")
48
+ print()
49
+
50
+ results = {}
51
+
52
+ for idx, model in enumerate(all_models, 1):
53
+ print(f"[{idx}/{len(all_models)}] Testing {model:<45}", end=" ", flush=True)
54
+ underlying = test_model_version(client, model)
55
+ results[model] = underlying
56
+
57
+ if underlying.startswith('ERROR'):
58
+ print("❌")
59
+ else:
60
+ print("βœ“")
61
+
62
+ # Print results
63
+ print("\n" + "=" * 100)
64
+ print("RESULTS: MODEL MAPPINGS")
65
+ print("=" * 100)
66
+
67
+ for model in sorted(results.keys()):
68
+ underlying = results[model]
69
+ if underlying.startswith('ERROR'):
70
+ print(f"❌ {model:<45} {underlying[:50]}")
71
+ else:
72
+ if model == underlying:
73
+ print(f" {model:<45} (no alias)")
74
+ else:
75
+ print(f" {model:<45} β†’ {underlying}")
76
+
77
+ # Save to file
78
+ output_file = 'model_version_mappings.txt'
79
+ with open(output_file, 'w') as f:
80
+ f.write("MODEL VERSION MAPPINGS\n")
81
+ f.write("=" * 100 + "\n")
82
+ f.write(f"Discovered on: October 29, 2025\n")
83
+ f.write(f"Total models tested: {len(results)}\n\n")
84
+
85
+ for model in sorted(results.keys()):
86
+ underlying = results[model]
87
+ if not underlying.startswith('ERROR'):
88
+ if model == underlying:
89
+ f.write(f"{model} (no alias)\n")
90
+ else:
91
+ f.write(f"{model} β†’ {underlying}\n")
92
+
93
+ print(f"\nβœ“ Results saved to {output_file}")
94
+ print("=" * 100)
95
+
96
+ if __name__ == '__main__':
97
+ main()
get_arr.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import argparse
3
+ import os
4
+
5
+ parser = argparse.ArgumentParser(description='read array')
6
+ add_arg = parser.add_argument
7
+ add_arg('--name', help='array name')
8
+ add_arg('--out_dir', help='output directory', default='.')
9
+ args = parser.parse_args()
10
+
11
+ # Prefer arrays saved under <out_dir>/logs, fallback to current directory
12
+ logs_path = os.path.join(args.out_dir, 'logs', f'{args.name}.npy')
13
+ root_path = os.path.join(args.out_dir, f'{args.name}.npy')
14
+ filepath = logs_path if os.path.exists(logs_path) else root_path
15
+
16
+ arr = np.load(filepath)
17
+ if len(arr) > 3:
18
+ arr = np.array([np.sum(arr[:-2]), arr[-2], arr[-1]])
19
+ print(*arr.flatten())
jobs/README.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Job Submissions
2
+
3
+ A series of Perlmutter jobs can be submitted via the `submit.sh` shell script. This is a one-button method of launching parallel tests for a given list of models.
4
+
5
+ ## `submit.sh`
6
+
7
+ This script reads `../models.txt` or `../models_supervisor.txt` + `../models_coder.txt` and extracts the list of supervisor models and coder models to test. This script has a command-line input specifying the configuration mode using `--mode`.
8
+ * `--mode identical`: the default option. This mode reads from `../models.txt` and uses identical models for supervisor/coder
9
+ * `--mode pairwise`: This mode reads from `../models_supervisor.txt` + `../models_coder.txt` and constructs all pairwise combinations of supervisor/coder setups.
10
+
11
+ All of the different supervisor/coder configurations are then submitted as separate jobs. This allows each supervisor/coder pairing to run testing in parallel via the `run_tests.sh` script. To adjust the number of "trials" per test (number of times each test is run), just modify the variable `NUM_TESTS`. There is also a variable called `OUTDIR` that will let you specify the output directory for your tests.
12
+
13
+ ## `run_tests.sh`
14
+ This script has 3 different input parameters:
15
+ * `supervisor`: the model to be used as supervisor
16
+ * `coder`: the model to be used as coder
17
+ * `NUM_TESTS`: the number of trials to run
18
+ * `OUTDIR`: the output directory for your tests (optional)
19
+
20
+ This script will just load the conda environment and call the final script of this chain, `test_models.py`. To adjust the slurm options, modify the header of this file (job time, account, qos, slurm output directory, etc).
21
+
22
+ ## `test_models.py`
23
+ This script parallelizes the testing for a given supervisor/coder setup. Each trial is broken down into 5 steps (summarize root, create_numpy, preprocess, scores, and categorization), and each step is run in parallel, taking advantage of the fact that each step is independent from all other steps. Furthermore, additional parallization is performed according to the number of total trials to be conducted. In the current configuration, 2 tests are run in parallel. You can modify the number of parallel tests my changing the `max_workers` in the argument of the `ProcessPoolExecutor`
jobs/run_tests.sh ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ #SBATCH -N 1
3
+ #SBATCH -C gpu
4
+ #SBATCH -q shared
5
+ #SBATCH -t 06:00:00
6
+ #SBATCH -A atlas
7
+ #SBATCH -o jobs/slurm/%j.out # STDOUT
8
+
9
+ supervisor="$1"
10
+ coder="$2"
11
+ NUM_TESTS="$3"
12
+ OUTDIR="$4"
13
+
14
+ module load python
15
+ source ~/.bashrc
16
+ conda activate llm_env
17
+
18
+ python jobs/test_models.py "$supervisor" "$coder" "$NUM_TESTS" --outdir "$OUTDIR"
jobs/submit.sh ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ MODEL_LIST="models.txt"
4
+ SUPERVISOR_LIST="models_supervisor.txt"
5
+ CODER_LIST="models_coder.txt"
6
+ NUM_TESTS=10
7
+ OUTDIR="/global/cfs/projectdirs/atlas/llm4hep/oct_11_tests/"
8
+
9
+ usage() {
10
+ echo "Usage: $0 [--mode identical|pairwise]"
11
+ echo " --mode identical : Use the same model for both supervisor and coder (from models.txt) [default]"
12
+ echo " --mode pairwise : Use all pairs (from models_supervisor.txt and models_coder.txt)"
13
+ exit 1
14
+ }
15
+
16
+ # Default mode
17
+ MODE="identical"
18
+
19
+ # Parse arguments
20
+ while [[ $# -gt 0 ]]; do
21
+ case "$1" in
22
+ --mode)
23
+ MODE="$2"
24
+ shift 2
25
+ ;;
26
+ *)
27
+ usage
28
+ ;;
29
+ esac
30
+ done
31
+
32
+ if [[ "$MODE" == "identical" ]]; then
33
+ # One model for both supervisor and coder
34
+ while IFS= read -r model; do
35
+ model=$(echo "$model" | xargs)
36
+ [ -z "$model" ] && continue
37
+ echo "Supervisor & Coder: $model"
38
+ sbatch --job-name="${model}_${model}" jobs/run_tests.sh "$model" "$model" "$NUM_TESTS" "$OUTDIR"
39
+ done < "$MODEL_LIST"
40
+ elif [[ "$MODE" == "pairwise" ]]; then
41
+ # Different models for supervisor and coder
42
+ while IFS= read -r supervisor; do
43
+ supervisor=$(echo "$supervisor" | xargs)
44
+ [ -z "$supervisor" ] && continue
45
+ while IFS= read -r coder; do
46
+ coder=$(echo "$coder" | xargs)
47
+ [ -z "$coder" ] && continue
48
+ echo "Supervisor: $supervisor, Coder: $coder"
49
+ sbatch --job-name="${supervisor}_${coder}" jobs/run_tests.sh "$supervisor" "$coder" "$NUM_TESTS" "$OUTDIR"
50
+ done < "$CODER_LIST"
51
+ done < "$SUPERVISOR_LIST"
52
+ else
53
+ usage
54
+ fi
jobs/test_models.py ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import subprocess
3
+ import time
4
+ import yaml
5
+ from concurrent.futures import ProcessPoolExecutor, as_completed
6
+ import re
7
+ import argparse
8
+
9
+ def sanitize(s):
10
+ # Replace / and : and other non-alphanumeric chars with _
11
+ return re.sub(r'[^A-Za-z0-9_.-]', '_', s)
12
+
13
+ def run_for_model(supervisor, coder, step, config_filepath, outdir):
14
+ timestamp = time.strftime("%Y%m%d_%H%M%S")
15
+ pid = os.getpid()
16
+ slurm_jobid = os.environ.get("SLURM_JOB_ID")
17
+ if slurm_jobid:
18
+ job_id = f"{sanitize(supervisor)}_{sanitize(coder)}_step{step}_{timestamp}_{pid}_slurm_{slurm_jobid}"
19
+ else:
20
+ job_id = f"{sanitize(supervisor)}_{sanitize(coder)}_step{step}_{timestamp}_{pid}"
21
+
22
+ out_path = os.path.join(outdir, job_id)
23
+ run_cmd = (
24
+ f"./run_smk_sequential.sh --step{step} --out-dir {out_path} --config {config_filepath} --validate"
25
+ )
26
+ subprocess.run(run_cmd, shell=True, check=True, executable='/bin/bash')
27
+
28
+ return supervisor, coder, pid
29
+
30
+ def main(supervisor, coder, num_tests, outdir):
31
+ config = {"supervisor": supervisor, "coder": coder, "temperature": 1.5}
32
+ config_dir = "/dev/shm/config"
33
+ os.makedirs(config_dir, exist_ok=True)
34
+ config_filepath = os.path.join(config_dir, f"{sanitize(supervisor)}_{sanitize(coder)}.yml")
35
+ with open(config_filepath, "w") as f:
36
+ yaml.dump(config, f)
37
+
38
+ futures = []
39
+ with ProcessPoolExecutor(max_workers=2) as executor:
40
+ for _ in range(num_tests):
41
+ for step in [1, 2, 3, 4, 5]:
42
+ futures.append(executor.submit(
43
+ run_for_model, supervisor, coder, step, config_filepath, outdir
44
+ ))
45
+
46
+ for future in as_completed(futures):
47
+ supervisor, coder, pid = future.result()
48
+ print(f"Completed PID {pid}")
49
+
50
+ if __name__ == "__main__":
51
+ parser = argparse.ArgumentParser()
52
+ parser.add_argument("supervisor", help="Supervisor name")
53
+ parser.add_argument("coder", help="Coder name")
54
+ parser.add_argument("num_tests", type=int, help="Number of tests")
55
+ parser.add_argument("--outdir", default="/global/cfs/projectdirs/atlas/llm4hep/",
56
+ help="Output directory (default: %(default)s)")
57
+ args = parser.parse_args()
58
+
59
+ main(args.supervisor, args.coder, args.num_tests, args.outdir)
list_cborg_models.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Script to list available CBORG models using your CBORG_API_KEY.
4
+ Usage:
5
+ export CBORG_API_KEY=...
6
+ python list_cborg_models.py
7
+ """
8
+ import os
9
+ import sys
10
+ from openai import OpenAI
11
+
12
+ def main():
13
+ api_key = os.environ.get('CBORG_API_KEY')
14
+ if not api_key:
15
+ print("Error: CBORG_API_KEY environment variable not set.")
16
+ sys.exit(1)
17
+
18
+ client = OpenAI(
19
+ api_key=api_key,
20
+ base_url="https://api.cborg.lbl.gov"
21
+ )
22
+ try:
23
+ response = client.models.list()
24
+ print("Available CBORG models:")
25
+ print("-" * 80)
26
+ for model in response.data:
27
+ print(f"\nModel ID: {model.id}")
28
+
29
+ # Try to retrieve detailed information about each model
30
+ try:
31
+ model_details = client.models.retrieve(model.id)
32
+ print(f" Created: {model_details.created if hasattr(model_details, 'created') else 'N/A'}")
33
+ print(f" Owned by: {model_details.owned_by if hasattr(model_details, 'owned_by') else 'N/A'}")
34
+
35
+ # Print all available attributes
36
+ print(f" Available attributes:")
37
+ for attr in dir(model_details):
38
+ if not attr.startswith('_'):
39
+ try:
40
+ value = getattr(model_details, attr)
41
+ if not callable(value):
42
+ print(f" {attr}: {value}")
43
+ except:
44
+ pass
45
+ except Exception as e:
46
+ print(f" (Could not retrieve detailed info: {e})")
47
+
48
+ print("-" * 80)
49
+ except Exception as e:
50
+ print(f"Error fetching model list: {e}")
51
+ sys.exit(1)
52
+
53
+ if __name__ == '__main__':
54
+ main()
logs_interpreter.py ADDED
@@ -0,0 +1,341 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ logs_interpreter.py
4
+
5
+ Parse log files, call the CBORG model to diagnose root causes of failures (or confirm success), and output its analysis.
6
+ """
7
+ import os
8
+ import sys
9
+ import argparse
10
+
11
+ try:
12
+ from openai import OpenAI # type: ignore
13
+ except ImportError:
14
+ print("Please install openai (pip install openai)")
15
+ sys.exit(1)
16
+
17
+
18
+ def parse_args():
19
+ parser = argparse.ArgumentParser(
20
+ description="Analyze run logs and ask CBORG model for root-cause analysis"
21
+ )
22
+ parser.add_argument(
23
+ "--log_dir", default=".",
24
+ help="Directory containing .txt log files (default: current directory)"
25
+ )
26
+ parser.add_argument(
27
+ "--model", default="lbl/cborg-deepthought",
28
+ help="CBORG model to use (default: lbl/cborg-deepthought)"
29
+ )
30
+ parser.add_argument(
31
+ "--output", default=None,
32
+ help="File to write the model's analysis (default: stdout)"
33
+ )
34
+ return parser.parse_args()
35
+
36
+
37
+ def gather_logs(log_dir):
38
+ # If logs are under a nested 'logs' directory, use that first
39
+ if os.path.isdir(os.path.join(log_dir, 'logs')):
40
+ log_base = os.path.join(log_dir, 'logs')
41
+ else:
42
+ log_base = log_dir
43
+ # Group TXT log files by prefix (before the last underscore)
44
+ files = [f for f in sorted(os.listdir(log_base)) if f.endswith('.txt')]
45
+ groups = {}
46
+ for fname in files:
47
+ if '_' in fname:
48
+ base = fname.rsplit('_', 1)[0]
49
+ else:
50
+ base = fname.rsplit('.', 1)[0]
51
+ groups.setdefault(base, []).append(fname)
52
+ segments = []
53
+ # Assemble grouped log contents
54
+ for base, flist in groups.items():
55
+ segments.append(f"=== Log group: {base} ===")
56
+ for fname in flist:
57
+ path = os.path.join(log_dir, fname)
58
+ try:
59
+ with open(path, 'r') as f:
60
+ content = f.read().strip()
61
+ except Exception as e:
62
+ content = f"<could not read: {e}>"
63
+ segments.append(f"-- {fname} --\n{content}")
64
+ segments.append("")
65
+
66
+ # Include Snakemake run logs from possible locations
67
+ # 1) sibling 'snakemake_log' folder
68
+ # 2) nested '.snakemake/log' under log_dir
69
+ candidates = [os.path.join(log_dir, 'snakemake_log'),
70
+ os.path.join(log_dir, '.snakemake', 'log')]
71
+ for sn_dir in candidates:
72
+ if os.path.isdir(sn_dir):
73
+ for fname in sorted(os.listdir(sn_dir)):
74
+ if fname.endswith('.log'):
75
+ path = os.path.join(sn_dir, fname)
76
+ try:
77
+ with open(path, 'r') as f:
78
+ content = f.read().strip()
79
+ except Exception as e:
80
+ content = f"<could not read: {e}>"
81
+ segments.append(f"=== Snakemake Log File: {fname} ===")
82
+ segments.append(content)
83
+ segments.append("")
84
+ return "\n".join(segments)
85
+
86
+
87
+ def call_cborg(prompt, model):
88
+ api_key = os.getenv("CBORG_API_KEY") or os.getenv("OPENAI_API_KEY")
89
+ if not api_key:
90
+ print("Error: CBORG_API_KEY or OPENAI_API_KEY environment variable not set.")
91
+ sys.exit(1)
92
+ # Initialize the CBORG/OpenAI client with the appropriate API endpoint
93
+ cborg_url = os.getenv("CBORG_API_URL", "https://api.cborg.lbl.gov")
94
+ client = OpenAI(api_key=api_key, base_url=cborg_url)
95
+ # Call the chat completions endpoint
96
+ response = client.chat.completions.create(
97
+ model=model,
98
+ messages=[
99
+ {"role": "system", "content": "You are a log root-cause analyzer. Provide a concise diagnosis."},
100
+ {"role": "user", "content": prompt},
101
+ ],
102
+ temperature=0.2,
103
+ )
104
+ # Safely extract content
105
+ choice = response.choices[0]
106
+ content = None
107
+ if hasattr(choice, 'message') and choice.message:
108
+ content = getattr(choice.message, 'content', None)
109
+ if content is None and hasattr(choice, 'text'):
110
+ content = choice.text
111
+ if content is None:
112
+ content = ''
113
+ return content.strip()
114
+
115
+
116
+ def main():
117
+ args = parse_args()
118
+ # If the log_dir contains run subdirectories with their own 'logs' folders, gather per-run
119
+ runs = [d for d in sorted(os.listdir(args.log_dir))
120
+ if os.path.isdir(os.path.join(args.log_dir, d)) and d != '.snakemake']
121
+ # Determine base log directory (for nested runs or single run)
122
+ # Determine the folder containing .txt logs
123
+ log_folder = os.path.join(args.log_dir, 'logs') if os.path.isdir(os.path.join(args.log_dir, 'logs')) else args.log_dir
124
+ if runs and os.path.isdir(os.path.join(args.log_dir, runs[0], 'logs')):
125
+ combined = []
126
+ for run in runs:
127
+ combined.append(f"=== Run: {run} ===")
128
+ run_log_dir = os.path.join(args.log_dir, run, 'logs')
129
+ combined.append(gather_logs(run_log_dir))
130
+ # Include root-level Snakemake logs if present
131
+ root_snake = os.path.join(args.log_dir, '.snakemake', 'log')
132
+ if os.path.isdir(root_snake):
133
+ combined.append("=== Root Snakemake Logs ===")
134
+ for fname in sorted(os.listdir(root_snake)):
135
+ if fname.endswith('.log'):
136
+ path = os.path.join(root_snake, fname)
137
+ try:
138
+ content = open(path).read().strip()
139
+ except Exception:
140
+ content = "<could not read>"
141
+ combined.append(f"-- {fname} --\n{content}")
142
+ logs = "\n\n".join(combined)
143
+ else:
144
+ # Gather logs from determined log_folder
145
+ logs = gather_logs(log_folder)
146
+ # Prepend a listing of available .txt files in the log_folder for clarity
147
+ try:
148
+ entries = sorted(f for f in os.listdir(log_folder) if f.endswith('.txt'))
149
+ listing = "=== Logs directory files (txt) ===\n" + "\n".join(entries) + "\n\n"
150
+ except Exception:
151
+ listing = ""
152
+ logs = listing + logs
153
+ if not logs:
154
+ print(f"No log files found in {args.log_dir}")
155
+ sys.exit(0)
156
+
157
+ # Include stats.csv summary and filter logs for failed steps
158
+ stats_file = os.path.join(args.log_dir, 'stats.csv')
159
+ if os.path.isfile(stats_file):
160
+ try:
161
+ with open(stats_file, 'r') as sf:
162
+ stats_content = sf.read().strip()
163
+ except Exception as e:
164
+ stats_content = f"<could not read stats.csv: {e}>"
165
+ # Begin prompt logs with stats summary
166
+ logs = f"=== Stats Summary ===\n{stats_content}\n\n"
167
+ # Parse CSV to identify failed steps
168
+ try:
169
+ with open(stats_file, 'r') as sf:
170
+ # Read the entire CSV content and parse manually due to potential line wrapping
171
+ content = sf.read().strip()
172
+ lines = content.split('\n')
173
+
174
+ # Find the data line (starts with '* ')
175
+ data_line = None
176
+ for line in lines:
177
+ if line.strip().startswith('* '):
178
+ data_line = line.strip()[2:] # Remove '* ' prefix
179
+ break
180
+
181
+ if data_line:
182
+ # Parse the data manually: model_name, step1_success, step1_time, step1_calls, step1_in, step1_out, step2_success, etc.
183
+ parts = [part.strip() for part in data_line.split(',')]
184
+ if len(parts) >= 16: # Ensure we have enough columns
185
+ stats_row = {
186
+ 'step 1 success?': parts[1], # Index 1: step 1 success
187
+ 'step 2 success?': parts[6], # Index 6: step 2 success
188
+ 'step 3 success?': parts[11], # Index 11: step 3 success
189
+ }
190
+ else:
191
+ stats_row = {}
192
+ else:
193
+ stats_row = {}
194
+ except Exception as e:
195
+ print(f"Warning: Could not parse CSV: {e}")
196
+ stats_row = {}
197
+ # Map step numbers to rule prefixes
198
+ step_rules = {
199
+ '1': ['create_numpy', 'insert_root_summary', 'preprocess', 'summarize_root'],
200
+ '2': ['scores'],
201
+ '3': ['categorization'],
202
+ }
203
+ # List available txt entries
204
+ entries = []
205
+ try:
206
+ entries = sorted(f for f in os.listdir(log_folder) if f.endswith('.txt'))
207
+ except Exception:
208
+ pass
209
+ # Build filtered log segments for each step (both failed and passed for context)
210
+ filtered = []
211
+
212
+ # Always include stats parsing for context
213
+ filtered.append("=== STEP STATUS FROM STATS.CSV ===")
214
+ for step, rules in step_rules.items():
215
+ key = f'step {step} success?'
216
+ status = stats_row.get(key, 'Unknown').strip()
217
+ filtered.append(f"Step {step}: {status}")
218
+ filtered.append("")
219
+
220
+ # Include logs for failed steps and their associated rules
221
+ failed_steps = []
222
+ for step, rules in step_rules.items():
223
+ key = f'step {step} success?'
224
+ if stats_row.get(key, '').lower() != 'true':
225
+ failed_steps.append(step)
226
+ filtered.append(f"=== FAILED STEP {step} LOGS ===")
227
+
228
+ for rule in rules:
229
+ filtered.append(f"--- Rule: {rule} ---")
230
+ matched = [f for f in entries if f.startswith(rule + '_')]
231
+ if matched:
232
+ for fname in matched:
233
+ path = os.path.join(log_folder, fname)
234
+ try:
235
+ content = open(path).read().strip()
236
+ # Truncate very long logs to focus on key parts
237
+ if len(content) > 5000:
238
+ lines = content.split('\n')
239
+ content = '\n'.join(lines[:100]) + "\n...[TRUNCATED]...\n" + '\n'.join(lines[-50:])
240
+ except Exception as e:
241
+ content = f"<could not read: {e}>"
242
+ filtered.append(f"Log file: {fname}")
243
+ filtered.append(content)
244
+ else:
245
+ filtered.append("No log files found for this rule.")
246
+ filtered.append("")
247
+
248
+ # Add Snakemake logs for execution context
249
+ snakemake_dir = os.path.join(args.log_dir, 'snakemake_log')
250
+ if os.path.isdir(snakemake_dir):
251
+ filtered.append("=== SNAKEMAKE EXECUTION LOGS ===")
252
+ for fname in sorted(os.listdir(snakemake_dir)):
253
+ if fname.endswith('.log'):
254
+ path = os.path.join(snakemake_dir, fname)
255
+ try:
256
+ content = open(path).read().strip()
257
+ # Focus on errors and warnings in Snakemake logs
258
+ lines = content.split('\n')
259
+ important_lines = []
260
+ for line in lines:
261
+ if any(keyword in line.lower() for keyword in ['error', 'exception', 'failed', 'warning', 'killed']):
262
+ important_lines.append(line)
263
+ if important_lines:
264
+ filtered.append(f"Snakemake log: {fname} (errors/warnings only)")
265
+ filtered.append('\n'.join(important_lines[-20:])) # Last 20 error lines
266
+ else:
267
+ filtered.append(f"Snakemake log: {fname} - No errors detected")
268
+ except Exception as e:
269
+ filtered.append(f"<could not read {fname}: {e}>")
270
+ filtered.append("")
271
+
272
+ # Append filtered logs
273
+ logs += "\n".join(filtered)
274
+
275
+ # Build prompt: a single f-string literal with embedded logs (no leading newline)
276
+ prompt = f"""You are analyzing a machine learning pipeline failure. Your task is to diagnose root causes by examining three sources:
277
+
278
+ 1) stats.csv: Shows pass/fail status for 3 steps:
279
+ - Step 1 (Data Preparation): create_numpy, insert_root_summary, preprocess, summarize_root
280
+ - Step 2 (Scoring): scores
281
+ - Step 3 (Categorization): categorization
282
+
283
+ 2) Individual .txt logs in logs/: Contain detailed execution output for each rule attempt
284
+ 3) Snakemake logs: Show workflow execution status and any workflow-level errors
285
+
286
+ ANALYSIS REQUIREMENTS:
287
+ Create a diagnostic report using this format for each step:
288
+
289
+ ------
290
+ Step X (Category of failure)
291
+ ------
292
+ Rule: [rule_name]
293
+ ------
294
+ Status: [Pass/Fail from stats.csv] | [Snakemake execution status]
295
+ ------
296
+ Root Cause Analysis: [detailed analysis]
297
+ ------
298
+
299
+ For each failed step (False in stats.csv):
300
+ - Examine ALL relevant .txt log files for that step's rules
301
+ - Look for specific error messages, exceptions, or failure indicators
302
+ - Identify the probable root cause (e.g., missing files, API failures, memory issues, logic errors, syntax errors)
303
+ - If logs show success messages but stats.csv shows failure, investigate this discrepancy
304
+ - Categorize the failure type (Data/API/Logic/Infrastructure/Other)
305
+
306
+ For passed steps (True in stats.csv):
307
+ - Simply mark as "OK" in Root Cause Analysis
308
+
309
+ After the table, provide:
310
+ 1. Overall Status: SUCCESS or FAILURE using similar format as above.
311
+ 2. Primary Failure Category (if applicable): Data/API/Logic/Infrastructure/Other
312
+ 3. Recommended Next Steps
313
+
314
+ DATA TO ANALYZE:
315
+ {logs}
316
+ """
317
+ # DEBUG: Uncomment to see full prompt
318
+ # print("=== PROMPT BEING SENT TO CBORG ===")
319
+ # print(prompt)
320
+ # print("=== END PROMPT ===\n")
321
+ analysis = call_cborg(prompt, args.model)
322
+ # Fallback if model returns empty
323
+ if not analysis or not analysis.strip():
324
+ analysis = (
325
+ "Warning: CBORG model returned no analysis.\n"
326
+ "Below is the prompt sent to the model for debugging:\n\n" + prompt
327
+ )
328
+
329
+ # Determine output path: either user-specified or default under log_dir
330
+ # Write analysis to logs_analysis.txt by default in the log directory
331
+ output_file = args.output or os.path.join(args.log_dir, 'logs_analysis.txt')
332
+ try:
333
+ with open(output_file, 'w') as f:
334
+ f.write(analysis + "\n")
335
+ print(f"Analysis written to {output_file}")
336
+ except Exception as e:
337
+ print(f"Error writing analysis to {output_file}: {e}")
338
+
339
+
340
+ if __name__ == "__main__":
341
+ main()
logs_interpreter.sh ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Load and activate the Conda environment for CBORG analysis
3
+ module load conda
4
+ conda activate llm_env
5
+ # Wrapper to run the log interpreter script with python3
6
+ if ! command -v python3 &>/dev/null; then
7
+ echo "Error: python3 not found in PATH"
8
+ exit 1
9
+ fi
10
+
11
+ dir=$(dirname "$0")
12
+ python3 "$dir/logs_interpreter.py" "$@"
map_latest_models.py ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Script to map all :latest models to their underlying versions.
4
+ Usage:
5
+ export CBORG_API_KEY=...
6
+ python map_latest_models.py
7
+ """
8
+ import os
9
+ import sys
10
+ from openai import OpenAI
11
+
12
+ def test_model_mapping(client, model_id):
13
+ """Test a model and return the underlying model name."""
14
+ try:
15
+ response = client.chat.completions.create(
16
+ model=model_id,
17
+ messages=[{"role": "user", "content": "Hi"}],
18
+ max_tokens=5
19
+ )
20
+ return response.model
21
+ except Exception as e:
22
+ return f"ERROR: {str(e)[:100]}"
23
+
24
+ def main():
25
+ api_key = os.environ.get('CBORG_API_KEY')
26
+ if not api_key:
27
+ print("Error: CBORG_API_KEY environment variable not set.")
28
+ sys.exit(1)
29
+
30
+ client = OpenAI(
31
+ api_key=api_key,
32
+ base_url="https://api.cborg.lbl.gov"
33
+ )
34
+
35
+ # Get all available models
36
+ try:
37
+ response = client.models.list()
38
+ all_models = [model.id for model in response.data]
39
+ except Exception as e:
40
+ print(f"Error fetching model list: {e}")
41
+ sys.exit(1)
42
+
43
+ # Filter for models with :latest
44
+ latest_models = [m for m in all_models if ':latest' in m]
45
+
46
+ # Also check models without suffix to compare
47
+ base_models = []
48
+ for latest in latest_models:
49
+ base = latest.replace(':latest', '')
50
+ if base in all_models:
51
+ base_models.append(base)
52
+
53
+ print("=" * 100)
54
+ print("MAPPING OF :latest MODELS TO UNDERLYING VERSIONS")
55
+ print("=" * 100)
56
+
57
+ results = []
58
+
59
+ # Test :latest models
60
+ print(f"\nTesting {len(latest_models)} models with :latest suffix...")
61
+ for model in sorted(latest_models):
62
+ print(f" Testing {model}...", end=" ", flush=True)
63
+ underlying = test_model_mapping(client, model)
64
+ results.append((model, underlying))
65
+ print("βœ“")
66
+
67
+ # Test base models for comparison
68
+ print(f"\nTesting {len(base_models)} corresponding base models (without :latest)...")
69
+ for model in sorted(base_models):
70
+ print(f" Testing {model}...", end=" ", flush=True)
71
+ underlying = test_model_mapping(client, model)
72
+ results.append((model, underlying))
73
+ print("βœ“")
74
+
75
+ # Print results
76
+ print("\n" + "=" * 100)
77
+ print("RESULTS")
78
+ print("=" * 100)
79
+
80
+ print("\nπŸ“‹ Models with :latest suffix:")
81
+ print("-" * 100)
82
+ for model, underlying in results:
83
+ if ':latest' in model:
84
+ if underlying.startswith('ERROR'):
85
+ print(f"❌ {model:<50} {underlying}")
86
+ else:
87
+ status = "β†’" if model != underlying else "="
88
+ print(f" {model:<50} {status} {underlying}")
89
+
90
+ print("\nπŸ“‹ Base models (without :latest):")
91
+ print("-" * 100)
92
+ for model, underlying in results:
93
+ if ':latest' not in model:
94
+ if underlying.startswith('ERROR'):
95
+ print(f"❌ {model:<50} {underlying}")
96
+ else:
97
+ status = "β†’" if model != underlying else "="
98
+ print(f" {model:<50} {status} {underlying}")
99
+
100
+ # Compare :latest vs base
101
+ print("\nπŸ“Š COMPARISON: Do :latest and base versions map to the same model?")
102
+ print("-" * 100)
103
+
104
+ latest_map = {m: u for m, u in results if ':latest' in m}
105
+ base_map = {m: u for m, u in results if ':latest' not in m}
106
+
107
+ for latest, underlying_latest in sorted(latest_map.items()):
108
+ base = latest.replace(':latest', '')
109
+ if base in base_map:
110
+ underlying_base = base_map[base]
111
+ if underlying_latest == underlying_base:
112
+ print(f"βœ“ {latest:<50} SAME as {base}")
113
+ print(f" └─ Both map to: {underlying_latest}")
114
+ else:
115
+ print(f"⚠️ {latest:<50} DIFFERENT from {base}")
116
+ print(f" β”œβ”€ :latest maps to: {underlying_latest}")
117
+ print(f" └─ base maps to: {underlying_base}")
118
+
119
+ print("\n" + "=" * 100)
120
+
121
+ if __name__ == '__main__':
122
+ main()
model_version_mappings.txt ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MODEL VERSION MAPPINGS
2
+ ====================================================================================================
3
+ Discovered on: October 29, 2025
4
+ Total models tested: 22
5
+
6
+ anthropic/claude-haiku:latest β†’ claude-haiku-4-5@20251001
7
+ anthropic/claude-opus:latest β†’ us.anthropic.claude-opus-4-1-20250805-v1:0
8
+ anthropic/claude-sonnet:latest β†’ claude-sonnet-4-5@20250929
9
+ aws/llama-4-maverick β†’ us.meta.llama4-maverick-17b-instruct-v1:0
10
+ aws/llama-4-scout β†’ us.meta.llama4-scout-17b-instruct-v1:0
11
+ claude-3-5-haiku-latest β†’ claude-3-5-haiku@20241022
12
+ deepseek-r1 β†’ MAI-DS-R1
13
+ gcp/qwen-3 β†’ qwen/qwen3-235b-a22b-instruct-2507-maas
14
+ gemini-2.0-flash-lite (no alias)
15
+ google/gemini-flash β†’ gemini-2.5-flash
16
+ google/gemini:latest β†’ gemini-2.5-pro
17
+ gpt-oss-120b β†’ hosted_vllm/hosted_vllm/gpt-oss-120b
18
+ openai/gpt-5 β†’ gpt-5-2025-08-07
19
+ openai/gpt-5-mini β†’ gpt-5-mini-2025-08-07
20
+ openai/o3 β†’ azure/o3-2025-04-16
21
+ openai/o3-mini β†’ azure/o3-mini-2025-01-31
22
+ openai/o4-mini β†’ azure/o4-mini-2025-04-16
23
+ openai/o:latest β†’ azure/o3-2025-04-16
24
+ xai/grok:latest β†’ grok-3
models.example.txt ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model list for testing
2
+ #
3
+ # Usage: Copy this file to models.txt and customize for your tests
4
+ #
5
+ # Format:
6
+ # - One model per line
7
+ # - Use CBORG model aliases (see CBORG_MODEL_MAPPINGS.md)
8
+ # - IMPORTANT: File MUST end with a blank line
9
+ # - Repeat model names to run multiple trials
10
+ #
11
+ # Available models (examples):
12
+ #
13
+ # Anthropic Claude models:
14
+ # anthropic/claude-sonnet:latest
15
+ # anthropic/claude-opus:latest
16
+ # anthropic/claude-haiku:latest
17
+ #
18
+ # OpenAI models:
19
+ # openai/gpt-5-mini
20
+ # openai/gpt-5
21
+ # openai/o3
22
+ # openai/o3-mini
23
+ #
24
+ # Google Gemini:
25
+ # google/gemini:latest
26
+ # google/gemini-flash
27
+ #
28
+ # Example configuration (uncomment to use):
29
+ # anthropic/claude-sonnet:latest
30
+ # openai/gpt-5-mini
31
+ # google/gemini:latest
32
+ #
33
+ # IMPORTANT: Add blank line below (required)
34
+
models.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ lbl/cborg-deepthought:latest
2
+ lbl/llama
models_coder.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ o4-mini
models_supervisor.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ o4-mini
plot_stats.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
plots/five_step_summary_stats.csv ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ pair,step,success_count,agent_work_mean,agent_work_std,API_calls_mean,API_calls_std,total_price_mean,total_price_std,duration_mean,duration_std,input_tokens_mean,output_tokens_mean
2
+ GPT-5 Codex,1.0,10,29.71,20.43,3.8,1.03,0.18,0.15,138.28,82.4,3084.6,8740.4
3
+ GPT-5 Codex,2.0,7,15.51,5.66,4.43,1.51,0.37,0.14,239.52,90.58,10528.29,17226.71
4
+ GPT-5 Codex,3.0,9,17.17,7.11,5.44,1.67,0.53,0.24,337.15,166.49,15786.22,24568.22
5
+ GPT-5 Codex,4.0,9,6.2,1.69,3.44,0.88,0.04,0.01,223.69,101.66,3291.0,1510.33
6
+ GPT-5 Codex,5.0,5,20.85,12.88,5.8,1.1,0.35,0.26,306.1,152.33,8801.2,16389.0
7
+ GPT-5 Mini (2025-08-07),1.0,11,103.92,11.06,7.0,0.0,0.1,0.01,330.62,76.77,16302.55,23915.36
8
+ GPT-5 Mini (2025-08-07),2.0,5,30.58,4.01,7.0,0.0,0.12,0.02,354.82,68.27,25173.4,27960.2
9
+ GPT-5 Mini (2025-08-07),3.0,9,20.2,0.69,7.0,0.0,0.1,0.0,316.01,38.12,23718.78,23161.67
10
+ GPT-5 Mini (2025-08-07),4.0,10,28.16,3.11,7.0,0.0,0.04,0.01,471.36,32.86,10257.8,8990.8
11
+ GPT-5 Mini (2025-08-07),5.0,10,25.21,1.89,7.0,0.0,0.07,0.01,338.15,18.41,14457.3,15501.9
12
+ GPT-OSS-120B,1.0,54,12.57,6.99,3.37,1.03,0.0,0.0,21.24,11.84,3410.41,2853.93
13
+ GPT-OSS-120B,2.0,15,12.23,6.96,4.6,2.03,0.0,0.0,66.27,44.68,13648.13,9816.93
14
+ GPT-OSS-120B,3.0,51,9.31,3.96,4.61,1.5,0.0,0.0,72.93,29.95,14005.53,9925.16
15
+ GPT-OSS-120B,4.0,63,11.27,6.03,4.75,1.74,0.0,0.0,209.88,103.57,6150.71,3126.57
16
+ GPT-OSS-120B,5.0,60,9.57,4.3,4.73,1.62,0.0,0.0,93.18,40.95,8075.18,5187.27
17
+ Gemini 2.5 Flash,1.0,8,21.53,7.32,3.5,0.93,0.04,0.01,44.67,12.88,4281.12,6576.5
18
+ Gemini 2.5 Flash,2.0,3,19.11,10.41,5.0,2.0,0.12,0.07,134.19,67.57,17629.33,22355.67
19
+ Gemini 2.5 Flash,3.0,9,10.86,3.59,3.22,0.67,0.1,0.04,110.95,32.06,14256.33,18029.89
20
+ Gemini 2.5 Flash,4.0,9,7.67,1.72,3.0,0.0,0.02,0.01,174.12,12.04,4807.56,2845.56
21
+ Gemini 2.5 Flash,5.0,5,27.67,23.6,5.4,1.67,0.16,0.16,239.42,205.47,12244.0,30325.6
22
+ Gemini 2.5 Pro,1.0,10,21.54,1.08,3.0,0.0,0.15,0.01,85.62,13.56,3332.2,7272.6
23
+ Gemini 2.5 Pro,2.0,5,15.45,9.32,4.2,1.79,0.42,0.28,203.8,114.19,12820.0,19547.2
24
+ Gemini 2.5 Pro,3.0,10,12.51,5.54,4.0,1.7,0.45,0.19,216.06,68.87,15538.9,20521.3
25
+ Gemini 2.5 Pro,4.0,10,10.91,0.96,3.0,0.0,0.12,0.01,247.46,100.84,4531.6,5594.2
26
+ Gemini 2.5 Pro,5.0,7,11.71,5.27,3.29,0.76,0.24,0.11,245.13,140.77,7157.29,11230.86
27
+ Grok-3,1.0,10,20.47,6.48,4.8,1.14,0.13,0.04,89.64,28.87,4422.6,3522.3
28
+ Grok-3,2.0,6,9.5,4.24,4.33,1.63,0.24,0.11,164.77,82.18,11228.67,5916.67
29
+ Grok-3,3.0,9,11.61,3.0,6.11,1.45,0.41,0.11,272.38,81.44,16853.33,10242.67
30
+ Grok-3,4.0,10,8.48,4.36,4.0,1.41,0.08,0.04,179.21,76.95,4100.6,1892.0
31
+ Grok-3,5.0,1,16.91,,7.0,,0.3,,261.68,,11842.0,7772.0
32
+ O3 (2025-04-16),1.0,19,13.96,7.67,3.53,1.12,0.06,0.03,53.0,26.99,2565.05,2905.89
33
+ O3 (2025-04-16),2.0,12,7.99,4.63,3.67,1.56,0.14,0.08,113.5,66.09,8249.42,6453.42
34
+ O3 (2025-04-16),3.0,4,12.03,4.57,6.0,2.0,0.27,0.1,218.26,58.79,15121.75,12987.0
35
+ O3 (2025-04-16),4.0,20,6.3,2.17,3.2,0.62,0.04,0.01,223.55,104.37,3035.1,1501.6
36
+ O3 (2025-04-16),5.0,13,9.15,5.61,4.38,1.89,0.11,0.06,222.75,170.42,5893.69,5141.54
37
+ O4 Mini (2025-04-16),1.0,9,21.44,11.42,4.33,1.73,0.05,0.03,64.15,80.02,3085.11,5285.78
38
+ O4 Mini (2025-04-16),2.0,6,11.41,3.19,4.33,1.03,0.11,0.03,81.17,14.35,10118.5,10389.0
39
+ O4 Mini (2025-04-16),3.0,8,8.8,5.02,4.5,2.07,0.11,0.06,200.31,318.54,11173.75,10194.62
40
+ O4 Mini (2025-04-16),4.0,10,7.79,2.35,3.2,0.63,0.03,0.01,224.63,266.14,3020.9,2597.3
41
+ O4 Mini (2025-04-16),5.0,1,5.83,,3.0,,0.04,,65.78,,3746.0,3859.0
42
+ Qwen-3 (235B),1.0,10,12.96,6.93,4.0,1.41,0.01,0.01,31.98,21.7,3646.9,2457.9
43
+ Qwen-3 (235B),2.0,7,12.07,4.69,5.57,1.51,0.05,0.02,103.66,33.28,15497.29,8631.43
44
+ Qwen-3 (235B),3.0,8,14.2,1.09,7.0,0.0,0.08,0.0,167.13,41.33,24811.12,13434.75
45
+ Qwen-3 (235B),4.0,10,5.36,1.85,3.4,0.84,0.01,0.0,225.46,130.23,3784.0,1271.9
46
+ Qwen-3 (235B),5.0,2,12.19,1.15,7.0,0.0,0.04,0.0,375.09,42.65,10558.0,6912.5
prompts/categorization.txt ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Your task is to produce a set of boundaries that will categorize the provided samples in a way that maximizes the statistical significance.
2
+ The relevant samples are:
3
+ - Signal: '{BASE_DIR}/solution/arrays/signal.npy'
4
+ - Background: '{BASE_DIR}/solution/arrays/bkgd.npy'
5
+ - Signal scores '{BASE_DIR}/solution/arrays/signal_scores.npy'
6
+ - Background scores: '{BASE_DIR}/solution/arrays/bkgd_scores.npy'
7
+
8
+ Write a python script to produce the categorization using the following tools (headers provided below).
9
+ YOU MUST INCLUDE "from utils import *" in the script; do not attempt to write these functions yourself.
10
+
11
+ def load_datasets(signal, bkgd, signal_scores, background_scores):
12
+ Return weighted and unweighted signal and background samples, signal_df and bkgd_df, as ROOT data frames.
13
+ You must load the input arguments as np arrays before passing to the function.
14
+ Example usage: signal_df, bkgd_df = load_datasets(signal, bkgd, signal_scores, bkgd_scores)
15
+
16
+ def get_significance(signal_df, bkgd_df, boundaries):
17
+ Return significance under current categorization.
18
+ Example usage: Z = get_significance(signal_df, bkgd_df, boundaries)
19
+
20
+ def place_boundary(signal_df, bkgd_df, boundaries, num_bins, min_events):
21
+ Return optimal location to place next boundary based on current boundaries, and resulting significance.
22
+ Example usage: new_boundary, new_Z = place_boundary(signal_df, bkgd_df, boundaries, num_bins, min_events)
23
+
24
+ Use the load_datasets(signal, bkgd, signal_scores, bkgd_scores) tool to get the signal and background histograms.
25
+ Keep track of the current categorization with an array containing the locations of the current boundaries (so start out with boundary_arr=[0,1]). num_bins should be set to 1000. Each time you want to place a boundary, use place_boundary to get the location of the new boundary and the resulting significance. Repeat until the significance improved by less than 5 percent as a result of adding the most recent boundary (that is, (new_significance - old_significance) / old_significance < 0.05). However, keep this last boundary computed for which the improvement in significances is less than 5%.
26
+
27
+ Save the boundary array to '{BASE_DIR}/arrays/boundaries.npy' and the significance array (i.e., significance after adding each boundary) to '{BASE_DIR}/arrays/significances.npy'.
prompts/create_numpy.txt ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Your task is to write a Python script that reads each ROOT file listed in {BASE_DIR}/solution/arrays/file_list.txt using uproot. For each file, extract the specified observables and store them in a NumPy array.
2
+
3
+ The naming of the output NumPy file should follow these rules:
4
+ - If the input ROOT file listed in file_list.txt contains "data_A.GamGam.root", name the output file: {BASE_DIR}/arrays/data_A_raw.npy
5
+ - If the input ROOT file listed in file_list.txt contains "mc_345318.WpH125J_Wincl_gamgam.GamGam.root", name the output file: {BASE_DIR}/arrays/signal_WH_raw.npy
6
+ - For other files, do not process or generate any output.
7
+
8
+ Refer to the ROOT file summary provided below to identify the correct tree and branch names. Be precise β€” instruct the worker exactly which trees and branches to extract.
9
+
10
+ Note: Some branches (for example, photon, lepton, and jet observables) are arrays containing multiple entries per event, ordered by descending pT.
11
+ Important: Do not loop over events. Use uproot to load entire branches at once for efficient processing.
12
+
13
+ For each event, you should save
14
+ - pT, eta, phi of each of the two photons
15
+ - pT, eta, phi of the two leptons in the event with the highest pT
16
+ - pT, eta, phi of the six jets in the event with the highest pT
17
+ - pT and phi of the MET
18
+ - Event weight (just MC weight, not multiplied by any extra scale factors)
19
+ - Flag for each photon indicating whether tight ID requirements are satisfied
20
+ - Cross section
21
+ - Sum of weights in ROOT file
22
+ - Scale factors for photon, electron,muon, btagging, pileup, electron trigger, photon trigger.
23
+
24
+ The indices should be as follows (note that these names may not correspond to the branch names in the ROOT files):
25
+ 0: leading photon pt
26
+ 1: leading photon eta
27
+ 2: leading photon phi
28
+ 3: subleading photon pt
29
+ 4: subleading photon eta
30
+ 5: subleading photon phi
31
+ 6: leading lepton pt
32
+ 7: leading lepton eta
33
+ 8: leading lepton phi
34
+ 9: subleading lepton pT
35
+ 10: subleading lepton eta
36
+ 11: subleading lepton phi
37
+ 12: jet 1 pT
38
+ 13: jet 1 eta
39
+ 14: jet 1 phi
40
+ 15: jet 2 pT
41
+ 16: jet 2 eta
42
+ 17: jet 2 phi
43
+ 18: jet 3 pT
44
+ 19: jet 3 eta
45
+ 20: jet 3 phi
46
+ 21: jet 4 pT
47
+ 22: jet 4 eta
48
+ 23: jet 4 phi
49
+ 24: jet 5 pT
50
+ 25: jet 5 eta
51
+ 26: jet 5 phi
52
+ 27: jet 6 pT
53
+ 28: jet 6 eta
54
+ 29: jet 6 phi
55
+ 30: met ET
56
+ 31: met phi
57
+ 32: MC weight
58
+ 33: sum of weights
59
+ 34: cross section
60
+ 35: tight ID of leading photon
61
+ 36: tight ID of subleading photon
62
+ 37: scaleFactor_PILEUP
63
+ 38: scaleFactor_PHOTON
64
+ 39: scaleFactor_PhotonTRIGGER
65
+ 40: scaleFactor_ELE
66
+ 41: scaleFactor_MUON
67
+ 42: scaleFactor_LepTRIGGER
68
+ 43: scaleFactor_BTAG
69
+ 44: NaN
70
+ 45: NaN
71
+
72
+ Fill indices 44 and 45 (last indices of the column) with NaN values to serve as placeholders for the diphoton invariant mass and transverse momentum, which will be computed later.
73
+
74
+ # Implementation Details (required for correct column mapping)
75
+ - Use TTree named "mini" and load branches via `uproot.open(...)["mini"].arrays()` or `uproot.lazy()`.
76
+ - Branch-to-column mapping:
77
+ * Columns 0–2: `photon_pt[0]`, `photon_eta[0]`, `photon_phi[0]`
78
+ * Columns 3–5: `photon_pt[1]`, `photon_eta[1]`, `photon_phi[1]`
79
+ * Columns 6–8: `lep_pt[0]`, `lep_eta[0]`, `lep_phi[0]`
80
+ * Columns 9–11: `lep_pt[1]`, `lep_eta[1]`, `lep_phi[1]`
81
+ * Columns 12–14: `jet_pt[0]`, `jet_eta[0]`, `jet_phi[0]` (and so on through index 29 for jets 0–5)
82
+ * Column 30: `met_et`
83
+ * Column 31: `met_phi`
84
+ * Column 32: `mcWeight`
85
+ * Column 33: `SumWeights`
86
+ * Column 34: `XSection`
87
+ * Column 35: `photon_isTightID[0]`
88
+ * Column 36: `photon_isTightID[1]`
89
+ * Columns 37–43: scale factors in the order `[scaleFactor_PILEUP, scaleFactor_PHOTON, scaleFactor_PhotonTRIGGER, scaleFactor_ELE, scaleFactor_MUON, scaleFactor_LepTRIGGER, scaleFactor_BTAG]`
90
+ - Jagged arrays (photons, leptons, jets) must be padded to length 2 or 6 with `np.nan`.
91
+ - After saving, print file path, array shape, dtype, and per-column NaN counts.
prompts/old/create_numpy_obsolete.txt ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ User Prompt:
2
+ Your task is to write a Python script that reads one of the ROOT files in '{BASE_DIR}/logs/file_list.txt' using uproot and stores the following observables in a NumPy array. The .root files to be processed are listed with absolute paths in '{BASE_DIR}/logs/file_list.txt'. You may use the ROOT file summary included below to see how the trees and branches in the ROOT file are labeled. It is very important to use the correct tree and branch names, so you should tell the worker agent exactly which trees and branches to extract. Note that some branches (for example, photon, lepton, and jet observables) will be arrays containing the corresponding observables for each particle, ordered from highest pT to lowest pT. Photon ID flags such as `photon_isTightID` are jagged arrays with one entry per photon per event and must be flattened or indexed appropriately. Do NOT allow the worker to loop over all events; that will be very slow, and it is much better to read entire branches at a time.
3
+
4
+ For each event, you should save
5
+ - pT, eta, phi of each of the two photons.
6
+ - pT, eta, phi of the two highest-pT leptons in the event.
7
+ - pT, eta, phi of the six highest-pT jets in the event.
8
+ - ET and phi of the MET.
9
+ - MC weight.
10
+ - Flag for each photon indicating whether tight indentification(ID) requirements.
11
+ - Cross section.
12
+ - Sum of weights.
13
+ - Scale factors for photon, electron,muon, btagging, pileup, electron trigger, photon trigger.
14
+
15
+ Fill indices 44 and 45 (last indices of the column) with NaN values to serve as placeholders for the diphoton invariant mass and transverse momentum, which will be computed later.
16
+
17
+ Save each observable in the NumPy array at the corresponding column index as listed below:
18
+
19
+ The indices should be as follows (note that these names may not correspond to the branch names in the ROOT files):
20
+ 0: leading photon pt
21
+ 1: leading photon eta
22
+ 2: leading photon phi
23
+ 3: subleading photon pt
24
+ 4: subleading photon eta
25
+ 5: subleading photon phi
26
+ 6: leading lepton pt
27
+ 7: leading lepton eta
28
+ 8: leading lepton phi
29
+ 9: subleading lepton pT
30
+ 10: subleading lepton eta
31
+ 11: subleading lepton phi
32
+ 12: jet 1 pT
33
+ 13: jet 1 eta
34
+ 14: jet 1 phi
35
+ 15: jet 2 pT
36
+ 16: jet 2 eta
37
+ 17: jet 2 phi
38
+ 18: jet 3 pT
39
+ 19: jet 3 eta
40
+ 20: jet 3 phi
41
+ 21: jet 4 pT
42
+ 22: jet 4 eta
43
+ 23: jet 4 phi
44
+ 24: jet 5 pT
45
+ 25: jet 5 eta
46
+ 26: jet 5 phi
47
+ 27: jet 6 pT
48
+ 28: jet 6 eta
49
+ 29: jet 6 phi
50
+ 30: met ET
51
+ 31: met phi
52
+ 32: MC weight
53
+ 33: sum of weights
54
+ 34: cross section
55
+ 35: tight ID of leading photon
56
+ 36: tight ID of subleading photon
57
+ 37: scaleFactor_PILEUP
58
+ 38: scaleFactor_PHOTON
59
+ 39: scaleFactor_PhotonTRIGGER
60
+ 40: scaleFactor_ELE
61
+ 41: scaleFactor_MUON
62
+ 42: scaleFactor_LepTRIGGER
63
+ 43: scaleFactor_BTAG
64
+ 44: NaN
65
+ 45: NaN
prompts/old/create_numpy_original.txt ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Your task is to write a Python script that reads each ROOT file in '{BASE_DIR}/logs/file_list.txt' using uproot and stores the following observables in a NumPy array. The NumPy array should be named as '{BASE_DIR}/arrays/{ROOT_name}.npy' where {ROOT_name} is replaced by the name of the ROOT file (without the extension or filepath). You may use the ROOT file summary included below to see how the trees and branches in the ROOT file are labeled. It is very important to use the correct tree and branch names, so you should tell the worker agent exactly which trees and branches to extract. Note that some branches (e.g., photon, lepton, and jet observables) will be arrays containing the corresponding observables for each particle, ordered from highest pT to lowest pT. Do NOT allow the worker to loop over all events; that will be very slow, and it is much better to read entire branches at a time.
2
+
3
+ For each event, you should save
4
+ - pT, eta, phi of each of the two photons
5
+ - pT, eta, phi of the two leptons in the event with the highest pT
6
+ - pT, eta, phi of the six jets in the event with the highest pT
7
+ - pT and phi of the MET
8
+ - Event weight (just MC weight, not multiplied by any extra scale factors)
9
+ - Flag for each photon indicating whether tight ID requirements are satisfied
10
+ - Cross section
11
+ - Sum of weights in ROOT file
12
+ - Scale factors for photon, electron,muon, btagging, pileup, electron trigger, photon trigger
13
+
14
+ The indices should be as follows (note that these names may not correspond to the branch names in the ROOT files):
15
+ 0: photon 1 pT
16
+ 1: photon 1 eta
17
+ 2: photon 1 phi
18
+ 3: photon 2 pT
19
+ 4: photon 2 eta
20
+ 5: photon 2 phi
21
+ 6: lepton 1 pT
22
+ 7: lepton 1 eta
23
+ 8: lepton 1 phi
24
+ 9: lepton 2 pT
25
+ 10: lepton 2 eta
26
+ 11: lepton 2 phi
27
+ 12: jet 1 pT
28
+ 13: jet 1 eta
29
+ 14: jet 1 phi
30
+ 15: jet 2 pT
31
+ 16: jet 2 eta
32
+ 17: jet 2 phi
33
+ 18: jet 3 pT
34
+ 19: jet 3 eta
35
+ 20: jet 3 phi
36
+ 21: jet 4 pT
37
+ 22: jet 4 eta
38
+ 23: jet 4 phi
39
+ 24: jet 5 pT
40
+ 25: jet 5 eta
41
+ 26: jet 5 phi
42
+ 27: jet 6 pT
43
+ 28: jet 6 eta
44
+ 29: jet 6 phi
45
+ 30: met pT
46
+ 31: met phi
47
+ 32: MC weight
48
+ 33: photon 1 tight ID?
49
+ 34: photon 2 tight ID?
50
+ 35: cross section
51
+ 36: sum of weights
52
+ 37: scaleFactor_PILEUP
53
+ 38: scaleFactor_PHOTON
54
+ 39: scaleFactor_PhotonTRIGGER
55
+ 40: scaleFactor_ELE
56
+ 41: scaleFactor_MUON
57
+ 42: scaleFactor_LepTRIGGER
58
+ 43: scaleFactor_BTAG
prompts/old/create_numpy_step2.txt ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Your primary task is to write a single, robust Python script that can process different ROOT files based on command-line arguments.
2
+
3
+ **Script Requirements:**
4
+
5
+ 1. **Argument Parsing:** The script must accept two command-line arguments:
6
+ * `--input-file-list`: The path to a text file containing a list of absolute paths to ROOT files. For this task, this will be '{BASE_DIR}/logs/file_list.txt'.
7
+ * `--input-name`: The base name of the specific ROOT file to process (e.g., "data_A.GamGam.root").
8
+ * `--output-file`: The absolute path for the output NumPy file (e.g., '{BASE_DIR}/arrays/data_A_raw.npy').
9
+
10
+ 2. **File Path Discovery:**
11
+ * The script must open the file specified by `--input-file-list`.
12
+ * It must read the contents and find the full, absolute path that ends with the filename given by `--input-name`.
13
+
14
+ 3. **Data Processing:**
15
+ * Using the discovered absolute path, the script will open the ROOT file with uproot.
16
+ * It must read the specified branches without looping over events (i.e., using bulk/vectorized reads).
17
+
18
+ The script will be executed twice with different arguments to handle the two conversions:
19
+
20
+ * **Execution 1:**
21
+ * `--input-name "data_A.GamGam.root"`
22
+ * `--output-file '{BASE_DIR}/arrays/data_A_raw.npy'`
23
+ * **Execution 2:**
24
+ * `--input-name "mc_345318.WpH125J_Wincl_gamgam.GamGam.root"`
25
+ * `--output-file '{BASE_DIR}/arrays/signal_WH_raw.npy'`
26
+
27
+ **Data Mapping:**
28
+
29
+ When processing each file, use uproot to store the following observables in the corresponding NumPy array. You may use the ROOT file summary included below to see how the trees and branches in the ROOT file are labeled. It is very important to use the correct tree and branch names. Note that some branches (for example, photon, lepton, and jet observables) will be arrays containing the corresponding observables for each particle, ordered from highest pT to lowest pT. Photon ID flags such as `photon_isTightID` are jagged arrays with one entry per photon per event and must be flattened or indexed appropriately. Do NOT loop over events; it is much better to read entire branches at a time.
30
+
31
+ For each event, you should save
32
+ - pT, eta, phi of each of the two photons.
33
+ - pT, eta, phi of the two highest-pT leptons in the event.
34
+ - pT, eta, phi of the six highest-pT jets in the event.
35
+ - ET and phi of the MET.
36
+ - MC weight.
37
+ - Flag for each photon indicating whether tight indentification(ID) requirements.
38
+ - Cross section.
39
+ - Sum of weights.
40
+ - Scale factors for photon, electron,muon, btagging, pileup, electron trigger, photon trigger.
41
+
42
+ Fill indices 44 and 45 (last indices of the column) with NaN values to serve as placeholders for the diphoton invariant mass and transverse momentum, which will be computed later.
43
+
44
+ Save each observable in the NumPy array at the corresponding column index as listed below:
45
+
46
+ The indices should be as follows (note that these names may not correspond to the branch names in the ROOT files):
47
+ 0: leading photon pt
48
+ 1: leading photon eta
49
+ 2: leading photon phi
50
+ 3: subleading photon pt
51
+ 4: subleading photon eta
52
+ 5: subleading photon phi
53
+ 6: leading lepton pt
54
+ 7: leading lepton eta
55
+ 8: leading lepton phi
56
+ 9: subleading lepton pT
57
+ 10: subleading lepton eta
58
+ 11: subleading lepton phi
59
+ 12: jet 1 pT
60
+ 13: jet 1 eta
61
+ 14: jet 1 phi
62
+ 15: jet 2 pT
63
+ 16: jet 2 eta
64
+ 17: jet 2 phi
65
+ 18: jet 3 pT
66
+ 19: jet 3 eta
67
+ 20: jet 3 phi
68
+ 21: jet 4 pT
69
+ 22: jet 4 eta
70
+ 23: jet 4 phi
71
+ 24: jet 5 pT
72
+ 25: jet 5 eta
73
+ 26: jet 5 phi
74
+ 27: jet 6 pT
75
+ 28: jet 6 eta
76
+ 29: jet 6 phi
77
+ 30: met ET
78
+ 31: met phi
79
+ 32: MC weight
80
+ 33: sum of weights
81
+ 34: cross section
82
+ 35: tight ID of leading photon?
83
+ 36: tight ID of subleading photon?
84
+ 37: scaleFactor_PILEUP
85
+ 38: scaleFactor_PHOTON
86
+ 39: scaleFactor_PhotonTRIGGER
87
+ 40: scaleFactor_ELE
88
+ 41: scaleFactor_MUON
89
+ 42: scaleFactor_LepTRIGGER
90
+ 43: scaleFactor_BTAG
91
+ 44: NaN
92
+ 45: NaN
93
+
94
+ ================================================================================
95
+ ROOT FILES ANALYSIS SUMMARY
96
+ ================================================================================
97
+
98
+ COMMON BRANCHES ACROSS ALL FILES
99
+ ========================================
100
+
101
+ Tree: mini;1
102
+ Common branches (81):
103
+ SumWeights, XSection, channelNumber, ditau_m, eventNumber, jet_E, jet_MV2c10, jet_eta, jet_jvt, jet_n, jet_phi, jet_pt, jet_pt_syst, jet_trueflav, jet_truthMatched, largeRjet_D2, largeRjet_E, largeRjet_eta, largeRjet_m, largeRjet_n, largeRjet_phi, largeRjet_pt, largeRjet_pt_syst, largeRjet_tau32, largeRjet_truthMatched, lep_E, lep_charge, lep_eta, lep_etcone20, lep_isTightID, lep_n, lep_phi, lep_pt, lep_pt_syst, lep_ptcone30, lep_trackd0pvunbiased, lep_tracksigd0pvunbiased, lep_trigMatched, lep_truthMatched, lep_type, lep_z0, mcWeight, met_et, met_et_syst, met_phi, photon_E, photon_convType, photon_eta, photon_etcone20, photon_isTightID, photon_n, photon_phi, photon_pt, photon_pt_syst, photon_ptcone30, photon_trigMatched, photon_truthMatched, runNumber, scaleFactor_BTAG, scaleFactor_ELE, scaleFactor_LepTRIGGER, scaleFactor_MUON, scaleFactor_PHOTON, scaleFactor_PILEUP, scaleFactor_PhotonTRIGGER, scaleFactor_TAU, tau_BDTid, tau_E, tau_charge, tau_eta, tau_isTightID, tau_n, tau_nTracks, tau_phi, tau_pt, tau_pt_syst, tau_trigMatched, tau_truthMatched, trigE, trigM, trigP
prompts/old/preprocess_obsolete.txt ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Your task is to write a Python script that:
2
+ 1. Loads the following two .npy files:
3
+ - {BASE_DIR}/solution/arrays/data_raw.npy (real data events)
4
+ - {BASE_DIR}/solution/arrays/signal_raw.npy (MC signal events)
5
+ 2. Filters the events in both files according to the criteria described below.
6
+
7
+ Each file contains a NumPy array with 46 columns, where each row corresponds to an event. The goal is to preprocess these arrays following the steps below, and then save the resulting output arrays as:
8
+ - `signal.npy`: containing selected MC signal events
9
+ - `bkgd.npy`: containing selected background modeling events (from real data)
10
+
11
+ Save both to `{BASE_DIR}/arrays/`.
12
+
13
+ Step 1: Load and validate
14
+ - Load both `.npy` files.
15
+ - Validate that each array has 46 columns. Raise an error if not.
16
+ - **Do not drop any columns**: preserve the full `(N, 46)` array. Only update columns 32, 44, and 45 in place.
17
+
18
+ Step 2: MC weight update (for MC signal only)
19
+ This step applies only to `signal_raw.npy`:
20
+ - Compute the process-level normalization weight using:
21
+ weight = (cross section [pb] Γ— luminosity [pb^{-1}]) / sum of weights
22
+ Use luminosity = 10,000 pb^{-1}.
23
+ - IMPORTANT CORRECTION: The cross-section value of 2.64338632e-06 pb in the data corresponds to SM Higgs production and needs to be corrected to 0.000116 pb (the expected SM Higgs -> gammagamma cross-section). Apply a correction factor of ~43.9Γ— to these events.
24
+
25
+ - Note: `signal_raw.npy` contains multiple physics processes. Cross section and sum of weights may differ per process.
26
+ > Handle each process separately if needed.
27
+
28
+ - Apply the following event-level scale factors multiplicatively:
29
+ > pileup
30
+ > photon
31
+ > trigger
32
+ > lepton
33
+ > b-tagging
34
+
35
+ - Filter out events that have zero in any of the scale factor fields.
36
+
37
+ - Compute the final event weight as:
38
+ final_weight = normalization_weight * (product of scale factors)
39
+
40
+ - Store the final weight in index 32 of each row.
41
+
42
+ Step 3: Kinematic calculations and preselection (for both MC and data)
43
+
44
+ - For each event (in both MC and data arrays):
45
+ > 1. Compute diphoton invariant mass and transverse momentum using `ROOT.TLorentzVector` (Do not use the `vector` module)
46
+ > 2. Store: diphoton invariant mass in column 44 and diphoton transverse momentum (pt) in column 45
47
+
48
+ - Apply the following preselection cuts to all events (both MC and data):
49
+ > Photon pseudorapidity: |Ξ·| < 1.37 or 1.52 < |Ξ·| < 2.37 (for **each** photon)
50
+ > Transverse momentum pt > 25,000 MeV (for both photons)
51
+ > Leading photon: (pt / m_yy) > 0.35
52
+ > Subleading photon: (pt / m_yy) > 0.25
53
+ > Diphoton invariant mass: 105,000 MeV < m_yy < 160,000 MeV
54
+
55
+ Step 4a: Signal selection (for MC)
56
+ - From the preselected MC signal events, keep only those:
57
+ > Where both photons pass tight photon ID
58
+ > And 123,000 MeV < m_yy < 127,000 MeV (signal region)
59
+
60
+ - Save the resulting events to: `{BASE_DIR}/arrays/signal.npy`
61
+
62
+ Step 4b: Background modeling and normalization (from real data)
63
+ - Use preselected data events to estimate the background shape and normalization.
64
+
65
+ Region definitions:
66
+ - Sideband region:
67
+ 105,000 MeV < m_yy < 120,000 or
68
+ 130,000 MeV < m_yy < 160,000
69
+ - Signal region:
70
+ 123,000 MeV < m_yy < 127,000
71
+
72
+ Photon ID categories:
73
+ - TI (tight ID): photons pass tight photon ID
74
+ - NTI (non-tight ID): photons fail tight ID but pass loose ID
75
+
76
+ Steps:
77
+ 1. Compute event yields (sum of weights) in the following categories:
78
+ - NTI sideband
79
+ - NTI signal region
80
+ - TI sideband
81
+ 2. Calculate scale factors:
82
+ - SF1 = TI sideband yield / NTI sideband yield
83
+ - SF2 = NTI signal region yield / NTI sideband yield
84
+ 3. Compute expected background yield in TI signal region: expected_yield = SF1 * SF2 * NTI sideband yield
85
+ 4. Retain only the NTI sideband events for background modeling.
86
+ 5. Rescale their weights so that their total weight matches the `expected_yield`.
87
+
88
+ - Save the rescaled background events to: `{BASE_DIR}/arrays/bkgd.npy`
89
+
90
+ Summary of Output
91
+
92
+ | Output File | Contains |
93
+ |----------------------|--------------------------------------------------------------------------|
94
+ | signal.npy | MC signal events passing preselection and signal region + tight ID cuts |
95
+ | bkgd.npy | Real data events (NTI sideband) rescaled to match expected background |
prompts/old/preprocess_original.txt ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Your task is to read each NumPy array in '{BASE_DIR}/solution/arrays/data_raw.npy' and '{BASE_DIR}/solution/arrays/signal_raw.npy' and preprocess them as described below. For two .npy files, each contains 46-column arrays for MC signal and real data events. Please follow the instruction on preprocessing described below:
2
+
3
+ Step 1: Load and validate the arrays to ensure they have 46 columns.
4
+
5
+ Step 2: MC weight update
6
+ - Compute the event weight using the formula: weight = (cross section [pb] Γ— luminosity [pb^{-1}]) / sum of weights (Use a luminosity value of 10,000 pb^{-1}.)
7
+ - The file signal_raw.npy contains multiple physics processes.
8
+ > The cross section and sum of weights may vary depending on the process, so handle them process-wise if needed.
9
+ - Apply the following scale factors multiplicatively: pileup, photon, trigger, lepton, b-tagging
10
+ - After applying scale factors, filter out any events that have zero in any of the scale factor fields.
11
+ - Multiply the process-level weight by the product of the event-level scale factors to get the final event weight.
12
+ - Store the final weights in index 32 of the event array for downstream analysis.
13
+
14
+ Step 3: pT, eta, and m_yy cuts
15
+ - Update the last two columns (index 44 and 45 in a 46-column array) to store the diphoton invariant mass(index 44) and transverse momentum (pT) (index 45). These values should be computed using ROOT.TLorentzVector. Do not use the vector module.
16
+ - The following preselection criteria are applied to all events before signal region selection:
17
+ > Photon eta selections: |eta| < 1.37 or 1.52 < |eta| < 2.37 for each photon.
18
+ > p_T > 25000 MeV for both photons.
19
+ > p_T / m_yy > 0.35 for leading photon.
20
+ > p_T / m_yy > 0.25 for subleading photon.
21
+ > 105000 MeV < m_yy < 160000 MeV.
22
+ > Only keep signal events which pass tight photon ID requirements for both photons and which have 123000 MeV < m_yy < 127000 MeV.
23
+
24
+ Step 4: background normalization
25
+ - Sideband region (for background): 105000 MeV < m_yy < 120000 MeV or 130000 MeV < m_yy < 160000 MeV.
26
+ - Signal region (for background estimation): 123000 MeV < m_yy < 127000 MeV.
27
+ - All yields are defined as the sum of event weights.
28
+ - Define: NTI:
29
+ > Non-tight photon ID region (fails tight ID but passes loose).
30
+ > TI: Tight photon ID region (passes tight photon ID).
31
+ - Scale factor 1 (SF1):
32
+ > SF1 = TI sideband yield / NTI sideband yield (estimates tight-to-loose ratio in the sideband)
33
+ - Scale factor 2 (SF2):
34
+ > SF2 = NTI signal window yield / NTI sideband yield (estimates signal-to-sideband transfer in NTI region)
35
+ - Expected background yield in TI signal region
36
+ > Expected yield = SF1 * SF2 * NTI sideband yield
37
+ - Action: keep only NTI sideband events for background modeling, but rescale their weights so that the total weight matches the expected background yield computed above
38
+
39
+ Step 5: save arrays
40
+ - Save arrays (46 columns) in '{BASE_DIR}/arrays/signal.npy' and '{BASE_DIR}/arrays/bkgd.npy'
41
+
42
+ For debugging please print the sum of signal weights and the sum of background weights before selection, after the photon pT and eta cuts, after the photon m_yy cut, and after applying tight photon ID requirements.
prompts/preprocess.txt ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Your task is to write a Python script that processes ATLAS diphoton event data.
2
+
3
+ Load the following two numpy array files:
4
+ - {BASE_DIR}/solution/arrays/data_raw.npy (real collision data)
5
+ - {BASE_DIR}/solution/arrays/signal_raw.npy (Monte Carlo simulated signal)
6
+
7
+ Each file contains a 2D array with shape (N_events, 46), where each row is one event and columns store physics quantities.
8
+
9
+ Your script must:
10
+ 1. Apply MC reweighting to simulated events
11
+ 2. Compute diphoton kinematics for all events
12
+ 3. Apply physics selection cuts
13
+ 4. Save final signal and background samples
14
+
15
+ Save outputs to:
16
+ - {BASE_DIR}/arrays/signal.npy
17
+ - {BASE_DIR}/arrays/bkgd.npy
18
+
19
+ ====================
20
+ COLUMN DEFINITIONS
21
+ ====================
22
+
23
+ 0: leading photon pT (MeV)
24
+ 1: leading photon eta
25
+ 2: leading photon phi
26
+ 3: subleading photon pT (MeV)
27
+ 4: subleading photon eta
28
+ 5: subleading photon phi
29
+ 6: leading lepton pT
30
+ 7: leading lepton eta
31
+ 8: leading lepton phi
32
+ 9: subleading lepton pT
33
+ 10: subleading lepton eta
34
+ 11: subleading lepton phi
35
+ 12-29: jet kinematics (6 jets x 3 variables)
36
+ 30: missing ET
37
+ 31: missing ET phi
38
+ 32: event weight
39
+ 33: sum of MC weights
40
+ 34: cross section (pb)
41
+ 35: leading photon tight ID flag
42
+ 36: subleading photon tight ID flag
43
+ 37: scaleFactor_PILEUP
44
+ 38: scaleFactor_PHOTON
45
+ 39: scaleFactor_PhotonTRIGGER
46
+ 40: scaleFactor_ELE
47
+ 41: scaleFactor_MUON
48
+ 42: scaleFactor_LepTRIGGER
49
+ 43: scaleFactor_BTAG
50
+ 44: (initially NaN) diphoton invariant mass m_yy (MeV)
51
+ 45: (initially NaN) diphoton transverse momentum pT_yy (MeV)
52
+
53
+ ====================
54
+ STEP 1: LOAD AND VALIDATE
55
+ ====================
56
+
57
+ Load both .npy files with numpy.load(). Verify each has exactly 46 columns; raise ValueError if not.
58
+ Do NOT drop any columns. Preserve the full (N, 46) shape throughout.
59
+
60
+ ====================
61
+ STEP 2: MC WEIGHT UPDATE (signal_raw.npy only)
62
+ ====================
63
+
64
+ A. Cross-section correction:
65
+ For any row where abs(column_34 - 2.64338632e-06) < 1e-10:
66
+ Replace column 34 with 0.000116 (correct Higgs to gamma-gamma cross-section in pb)
67
+
68
+ B. Normalization (per-event, not global):
69
+ For each row independently compute:
70
+ norm = (column_34 * 10000.0) / column_33
71
+ where 10000.0 is the luminosity in pb inverse
72
+
73
+ C. Scale factor product:
74
+ For each row multiply columns 37 through 43 (7 factors total)
75
+
76
+ D. Final weight:
77
+ column_32 = column_32 * norm * scale_factor_product
78
+ Store result back into column 32
79
+
80
+ ====================
81
+ STEP 3: KINEMATICS (both MC and data)
82
+ ====================
83
+
84
+ For every event use ROOT.TLorentzVector to compute diphoton system:
85
+
86
+ photon1 = ROOT.TLorentzVector()
87
+ photon1.SetPtEtaPhiM(column_0, column_1, column_2, 0.0)
88
+
89
+ photon2 = ROOT.TLorentzVector()
90
+ photon2.SetPtEtaPhiM(column_3, column_4, column_5, 0.0)
91
+
92
+ diphoton = photon1 + photon2
93
+ column_44 = diphoton.M()
94
+ column_45 = diphoton.Pt()
95
+
96
+ ====================
97
+ STEP 4: PRESELECTION (both MC and data)
98
+ ====================
99
+
100
+ Create a safe denominator for ratio cuts:
101
+ m_yy_safe = np.where(column_44 <= 0, 1e-6, column_44)
102
+
103
+ Apply ALL of the following cuts (combine with logical AND):
104
+
105
+ 1. Photon eta acceptance (both photons):
106
+ abs(column_1) < 1.37 OR (1.52 < abs(column_1) < 2.37)
107
+ abs(column_4) < 1.37 OR (1.52 < abs(column_4) < 2.37)
108
+
109
+ 2. Photon pT thresholds:
110
+ column_0 > 25000 (leading photon pT in MeV)
111
+ column_3 > 25000 (subleading photon pT in MeV)
112
+
113
+ 3. pT/mass ratios (use m_yy_safe to avoid division by zero):
114
+ column_0 / m_yy_safe > 0.35 (leading photon)
115
+ column_3 / m_yy_safe > 0.25 (subleading photon)
116
+
117
+ CRITICAL: Column 0 is ALWAYS the leading photon, column 3 is ALWAYS subleading.
118
+ Do NOT use np.maximum or np.minimum to pick which is which.
119
+ The input arrays are already sorted by pT.
120
+
121
+ 4. Diphoton mass window:
122
+ 105000 < column_44 < 160000 (MeV)
123
+
124
+ Keep only rows passing all cuts above.
125
+
126
+ After preselection, for DATA ONLY:
127
+ Set column_32 = 1.0 for all remaining data events
128
+
129
+ ====================
130
+ STEP 5: SIGNAL SELECTION (MC only)
131
+ ====================
132
+
133
+ From preselected MC events, apply:
134
+
135
+ 1. Tight photon ID:
136
+ (column_35 == 1.0) AND (column_36 == 1.0)
137
+ Use exact equality. Do NOT use np.isclose().
138
+
139
+ 2. Signal mass window:
140
+ 123000 < column_44 < 127000 (MeV)
141
+
142
+ Save selected events to {BASE_DIR}/arrays/signal.npy
143
+
144
+ ====================
145
+ STEP 6: BACKGROUND MODELING (data only)
146
+ ====================
147
+
148
+ From preselected data events (with column_32 = 1.0):
149
+
150
+ Define categories:
151
+ - TI (tight): (column_35 == 1.0) AND (column_36 == 1.0)
152
+ - NTI (non-tight): NOT TI
153
+
154
+ Define regions:
155
+ - Signal: 123000 < column_44 < 127000
156
+ - Sideband: (105000 < column_44 < 120000) OR (130000 < column_44 < 160000)
157
+
158
+ Compute yields (sum of column_32):
159
+ Y_NTI_sideband = sum of weights for (NTI AND sideband)
160
+ Y_NTI_signal = sum of weights for (NTI AND signal)
161
+ Y_TI_sideband = sum of weights for (TI AND sideband)
162
+
163
+ Scale factors (if Y_NTI_sideband > 0):
164
+ SF1 = Y_TI_sideband / Y_NTI_sideband
165
+ SF2 = Y_NTI_signal / Y_NTI_sideband
166
+
167
+ Expected yield:
168
+ Y_expected = SF1 * SF2 * Y_NTI_sideband
169
+
170
+ Keep ONLY NTI sideband events.
171
+ Rescale their weights: column_32 = column_32 * (Y_expected / Y_NTI_sideband)
172
+
173
+ Save to {BASE_DIR}/arrays/bkgd.npy
174
+
175
+ ====================
176
+ IMPLEMENTATION NOTES
177
+ ====================
178
+
179
+ - Import ROOT at the start; raise clear error if unavailable
180
+ - Use explicit Python loops for TLorentzVector (no vectorization)
181
+ - Guard all divisions (check denominator != 0)
182
+ - Preserve all 46 columns in output files
183
+ - Use exact equality (==) for tight ID, not approximate checks
184
+
prompts/preprocess_old.txt ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Your task is to write a Python script that:
2
+
3
+ 1. Loads the following two .npy files:
4
+ - {BASE_DIR}/solution/arrays/data_raw.npy (real data events)
5
+ - {BASE_DIR}/solution/arrays/signal_raw.npy (MC signal events)
6
+
7
+ Each file contains a NumPy array of shape (N, 46), where each row corresponds to a physics event and each column represents a feature. Your goal is to preprocess these arrays following the steps below, and save the processed results as:
8
+
9
+ - signal.npy: selected MC signal events
10
+ - bkgd.npy: selected and rescaled background events from real data
11
+
12
+ Save both output files to: {BASE_DIR}/arrays/
13
+
14
+ Information on the column indices:
15
+
16
+ 0: leading photon pT
17
+ 1: leading photon eta
18
+ 2: leading photon phi
19
+ 3: subleading photon pT
20
+ 4: subleading photon eta
21
+ 5: subleadingphoton phi
22
+ 6: leading lepton pT
23
+ 7: leading lepton eta
24
+ 8: leading lepton phi
25
+ 9: subleading lepton pT
26
+ 10: subleading lepton eta
27
+ 11: subleading lepton phi
28
+ 12: jet 1 pT
29
+ 13: jet 1 eta
30
+ 14: jet 1 phi
31
+ 15: jet 2 pT
32
+ 16: jet 2 eta
33
+ 17: jet 2 phi
34
+ 18: jet 3 pT
35
+ 19: jet 3 eta
36
+ 20: jet 3 phi
37
+ 21: jet 4 pT
38
+ 22: jet 4 eta
39
+ 23: jet 4 phi
40
+ 24: jet 5 pT
41
+ 25: jet 5 eta
42
+ 26: jet 5 phi
43
+ 27: jet 6 pT
44
+ 28: jet 6 eta
45
+ 29: jet 6 phi
46
+ 30: MET ET
47
+ 31: MET phi
48
+ 32: MC weight
49
+ 33: sum of weights
50
+ 34: cross section (XSection)
51
+ 35: leading photon tight ID?
52
+ 36: subleading photon tight ID?
53
+ 37: scaleFactor_PILEUP
54
+ 38: scaleFactor_PHOTON
55
+ 39: scaleFactor_PhotonTRIGGER
56
+ 40: scaleFactor_ELE
57
+ 41: scaleFactor_MUON
58
+ 42: scaleFactor_LepTRIGGER
59
+ 43: scaleFactor_BTAG
60
+ 44: unused(NaN) (to store diphoton invariant mass)
61
+ 45: unused(NaN) (to store diphoton transverse momentum)
62
+
63
+ ---
64
+
65
+ Step 1: Load and Validate
66
+
67
+ - Load both .npy files using NumPy.
68
+ - Verify that each array has exactly 46 columns. Raise an error if not.
69
+ - Do not drop any columns β€” preserve the full (N, 46) shape.
70
+ - Update the following columns in place:
71
+ - Column 32: final event weight
72
+ - Column 34: cross section (XSection) - only for ttH process
73
+ - Column 44: diphoton invariant mass (m_yy)
74
+ - Column 45: diphoton transverse momentum (pt_yy)
75
+
76
+ ---
77
+
78
+ Step 2: MC Signal Weight Update (signal_raw.npy only)
79
+
80
+ Normalization:
81
+
82
+ - Use luminosity = 10,000 pb^{-1}.
83
+ - For each event, compute the normalization factor as:
84
+ (cross_section * luminosity) / sum_of_weights
85
+ - The values of cross_section and sum_of_weights are found in columns 34 and 33, respectively.
86
+ - Important: If the cross-section value is 2.64338632e-06 pb (corresponding to ttH SM Higgs production), replace it with 0.000116 pb (the correct SM Higgs β†’ Ξ³Ξ³ cross-section).
87
+ - This correction should be applied only to events where the cross-section matches 2.64338632e-06 pb, and the corrected value should overwrite the original in column 34.
88
+ - Use the corrected cross-section value when computing normalization.
89
+
90
+ Scale factors:
91
+
92
+ - For each event, multiply the following scale factors:
93
+ - scaleFactor_PILEUP (column 37)
94
+ - scaleFactor_PHOTON (column 38)
95
+ - scaleFactor_PhotonTRIGGER (column 39)
96
+ - scaleFactor_ELE (column 40)
97
+ - scaleFactor_MUON (column 41)
98
+ - scaleFactor_LepTRIGGER (column 42)
99
+ - scaleFactor_BTAG (column 43)
100
+ - Remove any event where any of these scale factors is exactly zero.
101
+
102
+ Final weight:
103
+
104
+ - Compute the final event weight as:
105
+ final_weight = mcWeight * normalization * (product of all scale factors)
106
+ - Here, mcWeight is taken from column 32.
107
+ - Store the computed final weight back into column 32, replacing the original mcWeight.
108
+
109
+ ---
110
+
111
+ Step 3: Kinematic Calculations and Preselection (for both MC and data)
112
+
113
+ - For each event, compute diphoton invariant mass and transverse momentum using ROOT.TLorentzVector (do not use the vector module).
114
+ - Store the diphoton invariant mass in column 44 (m_yy).
115
+ - Store the diphoton transverse momentum in column 45 (pt_yy).
116
+
117
+ Apply the following preselection cuts to both MC and data:
118
+
119
+ - Photon pseudorapidity (|eta|): |eta| < 1.37 or 1.52 < |eta| < 2.37 (for each photon)
120
+ - Photon transverse momentum: pt_yy > 25,000 MeV (both photons)
121
+ - Leading photon: (pt_yy / m_yy) > 0.35
122
+ - Subleading photon: (pt_yy / m_yy) > 0.25
123
+ - Diphoton invariant mass: 105,000 MeV < m_yy < 160,000 MeV
124
+
125
+ ---
126
+
127
+ Step 4a: Final Signal Selection (MC only)
128
+
129
+ From the preselected MC events:
130
+
131
+ - Keep only events where both photons pass tight photon ID.
132
+ - Keep only events within the signal region: 123,000 MeV < m_yy < 127,000 MeV
133
+
134
+ Save the selected events to:
135
+
136
+ - {BASE_DIR}/arrays/signal.npy
137
+
138
+ ---
139
+
140
+ Step 4b: Background Modeling and Normalization (real data only)
141
+
142
+ Using preselected data events:
143
+
144
+ Region definitions:
145
+
146
+ - Signal region: 123,000 MeV < m_yy < 127,000 MeV
147
+ - Sideband region: 105,000 MeV < m_yy < 120,000 MeV or 130,000 MeV < m_yy < 160,000 MeV
148
+
149
+ Photon ID categories:
150
+
151
+ - TI (tight ID): both photons pass tight photon ID
152
+ - NTI (non-tight ID): photons fail tight ID but pass loose ID
153
+
154
+ Steps:
155
+
156
+ 1. Compute yields (sum of weights) for:
157
+ - NTI sideband
158
+ - NTI signal region
159
+ - TI sideband
160
+ 2. Calculate scale factors:
161
+ - SF1 = (TI sideband) / (NTI sideband)
162
+ - SF2 = (NTI signal region) / (NTI sideband)
163
+ 3. Estimate expected yield in TI signal region:
164
+ - expected_yield = SF1 * SF2 * (NTI sideband)
165
+ 4. Retain only NTI sideband events.
166
+ 5. Rescale their weights so that the total weight matches expected_yield.
167
+ 6. Save the result to:
168
+ - {BASE_DIR}/arrays/bkgd.npy
169
+
170
+ ---
171
+
172
+ Final Output Summary:
173
+
174
+ - signal.npy – MC signal events passing preselection, signal region, and tight ID cuts
175
+ - bkgd.npy – Real data events (NTI sideband) rescaled to match expected background
prompts/preprocess_old_corrupted.txt ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Your task is to write a Python script that:
2
+
3
+ 1. Loads the following two .npy files:
4
+ - {BASE_DIR}/solution/arrays/Apply the following preselection cuts to both MC and data:
5
+
6
+ - Photon pseudorapidity (|eta|): |eta| < 1.37 or 1.52 < |eta| < 2.37 (for each photon)
7
+ - Photon transverse momentum: pt_yy > 25,000 MeV (both photons)
8
+ - Leading photon: (pt_lead / m_yy) > 0.35, where pt_lead is column 0 (the leading photon pT is always stored in column 0)
9
+ - Subleading photon: (pt_sub / m_yy) > 0.25, where pt_sub is column 3 (the subleading photon pT is always stored in column 3)
10
+ - Diphoton invariant mass: 105,000 MeV < m_yy < 160,000 MeV
11
+ - Use the safe denominator defined above for all pT/m_yy ratios so that no division by zero occurs and any event with m_yy ≀ 1e-6 (effectively zero or negative) automatically fails the ratio requirements.
12
+ - IMPORTANT: Do NOT dynamically determine which photon is leading/subleading using np.maximum or np.minimum. The input arrays are pre-ordered so column 0 is always the leading photon and column 3 is always the subleading photon..npy (real data events)
13
+ - {BASE_DIR}/solution/arrays/signal_raw.npy (MC signal events)
14
+
15
+ Each file contains a NumPy array of shape (N, 46), where each row corresponds to a physics event and each column represents a feature. Your goal is to preprocess these arrays following the steps below, and save the processed results as:
16
+
17
+ - signal.npy: selected MC signal events
18
+ - bkgd.npy: selected and rescaled background events from real data
19
+
20
+ Save both output files to: {BASE_DIR}/arrays/
21
+
22
+ Information on the column indices:
23
+
24
+ 0: leading photon pT
25
+ 1: leading photon eta
26
+ 2: leading photon phi
27
+ 3: subleading photon pT
28
+ 4: subleading photon eta
29
+ 5: subleadingphoton phi
30
+ 6: leading lepton pT
31
+ 7: leading lepton eta
32
+ 8: leading lepton phi
33
+ 9: subleading lepton pT
34
+ 10: subleading lepton eta
35
+ 11: subleading lepton phi
36
+ 12: jet 1 pT
37
+ 13: jet 1 eta
38
+ 14: jet 1 phi
39
+ 15: jet 2 pT
40
+ 16: jet 2 eta
41
+ 17: jet 2 phi
42
+ 18: jet 3 pT
43
+ 19: jet 3 eta
44
+ 20: jet 3 phi
45
+ 21: jet 4 pT
46
+ 22: jet 4 eta
47
+ 23: jet 4 phi
48
+ 24: jet 5 pT
49
+ 25: jet 5 eta
50
+ 26: jet 5 phi
51
+ 27: jet 6 pT
52
+ 28: jet 6 eta
53
+ 29: jet 6 phi
54
+ 30: MET ET
55
+ 31: MET phi
56
+ 32: MC weight
57
+ 33: sum of weights
58
+ 34: cross section (XSection)
59
+ 35: leading photon tight ID?
60
+ 36: subleading photon tight ID?
61
+ 37: scaleFactor_PILEUP
62
+ 38: scaleFactor_PHOTON
63
+ 39: scaleFactor_PhotonTRIGGER
64
+ 40: scaleFactor_ELE
65
+ 41: scaleFactor_MUON
66
+ 42: scaleFactor_LepTRIGGER
67
+ 43: scaleFactor_BTAG
68
+ 44: unused(NaN) (to store diphoton invariant mass)
69
+ 45: unused(NaN) (to store diphoton transverse momentum)
70
+
71
+ ---
72
+
73
+ Step 1: Load and Validate
74
+
75
+ - Load both .npy files using NumPy.
76
+ - Verify that each array has exactly 46 columns. Raise an error if not.
77
+ - Do not drop any columns β€” preserve the full (N, 46) shape.
78
+ - Update the following columns in place:
79
+ - Column 32: final event weight
80
+ - Column 34: cross section (XSection) - only for ttH process
81
+ - Column 44: diphoton invariant mass (m_yy)
82
+ - Column 45: diphoton transverse momentum (pt_yy)
83
+
84
+ ---
85
+
86
+ Step 2: MC Signal Weight Update (signal_raw.npy only)
87
+
88
+ Normalization:
89
+
90
+ - Use luminosity = 10,000 pb^{-1}.
91
+ - For each event (row-by-row), compute the normalization factor as:
92
+ (cross_section * luminosity) / sum_of_weights
93
+ - The normalization factor is event-specific. Do not compute a single global value; apply the formula independently for every row.
94
+ - The values of cross_section and sum_of_weights are found in columns 34 and 33, respectively.
95
+ - Important: If the cross-section value is np.abs(XSection - 2.64338632e-06) < 1e-10 (corresponding to ttH SM Higgs production), replace it with 0.000116 pb (the correct SM Higgs -> Ξ³Ξ³ cross-section) in column 34.
96
+ - Use the corrected cross-section value when computing normalization.
97
+
98
+ Scale factors:
99
+
100
+ - For each event, multiply the following scale factors:
101
+ - scaleFactor_PILEUP (column 37)
102
+ - scaleFactor_PHOTON (column 38)
103
+ - scaleFactor_PhotonTRIGGER (column 39)
104
+ - scaleFactor_ELE (column 40)
105
+ - scaleFactor_MUON (column 41)
106
+ - scaleFactor_LepTRIGGER (column 42)
107
+ - scaleFactor_BTAG (column 43)
108
+
109
+ Final weight:
110
+
111
+ - Compute the final event weight as:
112
+ final_weight = mcWeight * normalization * (product of all scale factors)
113
+ - Here, mcWeight is taken from column 32.
114
+ - Store the computed final weight back into column 32, replacing the original mcWeight.
115
+
116
+ ---
117
+
118
+ Step 3: Kinematic Calculations and Preselection (for both MC and data)
119
+
120
+ - For each event, compute diphoton invariant mass and transverse momentum using ROOT.TLorentzVector (do not use the vector module).
121
+ - Store the diphoton invariant mass in column 44 (m_yy).
122
+ - Store the diphoton transverse momentum in column 45 (pt_yy).
123
+ - When computing ratios that involve m_yy, create a safe denominator first. For example, define `m_yy_safe = np.where(m_yy <= 0, 1e-6, m_yy)` and use `m_yy_safe` in every division. Events that would have m_yy <= 0 must fail the subsequent ratio cuts.
124
+
125
+ Apply the following preselection cuts to both MC and data:
126
+
127
+ - Photon pseudorapidity (|eta|): |eta| < 1.37 or 1.52 < |eta| < 2.37 (for each photon)
128
+ - Photon transverse momentum: pt_yy > 25,000 MeV (both photons)
129
+ - Leading photon: (pt_yy / m_yy) > 0.35
130
+ - Subleading photon: (pt_yy / m_yy) > 0.25
131
+ - Diphoton invariant mass: 105,000 MeV < m_yy < 160,000 MeV
132
+ - Use the safe denominator defined above for all pT/m_yy ratios so that no division by zero occurs and any event with m_yy <= 1e-6 (effectively zero or negative) automatically fails the ratio requirements.
133
+
134
+ - After computing the diphoton variables, set all data event weights (column 32) to 1.0 before background modeling.
135
+
136
+ ---
137
+
138
+ Step 4a: Final Signal Selection (MC only)
139
+
140
+ From the preselected MC events:
141
+
142
+ - Before applying photon-ID cuts, build boolean masks for columns 35 and 36 using exact equality: `tight = (column == 1.0)`. Only values exactly equal to 1.0 pass tight ID; treat everything else (including values like 0.0, 0.5, NaNs) as `False`.
143
+ - Keep only events where both photons pass tight photon ID (both boolean flags must be True).
144
+ - Keep only events within the signal region: 123,000 MeV < m_yy < 127,000 MeV
145
+
146
+ Save the selected events to:
147
+
148
+ - {BASE_DIR}/arrays/signal.npy
149
+
150
+ ---
151
+
152
+ Step 4b: Background Modeling and Normalization (real data only)
153
+
154
+ Using preselected data events:
155
+
156
+ Region definitions:
157
+
158
+ - Signal region: 123,000 MeV < m_yy < 127,000 MeV
159
+ - Sideband region: 105,000 MeV < m_yy < 120,000 MeV or 130,000 MeV < m_yy < 160,000 MeV
160
+
161
+ Photon ID categories:
162
+
163
+ - TI (tight ID): both photons pass tight photon ID (use the boolean masks built with `(column == 1.0)`)
164
+ - NTI (non-tight ID): photons fail tight ID but pass loose ID
165
+
166
+ Steps:
167
+
168
+ 1. Compute yields (sum of weights) for:
169
+ - NTI sideband
170
+ - NTI signal region
171
+ - TI sideband
172
+ 2. Calculate scale factors:
173
+ - SF1 = (TI sideband) / (NTI sideband)
174
+ - SF2 = (NTI signal region) / (NTI sideband)
175
+ 3. Estimate expected yield in TI signal region:
176
+ - expected_yield = SF1 * SF2 * (NTI sideband)
177
+ 4. Retain only NTI sideband events.
178
+ 5. Rescale their weights so that the total weight matches expected_yield.
179
+ 6. Save the result to:
180
+ - {BASE_DIR}/arrays/bkgd.npy
181
+
182
+ ---
183
+
184
+ Final Output Summary:
185
+
186
+ - signal.npy – MC signal events passing preselection, signal region, and tight ID cuts
187
+ - bkgd.npy – Real data events (NTI sideband) rescaled to match expected background
prompts/scores.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ Your task is to compute signal/background separation scores using the provided function tabpfn() in utils.py. First make sure to include "from utils import *". DO NOT WRITE YOUR OWN tabpfn() function.
2
+
3
+ After importing the function from utils, it can be used as follows:
4
+ signal_scores, bkgd_scores = tabpfn(signal_arr, bkgd_arr, batch_size=batch_size, test_size=test_size):
5
+
6
+ You should read in the signal and background arrays from the directory '{BASE_DIR}/solution/arrays/signal.npy' and '{BASE_DIR}/solution/arrays/bkgd.npy'. Set the batch size to 20,000 and the test size to 0.5.
7
+
8
+ The scores should be saved to the directory '{BASE_DIR}/arrays/' with the names 'signal_scores.npy' and 'bkgd_scores.npy'.
prompts/summarize_root.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ Your task is to write a Python script that writes two txt files summarizing the ROOT files in '/global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/'.
2
+ Both txt files should be saved to '{BASE_DIR}/logs/'.
3
+ The first file, file_list.txt, should contain an alphabetized list of file paths to all ROOT files in the data directory.
4
+ The second file, root_summary.txt, should contain a description of the tree and branch names found in one of the ROOT files.
prompts/supervisor_call.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Your task is to write a prompt for another API call ("call to worker agent") that will address the user's prompt (see below). The API call for which you are writing the prompt should always return python code. The code needs to be standalone; that is, running the script should address the user's prompt without a human needing to do anything. Also note that previous versions of the code will not be saved, so the worker should not replace working code with a script that only addresses part of user's prompt.
2
+
3
+ After the worker prompt, write "Call record: " followed by a description of the current status and what you have asked the worker to do. This will be used to keep track of the progress made so far in future API calls. The existing record is shown under "Existing Record: ".See below for the code produced by the previous API call and the command line output (if any) obtained by running that code. When you believe the user's prompt has been addressed, return the string "Supervisor is satisfied with current results".
4
+
5
+ User prompt:
6
+
7
+ Generated code:
8
+
9
+ Command line output:
10
+
11
+ Existing record:
prompts/supervisor_first_call.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Your task is to write a prompt for another API call ("call to worker agent") that will address the user's prompt (see below). The API call for which you are writing the prompt should always return python code.
2
+
3
+ After the worker prompt, write "Call record: " followed by a description of your plan and what you have asked the worker to do. This will be used to keep track of the progress made so far in future API calls. Based on this record it should be clear what has already been done and what still needs to be done.
4
+
5
+ User prompt:
run_smk_sequential.sh ADDED
@@ -0,0 +1,329 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ #
3
+ # run_smk_sequential.sh - Run Snakemake workflows one at a time for debugging
4
+ #
5
+ # This script runs each Snakemake workflow sequentially to observe
6
+ # the behavior of prompt scripts, supervisor, and coder in real time.
7
+ #
8
+ # Usage:
9
+ # ./run_smk_sequential.sh # Run all steps
10
+ # ./run_smk_sequential.sh --step1 # Run summarize_root (both rules)
11
+ # ./run_smk_sequential.sh --step2 # Run create_numpy
12
+ # ./run_smk_sequential.sh --step3 # Run preprocess
13
+ # ./run_smk_sequential.sh --step4 # Run scores
14
+ # ./run_smk_sequential.sh --step5 # Run categorization
15
+ # ./run_smk_sequential.sh --step1 --step3 # Run summarize_root + preprocess
16
+ #
17
+
18
+ if [ -f ~/.apikeys.sh ]; then
19
+ source ~/.apikeys.sh
20
+ fi
21
+
22
+ # Parse command line arguments
23
+ RUN_STEP1=false
24
+ RUN_STEP2=false
25
+ RUN_STEP3=false
26
+ RUN_STEP4=false
27
+ RUN_STEP5=false
28
+ VALIDATE_STEPS=false
29
+ OUTPUT_DIR="results"
30
+ CONFIG="config.yml"
31
+
32
+ # Remember the project root where this script is invoked
33
+ PROJECT_ROOT="$(pwd)"
34
+
35
+
36
+ while [[ $# -gt 0 ]]; do
37
+ case $1 in
38
+ --step1)
39
+ RUN_STEP1=true
40
+ shift
41
+ ;;
42
+ --step2)
43
+ RUN_STEP2=true
44
+ shift
45
+ ;;
46
+ --step3)
47
+ RUN_STEP3=true
48
+ shift
49
+ ;;
50
+ --step4)
51
+ RUN_STEP4=true
52
+ shift
53
+ ;;
54
+ --step5)
55
+ RUN_STEP5=true
56
+ shift
57
+ ;;
58
+ --validate)
59
+ VALIDATE_STEPS=true
60
+ shift
61
+ ;;
62
+ --out-dir)
63
+ OUTPUT_DIR="$2"
64
+ shift
65
+ shift
66
+ ;;
67
+ --job-id)
68
+ # Create unique directory based on job ID
69
+ OUTPUT_DIR="results_job_$2"
70
+ shift
71
+ shift
72
+ ;;
73
+ --auto-dir)
74
+ # Create unique directory with timestamp
75
+ TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
76
+ OUTPUT_DIR="results_${TIMESTAMP}"
77
+ shift
78
+ ;;
79
+ --config)
80
+ CONFIG="$2"
81
+ shift
82
+ shift
83
+ ;;
84
+ --help|-h)
85
+ echo "Usage: $0 [OPTIONS]"
86
+ echo ""
87
+ echo "Run Snakemake workflows for ATLAS analysis"
88
+ echo ""
89
+ echo "Options:"
90
+ echo " --step1 Run summarize_root workflow (both rules: data generation + prompt processing)"
91
+ echo " --step2 Run create_numpy workflow"
92
+ echo " --step3 Run preprocess workflow"
93
+ echo " --step4 Run scores workflow"
94
+ echo " --step5 Run categorization workflow"
95
+ echo " --validate Run validation after each successful step"
96
+ echo " --out-dir DIR Custom output directory (default: results)"
97
+ echo " --job-id ID Create unique directory: results_job_ID"
98
+ echo " --auto-dir Create unique directory with timestamp: results_YYYYMMDD_HHMMSS"
99
+ echo " --help Show this help message"
100
+ echo ""
101
+ echo "Examples:"
102
+ echo " $0 --step1 --auto-dir # results_20250916_143052/"
103
+ echo " $0 --step1 --job-id 12345 # results_job_12345/"
104
+ echo " $0 --step1 --out-dir my_run_1 # my_run_1/"
105
+ echo ""
106
+ echo "If no options are provided, all steps are run sequentially."
107
+ exit 0
108
+ ;;
109
+ *)
110
+ echo "Unknown option: $1"
111
+ echo "Use --help for usage information"
112
+ exit 1
113
+ ;;
114
+ esac
115
+ done
116
+
117
+ # If no specific steps requested, run all
118
+ if [[ "$RUN_STEP1" == "false" && "$RUN_STEP2" == "false" && "$RUN_STEP3" == "false" && "$RUN_STEP4" == "false" && "$RUN_STEP5" == "false" ]]; then
119
+ RUN_STEP1=true
120
+ RUN_STEP2=true
121
+ RUN_STEP3=true
122
+ RUN_STEP4=true
123
+ RUN_STEP5=true
124
+ echo "=== Running All Snakemake Workflows Sequentially (Output: ${OUTPUT_DIR}) ==="
125
+ else
126
+ echo "=== Running Selected Snakemake Workflows (Output: ${OUTPUT_DIR}) ==="
127
+ fi
128
+ echo ""
129
+
130
+ # Set up environment
131
+ module load python
132
+ conda activate llm_env
133
+
134
+ # Resolve config file to an absolute path so Snakemake can always find it
135
+ if [[ "${CONFIG}" = /* ]]; then
136
+ CONFIG_PATH="${CONFIG}"
137
+ else
138
+ CONFIG_PATH="${PROJECT_ROOT}/${CONFIG}"
139
+ fi
140
+
141
+ if [[ ! -f "${CONFIG_PATH}" ]]; then
142
+ echo "❌ Config file not found at ${CONFIG_PATH}"
143
+ exit 1
144
+ fi
145
+
146
+ # Copy and prepare workflow files
147
+
148
+ OUTPUT_DIR="${OUTPUT_DIR%/}"
149
+ if [[ "${OUTPUT_DIR}" = /* ]]; then
150
+ BASE_DIR="${OUTPUT_DIR}"
151
+ else
152
+ BASE_DIR="$PWD/${OUTPUT_DIR}"
153
+ fi
154
+
155
+ echo "Preparing workflow files..."
156
+ mkdir -p ${OUTPUT_DIR}/prompts_temp
157
+ cp -r prompts/* ${OUTPUT_DIR}/prompts_temp/
158
+ sed -i "s#{BASE_DIR}#${BASE_DIR}#g" ${OUTPUT_DIR}/prompts_temp/*.txt
159
+
160
+ cp workflow/summarize_root.smk ${OUTPUT_DIR}/summarize_root_temp.smk
161
+ cp workflow/create_numpy.smk ${OUTPUT_DIR}/create_numpy_temp.smk
162
+ cp workflow/preprocess.smk ${OUTPUT_DIR}/preprocess_temp.smk
163
+ cp workflow/scores.smk ${OUTPUT_DIR}/scores_temp.smk
164
+ cp workflow/categorization.smk ${OUTPUT_DIR}/categorization_temp.smk
165
+ cp supervisor_coder.py ${OUTPUT_DIR}/supervisor_coder.py
166
+ cp write_prompt.py ${OUTPUT_DIR}/write_prompt.py
167
+ cp check_soln.py ${OUTPUT_DIR}/check_soln.py
168
+
169
+ sed -i "s#{BASE_DIR}#${BASE_DIR}#g" ${OUTPUT_DIR}/*_temp.smk
170
+ # Replace {CONFIG} in temp snakemake files with the absolute path to the project's config
171
+ sed -i "s#{CONFIG}#${CONFIG_PATH}#g" ${OUTPUT_DIR}/*_temp.smk
172
+
173
+ # Copy solutions for validation
174
+ echo "Copying reference solution arrays for validation..."
175
+ mkdir -p ${OUTPUT_DIR}/solution/arrays
176
+ # Remove any existing files first to avoid permission issues
177
+ rm -f ${OUTPUT_DIR}/solution/arrays/*
178
+ cp solution/arrays/* ${OUTPUT_DIR}/solution/arrays/
179
+
180
+ # Create output directory
181
+ mkdir -p ${OUTPUT_DIR}/generated_code
182
+ mkdir -p ${OUTPUT_DIR}/logs
183
+ cp utils.py ${OUTPUT_DIR}/generated_code/utils.py
184
+
185
+ # Clean up any existing numpy files (store metrics under logs)
186
+ rm -f ${OUTPUT_DIR}/logs/success.npy ${OUTPUT_DIR}/logs/calls.npy ${OUTPUT_DIR}/logs/input_tokens.npy ${OUTPUT_DIR}/logs/output_tokens.npy
187
+
188
+ echo "Starting sequential execution..."
189
+ echo ""
190
+
191
+ # Function to run a single workflow
192
+ run_workflow() {
193
+ local workflow_name=$1
194
+ local smk_file=$2
195
+ local target=$3
196
+ local step_number=$4
197
+
198
+ echo "========================================="
199
+ echo "Running: $workflow_name"
200
+ echo "Target: $target"
201
+ echo "Time: $(date)"
202
+ echo "========================================="
203
+
204
+ # cd into OUTPUT_DIR and do all the work there
205
+ if ! pushd "$OUTPUT_DIR" > /dev/null; then
206
+ echo "❌ Failed to cd into $OUTPUT_DIR"
207
+ return 1
208
+ fi
209
+
210
+ # Print the command that will be executed (run inside ${OUTPUT_DIR})
211
+ # Commented out original with --stats, kept for reference
212
+ # echo "Command: snakemake -s \"$smk_file\" -j 1 --forcerun \"$target\" --rerun-incomplete --configfile \"${CONFIG}\" --latency-wait 120 --verbose --stats logs/${workflow_name}.stats > logs/${workflow_name}.log 2>&1"
213
+ echo "Command: snakemake -s \"$smk_file\" -j 1 --forcerun \"$target\" --rerun-incomplete --configfile \"${CONFIG}\" --latency-wait 120 --verbose > logs/${workflow_name}.log 2>&1"
214
+ echo ""
215
+
216
+ local start_time=$SECONDS
217
+
218
+ # Run snakemake from inside the output directory. Use BASE_DIR for the config file
219
+ # so Snakemake can find the main config.yml even when cwd is the job folder.
220
+ # Original Snakemake run with --stats (commented out)
221
+ # if snakemake -s "$smk_file" -j 1 --forcerun "$target" --rerun-incomplete --configfile "${CONFIG}" --latency-wait 120 --verbose --stats "logs/${workflow_name}.stats" > "logs/${workflow_name}.log" 2>&1; then
222
+ if snakemake -s "$smk_file" -j 1 --forcerun "$target" --rerun-incomplete --configfile "${CONFIG_PATH}" --latency-wait 120 --verbose > "logs/${workflow_name}.log" 2>&1; then
223
+ local duration=$((SECONDS - start_time))
224
+ echo ""
225
+ echo "βœ… $workflow_name completed successfully in ${duration}s"
226
+ echo ""
227
+
228
+ # Run validation for this step if it completed successfully
229
+ if [[ "$VALIDATE_STEPS" == "true" ]]; then
230
+ echo "Running validation for Step $step_number..."
231
+ if python check_soln.py --out_dir "${BASE_DIR}" --step $step_number >> "logs/${workflow_name}_validation.log" 2>&1; then
232
+ echo "βœ… Step $step_number validation completed"
233
+ # Check if validation passed
234
+ if [[ -f "${OUTPUT_DIR}/logs/success.npy" ]]; then
235
+ validation_result=$(python -c "import numpy as np; print(np.load('${OUTPUT_DIR}/logs/success.npy')[$step_number-1])")
236
+ if [[ "$validation_result" == "1" ]]; then
237
+ echo "βœ… Step $step_number validation: PASS"
238
+ else
239
+ echo "❌ Step $step_number validation: FAIL"
240
+ fi
241
+ fi
242
+ else
243
+ echo "❌ Step $step_number validation failed to run"
244
+ fi
245
+ echo ""
246
+ fi
247
+ popd > /dev/null
248
+ return 0
249
+ else
250
+ local duration=$((SECONDS - start_time))
251
+ echo ""
252
+ echo "❌ $workflow_name failed after ${duration}s"
253
+ echo ""
254
+ popd > /dev/null
255
+ return 1
256
+ fi
257
+ }
258
+
259
+ # Run workflows sequentially based on flags
260
+ step_counter=1
261
+
262
+ if [[ "$RUN_STEP1" == "true" ]]; then
263
+ echo "$step_counter. Running summarize_root workflow (both rules)..."
264
+ # Run both rules: first summarize_root, then insert_root_summary
265
+ run_workflow "summarize_root" "summarize_root_temp.smk" "summarize_root" 1
266
+ run_workflow "insert_root_summary" "summarize_root_temp.smk" "insert_root_summary" 1
267
+ ((step_counter++))
268
+ fi
269
+
270
+ if [[ "$RUN_STEP2" == "true" ]]; then
271
+ echo "$step_counter. Running create_numpy workflow..."
272
+ run_workflow "create_numpy" "create_numpy_temp.smk" "create_numpy" 2
273
+ ((step_counter++))
274
+ fi
275
+
276
+ if [[ "$RUN_STEP3" == "true" ]]; then
277
+ echo "$step_counter. Running preprocess workflow..."
278
+ run_workflow "preprocess" "preprocess_temp.smk" "preprocess" 3
279
+ ((step_counter++))
280
+ fi
281
+
282
+ if [[ "$RUN_STEP4" == "true" ]]; then
283
+ echo "$step_counter. Running scores workflow..."
284
+ run_workflow "scores" "scores_temp.smk" "scores" 4
285
+ ((step_counter++))
286
+ fi
287
+
288
+ if [[ "$RUN_STEP5" == "true" ]]; then
289
+ echo "$step_counter. Running categorization workflow..."
290
+ run_workflow "categorization" "categorization_temp.smk" "categorization" 5
291
+ ((step_counter++))
292
+ fi
293
+
294
+ echo ""
295
+ echo "=== Sequential Execution Complete ==="
296
+ echo "Check ${OUTPUT_DIR}/ for output files"
297
+ echo "Check ${OUTPUT_DIR}/logs/*.log files for detailed logs"
298
+ if [[ "$VALIDATE_STEPS" == "true" ]]; then
299
+ echo "Check ${OUTPUT_DIR}/logs/*_validation.log files for validation results"
300
+ fi
301
+
302
+ # Optional: Run final comprehensive validation (only if all steps were run)
303
+ if [[ "$RUN_STEP1" == "true" && "$RUN_STEP2" == "true" && "$RUN_STEP3" == "true" && "$RUN_STEP4" == "true" && "$RUN_STEP5" == "true" ]]; then
304
+ echo ""
305
+ if [[ "$VALIDATE_STEPS" == "false" ]]; then
306
+ read -p "Run final comprehensive validation? (y/n): " -n 1 -r
307
+ echo ""
308
+ if [[ $REPLY =~ ^[Yy]$ ]]; then
309
+ echo "Running final comprehensive validation..."
310
+ python check_soln.py --out_dir ${OUTPUT_DIR}
311
+ fi
312
+ else
313
+ echo "Running final comprehensive validation..."
314
+ python check_soln.py --out_dir ${OUTPUT_DIR}
315
+ fi
316
+ else
317
+ echo ""
318
+ echo "Note: Final comprehensive validation skipped (not all steps were run)"
319
+ fi
320
+
321
+ # Clean up
322
+ echo ""
323
+ # echo "Cleaning up temporary files..."
324
+ # Comment out the next line to keep prompts_temp for inspection
325
+ # rm -rf prompts_temp
326
+ # rm -f *_temp.smk
327
+ # rm -rf .snakemake # Clean up Snakemake's default log directory
328
+
329
+ echo -e "Done!\n"