Spaces:
Sleeping
Add persistent leaderboard feature - solves GPU timeout issue
Browse filesMAJOR FEATURE: Results now persist across sessions, enabling incremental
model evaluation without hitting 60s GPU timeouts.
Features:
- Persistent results stored in persistent_results.json
- Automatic merge with existing results (keeps best score per model)
- Leaderboard loads on startup and displays historical results
- Clear All Results button to reset leaderboard
- New runs merge seamlessly with previous evaluations
Benefits:
β
Run 1-2 models at a time without timeouts
β
Build comprehensive leaderboard incrementally
β
Perfect for ZeroGPU free tier (60s limit)
β
Best score per model automatically retained
β
No need to run all models in one session
UI Changes:
- 'Persistent Leaderboard' header with explanation
- Clear Results button with status message
- Leaderboard auto-loads on app startup
- Results update live after each model
This elegantly solves the timeout issue by allowing users to evaluate
the full model suite across multiple sessions instead of forcing all
models into one 60-second window.
|
@@ -119,11 +119,30 @@ Falcon, OpenChat, OpenHermes
|
|
| 119 |
|
| 120 |
## Usage
|
| 121 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
1. **Select Models:** Use checkboxes or quick selection buttons
|
| 123 |
2. **Configure Settings:** Adjust sample size, quantisation, temperature
|
| 124 |
3. **Run Evaluation:** Click "π Run Evaluation"
|
| 125 |
4. **Monitor Progress:** Watch real-time progress and intermediate results
|
| 126 |
-
5. **Analyse Results:** Review leaderboard, charts, and category breakdowns
|
| 127 |
6. **Download:** Export results for further analysis
|
| 128 |
|
| 129 |
## Dataset
|
|
|
|
| 119 |
|
| 120 |
## Usage
|
| 121 |
|
| 122 |
+
### πΎ Persistent Leaderboard Feature
|
| 123 |
+
|
| 124 |
+
**NEW:** Results now persist across sessions! This solves the GPU timeout issue:
|
| 125 |
+
|
| 126 |
+
- Run models **one at a time** to avoid timeouts
|
| 127 |
+
- Each run merges with previous results
|
| 128 |
+
- Best score per model is automatically kept
|
| 129 |
+
- Build a comprehensive leaderboard incrementally
|
| 130 |
+
- Perfect for the 60-second free tier limit
|
| 131 |
+
|
| 132 |
+
**Workflow:**
|
| 133 |
+
1. Select 1-2 models and run evaluation
|
| 134 |
+
2. Results automatically save and merge with leaderboard
|
| 135 |
+
3. Select different models and run again
|
| 136 |
+
4. Leaderboard updates with all results
|
| 137 |
+
5. Use "Clear All Results" button to start fresh
|
| 138 |
+
|
| 139 |
+
### Standard Usage
|
| 140 |
+
|
| 141 |
1. **Select Models:** Use checkboxes or quick selection buttons
|
| 142 |
2. **Configure Settings:** Adjust sample size, quantisation, temperature
|
| 143 |
3. **Run Evaluation:** Click "π Run Evaluation"
|
| 144 |
4. **Monitor Progress:** Watch real-time progress and intermediate results
|
| 145 |
+
5. **Analyse Results:** Review persistent leaderboard, charts, and category breakdowns
|
| 146 |
6. **Download:** Export results for further analysis
|
| 147 |
|
| 148 |
## Dataset
|
|
@@ -75,6 +75,83 @@ ALL_MODELS = [model for category in MODELS_BY_CATEGORY.values() for model in cat
|
|
| 75 |
# Global state
|
| 76 |
current_results = []
|
| 77 |
dataset_cache = None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
|
| 80 |
def load_benchmark_dataset(subset="australian", num_samples=200):
|
|
@@ -401,24 +478,34 @@ def run_evaluation(selected_models, num_samples, use_4bit, temperature, max_toke
|
|
| 401 |
if not selected_models:
|
| 402 |
return "Please select at least one model to evaluate.", None, None
|
| 403 |
|
|
|
|
|
|
|
|
|
|
| 404 |
# Load dataset
|
| 405 |
progress(0, desc="Loading AusCyberBench dataset...")
|
| 406 |
tasks = load_benchmark_dataset(num_samples=num_samples)
|
| 407 |
|
| 408 |
# Evaluate each model
|
| 409 |
-
|
| 410 |
for i, model_name in enumerate(selected_models):
|
| 411 |
progress((i / len(selected_models)), desc=f"Model {i+1}/{len(selected_models)}")
|
| 412 |
|
| 413 |
result = evaluate_single_model(
|
| 414 |
model_name, tasks, use_4bit, temperature, max_tokens, progress
|
| 415 |
)
|
| 416 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 417 |
|
| 418 |
-
# Yield intermediate results
|
| 419 |
yield format_results_table(current_results), create_comparison_chart(current_results), None
|
| 420 |
|
| 421 |
-
# Final results
|
|
|
|
|
|
|
|
|
|
| 422 |
final_table = format_results_table(current_results)
|
| 423 |
final_chart = create_comparison_chart(current_results)
|
| 424 |
download_data = create_download_data(current_results)
|
|
@@ -595,7 +682,17 @@ with gr.Blocks(title="AusCyberBench Evaluation Dashboard", theme=gr.themes.Soft(
|
|
| 595 |
run_btn = gr.Button("π Run Evaluation", variant="primary", size="lg")
|
| 596 |
|
| 597 |
with gr.Column(scale=2):
|
| 598 |
-
gr.Markdown("### π
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 599 |
|
| 600 |
results_table = gr.Dataframe(
|
| 601 |
label="Leaderboard",
|
|
@@ -659,6 +756,18 @@ with gr.Blocks(title="AusCyberBench Evaluation Dashboard", theme=gr.themes.Soft(
|
|
| 659 |
outputs=[results_table, comparison_plot, download_file]
|
| 660 |
)
|
| 661 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 662 |
gr.Markdown("""
|
| 663 |
---
|
| 664 |
**Dataset:** [Zen0/AusCyberBench](https://huggingface.co/datasets/Zen0/AusCyberBench) β’ 13,449 tasks |
|
|
|
|
| 75 |
# Global state
|
| 76 |
current_results = []
|
| 77 |
dataset_cache = None
|
| 78 |
+
PERSISTENT_RESULTS_FILE = "persistent_results.json"
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
def load_persistent_results():
|
| 82 |
+
"""Load persistent results from disk"""
|
| 83 |
+
if Path(PERSISTENT_RESULTS_FILE).exists():
|
| 84 |
+
try:
|
| 85 |
+
with open(PERSISTENT_RESULTS_FILE, 'r') as f:
|
| 86 |
+
return json.load(f)
|
| 87 |
+
except Exception as e:
|
| 88 |
+
print(f"Error loading persistent results: {e}")
|
| 89 |
+
return []
|
| 90 |
+
return []
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
def save_persistent_results(results):
|
| 94 |
+
"""Save results to persistent storage"""
|
| 95 |
+
try:
|
| 96 |
+
with open(PERSISTENT_RESULTS_FILE, 'w') as f:
|
| 97 |
+
json.dump(results, f, indent=2)
|
| 98 |
+
except Exception as e:
|
| 99 |
+
print(f"Error saving persistent results: {e}")
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
def merge_results(existing_results, new_results):
|
| 103 |
+
"""Merge new results with existing, keeping best score per model"""
|
| 104 |
+
# Create dict of existing results keyed by model name
|
| 105 |
+
results_dict = {r['model']: r for r in existing_results}
|
| 106 |
+
|
| 107 |
+
# Update with new results (keep best accuracy)
|
| 108 |
+
for new_result in new_results:
|
| 109 |
+
model_name = new_result['model']
|
| 110 |
+
if model_name in results_dict:
|
| 111 |
+
# Keep result with higher accuracy
|
| 112 |
+
existing_acc = results_dict[model_name].get('overall_accuracy', 0)
|
| 113 |
+
new_acc = new_result.get('overall_accuracy', 0)
|
| 114 |
+
if new_acc > existing_acc:
|
| 115 |
+
results_dict[model_name] = new_result
|
| 116 |
+
else:
|
| 117 |
+
results_dict[model_name] = new_result
|
| 118 |
+
|
| 119 |
+
# Convert back to list and sort by accuracy
|
| 120 |
+
merged = list(results_dict.values())
|
| 121 |
+
merged.sort(key=lambda x: x.get('overall_accuracy', 0), reverse=True)
|
| 122 |
+
return merged
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
def clear_persistent_results():
|
| 126 |
+
"""Clear all persistent results"""
|
| 127 |
+
try:
|
| 128 |
+
if Path(PERSISTENT_RESULTS_FILE).exists():
|
| 129 |
+
Path(PERSISTENT_RESULTS_FILE).unlink()
|
| 130 |
+
# Return empty displays
|
| 131 |
+
return (
|
| 132 |
+
"β
Persistent results cleared!",
|
| 133 |
+
pd.DataFrame(),
|
| 134 |
+
None,
|
| 135 |
+
None
|
| 136 |
+
)
|
| 137 |
+
except Exception as e:
|
| 138 |
+
return (
|
| 139 |
+
f"β Error clearing results: {e}",
|
| 140 |
+
pd.DataFrame(),
|
| 141 |
+
None,
|
| 142 |
+
None
|
| 143 |
+
)
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
def load_initial_leaderboard():
|
| 147 |
+
"""Load and display persistent leaderboard on startup"""
|
| 148 |
+
persistent_results = load_persistent_results()
|
| 149 |
+
if persistent_results:
|
| 150 |
+
table = format_results_table(persistent_results)
|
| 151 |
+
chart = create_comparison_chart(persistent_results)
|
| 152 |
+
download = create_download_data(persistent_results)
|
| 153 |
+
return table, chart, download
|
| 154 |
+
return pd.DataFrame(), None, None
|
| 155 |
|
| 156 |
|
| 157 |
def load_benchmark_dataset(subset="australian", num_samples=200):
|
|
|
|
| 478 |
if not selected_models:
|
| 479 |
return "Please select at least one model to evaluate.", None, None
|
| 480 |
|
| 481 |
+
# Load existing persistent results
|
| 482 |
+
persistent_results = load_persistent_results()
|
| 483 |
+
|
| 484 |
# Load dataset
|
| 485 |
progress(0, desc="Loading AusCyberBench dataset...")
|
| 486 |
tasks = load_benchmark_dataset(num_samples=num_samples)
|
| 487 |
|
| 488 |
# Evaluate each model
|
| 489 |
+
new_results = []
|
| 490 |
for i, model_name in enumerate(selected_models):
|
| 491 |
progress((i / len(selected_models)), desc=f"Model {i+1}/{len(selected_models)}")
|
| 492 |
|
| 493 |
result = evaluate_single_model(
|
| 494 |
model_name, tasks, use_4bit, temperature, max_tokens, progress
|
| 495 |
)
|
| 496 |
+
new_results.append(result)
|
| 497 |
+
|
| 498 |
+
# Merge with persistent results after each model
|
| 499 |
+
current_results = merge_results(persistent_results, new_results)
|
| 500 |
+
save_persistent_results(current_results)
|
| 501 |
|
| 502 |
+
# Yield intermediate results (showing full leaderboard including historical)
|
| 503 |
yield format_results_table(current_results), create_comparison_chart(current_results), None
|
| 504 |
|
| 505 |
+
# Final results (merged with historical)
|
| 506 |
+
current_results = merge_results(persistent_results, new_results)
|
| 507 |
+
save_persistent_results(current_results)
|
| 508 |
+
|
| 509 |
final_table = format_results_table(current_results)
|
| 510 |
final_chart = create_comparison_chart(current_results)
|
| 511 |
download_data = create_download_data(current_results)
|
|
|
|
| 682 |
run_btn = gr.Button("π Run Evaluation", variant="primary", size="lg")
|
| 683 |
|
| 684 |
with gr.Column(scale=2):
|
| 685 |
+
gr.Markdown("### π Persistent Leaderboard")
|
| 686 |
+
gr.Markdown("""
|
| 687 |
+
**πΎ Results persist across sessions!** Run models one at a time to build up a complete leaderboard.
|
| 688 |
+
|
| 689 |
+
- New runs merge with existing results
|
| 690 |
+
- Best score per model is kept
|
| 691 |
+
- Perfect for avoiding GPU timeouts
|
| 692 |
+
""")
|
| 693 |
+
|
| 694 |
+
clear_status = gr.Markdown("")
|
| 695 |
+
clear_btn = gr.Button("ποΈ Clear All Results", size="sm", variant="stop")
|
| 696 |
|
| 697 |
results_table = gr.Dataframe(
|
| 698 |
label="Leaderboard",
|
|
|
|
| 756 |
outputs=[results_table, comparison_plot, download_file]
|
| 757 |
)
|
| 758 |
|
| 759 |
+
# Clear results button
|
| 760 |
+
clear_btn.click(
|
| 761 |
+
fn=clear_persistent_results,
|
| 762 |
+
outputs=[clear_status, results_table, comparison_plot, download_file]
|
| 763 |
+
)
|
| 764 |
+
|
| 765 |
+
# Load persistent leaderboard on startup
|
| 766 |
+
app.load(
|
| 767 |
+
fn=load_initial_leaderboard,
|
| 768 |
+
outputs=[results_table, comparison_plot, download_file]
|
| 769 |
+
)
|
| 770 |
+
|
| 771 |
gr.Markdown("""
|
| 772 |
---
|
| 773 |
**Dataset:** [Zen0/AusCyberBench](https://huggingface.co/datasets/Zen0/AusCyberBench) β’ 13,449 tasks |
|