Zen0 commited on
Commit
2338c46
Β·
1 Parent(s): f8a48f3

Add persistent leaderboard feature - solves GPU timeout issue

Browse files

MAJOR FEATURE: Results now persist across sessions, enabling incremental
model evaluation without hitting 60s GPU timeouts.

Features:
- Persistent results stored in persistent_results.json
- Automatic merge with existing results (keeps best score per model)
- Leaderboard loads on startup and displays historical results
- Clear All Results button to reset leaderboard
- New runs merge seamlessly with previous evaluations

Benefits:
βœ… Run 1-2 models at a time without timeouts
βœ… Build comprehensive leaderboard incrementally
βœ… Perfect for ZeroGPU free tier (60s limit)
βœ… Best score per model automatically retained
βœ… No need to run all models in one session

UI Changes:
- 'Persistent Leaderboard' header with explanation
- Clear Results button with status message
- Leaderboard auto-loads on app startup
- Results update live after each model

This elegantly solves the timeout issue by allowing users to evaluate
the full model suite across multiple sessions instead of forcing all
models into one 60-second window.

Files changed (2) hide show
  1. README.md +20 -1
  2. app.py +114 -5
README.md CHANGED
@@ -119,11 +119,30 @@ Falcon, OpenChat, OpenHermes
119
 
120
  ## Usage
121
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  1. **Select Models:** Use checkboxes or quick selection buttons
123
  2. **Configure Settings:** Adjust sample size, quantisation, temperature
124
  3. **Run Evaluation:** Click "πŸš€ Run Evaluation"
125
  4. **Monitor Progress:** Watch real-time progress and intermediate results
126
- 5. **Analyse Results:** Review leaderboard, charts, and category breakdowns
127
  6. **Download:** Export results for further analysis
128
 
129
  ## Dataset
 
119
 
120
  ## Usage
121
 
122
+ ### πŸ’Ύ Persistent Leaderboard Feature
123
+
124
+ **NEW:** Results now persist across sessions! This solves the GPU timeout issue:
125
+
126
+ - Run models **one at a time** to avoid timeouts
127
+ - Each run merges with previous results
128
+ - Best score per model is automatically kept
129
+ - Build a comprehensive leaderboard incrementally
130
+ - Perfect for the 60-second free tier limit
131
+
132
+ **Workflow:**
133
+ 1. Select 1-2 models and run evaluation
134
+ 2. Results automatically save and merge with leaderboard
135
+ 3. Select different models and run again
136
+ 4. Leaderboard updates with all results
137
+ 5. Use "Clear All Results" button to start fresh
138
+
139
+ ### Standard Usage
140
+
141
  1. **Select Models:** Use checkboxes or quick selection buttons
142
  2. **Configure Settings:** Adjust sample size, quantisation, temperature
143
  3. **Run Evaluation:** Click "πŸš€ Run Evaluation"
144
  4. **Monitor Progress:** Watch real-time progress and intermediate results
145
+ 5. **Analyse Results:** Review persistent leaderboard, charts, and category breakdowns
146
  6. **Download:** Export results for further analysis
147
 
148
  ## Dataset
app.py CHANGED
@@ -75,6 +75,83 @@ ALL_MODELS = [model for category in MODELS_BY_CATEGORY.values() for model in cat
75
  # Global state
76
  current_results = []
77
  dataset_cache = None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
 
80
  def load_benchmark_dataset(subset="australian", num_samples=200):
@@ -401,24 +478,34 @@ def run_evaluation(selected_models, num_samples, use_4bit, temperature, max_toke
401
  if not selected_models:
402
  return "Please select at least one model to evaluate.", None, None
403
 
 
 
 
404
  # Load dataset
405
  progress(0, desc="Loading AusCyberBench dataset...")
406
  tasks = load_benchmark_dataset(num_samples=num_samples)
407
 
408
  # Evaluate each model
409
- current_results = []
410
  for i, model_name in enumerate(selected_models):
411
  progress((i / len(selected_models)), desc=f"Model {i+1}/{len(selected_models)}")
412
 
413
  result = evaluate_single_model(
414
  model_name, tasks, use_4bit, temperature, max_tokens, progress
415
  )
416
- current_results.append(result)
 
 
 
 
417
 
418
- # Yield intermediate results
419
  yield format_results_table(current_results), create_comparison_chart(current_results), None
420
 
421
- # Final results
 
 
 
422
  final_table = format_results_table(current_results)
423
  final_chart = create_comparison_chart(current_results)
424
  download_data = create_download_data(current_results)
@@ -595,7 +682,17 @@ with gr.Blocks(title="AusCyberBench Evaluation Dashboard", theme=gr.themes.Soft(
595
  run_btn = gr.Button("πŸš€ Run Evaluation", variant="primary", size="lg")
596
 
597
  with gr.Column(scale=2):
598
- gr.Markdown("### πŸ“Š Results")
 
 
 
 
 
 
 
 
 
 
599
 
600
  results_table = gr.Dataframe(
601
  label="Leaderboard",
@@ -659,6 +756,18 @@ with gr.Blocks(title="AusCyberBench Evaluation Dashboard", theme=gr.themes.Soft(
659
  outputs=[results_table, comparison_plot, download_file]
660
  )
661
 
 
 
 
 
 
 
 
 
 
 
 
 
662
  gr.Markdown("""
663
  ---
664
  **Dataset:** [Zen0/AusCyberBench](https://huggingface.co/datasets/Zen0/AusCyberBench) β€’ 13,449 tasks |
 
75
  # Global state
76
  current_results = []
77
  dataset_cache = None
78
+ PERSISTENT_RESULTS_FILE = "persistent_results.json"
79
+
80
+
81
+ def load_persistent_results():
82
+ """Load persistent results from disk"""
83
+ if Path(PERSISTENT_RESULTS_FILE).exists():
84
+ try:
85
+ with open(PERSISTENT_RESULTS_FILE, 'r') as f:
86
+ return json.load(f)
87
+ except Exception as e:
88
+ print(f"Error loading persistent results: {e}")
89
+ return []
90
+ return []
91
+
92
+
93
+ def save_persistent_results(results):
94
+ """Save results to persistent storage"""
95
+ try:
96
+ with open(PERSISTENT_RESULTS_FILE, 'w') as f:
97
+ json.dump(results, f, indent=2)
98
+ except Exception as e:
99
+ print(f"Error saving persistent results: {e}")
100
+
101
+
102
+ def merge_results(existing_results, new_results):
103
+ """Merge new results with existing, keeping best score per model"""
104
+ # Create dict of existing results keyed by model name
105
+ results_dict = {r['model']: r for r in existing_results}
106
+
107
+ # Update with new results (keep best accuracy)
108
+ for new_result in new_results:
109
+ model_name = new_result['model']
110
+ if model_name in results_dict:
111
+ # Keep result with higher accuracy
112
+ existing_acc = results_dict[model_name].get('overall_accuracy', 0)
113
+ new_acc = new_result.get('overall_accuracy', 0)
114
+ if new_acc > existing_acc:
115
+ results_dict[model_name] = new_result
116
+ else:
117
+ results_dict[model_name] = new_result
118
+
119
+ # Convert back to list and sort by accuracy
120
+ merged = list(results_dict.values())
121
+ merged.sort(key=lambda x: x.get('overall_accuracy', 0), reverse=True)
122
+ return merged
123
+
124
+
125
+ def clear_persistent_results():
126
+ """Clear all persistent results"""
127
+ try:
128
+ if Path(PERSISTENT_RESULTS_FILE).exists():
129
+ Path(PERSISTENT_RESULTS_FILE).unlink()
130
+ # Return empty displays
131
+ return (
132
+ "βœ… Persistent results cleared!",
133
+ pd.DataFrame(),
134
+ None,
135
+ None
136
+ )
137
+ except Exception as e:
138
+ return (
139
+ f"❌ Error clearing results: {e}",
140
+ pd.DataFrame(),
141
+ None,
142
+ None
143
+ )
144
+
145
+
146
+ def load_initial_leaderboard():
147
+ """Load and display persistent leaderboard on startup"""
148
+ persistent_results = load_persistent_results()
149
+ if persistent_results:
150
+ table = format_results_table(persistent_results)
151
+ chart = create_comparison_chart(persistent_results)
152
+ download = create_download_data(persistent_results)
153
+ return table, chart, download
154
+ return pd.DataFrame(), None, None
155
 
156
 
157
  def load_benchmark_dataset(subset="australian", num_samples=200):
 
478
  if not selected_models:
479
  return "Please select at least one model to evaluate.", None, None
480
 
481
+ # Load existing persistent results
482
+ persistent_results = load_persistent_results()
483
+
484
  # Load dataset
485
  progress(0, desc="Loading AusCyberBench dataset...")
486
  tasks = load_benchmark_dataset(num_samples=num_samples)
487
 
488
  # Evaluate each model
489
+ new_results = []
490
  for i, model_name in enumerate(selected_models):
491
  progress((i / len(selected_models)), desc=f"Model {i+1}/{len(selected_models)}")
492
 
493
  result = evaluate_single_model(
494
  model_name, tasks, use_4bit, temperature, max_tokens, progress
495
  )
496
+ new_results.append(result)
497
+
498
+ # Merge with persistent results after each model
499
+ current_results = merge_results(persistent_results, new_results)
500
+ save_persistent_results(current_results)
501
 
502
+ # Yield intermediate results (showing full leaderboard including historical)
503
  yield format_results_table(current_results), create_comparison_chart(current_results), None
504
 
505
+ # Final results (merged with historical)
506
+ current_results = merge_results(persistent_results, new_results)
507
+ save_persistent_results(current_results)
508
+
509
  final_table = format_results_table(current_results)
510
  final_chart = create_comparison_chart(current_results)
511
  download_data = create_download_data(current_results)
 
682
  run_btn = gr.Button("πŸš€ Run Evaluation", variant="primary", size="lg")
683
 
684
  with gr.Column(scale=2):
685
+ gr.Markdown("### πŸ“Š Persistent Leaderboard")
686
+ gr.Markdown("""
687
+ **πŸ’Ύ Results persist across sessions!** Run models one at a time to build up a complete leaderboard.
688
+
689
+ - New runs merge with existing results
690
+ - Best score per model is kept
691
+ - Perfect for avoiding GPU timeouts
692
+ """)
693
+
694
+ clear_status = gr.Markdown("")
695
+ clear_btn = gr.Button("πŸ—‘οΈ Clear All Results", size="sm", variant="stop")
696
 
697
  results_table = gr.Dataframe(
698
  label="Leaderboard",
 
756
  outputs=[results_table, comparison_plot, download_file]
757
  )
758
 
759
+ # Clear results button
760
+ clear_btn.click(
761
+ fn=clear_persistent_results,
762
+ outputs=[clear_status, results_table, comparison_plot, download_file]
763
+ )
764
+
765
+ # Load persistent leaderboard on startup
766
+ app.load(
767
+ fn=load_initial_leaderboard,
768
+ outputs=[results_table, comparison_plot, download_file]
769
+ )
770
+
771
  gr.Markdown("""
772
  ---
773
  **Dataset:** [Zen0/AusCyberBench](https://huggingface.co/datasets/Zen0/AusCyberBench) β€’ 13,449 tasks |