Zen0 commited on
Commit
62d78bf
Β·
1 Parent(s): 2338c46

Improve UX: Move evaluation settings to top of page

Browse files

Better workflow:
1. Settings at top (no scrolling needed)
2. Big Run button immediately below
3. Model selection below that
4. Results on right side

Benefits:
- Faster to configure and run evaluations
- Settings visible without scrolling
- Cleaner separation of concerns
- More intuitive workflow

Also:
- Condensed GPU warning (less verbose)
- Added persistent results note to model selection
- Better visual hierarchy with divider

Files changed (1) hide show
  1. app.py +23 -17
app.py CHANGED
@@ -637,10 +637,28 @@ with gr.Blocks(title="AusCyberBench Evaluation Dashboard", theme=gr.themes.Soft(
637
  βœ… **Recommended models** have been tested: Qwen2.5-3B (55.6%), DeepSeek (55%), TinyLlama (33%)
638
  """)
639
 
 
 
 
 
 
 
 
 
 
 
 
 
 
640
  with gr.Row():
641
  with gr.Column(scale=1):
642
  gr.Markdown("### πŸ“‹ Model Selection")
643
 
 
 
 
 
 
644
  # Quick selection buttons
645
  with gr.Row():
646
  btn_recommended = gr.Button("βœ… Recommended (6)", size="sm", variant="primary")
@@ -661,26 +679,14 @@ with gr.Blocks(title="AusCyberBench Evaluation Dashboard", theme=gr.themes.Soft(
661
  cb = gr.Checkbox(label=f"{short_name}", value=False)
662
  model_checkboxes.append((cb, model))
663
 
664
- gr.Markdown("### ⚑ GPU Limits (Free Tier)")
665
  gr.Markdown("""
666
- **⚠️ Important:** ZeroGPU free tier has a **60-second limit per session**.
667
-
668
- **Recommendations:**
669
- - βœ… **1-2 models** with 10-20 tasks: Safe, will complete
670
- - ⚠️ **3-5 models** with 10 tasks: May timeout midway
671
- - ❌ **6+ models** or 50+ tasks: Will likely timeout
672
-
673
- For testing multiple models, run evaluations separately or use fewer tasks.
674
  """)
675
 
676
- gr.Markdown("### βš™οΈ Settings")
677
- num_samples = gr.Slider(10, 500, value=10, step=10, label="Number of Tasks (10 recommended for multiple models)")
678
- use_4bit = gr.Checkbox(label="Use 4-bit Quantisation", value=True)
679
- temperature = gr.Slider(0.1, 1.0, value=0.7, step=0.1, label="Temperature")
680
- max_tokens = gr.Slider(8, 256, value=32, step=8, label="Max New Tokens")
681
-
682
- run_btn = gr.Button("πŸš€ Run Evaluation", variant="primary", size="lg")
683
-
684
  with gr.Column(scale=2):
685
  gr.Markdown("### πŸ“Š Persistent Leaderboard")
686
  gr.Markdown("""
 
637
  βœ… **Recommended models** have been tested: Qwen2.5-3B (55.6%), DeepSeek (55%), TinyLlama (33%)
638
  """)
639
 
640
+ # Settings section at top for better UX
641
+ gr.Markdown("## βš™οΈ Evaluation Settings")
642
+ with gr.Row():
643
+ num_samples = gr.Slider(10, 500, value=10, step=10, label="Number of Tasks (10 recommended)")
644
+ use_4bit = gr.Checkbox(label="Use 4-bit Quantisation", value=True)
645
+ with gr.Row():
646
+ temperature = gr.Slider(0.1, 1.0, value=0.7, step=0.1, label="Temperature")
647
+ max_tokens = gr.Slider(8, 256, value=32, step=8, label="Max New Tokens")
648
+
649
+ run_btn = gr.Button("πŸš€ Run Evaluation", variant="primary", size="lg")
650
+
651
+ gr.Markdown("---")
652
+
653
  with gr.Row():
654
  with gr.Column(scale=1):
655
  gr.Markdown("### πŸ“‹ Model Selection")
656
 
657
+ gr.Markdown("""
658
+ **πŸ’Ύ Persistent Results:** Run 1-2 models at a time to avoid GPU timeouts.
659
+ Results merge with the leaderboard automatically!
660
+ """)
661
+
662
  # Quick selection buttons
663
  with gr.Row():
664
  btn_recommended = gr.Button("βœ… Recommended (6)", size="sm", variant="primary")
 
679
  cb = gr.Checkbox(label=f"{short_name}", value=False)
680
  model_checkboxes.append((cb, model))
681
 
682
+ gr.Markdown("### ⚑ GPU Limits")
683
  gr.Markdown("""
684
+ **Free tier: 60-second limit**
685
+ - βœ… 1-2 models: Safe
686
+ - ⚠️ 3-5 models: May timeout
687
+ - ❌ 6+ models: Will timeout
 
 
 
 
688
  """)
689
 
 
 
 
 
 
 
 
 
690
  with gr.Column(scale=2):
691
  gr.Markdown("### πŸ“Š Persistent Leaderboard")
692
  gr.Markdown("""