MedGemma 27B β€” Clinical Error Detection (SFT LoRA Adapter)

A LoRA fine-tuned adapter for Google MedGemma 27B-IT trained to detect critical patient safety errors in clinical documentation.

Model Description

This adapter was trained as part of Clinipal β€” an AI-powered clinical error detection system that acts as an automated "second reviewer" of medical reports. The model identifies 6 categories of high-impact safety errors in emergency department and internal medicine documentation.

Error Categories

Error Type Description
CONTRAINDICATED_MEDICATION Drug dangerous given patient's conditions/allergies
DANGEROUS_DOSAGE Dose significantly outside therapeutic range
CLINICAL_SCORE_ERROR Miscalculated risk score affecting treatment decisions
MISSING_CRITICAL_TREATMENT Life-saving intervention clearly omitted
TREATMENT_LOGIC_FAILURE Treatment contradicts the diagnosis
MISSING_CRITICAL_WORKUP Essential diagnostic tests not ordered

Training

Dataset

  • 300 synthetic clinical reports with annotated errors, generated using GPT-5.2, Gemini 3 Flash Preview, and DeepSeek-V3-R1
  • Synthetic reports designed to emulate real-world emergency department documentation
  • 150 real clinical reports from Internal Medicine, annotated by 3 physicians with realistic inserted errors (100 used as held-out test set)

LoRA Configuration

Parameter Value
Base model google/medgemma-27b-it
PEFT type LoRA
Rank (r) 32
Alpha 64
Dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Task type CAUSAL_LM
Adapter size ~889 MB

Results

Evaluated on 100 held-out real-world clinical cases with physician-annotated errors:

Configuration Accuracy
Baseline MedGemma 27B (no fine-tuning) 22.0%
This adapter (single-pass) 40.5%
This adapter (multi-agent pipeline) 60.4%
GPT-OSS-120b 38.1%
Gemini 3 Flash Preview 35.3%

The multi-agent pipeline runs 6 sequential inference calls (2 first-pass reviewers + 3 specialist critics + 1 final adjudicator) using the same adapter, achieving a 3x improvement over the baseline.

Usage

With vLLM (recommended for production)

# Serve with vLLM + LoRA
python -m vllm.entrypoints.openai.api_server \
  --model google/medgemma-27b-it \
  --port 8000 \
  --enable-lora \
  --lora-modules "sft_adapter=<path-to-this-adapter>" \
  --max-lora-rank 64 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.75 \
  --dtype bfloat16

Then call the API:

import requests

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "sft_adapter",
    "messages": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Analyze the following clinical note:\n\n{clinical_note}"}
    ],
    "temperature": 0.6,
    "max_tokens": 1024
})

With Transformers + PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "google/medgemma-27b-it",
    torch_dtype="bfloat16",
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "Vrda/medgemma-27b-clinical-error-sft")
tokenizer = AutoTokenizer.from_pretrained("Vrda/medgemma-27b-clinical-error-sft")

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"Analyze the following clinical note:\n\n{clinical_note}"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0.6, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

System Prompt

The model expects this system prompt for optimal performance:

You are an emergency medicine clinical safety reviewer analyzing a real patient's
emergency department documentation. Your ONLY task is to identify CRITICAL patient
safety errors β€” the kind that could cause direct harm if missed.

FOCUS EXCLUSIVELY on these error types:

1. CONTRAINDICATED_MEDICATION
2. DANGEROUS_DOSAGE
3. CLINICAL_SCORE_ERROR
4. MISSING_CRITICAL_TREATMENT
5. TREATMENT_LOGIC_FAILURE
6. MISSING_CRITICAL_WORKUP

STRICT RULES:
- Report AT MOST 3 errors, strictly prioritized by patient safety impact.
- Only report errors you are β‰₯80% confident about.
- Do NOT report style preferences, minor documentation gaps, or speculative concerns.
- If no critical safety errors exist, return an empty errors array.

IMPORTANT β€” THINK STEP BY STEP:
For each potential error, include a "reasoning" field with your clinical logic.

Respond with ONLY valid JSON:
{
  "errors": [
    {
      "type": "CONTRAINDICATED_MEDICATION|DANGEROUS_DOSAGE|CLINICAL_SCORE_ERROR|MISSING_CRITICAL_TREATMENT|TREATMENT_LOGIC_FAILURE|MISSING_CRITICAL_WORKUP",
      "severity": "critical|warning",
      "reasoning": "Step-by-step clinical logic...",
      "problem": "1-2 sentence explanation",
      "recommendation": "1 sentence corrective action",
      "confidence": 0.95
    }
  ],
  "summary": "One-sentence overall safety assessment."
}

Output Format

The model outputs structured JSON:

{
  "errors": [
    {
      "type": "CONTRAINDICATED_MEDICATION",
      "severity": "critical",
      "reasoning": "The patient has documented bilateral renal artery stenosis. Perindopril is an ACE inhibitor, which is strictly contraindicated in this condition as it can precipitate acute kidney injury and hyperkalemia.",
      "problem": "ACE inhibitor (perindopril) prescribed despite bilateral renal artery stenosis.",
      "recommendation": "Immediately discontinue perindopril and monitor renal function and potassium levels.",
      "confidence": 0.99
    }
  ],
  "summary": "Critical medication contraindication identified requiring immediate intervention."
}

Multi-Agent Pipeline

For best results, use the multi-agent pipeline (6 sequential calls with the same adapter):

  1. First-Pass Conservative (temp=0.6) β€” high-precision scan
  2. First-Pass Exploratory (temp=1.0) β€” high-recall scan
  3. Diagnostics Critic (temp=0.7) β€” diagnostic reasoning focus
  4. Treatment Plan Critic (temp=0.75) β€” medication safety focus
  5. Follow-Up Critic (temp=0.7) β€” disposition safety focus
  6. Final Adjudicator (temp=0.5) β€” synthesizes, de-duplicates, selects top 3

See the Clinipal repository for the full pipeline implementation.

Limitations

  • Trained primarily on internal medicine and emergency department cases; may underperform on other specialties
  • Accuracy is 40.5% single-pass (60.4% with multi-agent); not suitable as a sole decision-maker
  • May produce false positives; all findings should be reviewed by a qualified clinician
  • Best performance with English-language clinical notes

Citation

If you use this model, please cite:

@misc{clinipal2026,
  title={Clinipal: AI-Powered Clinical Error Detection Using Fine-Tuned MedGemma 27B},
  author={Vrdoljak, J. and Luksic, I. and Baric, D.},
  year={2026},
  url={https://github.com/IvanLuksic/medgemma-next}
}

@article{krabic2026llm,
  title={Large language models as second reviewers for medical errors in real-world internal medicine reports: a prospective comparative study of open- and closed-source models},
  author={Krabic, R. and Viculin, I. and Boban, Z. and Kumric, M. and Vilovic, M. and Vrdoljak, J. and Bozic, J.},
  journal={International Journal of Medical Informatics},
  volume={211},
  pages={106316},
  year={2026},
  doi={10.1016/j.ijmedinf.2026.106316},
  pmid={41655522}
}

License

This adapter is released under the Apache 2.0 license. The base model (google/medgemma-27b-it) is subject to Google's Gemma license terms.

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Vrda/medgemma-27b-clinical-error-sft

Adapter
(5)
this model

Evaluation results