MedGemma 27B β Clinical Error Detection (SFT LoRA Adapter)
A LoRA fine-tuned adapter for Google MedGemma 27B-IT trained to detect critical patient safety errors in clinical documentation.
Model Description
This adapter was trained as part of Clinipal β an AI-powered clinical error detection system that acts as an automated "second reviewer" of medical reports. The model identifies 6 categories of high-impact safety errors in emergency department and internal medicine documentation.
Error Categories
| Error Type | Description |
|---|---|
CONTRAINDICATED_MEDICATION |
Drug dangerous given patient's conditions/allergies |
DANGEROUS_DOSAGE |
Dose significantly outside therapeutic range |
CLINICAL_SCORE_ERROR |
Miscalculated risk score affecting treatment decisions |
MISSING_CRITICAL_TREATMENT |
Life-saving intervention clearly omitted |
TREATMENT_LOGIC_FAILURE |
Treatment contradicts the diagnosis |
MISSING_CRITICAL_WORKUP |
Essential diagnostic tests not ordered |
Training
Dataset
- 300 synthetic clinical reports with annotated errors, generated using GPT-5.2, Gemini 3 Flash Preview, and DeepSeek-V3-R1
- Synthetic reports designed to emulate real-world emergency department documentation
- 150 real clinical reports from Internal Medicine, annotated by 3 physicians with realistic inserted errors (100 used as held-out test set)
LoRA Configuration
| Parameter | Value |
|---|---|
| Base model | google/medgemma-27b-it |
| PEFT type | LoRA |
| Rank (r) | 32 |
| Alpha | 64 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Task type | CAUSAL_LM |
| Adapter size | ~889 MB |
Results
Evaluated on 100 held-out real-world clinical cases with physician-annotated errors:
| Configuration | Accuracy |
|---|---|
| Baseline MedGemma 27B (no fine-tuning) | 22.0% |
| This adapter (single-pass) | 40.5% |
| This adapter (multi-agent pipeline) | 60.4% |
| GPT-OSS-120b | 38.1% |
| Gemini 3 Flash Preview | 35.3% |
The multi-agent pipeline runs 6 sequential inference calls (2 first-pass reviewers + 3 specialist critics + 1 final adjudicator) using the same adapter, achieving a 3x improvement over the baseline.
Usage
With vLLM (recommended for production)
# Serve with vLLM + LoRA
python -m vllm.entrypoints.openai.api_server \
--model google/medgemma-27b-it \
--port 8000 \
--enable-lora \
--lora-modules "sft_adapter=<path-to-this-adapter>" \
--max-lora-rank 64 \
--max-model-len 8192 \
--gpu-memory-utilization 0.75 \
--dtype bfloat16
Then call the API:
import requests
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "sft_adapter",
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Analyze the following clinical note:\n\n{clinical_note}"}
],
"temperature": 0.6,
"max_tokens": 1024
})
With Transformers + PEFT
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"google/medgemma-27b-it",
torch_dtype="bfloat16",
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "Vrda/medgemma-27b-clinical-error-sft")
tokenizer = AutoTokenizer.from_pretrained("Vrda/medgemma-27b-clinical-error-sft")
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Analyze the following clinical note:\n\n{clinical_note}"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0.6, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
System Prompt
The model expects this system prompt for optimal performance:
You are an emergency medicine clinical safety reviewer analyzing a real patient's
emergency department documentation. Your ONLY task is to identify CRITICAL patient
safety errors β the kind that could cause direct harm if missed.
FOCUS EXCLUSIVELY on these error types:
1. CONTRAINDICATED_MEDICATION
2. DANGEROUS_DOSAGE
3. CLINICAL_SCORE_ERROR
4. MISSING_CRITICAL_TREATMENT
5. TREATMENT_LOGIC_FAILURE
6. MISSING_CRITICAL_WORKUP
STRICT RULES:
- Report AT MOST 3 errors, strictly prioritized by patient safety impact.
- Only report errors you are β₯80% confident about.
- Do NOT report style preferences, minor documentation gaps, or speculative concerns.
- If no critical safety errors exist, return an empty errors array.
IMPORTANT β THINK STEP BY STEP:
For each potential error, include a "reasoning" field with your clinical logic.
Respond with ONLY valid JSON:
{
"errors": [
{
"type": "CONTRAINDICATED_MEDICATION|DANGEROUS_DOSAGE|CLINICAL_SCORE_ERROR|MISSING_CRITICAL_TREATMENT|TREATMENT_LOGIC_FAILURE|MISSING_CRITICAL_WORKUP",
"severity": "critical|warning",
"reasoning": "Step-by-step clinical logic...",
"problem": "1-2 sentence explanation",
"recommendation": "1 sentence corrective action",
"confidence": 0.95
}
],
"summary": "One-sentence overall safety assessment."
}
Output Format
The model outputs structured JSON:
{
"errors": [
{
"type": "CONTRAINDICATED_MEDICATION",
"severity": "critical",
"reasoning": "The patient has documented bilateral renal artery stenosis. Perindopril is an ACE inhibitor, which is strictly contraindicated in this condition as it can precipitate acute kidney injury and hyperkalemia.",
"problem": "ACE inhibitor (perindopril) prescribed despite bilateral renal artery stenosis.",
"recommendation": "Immediately discontinue perindopril and monitor renal function and potassium levels.",
"confidence": 0.99
}
],
"summary": "Critical medication contraindication identified requiring immediate intervention."
}
Multi-Agent Pipeline
For best results, use the multi-agent pipeline (6 sequential calls with the same adapter):
- First-Pass Conservative (temp=0.6) β high-precision scan
- First-Pass Exploratory (temp=1.0) β high-recall scan
- Diagnostics Critic (temp=0.7) β diagnostic reasoning focus
- Treatment Plan Critic (temp=0.75) β medication safety focus
- Follow-Up Critic (temp=0.7) β disposition safety focus
- Final Adjudicator (temp=0.5) β synthesizes, de-duplicates, selects top 3
See the Clinipal repository for the full pipeline implementation.
Limitations
- Trained primarily on internal medicine and emergency department cases; may underperform on other specialties
- Accuracy is 40.5% single-pass (60.4% with multi-agent); not suitable as a sole decision-maker
- May produce false positives; all findings should be reviewed by a qualified clinician
- Best performance with English-language clinical notes
Citation
If you use this model, please cite:
@misc{clinipal2026,
title={Clinipal: AI-Powered Clinical Error Detection Using Fine-Tuned MedGemma 27B},
author={Vrdoljak, J. and Luksic, I. and Baric, D.},
year={2026},
url={https://github.com/IvanLuksic/medgemma-next}
}
@article{krabic2026llm,
title={Large language models as second reviewers for medical errors in real-world internal medicine reports: a prospective comparative study of open- and closed-source models},
author={Krabic, R. and Viculin, I. and Boban, Z. and Kumric, M. and Vilovic, M. and Vrdoljak, J. and Bozic, J.},
journal={International Journal of Medical Informatics},
volume={211},
pages={106316},
year={2026},
doi={10.1016/j.ijmedinf.2026.106316},
pmid={41655522}
}
License
This adapter is released under the Apache 2.0 license. The base model (google/medgemma-27b-it) is subject to Google's Gemma license terms.
- Downloads last month
- 30
Model tree for Vrda/medgemma-27b-clinical-error-sft
Evaluation results
- Accuracy (single-pass)self-reported40.500
- Accuracy (multi-agent pipeline)self-reported60.400