MedGemma 27B — Clinical Error Detection (SFT LoRA Adapter)

A LoRA fine-tuned adapter for Google MedGemma 27B-IT trained to detect critical patient safety errors in clinical documentation.

Model Description

This adapter was trained as part of Clinipal — an AI-powered clinical error detection system that acts as an automated "second reviewer" of medical reports. The model identifies 6 categories of high-impact safety errors in emergency department and internal medicine documentation.

Error Categories

Error Type	Description
`CONTRAINDICATED_MEDICATION`	Drug dangerous given patient's conditions/allergies
`DANGEROUS_DOSAGE`	Dose significantly outside therapeutic range
`CLINICAL_SCORE_ERROR`	Miscalculated risk score affecting treatment decisions
`MISSING_CRITICAL_TREATMENT`	Life-saving intervention clearly omitted
`TREATMENT_LOGIC_FAILURE`	Treatment contradicts the diagnosis
`MISSING_CRITICAL_WORKUP`	Essential diagnostic tests not ordered

Training

Dataset

300 synthetic clinical reports with annotated errors, generated using GPT-5.2, Gemini 3 Flash Preview, and DeepSeek-V3-R1
Synthetic reports designed to emulate real-world emergency department documentation
150 real clinical reports from Internal Medicine, annotated by 3 physicians with realistic inserted errors (100 used as held-out test set)

LoRA Configuration

Parameter	Value
Base model	`google/medgemma-27b-it`
PEFT type	LoRA
Rank (r)	32
Alpha	64
Dropout	0.05
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Task type	CAUSAL_LM
Adapter size	~889 MB

Results

Evaluated on 100 held-out real-world clinical cases with physician-annotated errors:

Configuration	Accuracy
Baseline MedGemma 27B (no fine-tuning)	22.0%
This adapter (single-pass)	40.5%
This adapter (multi-agent pipeline)	60.4%
GPT-OSS-120b	38.1%
Gemini 3 Flash Preview	35.3%

The multi-agent pipeline runs 6 sequential inference calls (2 first-pass reviewers + 3 specialist critics + 1 final adjudicator) using the same adapter, achieving a 3x improvement over the baseline.

Usage

With vLLM (recommended for production)

# Serve with vLLM + LoRA
python -m vllm.entrypoints.openai.api_server \
  --model google/medgemma-27b-it \
  --port 8000 \
  --enable-lora \
  --lora-modules "sft_adapter=<path-to-this-adapter>" \
  --max-lora-rank 64 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.75 \
  --dtype bfloat16

Then call the API:

import requests

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "sft_adapter",
    "messages": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Analyze the following clinical note:\n\n{clinical_note}"}
    ],
    "temperature": 0.6,
    "max_tokens": 1024
})

With Transformers + PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "google/medgemma-27b-it",
    torch_dtype="bfloat16",
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "Vrda/medgemma-27b-clinical-error-sft")
tokenizer = AutoTokenizer.from_pretrained("Vrda/medgemma-27b-clinical-error-sft")

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"Analyze the following clinical note:\n\n{clinical_note}"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0.6, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

System Prompt

The model expects this system prompt for optimal performance:

You are an emergency medicine clinical safety reviewer analyzing a real patient's
emergency department documentation. Your ONLY task is to identify CRITICAL patient
safety errors — the kind that could cause direct harm if missed.

FOCUS EXCLUSIVELY on these error types:

1. CONTRAINDICATED_MEDICATION
2. DANGEROUS_DOSAGE
3. CLINICAL_SCORE_ERROR
4. MISSING_CRITICAL_TREATMENT
5. TREATMENT_LOGIC_FAILURE
6. MISSING_CRITICAL_WORKUP

STRICT RULES:
- Report AT MOST 3 errors, strictly prioritized by patient safety impact.
- Only report errors you are ≥80% confident about.
- Do NOT report style preferences, minor documentation gaps, or speculative concerns.
- If no critical safety errors exist, return an empty errors array.

IMPORTANT — THINK STEP BY STEP:
For each potential error, include a "reasoning" field with your clinical logic.

Respond with ONLY valid JSON:
{
  "errors": [
    {
      "type": "CONTRAINDICATED_MEDICATION|DANGEROUS_DOSAGE|CLINICAL_SCORE_ERROR|MISSING_CRITICAL_TREATMENT|TREATMENT_LOGIC_FAILURE|MISSING_CRITICAL_WORKUP",
      "severity": "critical|warning",
      "reasoning": "Step-by-step clinical logic...",
      "problem": "1-2 sentence explanation",
      "recommendation": "1 sentence corrective action",
      "confidence": 0.95
    }
  ],
  "summary": "One-sentence overall safety assessment."
}

Output Format

The model outputs structured JSON:

{
  "errors": [
    {
      "type": "CONTRAINDICATED_MEDICATION",
      "severity": "critical",
      "reasoning": "The patient has documented bilateral renal artery stenosis. Perindopril is an ACE inhibitor, which is strictly contraindicated in this condition as it can precipitate acute kidney injury and hyperkalemia.",
      "problem": "ACE inhibitor (perindopril) prescribed despite bilateral renal artery stenosis.",
      "recommendation": "Immediately discontinue perindopril and monitor renal function and potassium levels.",
      "confidence": 0.99
    }
  ],
  "summary": "Critical medication contraindication identified requiring immediate intervention."
}

Multi-Agent Pipeline

For best results, use the multi-agent pipeline (6 sequential calls with the same adapter):

First-Pass Conservative (temp=0.6) — high-precision scan
First-Pass Exploratory (temp=1.0) — high-recall scan
Diagnostics Critic (temp=0.7) — diagnostic reasoning focus
Treatment Plan Critic (temp=0.75) — medication safety focus
Follow-Up Critic (temp=0.7) — disposition safety focus
Final Adjudicator (temp=0.5) — synthesizes, de-duplicates, selects top 3

See the Clinipal repository for the full pipeline implementation.

Limitations

Trained primarily on internal medicine and emergency department cases; may underperform on other specialties
Accuracy is 40.5% single-pass (60.4% with multi-agent); not suitable as a sole decision-maker
May produce false positives; all findings should be reviewed by a qualified clinician
Best performance with English-language clinical notes

Citation

If you use this model, please cite:

@misc{clinipal2026,
  title={Clinipal: AI-Powered Clinical Error Detection Using Fine-Tuned MedGemma 27B},
  author={Vrdoljak, J. and Luksic, I. and Baric, D.},
  year={2026},
  url={https://github.com/IvanLuksic/medgemma-next}
}

@article{krabic2026llm,
  title={Large language models as second reviewers for medical errors in real-world internal medicine reports: a prospective comparative study of open- and closed-source models},
  author={Krabic, R. and Viculin, I. and Boban, Z. and Kumric, M. and Vilovic, M. and Vrdoljak, J. and Bozic, J.},
  journal={International Journal of Medical Informatics},
  volume={211},
  pages={106316},
  year={2026},
  doi={10.1016/j.ijmedinf.2026.106316},
  pmid={41655522}
}

License

This adapter is released under the Apache 2.0 license. The base model (google/medgemma-27b-it) is subject to Google's Gemma license terms.

Downloads last month: 30

Model tree for Vrda/medgemma-27b-clinical-error-sft

Base model

google/gemma-3-27b-pt

Finetuned

google/medgemma-27b-it

Adapter

(5)

this model

Evaluation results

Accuracy (single-pass)
self-reported

40.500
Accuracy (multi-agent pipeline)
self-reported

60.400