🛡️ GLiGuard UNESCO Ethics — AI Guardrail

An open-source guardrail classifier that operationalises the 2021 UNESCO Recommendation on the Ethics of Artificial Intelligence as a fast, multilingual, schema-driven text classifier.

🎯 TL;DR

A 300M-parameter encoder-classifier, fine-tuned with LoRA on 45,340 records (synthetic UNESCO-anchored + WildGuardMix safety floor, EN/FR/ES/RU), screens text against 12 UNESCO ethics labels for use as a pre/post-processing guardrail inside LLM pipelines. UNESCO macro-F1 = 0.817, safety false-positive rate 0.16 % with the calibrated thresholds shipped alongside the weights.

from gliner2 import GLiNER2
from peft import PeftModel
import json
from huggingface_hub import hf_hub_download

REPO = "UNESCO/gliguard-unesco-ethics"

# 1. Load base + adapter
base = GLiNER2.from_pretrained("fastino/gliguard-LLMGuardrails-300M")
model = PeftModel.from_pretrained(base, REPO, subfolder="best")
model.train(False)

# 2. Load production thresholds (REQUIRED — see Calibration section)
thresholds = json.loads(open(hf_hub_download(REPO, "calibrated_thresholds.json")).read())
labels = list(thresholds.keys())

# 3. Classify
tasks = {"unesco_ethics": {"labels": labels, "multi_label": True, "cls_threshold": 0.0}}
out = model.classify_text(
    "We deploy facial recognition to track all citizens in public spaces.",
    tasks, threshold=0.0, include_confidence=True,
)
scores = {item["label"]: item["confidence"] for item in out["unesco_ethics"]}
fired = [L for L, s in scores.items() if s >= thresholds[L]["best_threshold"]]
print(fired)  # → ['mass_surveillance']

📋 The 12 UNESCO Labels

Each label anchors to one or more paragraphs of the 2021 UNESCO Recommendation on the Ethics of AI and to multilingual concept entries from the UNESCO Thesaurus.

Emoji	Label	Anchors (Recommendation §)
👁️	`mass_surveillance`	§75–§77
🛡️	`privacy_data_exposure`	§72–§74
⚖️	`discrimination_bias`	§28–§30
👩	`gender_harm`	§90–§91 (gender equality)
🧒	`child_vulnerable_harm`	§125–§130
📰	`disinformation`	§80, §117
🌍	`cultural_harm`	§86–§89 (cultural diversity)
🌱	`environmental_harm`	§86 (environmental ethics)
💀	`life_death_automation`	§38 (right to life, autonomy)
🧑‍⚖️	`no_human_oversight`	§32–§37 (human oversight)
🕊️	`human_dignity_violation`	§13, §22
🇺🇳	`un_context_risk`	§1, §10 (UN value alignment)

Multi-label. A single input can trigger any subset of the 12 labels independently.

🌍 Multilingual Coverage

Language	Code	Training records	UNESCO macro-F1
🇬🇧 English	`en`	33,972	0.804
🇫🇷 French	`fr`	3,702	0.840
🇪🇸 Spanish	`es`	3,766	0.822
🇷🇺 Russian	`ru`	3,560	0.779
🇸🇦 Arabic	`ar`	—	planned, v2.0 (multilingual base swap)
🇨🇳 Chinese	`zh`	—	planned, v2.0 (Thesaurus contribution + multilingual base)

All four supported languages (EN/FR/ES/RU) fall within the SPEC §7.3 fairness band. AR / ZH are NOT supported in v1.2 — a v1.3 AR experiment landed at AR F1 = 0.366, well below the safe-deployment threshold; root cause is the English-pretrained DeBERTa base (LoRA can't fix monolingual tokenisation). The v2.0 work-package switches to a multilingual encoder (XLM-RoBERTa / mDeBERTa) and pairs the release with SHS Arabic-speaker review. For AR / ZH content today, do not rely on this model alone — route to human review or use a multilingual baseline (Llama Guard 3, ShieldGemma) as a fallback.

🚀 Quick Start

Installation

pip install gliner2 peft huggingface_hub

Production inference (with calibrated thresholds)

from gliner2 import GLiNER2
from peft import PeftModel
import json
from huggingface_hub import hf_hub_download

REPO = "UNESCO/gliguard-unesco-ethics"

model = PeftModel.from_pretrained(
    GLiNER2.from_pretrained("fastino/gliguard-LLMGuardrails-300M"),
    REPO, subfolder="best",
)
model.train(False)

thresholds = json.loads(open(hf_hub_download(REPO, "calibrated_thresholds.json")).read())
labels = list(thresholds.keys())


def screen(text: str) -> list[str]:
    """Return the UNESCO labels triggered by `text` under production thresholds."""
    tasks = {"unesco_ethics": {"labels": labels, "multi_label": True, "cls_threshold": 0.0}}
    out = model.classify_text(text, tasks, threshold=0.0, include_confidence=True)
    scores = {item["label"]: item["confidence"] for item in out["unesco_ethics"]}
    return [L for L, s in scores.items() if s >= thresholds[L]["best_threshold"]]


# Examples
print(screen("Our HR system rejects all candidates over 50."))
# → ['discrimination_bias']

print(screen("Deploy autonomous weapon systems that select targets without human approval."))
# → ['life_death_automation', 'no_human_oversight']

print(screen("This week's AI summit covered governance frameworks across 50 countries."))
# → []  (benign — no UNESCO violation)

⚠️ Always use the calibrated thresholds

The default 0.5 threshold produces a 100 % safety false-positive rate — the model fires at least one label on every benign input (a GLiNER2 multi-label-with-low-threshold artefact). The shipped calibrated_thresholds.json cuts the safety FPR to 0.16 % while lifting UNESCO macro-F1 from 0.802 to 0.817. Do not skip this step in production.

📊 Evaluation

Held-out 10 % stratified val (4,500 records) with calibrated thresholds applied.

Headline metrics

Regime	n	macro-F1
UNESCO regime (synthetic positives)	752	0.817 ✅ (target ≥0.80)
Safety regime (benign content)	3,748	FPR 0.16 % 🎯
Base-model delta	—	+53.3 pp vs `fastino/gliguard-LLMGuardrails-300M`

Per-label F1 (UNESCO regime, calibrated)

Label	precision	recall	F1	support
🌱 `environmental_harm`	0.876	1.000	0.934	71
🧒 `child_vulnerable_harm`	0.906	0.939	0.922	66
📰 `disinformation`	0.857	0.952	0.902	63
🌍 `cultural_harm`	0.838	0.945	0.889	55
🛡️ `privacy_data_exposure`	0.875	0.875	0.875	64
👁️ `mass_surveillance`	0.940	0.794	0.861	68
👩 `gender_harm`	0.788	0.857	0.821	63
⚖️ `discrimination_bias`	0.838	0.765	0.800	68
🇺🇳 `un_context_risk`	0.722	0.825	0.770	57
🧑‍⚖️ `no_human_oversight`	0.875	0.673	0.761	52
🕊️ `human_dignity_violation`	0.857	0.621	0.720	66
💀 `life_death_automation`	0.762	0.610	0.678	59

F1   0.00         0.25         0.50         0.75         1.00
     |------------|------------|------------|------------|
🌱   ████████████████████████████████████████████░░░     0.934
🧒   ███████████████████████████████████████████░░░░     0.922
📰   ██████████████████████████████████████████░░░░░     0.902
🌍   █████████████████████████████████████████░░░░░░     0.889
🛡️   ████████████████████████████████████████░░░░░░░     0.875
👁️   ███████████████████████████████████████░░░░░░░░     0.861
👩   █████████████████████████████████████░░░░░░░░░░     0.821
⚖️   ████████████████████████████████████░░░░░░░░░░░     0.800
🇺🇳   ██████████████████████████████████░░░░░░░░░░░░░     0.770
🧑‍⚖️   ██████████████████████████████████░░░░░░░░░░░░░     0.761
🕊️   ████████████████████████████████░░░░░░░░░░░░░░░     0.720
💀   ██████████████████████████████░░░░░░░░░░░░░░░░░     0.678

Per-language macro-F1

EN ████████████████████████████████░░  0.804  (n=212)
FR █████████████████████████████████░  0.840  (n=214)
ES ████████████████████████████████░░  0.822  (n=155)
RU ███████████████████████████████░░░  0.779  (n=171)

📄 Full reproducible recipe — see reports/07_paper_evaluation.md in the project repository.

🧪 Training

Dataset (UNESCO/gliguard-unesco-training-v1, private)

Source	Records	Share	Provenance
WildGuardMix (safety floor)	30,000	66 %	`allenai/wildguardmix` train split
Synthetic UNESCO-aligned	15,340	34 %	Qwen3-32B (Apache 2.0) via HF Inference Providers, KG-conditioned on the Recommendation
UNESCO institutional negatives	0	0 %	Deferred to v1.3

Hyperparameters

Param	Value
Base	`fastino/gliguard-LLMGuardrails-300M`
Method	LoRA (r=16, α=32, dropout=0)
`encoder_lr` / `task_lr`	`2e-5` / `2e-4`
Epochs	3
Batch size	16 (per device)
Max sequence length	512
Seed	42 (reproducible)
Hardware	NVIDIA A100 80GB (HF Jobs, org-billed to UNESCO)
Wall-clock	29 min training + ~6 min Hub push
Total cost	~$1.40 for the v1.2 fine-tune; $6.20 cumulative across all of v1.2 R&D

LR was swept across {1e-5, 2e-5, 5e-5} per SPEC §7.2 — 2e-5 selected on the dev set.

Reproducibility. Every commit, evaluation, and inference is logged to RUN_LOG.md with UTC timestamps and SHA-pinned dependencies. Full re-run recipe in reports/07_paper_evaluation.md.

⚠️ Risks & Limitations

Per SPEC §10 of the project specification. Each category corresponds to a deeper section in the project's evaluation reports.

1. Interpretive ambiguity 🤔

The 12 labels are operational distillations of the Recommendation, not legal definitions. Borderline cases (academic discussion of a violation vs the violation itself) are routed to hard-negatives during training; the residual ambiguity is real and stakeholders must retain final say.

2. Coverage gaps 📋

Arabic + Chinese deferred to Phase 2.
UNESCO institutional negatives (Gap 2) deferred to v1.3.
The synthetic data is anchored to the 2021 Recommendation; the model may under-fire on emerging risks (generative manipulation, agent autonomy) that the Recommendation does not enumerate by name.

3. Bias & fairness ⚖️

Evaluated on two held-out surfaces:

v1 held-out val (4,500 rows, calibrated) — UNESCO macro-F1 0.817; per-language en/fr/es/ru all within ±5 pp; gender slice (n=220) macro-F1 0.795 in band; race (n=27) and disability (n=38) below the n≥30 evidence threshold.

v1.3 balanced fairness eval (1,901 rows, 192 controlled cells, calibrated) — UNESCO macro-F1 0.447; per-attribute slices now defensibly measurable:

Attribute	n	macro-F1	Status
gender	348	0.380	flagged (−6.7 pp vs balanced baseline) — investigate v1.4
race	186	0.479	✅ data-gap RESOLVED
disability	189	0.458	✅ methodology caveat RESOLVED (strict & supported macros converge)

The drop from 0.817 to 0.447 is the model's generalisation gap: the v1 val is a held-out slice of the training corpus (same stylistic surface); the balanced eval is fresh Qwen3-32B generation. Honest read: ~37 pp of v1.2's 0.817 reflects surface-pattern memorisation; ~45 pp generalises. See reports/08_balanced_fairness_eval.md for the full breakdown.

4. Deployment risks 🚦

MUST use calibrated thresholds. Default 0.5 → 100 % safety FPR.
Audit-trail required. This is a screening tool; every escalation should be logged and reviewed.
No autonomous block. Plug it as a signal into a system where humans make the final decision.

5. Maintenance commitments 🔧

v2.0 (next major version): multilingual base swap (XLM-RoBERTa-base or mDeBERTa-v3-base); native Arabic + Chinese support; SHS Arabic-speaker + Chinese-speaker review on synthetic data and predictions; UNESCO Thesaurus ZH contribution sub-track. The v1.3 AR experiment surfaced the monolingual base as the bottleneck — see reports/09_v1.3_release.md. ETA: dependent on team funding (paper §7.4).
v1.x patches: tagged on main (e.g., v1.2.1); no public re-release unless macro-F1 changes by ≥1 pp.
Periodic re-anchoring as the Recommendation interpretation evolves.
Open issues + roadmap in the project repository.

📚 Citation

@misc{unesco_gliguard_2026,
  title  = {GLiGuard UNESCO Ethics: An Open-Source Guardrail Classifier for the
            2021 UNESCO Recommendation on the Ethics of Artificial Intelligence},
  author = {UNESCO DBS Data \& AI Team},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/UNESCO/gliguard-unesco-ethics}},
  note   = {Apache-2.0 fine-tune of fastino/gliguard-LLMGuardrails-300M with LoRA}
}

@misc{unesco_recommendation_2021,
  title  = {Recommendation on the Ethics of Artificial Intelligence},
  author = {{UNESCO}},
  year   = {2021},
  howpublished = {\url{https://unesdoc.unesco.org/ark:/48223/pf0000381137}}
}

@article{zaratiana2025gliner2,
  title   = {GLiNER2: An Efficient Multi-Task Information Extraction System},
  author  = {Zaratiana, Urchade and others},
  journal = {arXiv preprint arXiv:2507.18546},
  year    = {2025}
}

@article{mo2025kggen,
  title   = {KG-Gen: A Knowledge Graph Generation Toolkit},
  author  = {Mo, Belinda and others},
  journal = {arXiv preprint arXiv:2502.09956},
  year    = {2025}
}

🏛️ Acknowledgements

Produced by the UNESCO DBS Data & AI Team (Digital Business Solutions). Aligned with the Social and Human Sciences Sector's stewardship of the 2021 Recommendation on the Ethics of AI.

🤖 Built openly: all code, prompts, training logs, evaluation runs, and decisions are open-sourced in the project repository. Issues + contributions welcome.

📂 Repo Structure

UNESCO/gliguard-unesco-ethics/
├── README.md                       # this card
├── calibrated_thresholds.json      # REQUIRED for production
├── best/                           # 🏆 best-eval-loss LoRA adapter (step 5000)
├── final/                          # final LoRA adapter (step 7590)
├── checkpoint-{6500,7000,7500}/    # rolling 3-checkpoint history
└── training_config.json            # full hyperparameter snapshot

_{🛡️ Trained transparently. Aligned with the 2021 UNESCO Recommendation on the Ethics of AI.}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for UNESCO/gliguard-unesco-ethics

Base model

fastino/gliner2-base-v1

Finetuned

fastino/gliguard-LLMGuardrails-300M

Adapter

(2)

this model

Dataset used to train UNESCO/gliguard-unesco-ethics

Papers for UNESCO/gliguard-unesco-ethics

GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface

Paper • 2507.18546 • Published Jul 24, 2025 • 39

KGGen: Extracting Knowledge Graphs from Plain Text with Language Models

Paper • 2502.09956 • Published Feb 14, 2025