Arabic PII Detection Model

This model detects Personally Identifiable Information (PII) in Arabic, English, and French text.

Model Description

A fine-tuned token classification model for detecting PII entities in multilingual text, with a focus on Arabic legal and business documents.

Supported PII Types

The model can detect the following PII types:

Names: FIRSTNAME, MIDDLENAME, LASTNAME
Addresses: STREET, CITY, STATE, ZIPCODE, COUNTRY
Contact: EMAIL, PHONENUMBER, URL
Identity: SSN, IBAN, CREDITCARDNUMBER, DOB
Financial: AMOUNT, ACCOUNTNUMBER, BIC
Professional: COMPANYNAME, JOBTITLE
And many more...

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("Qanoniah/arabic-pii-detector")
model = AutoModelForTokenClassification.from_pretrained("Qanoniah/arabic-pii-detector")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "اسمي محمد العتيبي وبريدي الإلكتروني هو [email protected]"
results = nlp(text)
print(results)

Training

This model was trained using transfer learning from the base English PII model, with additional training on:

Translated French legal documents to Arabic
Augmented Arabic data with native names and locations
Synthetic Arabic legal documents (contracts, court rulings)
English retention samples for multilingual capability

Label Mappings

{"label2id": {
  "B-ACCOUNTNAME": 1,
  "B-ACCOUNTNUMBER": 3,
  "B-AGE": 5,
  "B-AMOUNT": 7,
  "B-BIC": 9,
  "B-BITCOINADDRESS": 11,
  "B-BUILDINGNUMBER": 13,
  "B-CITY": 15,
  "B-COMPANYNAME": 17,
  "B-COUNTY": 19,
  "B-CREDITCARDCVV": 21,
  "B-CREDITCARDISSUER": 23,
  "B-CREDITCARDNUMBER": 25,
  "B-CURRENCY": 27,
  "B-CURRENCYCODE": 29,
  "B-CURRENCYNAME": 31,
  "B-CURRENCYSYMBOL": 33,
  "B-DATE": 35,
  "B-DOB": 37,
  "B-EMAIL": 39,
  "B-ETHEREUMADDRESS": 41,
  "B-EYECOLOR": 43,
  "B-FIRSTNAME": 45,
  "B-GE...}

Downloads last month: 32

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for Qanoniah/arabic-pii-detector

Base model

lakshyakh93/deberta_finetuned_pii

Quantized

(3)

this model