Arabic PII Detection Model

This model detects Personally Identifiable Information (PII) in Arabic, English, and French text.

Model Description

A fine-tuned token classification model for detecting PII entities in multilingual text, with a focus on Arabic legal and business documents.

Supported PII Types

The model can detect the following PII types:

  • Names: FIRSTNAME, MIDDLENAME, LASTNAME
  • Addresses: STREET, CITY, STATE, ZIPCODE, COUNTRY
  • Contact: EMAIL, PHONENUMBER, URL
  • Identity: SSN, IBAN, CREDITCARDNUMBER, DOB
  • Financial: AMOUNT, ACCOUNTNUMBER, BIC
  • Professional: COMPANYNAME, JOBTITLE
  • And many more...

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("Qanoniah/arabic-pii-detector")
model = AutoModelForTokenClassification.from_pretrained("Qanoniah/arabic-pii-detector")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "ุงุณู…ูŠ ู…ุญู…ุฏ ุงู„ุนุชูŠุจูŠ ูˆุจุฑูŠุฏูŠ ุงู„ุฅู„ูƒุชุฑูˆู†ูŠ ู‡ูˆ [email protected]"
results = nlp(text)
print(results)

Training

This model was trained using transfer learning from the base English PII model, with additional training on:

  • Translated French legal documents to Arabic
  • Augmented Arabic data with native names and locations
  • Synthetic Arabic legal documents (contracts, court rulings)
  • English retention samples for multilingual capability

Label Mappings

{"label2id": {
  "B-ACCOUNTNAME": 1,
  "B-ACCOUNTNUMBER": 3,
  "B-AGE": 5,
  "B-AMOUNT": 7,
  "B-BIC": 9,
  "B-BITCOINADDRESS": 11,
  "B-BUILDINGNUMBER": 13,
  "B-CITY": 15,
  "B-COMPANYNAME": 17,
  "B-COUNTY": 19,
  "B-CREDITCARDCVV": 21,
  "B-CREDITCARDISSUER": 23,
  "B-CREDITCARDNUMBER": 25,
  "B-CURRENCY": 27,
  "B-CURRENCYCODE": 29,
  "B-CURRENCYNAME": 31,
  "B-CURRENCYSYMBOL": 33,
  "B-DATE": 35,
  "B-DOB": 37,
  "B-EMAIL": 39,
  "B-ETHEREUMADDRESS": 41,
  "B-EYECOLOR": 43,
  "B-FIRSTNAME": 45,
  "B-GE...}
Downloads last month
32
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Qanoniah/arabic-pii-detector

Quantized
(3)
this model