Arabic PII Detection Model
This model detects Personally Identifiable Information (PII) in Arabic, English, and French text.
Model Description
A fine-tuned token classification model for detecting PII entities in multilingual text, with a focus on Arabic legal and business documents.
Supported PII Types
The model can detect the following PII types:
- Names: FIRSTNAME, MIDDLENAME, LASTNAME
- Addresses: STREET, CITY, STATE, ZIPCODE, COUNTRY
- Contact: EMAIL, PHONENUMBER, URL
- Identity: SSN, IBAN, CREDITCARDNUMBER, DOB
- Financial: AMOUNT, ACCOUNTNUMBER, BIC
- Professional: COMPANYNAME, JOBTITLE
- And many more...
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("Qanoniah/arabic-pii-detector")
model = AutoModelForTokenClassification.from_pretrained("Qanoniah/arabic-pii-detector")
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "ุงุณู
ู ู
ุญู
ุฏ ุงูุนุชูุจู ูุจุฑูุฏู ุงูุฅููุชุฑููู ูู [email protected]"
results = nlp(text)
print(results)
Training
This model was trained using transfer learning from the base English PII model, with additional training on:
- Translated French legal documents to Arabic
- Augmented Arabic data with native names and locations
- Synthetic Arabic legal documents (contracts, court rulings)
- English retention samples for multilingual capability
Label Mappings
{"label2id": {
"B-ACCOUNTNAME": 1,
"B-ACCOUNTNUMBER": 3,
"B-AGE": 5,
"B-AMOUNT": 7,
"B-BIC": 9,
"B-BITCOINADDRESS": 11,
"B-BUILDINGNUMBER": 13,
"B-CITY": 15,
"B-COMPANYNAME": 17,
"B-COUNTY": 19,
"B-CREDITCARDCVV": 21,
"B-CREDITCARDISSUER": 23,
"B-CREDITCARDNUMBER": 25,
"B-CURRENCY": 27,
"B-CURRENCYCODE": 29,
"B-CURRENCYNAME": 31,
"B-CURRENCYSYMBOL": 33,
"B-DATE": 35,
"B-DOB": 37,
"B-EMAIL": 39,
"B-ETHEREUMADDRESS": 41,
"B-EYECOLOR": 43,
"B-FIRSTNAME": 45,
"B-GE...}
- Downloads last month
- 32
Model tree for Qanoniah/arabic-pii-detector
Base model
lakshyakh93/deberta_finetuned_pii