Upload Transactor AIBA - Multilingual Banking Transaction NER Model

Browse files

Files changed (10) hide show

README.md +176 -0
config.json +80 -0
label_mapping.json +52 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
training_args.bin +3 -0
training_config.json +8 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,176 @@

+---
+language:
+- en
+- ru
+- multilingual
+license: apache-2.0
+tags:
+- token-classification
+- ner
+- named-entity-recognition
+- banking
+- transactions
+- financial
+- multilingual
+- bert
+datasets:
+- custom
+metrics:
+- precision
+- recall
+- f1
+- seqeval
+widget:
+- text: "Transfer 12.5mln USD to Apex Industries account 27109477752047116719 INN 123456789 bank code 01234 for consulting"
+- text: "Send 150k RUB to ООО Ромашка счет 40817810099910004312 ИНН 987654321 за услуги"
+- text: "Show completed transactions from 01.12.2024 to 15.12.2024"
+pipeline_tag: token-classification
+---
+# Transactor AIBA - Banking Transaction NER Model
+## Model Description
+**Transactor AIBA** is a multilingual Named Entity Recognition (NER) model fine-tuned on `google-bert/bert-base-multilingual-cased` for extracting entities from banking and financial transaction texts. The model supports both English and Russian languages.
+## Intended Use
+This model is designed to extract key entities from banking transaction requests, including:
+- Transaction amounts and currencies
+- Account numbers and bank codes
+- Tax identification numbers (INN)
+- Recipient/sender information
+- Transaction purposes
+- Dates and time periods
+## Entity Types
+The model recognizes the following entity types:
+- `amount`
+- `bank_code`
+- `currency`
+- `date`
+- `description`
+- `end_date`
+- `receiver_hr`
+- `receiver_inn`
+- `receiver_name`
+- `start_date`
+- `status`
+## Training Data
+- **Base Model**: `google-bert/bert-base-multilingual-cased`
+- **Training Samples**: 200,015
+- **Validation Samples**: 35,297
+- **Dataset**: Custom banking transaction dataset with multilingual support
+## Training Details
+- **Epochs**: 5
+- **Batch Size**: 16
+- **Learning Rate**: 2e-5
+- **Optimizer**: AdamW
+- **LR Scheduler**: Linear with warmup
+- **Framework**: Transformers + PyTorch
+## Performance
+- **Validation F1 Score**: 0.9999
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+# Load model and tokenizer
+model_name = "primel/transactor-aiba"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+# Example prediction
+def extract_entities(text):
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
+    with torch.no_grad():
+        outputs = model(**inputs)
+        predictions = torch.argmax(outputs.logits, dim=2)
+    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
+    predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
+    entities = {}
+    current_entity = None
+    current_tokens = []
+    for token, label in zip(tokens, predicted_labels):
+        if token in ['[CLS]', '[SEP]', '[PAD]']:
+            continue
+        if label.startswith('B-'):
+            if current_entity and current_tokens:
+                entity_text = tokenizer.convert_tokens_to_string(current_tokens)
+                entities[current_entity] = entity_text.strip()
+            current_entity = label[2:]
+            current_tokens = [token]
+        elif label.startswith('I-') and current_entity == label[2:]:
+            current_tokens.append(token)
+        else:
+            if current_entity and current_tokens:
+                entity_text = tokenizer.convert_tokens_to_string(current_tokens)
+                entities[current_entity] = entity_text.strip()
+            current_entity = None
+            current_tokens = []
+    if current_entity and current_tokens:
+        entity_text = tokenizer.convert_tokens_to_string(current_tokens)
+        entities[current_entity] = entity_text.strip()
+    return entities
+# Example
+text = "Transfer 12.5mln USD to Apex Industries account 27109477752047116719"
+print(extract_entities(text))
+```
+## Example Outputs
+**Input**: "Transfer 12.5mln USD to Apex Industries account 27109477752047116719 INN 123456789 bank code 01234 for consulting"
+**Output**:
+```python
+{
+    "amount": "12.5mln",
+    "currency": "USD",
+    "receiver_name": "Apex Industries",
+    "receiver_hr": "27109477752047116719",
+    "receiver_inn": "123456789",
+    "receiver_bank_code": "01234",
+    "purpose": "consulting"
+}
+```
+## Limitations
+- The model is trained on synthetic and curated banking transaction data
+- Performance may vary on real-world data with different formatting
+- Best results are achieved with transaction texts similar to training distribution
+- May require fine-tuning for specific banking systems or regional variations
+## License
+Apache 2.0
+## Citation
+```bibtex
+@misc{transactor-aiba,
+  author = {Primel},
+  title = {Transactor AIBA: Multilingual Banking Transaction NER},
+  year = {2025},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/primel/transactor-aiba}}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,80 @@

+{
+  "architectures": [
+    "BertForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "directionality": "bidi",
+  "dtype": "float32",
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "B-amount",
+    "1": "B-bank_code",
+    "2": "B-currency",
+    "3": "B-date",
+    "4": "B-description",
+    "5": "B-end_date",
+    "6": "B-receiver_hr",
+    "7": "B-receiver_inn",
+    "8": "B-receiver_name",
+    "9": "B-start_date",
+    "10": "B-status",
+    "11": "I-amount",
+    "12": "I-bank_code",
+    "13": "I-currency",
+    "14": "I-date",
+    "15": "I-description",
+    "16": "I-end_date",
+    "17": "I-receiver_hr",
+    "18": "I-receiver_inn",
+    "19": "I-receiver_name",
+    "20": "I-start_date",
+    "21": "I-status",
+    "22": "O"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "B-amount": 0,
+    "B-bank_code": 1,
+    "B-currency": 2,
+    "B-date": 3,
+    "B-description": 4,
+    "B-end_date": 5,
+    "B-receiver_hr": 6,
+    "B-receiver_inn": 7,
+    "B-receiver_name": 8,
+    "B-start_date": 9,
+    "B-status": 10,
+    "I-amount": 11,
+    "I-bank_code": 12,
+    "I-currency": 13,
+    "I-date": 14,
+    "I-description": 15,
+    "I-end_date": 16,
+    "I-receiver_hr": 17,
+    "I-receiver_inn": 18,
+    "I-receiver_name": 19,
+    "I-start_date": 20,
+    "I-status": 21,
+    "O": 22
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "pooler_fc_size": 768,
+  "pooler_num_attention_heads": 12,
+  "pooler_num_fc_layers": 3,
+  "pooler_size_per_head": 128,
+  "pooler_type": "first_token_transform",
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.57.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 119547
+}

label_mapping.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "tag2id": {
+    "B-amount": 0,
+    "B-bank_code": 1,
+    "B-currency": 2,
+    "B-date": 3,
+    "B-description": 4,
+    "B-end_date": 5,
+    "B-receiver_hr": 6,
+    "B-receiver_inn": 7,
+    "B-receiver_name": 8,
+    "B-start_date": 9,
+    "B-status": 10,
+    "I-amount": 11,
+    "I-bank_code": 12,
+    "I-currency": 13,
+    "I-date": 14,
+    "I-description": 15,
+    "I-end_date": 16,
+    "I-receiver_hr": 17,
+    "I-receiver_inn": 18,
+    "I-receiver_name": 19,
+    "I-start_date": 20,
+    "I-status": 21,
+    "O": 22
+  },
+  "id2tag": {
+    "0": "B-amount",
+    "1": "B-bank_code",
+    "2": "B-currency",
+    "3": "B-date",
+    "4": "B-description",
+    "5": "B-end_date",
+    "6": "B-receiver_hr",
+    "7": "B-receiver_inn",
+    "8": "B-receiver_name",
+    "9": "B-start_date",
+    "10": "B-status",
+    "11": "I-amount",
+    "12": "I-bank_code",
+    "13": "I-currency",
+    "14": "I-date",
+    "15": "I-description",
+    "16": "I-end_date",
+    "17": "I-receiver_hr",
+    "18": "I-receiver_inn",
+    "19": "I-receiver_name",
+    "20": "I-start_date",
+    "21": "I-status",
+    "22": "O"
+  }
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4b1f78ffa7bcf7a93fcdb56fda925503257074b7ba3ef383e047d934c940d4f9
+size 709145500

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d41dd6c606f9b015995295fdb80c0ab243a2ace6beec1bde55e410d7efa85a40
+size 5777

training_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "model_name": "google-bert/bert-base-multilingual-cased",
+  "num_train_samples": 200015,
+  "num_val_samples": 35297,
+  "num_epochs": 5,
+  "batch_size": 16,
+  "validation_f1": 0.9998642818660011
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff