Political Meme Classification - MAF Model

Model Description

Multimodal Attention Fusion (MAF) model for binary classification of Bengali political memes:

  • NonPolitical (0): Non-political content
  • Political (1): Political content

This model combines visual features from CLIP (ViT-B-16) and textual features from XLM-RoBERTa-Large using multi-head attention to classify meme images with Bengali text.

Architecture

  • Visual Encoder: CLIP ViT-B-16 (fine-tuned, pretrained on LAION-2B)
  • Text Encoder: XLM-RoBERTa-Large (fine-tuned)
  • Fusion: Multi-head Attention (8 heads) for cross-modal interaction
  • Classifier: 2-layer fully connected network with dropout
  • Lexicon Boosting: Political keyword detection for improved accuracy
  • Input: 224x224 images + text (max 70 tokens)
  • Output: Binary classification (NonPolitical/Political)

Training Details

  • Task: Binary Image Classification
  • Dataset: PoliMemeDecode (2,290 training samples, 572 validation samples)
  • Epochs: 5
  • Learning Rate: 8e-05
  • Batch Size: 16
  • Max Text Length: 70
  • Attention Heads: 8
  • Optimizer: AdamW with linear warmup scheduler
  • Loss: CrossEntropyLoss with label smoothing
  • Fine-tuning: Both CLIP visual encoder and XLM-RoBERTa are fully fine-tuned

Files in Repository

  • maf_model_full.pth - Complete model state dict (includes all weights)
  • clip_config.pth - CLIP model configuration (required for loading)
  • model_config.pth - Model hyperparameters (required for initialization)
  • model_architecture.py - Model architecture code with lexicon support
  • README.md - This documentation file

Usage

Installation

pip install torch torchvision transformers open_clip_torch pillow huggingface_hub

Loading the Model

from huggingface_hub import hf_hub_download
import torch
import open_clip
from transformers import AutoTokenizer
import importlib.util

# Setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ===== STEP 1: Download all required files =====
repo_id = "lucius-40/bengali-political-maf-v8"

clip_config_path = hf_hub_download(repo_id=repo_id, filename="clip_config.pth")
model_config_path = hf_hub_download(repo_id=repo_id, filename="model_config.pth")
model_weights_path = hf_hub_download(repo_id=repo_id, filename="maf_model_full.pth")
arch_path = hf_hub_download(repo_id=repo_id, filename="model_architecture.py")

# ===== STEP 2: Load configurations =====
clip_config = torch.load(clip_config_path, map_location=device)
model_config = torch.load(model_config_path, map_location=device)

print(f"Loading CLIP: {clip_config['model_name']} ({clip_config['pretrained']})")

# ===== STEP 3: Initialize CLIP visual encoder =====
clip_model, _, preprocess = open_clip.create_model_and_transforms(
    clip_config['model_name'],      # 'ViT-B-16'
    pretrained=clip_config['pretrained'],  # 'laion2b_s34b_b88k'
    device=device
)

# Extract visual encoder only
clip_visual = clip_model.visual.float().to(device)

# ===== STEP 4: Load model architecture =====
spec = importlib.util.spec_from_file_location("model_architecture", arch_path)
model_arch = importlib.util.module_from_spec(spec)
spec.loader.exec_module(model_arch)
MAF = model_arch.MAF

# ===== STEP 5: Initialize MAF model =====
model = MAF(
    clip_model=clip_visual,
    num_classes=model_config['num_classes'],
    num_heads=model_config['num_heads'],
    use_lexicon_boost=model_config['use_lexicon_boost']
)

# ===== STEP 6: Load fine-tuned weights =====
model.load_state_dict(torch.load(model_weights_path, map_location=device))
model = model.to(device)
model.eval()

print("✓ Model loaded successfully with fine-tuned CLIP and XLM-RoBERTa weights!")

# ===== STEP 7: Prepare tokenizer =====
tokenizer = AutoTokenizer.from_pretrained(model_config['xlm_model_name'])

Inference Example

from PIL import Image
from torchvision import transforms

# Define image preprocessing (IMPORTANT: Must match training)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Load and preprocess image
image_path = "path/to/your/meme.jpg"
image = Image.open(image_path).convert('RGB')
image_tensor = transform(image).unsqueeze(0).to(device)

# Prepare text (OCR text from the meme)
text = "আপনার বাংলা টেক্সট এখানে"  # Your Bengali text here

# Tokenize text
encoded = tokenizer(
    text,
    max_length=model_config['max_length'],
    padding='max_length',
    truncation=True,
    return_tensors='pt'
)
input_ids = encoded['input_ids'].to(device)
attention_mask = encoded['attention_mask'].to(device)

# Calculate lexicon matches (for boosting)
from model_architecture import contains_political_keywords
lexicon_matches = contains_political_keywords(text)
lexicon_tensor = torch.tensor([lexicon_matches]).to(device)

# Run inference
with torch.no_grad():
    outputs = model(image_tensor, input_ids, attention_mask, lexicon_tensor)
    probs = torch.softmax(outputs, dim=1)
    pred_class = torch.argmax(probs, dim=1).item()
    confidence = probs[0][pred_class].item()

# Print results
class_names = ['NonPolitical', 'Political']
print(f"Prediction: {class_names[pred_class]}")
print(f"Confidence: {confidence:.4f}")
print(f"Probabilities: NonPolitical={probs[0][0]:.4f}, Political={probs[0][1]:.4f}")

Model Performance

Evaluated on validation set:

  • Binary classification metrics (Accuracy, Precision, Recall, F1)
  • Class-specific metrics for Political class
  • Confusion matrix analysis
  • Lexicon-based boosting for improved political content detection

Requirements

torch>=1.9.0
torchvision>=0.10.0
transformers>=4.41.2
open_clip_torch>=2.0.0
pillow>=9.5.0
huggingface_hub>=0.16.0

Citation

@inproceedings{ahsan2024multimodal,
  title={A Multimodal Framework to Detect Target Aware Aggression in Memes},
  author={Ahsan, Shawly and Hossain, Eftekhar and Sharif, Omar and Das, Avishek and Hoque, Mohammed Moshiul and Dewan, M},
  booktitle={Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={2487--2500},
  year={2024}
}

License

Apache 2.0

Limitations

  • Trained specifically on Bengali political memes
  • Requires both image and text input (OCR text)
  • Performance may vary on out-of-domain content
  • Binary classification only (Political vs NonPolitical)
  • Image preprocessing must match training pipeline exactly

Important Notes

Image Preprocessing

⚠️ Critical: The image preprocessing pipeline must match the training pipeline exactly:

  • Resize to 224×224
  • Convert to RGB
  • Normalize with ImageNet statistics: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]

Model Loading

The model must be loaded in this exact order:

  1. Load CLIP configuration
  2. Initialize CLIP with OpenCLIP library
  3. Load model architecture
  4. Initialize MAF model with the CLIP visual encoder
  5. Load the fine-tuned state dict

Lexicon Boosting

The model uses political keyword detection to boost predictions. The contains_political_keywords function is included in model_architecture.py and should be used during inference for optimal performance.

Model Details

CLIP Visual Encoder

  • Model: ViT-B-16 (Vision Transformer)
  • Pretrained: LAION-2B dataset (34B samples seen)
  • Patch Size: 16×16 (196 patches per image)
  • Output Dimension: 512
  • Fine-tuned: All parameters updated during training

Text Encoder

  • Model: XLM-RoBERTa-Large
  • Parameters: 559M
  • Fine-tuned: All layers updated during training
  • Max sequence length: 70 tokens

Political Lexicon

The model uses a curated lexicon of Bengali and English political keywords to boost detection of political content. The lexicon includes terms related to political parties, leaders, movements, and events specific to Bengali political discourse.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lucius-40/bengali-political-maf-v8

Finetuned
(842)
this model