Political Meme Classification - MAF Model

Model Description

Multimodal Attention Fusion (MAF) model for binary classification of Bengali political memes:

NonPolitical (0): Non-political content
Political (1): Political content

This model combines visual features from CLIP (ViT-B-16) and textual features from XLM-RoBERTa-Large using multi-head attention to classify meme images with Bengali text.

Architecture

Visual Encoder: CLIP ViT-B-16 (fine-tuned, pretrained on LAION-2B)
Text Encoder: XLM-RoBERTa-Large (fine-tuned)
Fusion: Multi-head Attention (8 heads) for cross-modal interaction
Classifier: 2-layer fully connected network with dropout
Lexicon Boosting: Political keyword detection for improved accuracy
Input: 224x224 images + text (max 70 tokens)
Output: Binary classification (NonPolitical/Political)

Training Details

Task: Binary Image Classification
Dataset: PoliMemeDecode (2,290 training samples, 572 validation samples)
Epochs: 5
Learning Rate: 8e-05
Batch Size: 16
Max Text Length: 70
Attention Heads: 8
Optimizer: AdamW with linear warmup scheduler
Loss: CrossEntropyLoss with label smoothing
Fine-tuning: Both CLIP visual encoder and XLM-RoBERTa are fully fine-tuned

Files in Repository

maf_model_full.pth - Complete model state dict (includes all weights)
clip_config.pth - CLIP model configuration (required for loading)
model_config.pth - Model hyperparameters (required for initialization)
model_architecture.py - Model architecture code with lexicon support
README.md - This documentation file

Usage

Installation

pip install torch torchvision transformers open_clip_torch pillow huggingface_hub

Loading the Model

from huggingface_hub import hf_hub_download
import torch
import open_clip
from transformers import AutoTokenizer
import importlib.util

# Setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ===== STEP 1: Download all required files =====
repo_id = "lucius-40/bengali-political-maf-v8"

clip_config_path = hf_hub_download(repo_id=repo_id, filename="clip_config.pth")
model_config_path = hf_hub_download(repo_id=repo_id, filename="model_config.pth")
model_weights_path = hf_hub_download(repo_id=repo_id, filename="maf_model_full.pth")
arch_path = hf_hub_download(repo_id=repo_id, filename="model_architecture.py")

# ===== STEP 2: Load configurations =====
clip_config = torch.load(clip_config_path, map_location=device)
model_config = torch.load(model_config_path, map_location=device)

print(f"Loading CLIP: {clip_config['model_name']} ({clip_config['pretrained']})")

# ===== STEP 3: Initialize CLIP visual encoder =====
clip_model, _, preprocess = open_clip.create_model_and_transforms(
    clip_config['model_name'],      # 'ViT-B-16'
    pretrained=clip_config['pretrained'],  # 'laion2b_s34b_b88k'
    device=device
)

# Extract visual encoder only
clip_visual = clip_model.visual.float().to(device)

# ===== STEP 4: Load model architecture =====
spec = importlib.util.spec_from_file_location("model_architecture", arch_path)
model_arch = importlib.util.module_from_spec(spec)
spec.loader.exec_module(model_arch)
MAF = model_arch.MAF

# ===== STEP 5: Initialize MAF model =====
model = MAF(
    clip_model=clip_visual,
    num_classes=model_config['num_classes'],
    num_heads=model_config['num_heads'],
    use_lexicon_boost=model_config['use_lexicon_boost']
)

# ===== STEP 6: Load fine-tuned weights =====
model.load_state_dict(torch.load(model_weights_path, map_location=device))
model = model.to(device)
model.eval()

print("✓ Model loaded successfully with fine-tuned CLIP and XLM-RoBERTa weights!")

# ===== STEP 7: Prepare tokenizer =====
tokenizer = AutoTokenizer.from_pretrained(model_config['xlm_model_name'])

Inference Example

from PIL import Image
from torchvision import transforms

# Define image preprocessing (IMPORTANT: Must match training)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Load and preprocess image
image_path = "path/to/your/meme.jpg"
image = Image.open(image_path).convert('RGB')
image_tensor = transform(image).unsqueeze(0).to(device)

# Prepare text (OCR text from the meme)
text = "আপনার বাংলা টেক্সট এখানে"  # Your Bengali text here

# Tokenize text
encoded = tokenizer(
    text,
    max_length=model_config['max_length'],
    padding='max_length',
    truncation=True,
    return_tensors='pt'
)
input_ids = encoded['input_ids'].to(device)
attention_mask = encoded['attention_mask'].to(device)

# Calculate lexicon matches (for boosting)
from model_architecture import contains_political_keywords
lexicon_matches = contains_political_keywords(text)
lexicon_tensor = torch.tensor([lexicon_matches]).to(device)

# Run inference
with torch.no_grad():
    outputs = model(image_tensor, input_ids, attention_mask, lexicon_tensor)
    probs = torch.softmax(outputs, dim=1)
    pred_class = torch.argmax(probs, dim=1).item()
    confidence = probs[0][pred_class].item()

# Print results
class_names = ['NonPolitical', 'Political']
print(f"Prediction: {class_names[pred_class]}")
print(f"Confidence: {confidence:.4f}")
print(f"Probabilities: NonPolitical={probs[0][0]:.4f}, Political={probs[0][1]:.4f}")

Model Performance

Evaluated on validation set:

Binary classification metrics (Accuracy, Precision, Recall, F1)
Class-specific metrics for Political class
Confusion matrix analysis
Lexicon-based boosting for improved political content detection

Requirements

torch>=1.9.0
torchvision>=0.10.0
transformers>=4.41.2
open_clip_torch>=2.0.0
pillow>=9.5.0
huggingface_hub>=0.16.0

Citation

@inproceedings{ahsan2024multimodal,
  title={A Multimodal Framework to Detect Target Aware Aggression in Memes},
  author={Ahsan, Shawly and Hossain, Eftekhar and Sharif, Omar and Das, Avishek and Hoque, Mohammed Moshiul and Dewan, M},
  booktitle={Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={2487--2500},
  year={2024}
}

License

Apache 2.0

Limitations

Trained specifically on Bengali political memes
Requires both image and text input (OCR text)
Performance may vary on out-of-domain content
Binary classification only (Political vs NonPolitical)
Image preprocessing must match training pipeline exactly

Important Notes

Image Preprocessing

⚠️ Critical: The image preprocessing pipeline must match the training pipeline exactly:

Resize to 224×224
Convert to RGB
Normalize with ImageNet statistics: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]

Model Loading

The model must be loaded in this exact order:

Load CLIP configuration
Initialize CLIP with OpenCLIP library
Load model architecture
Initialize MAF model with the CLIP visual encoder
Load the fine-tuned state dict

Lexicon Boosting

The model uses political keyword detection to boost predictions. The contains_political_keywords function is included in model_architecture.py and should be used during inference for optimal performance.

Model Details

CLIP Visual Encoder

Model: ViT-B-16 (Vision Transformer)
Pretrained: LAION-2B dataset (34B samples seen)
Patch Size: 16×16 (196 patches per image)
Output Dimension: 512
Fine-tuned: All parameters updated during training

Text Encoder

Model: XLM-RoBERTa-Large
Parameters: 559M
Fine-tuned: All layers updated during training
Max sequence length: 70 tokens

Political Lexicon

The model uses a curated lexicon of Bengali and English political keywords to boost detection of political content. The lexicon includes terms related to political parties, leaders, movements, and events specific to Bengali political discourse.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for lucius-40/bengali-political-maf-v8

Base model

FacebookAI/xlm-roberta-large

Finetuned

(842)

this model