Political Meme Classification - MAF Model
Model Description
Multimodal Attention Fusion (MAF) model for binary classification of Bengali political memes:
- NonPolitical (0): Non-political content
- Political (1): Political content
This model combines visual features from CLIP (ViT-B-16) and textual features from XLM-RoBERTa-Large using multi-head attention to classify meme images with Bengali text.
Architecture
- Visual Encoder: CLIP ViT-B-16 (fine-tuned, pretrained on LAION-2B)
- Text Encoder: XLM-RoBERTa-Large (fine-tuned)
- Fusion: Multi-head Attention (8 heads) for cross-modal interaction
- Classifier: 2-layer fully connected network with dropout
- Lexicon Boosting: Political keyword detection for improved accuracy
- Input: 224x224 images + text (max 70 tokens)
- Output: Binary classification (NonPolitical/Political)
Training Details
- Task: Binary Image Classification
- Dataset: PoliMemeDecode (2,290 training samples, 572 validation samples)
- Epochs: 5
- Learning Rate: 8e-05
- Batch Size: 16
- Max Text Length: 70
- Attention Heads: 8
- Optimizer: AdamW with linear warmup scheduler
- Loss: CrossEntropyLoss with label smoothing
- Fine-tuning: Both CLIP visual encoder and XLM-RoBERTa are fully fine-tuned
Files in Repository
maf_model_full.pth- Complete model state dict (includes all weights)clip_config.pth- CLIP model configuration (required for loading)model_config.pth- Model hyperparameters (required for initialization)model_architecture.py- Model architecture code with lexicon supportREADME.md- This documentation file
Usage
Installation
pip install torch torchvision transformers open_clip_torch pillow huggingface_hub
Loading the Model
from huggingface_hub import hf_hub_download
import torch
import open_clip
from transformers import AutoTokenizer
import importlib.util
# Setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# ===== STEP 1: Download all required files =====
repo_id = "lucius-40/bengali-political-maf-v8"
clip_config_path = hf_hub_download(repo_id=repo_id, filename="clip_config.pth")
model_config_path = hf_hub_download(repo_id=repo_id, filename="model_config.pth")
model_weights_path = hf_hub_download(repo_id=repo_id, filename="maf_model_full.pth")
arch_path = hf_hub_download(repo_id=repo_id, filename="model_architecture.py")
# ===== STEP 2: Load configurations =====
clip_config = torch.load(clip_config_path, map_location=device)
model_config = torch.load(model_config_path, map_location=device)
print(f"Loading CLIP: {clip_config['model_name']} ({clip_config['pretrained']})")
# ===== STEP 3: Initialize CLIP visual encoder =====
clip_model, _, preprocess = open_clip.create_model_and_transforms(
clip_config['model_name'], # 'ViT-B-16'
pretrained=clip_config['pretrained'], # 'laion2b_s34b_b88k'
device=device
)
# Extract visual encoder only
clip_visual = clip_model.visual.float().to(device)
# ===== STEP 4: Load model architecture =====
spec = importlib.util.spec_from_file_location("model_architecture", arch_path)
model_arch = importlib.util.module_from_spec(spec)
spec.loader.exec_module(model_arch)
MAF = model_arch.MAF
# ===== STEP 5: Initialize MAF model =====
model = MAF(
clip_model=clip_visual,
num_classes=model_config['num_classes'],
num_heads=model_config['num_heads'],
use_lexicon_boost=model_config['use_lexicon_boost']
)
# ===== STEP 6: Load fine-tuned weights =====
model.load_state_dict(torch.load(model_weights_path, map_location=device))
model = model.to(device)
model.eval()
print("✓ Model loaded successfully with fine-tuned CLIP and XLM-RoBERTa weights!")
# ===== STEP 7: Prepare tokenizer =====
tokenizer = AutoTokenizer.from_pretrained(model_config['xlm_model_name'])
Inference Example
from PIL import Image
from torchvision import transforms
# Define image preprocessing (IMPORTANT: Must match training)
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Load and preprocess image
image_path = "path/to/your/meme.jpg"
image = Image.open(image_path).convert('RGB')
image_tensor = transform(image).unsqueeze(0).to(device)
# Prepare text (OCR text from the meme)
text = "আপনার বাংলা টেক্সট এখানে" # Your Bengali text here
# Tokenize text
encoded = tokenizer(
text,
max_length=model_config['max_length'],
padding='max_length',
truncation=True,
return_tensors='pt'
)
input_ids = encoded['input_ids'].to(device)
attention_mask = encoded['attention_mask'].to(device)
# Calculate lexicon matches (for boosting)
from model_architecture import contains_political_keywords
lexicon_matches = contains_political_keywords(text)
lexicon_tensor = torch.tensor([lexicon_matches]).to(device)
# Run inference
with torch.no_grad():
outputs = model(image_tensor, input_ids, attention_mask, lexicon_tensor)
probs = torch.softmax(outputs, dim=1)
pred_class = torch.argmax(probs, dim=1).item()
confidence = probs[0][pred_class].item()
# Print results
class_names = ['NonPolitical', 'Political']
print(f"Prediction: {class_names[pred_class]}")
print(f"Confidence: {confidence:.4f}")
print(f"Probabilities: NonPolitical={probs[0][0]:.4f}, Political={probs[0][1]:.4f}")
Model Performance
Evaluated on validation set:
- Binary classification metrics (Accuracy, Precision, Recall, F1)
- Class-specific metrics for Political class
- Confusion matrix analysis
- Lexicon-based boosting for improved political content detection
Requirements
torch>=1.9.0
torchvision>=0.10.0
transformers>=4.41.2
open_clip_torch>=2.0.0
pillow>=9.5.0
huggingface_hub>=0.16.0
Citation
@inproceedings{ahsan2024multimodal,
title={A Multimodal Framework to Detect Target Aware Aggression in Memes},
author={Ahsan, Shawly and Hossain, Eftekhar and Sharif, Omar and Das, Avishek and Hoque, Mohammed Moshiul and Dewan, M},
booktitle={Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={2487--2500},
year={2024}
}
License
Apache 2.0
Limitations
- Trained specifically on Bengali political memes
- Requires both image and text input (OCR text)
- Performance may vary on out-of-domain content
- Binary classification only (Political vs NonPolitical)
- Image preprocessing must match training pipeline exactly
Important Notes
Image Preprocessing
⚠️ Critical: The image preprocessing pipeline must match the training pipeline exactly:
- Resize to 224×224
- Convert to RGB
- Normalize with ImageNet statistics: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
Model Loading
The model must be loaded in this exact order:
- Load CLIP configuration
- Initialize CLIP with OpenCLIP library
- Load model architecture
- Initialize MAF model with the CLIP visual encoder
- Load the fine-tuned state dict
Lexicon Boosting
The model uses political keyword detection to boost predictions. The contains_political_keywords function is included in model_architecture.py and should be used during inference for optimal performance.
Model Details
CLIP Visual Encoder
- Model: ViT-B-16 (Vision Transformer)
- Pretrained: LAION-2B dataset (34B samples seen)
- Patch Size: 16×16 (196 patches per image)
- Output Dimension: 512
- Fine-tuned: All parameters updated during training
Text Encoder
- Model: XLM-RoBERTa-Large
- Parameters: 559M
- Fine-tuned: All layers updated during training
- Max sequence length: 70 tokens
Political Lexicon
The model uses a curated lexicon of Bengali and English political keywords to boost detection of political content. The lexicon includes terms related to political parties, leaders, movements, and events specific to Bengali political discourse.
Model tree for lucius-40/bengali-political-maf-v8
Base model
FacebookAI/xlm-roberta-large