Political Meme Classification - MAF Model

Model Description

Multimodal Attention Fusion (MAF) model for binary classification of Bengali political memes:

NonPolitical (0): Non-political content
Political (1): Political content

This model combines visual features from CLIP and textual features from Bangla-BERT using multi-head attention to classify meme images with Bengali text.

Architecture

Visual Encoder: CLIP ViT-B/32 (last 2 transformer blocks fine-tuned)
Text Encoder: Bangla-BERT (last 2 layers fine-tuned)
Fusion: Multi-head Attention (16 heads) for cross-modal interaction
Classifier: 2-layer fully connected network with dropout
Input: 224x224 images + Bengali text (max 70 tokens)
Output: Binary classification (NonPolitical/Political)

Training Details

Task: Binary Image Classification
Dataset: PoliMemeDecode (2,290 training samples, 572 validation samples)
Epochs: 10
Learning Rate: 8e-05
Batch Size: 16
Max Text Length: 70
Attention Heads: 16
Optimizer: AdamW with linear warmup scheduler
Loss: CrossEntropyLoss

Usage

from huggingface_hub import hf_hub_download
import torch
import clip
from transformers import AutoTokenizer

# Download model files
model_path = hf_hub_download(repo_id="lucius-40/bengali-political-maf-v3", filename="maf_model.pth")
arch_path = hf_hub_download(repo_id="lucius-40/bengali-political-maf-v3", filename="model_architecture.py")

# Import architecture
import importlib.util
spec = importlib.util.spec_from_file_location("model_architecture", arch_path)
model_arch = importlib.util.module_from_spec(spec)
spec.loader.exec_module(model_arch)
MAF = model_arch.MAF

# Setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load CLIP visual encoder
clip_model, _ = clip.load("ViT-B/32", device=device)
clip_model = clip_model.visual.float()

# Initialize and load trained model
model = MAF(clip_model, num_classes=2, num_heads=16)
model.load_state_dict(torch.load(model_path, map_location=device))
model = model.to(device)
model.eval()

# Prepare tokenizer
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base")

# Run inference
# ... (prepare image and text inputs)

Model Performance

Evaluated on validation set with binary classification metrics:

Accuracy, Precision, Recall, F1 Score
Class-specific metrics for Political class
Confusion matrix analysis

Requirements

torch>=1.9.0
torchvision>=0.10.0
transformers>=4.41.2
clip @ git+https://github.com/openai/CLIP.git
pillow>=9.5.0

Citation

@inproceedings{ahsan2024multimodal,
  title={A Multimodal Framework to Detect Target Aware Aggression in Memes},
  author={Ahsan, Shawly and Hossain, Eftekhar and Sharif, Omar and Das, Avishek and Hoque, Mohammed Moshiul and Dewan, M},
  booktitle={Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={2487--2500},
  year={2024}
}

License

Apache 2.0

Limitations

Trained specifically on Bengali political memes
Requires both image and text input
Performance may vary on out-of-domain content
Binary classification only (Political vs NonPolitical)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for lucius-40/bengali-political-maf-v3

Base model

openai/clip-vit-base-patch32

Finetuned

(107)

this model