File size: 2,111 Bytes

---
library_name: transformers
tags:
- smiles
- chemistry
- BERT
- molecules
license: mit
datasets:
- fabikru/half-of-chembl-2025-randomized-smiles-cleaned
---

# MolEncoder

MolEncoder is a BERT-based chemical language model pretrained on SMILES strings using masked language modeling (MLM). It was designed to investigate optimal pretraining strategies for molecular representation learning, with a particular focus on masking ratio, dataset size, and model size. It is described in detail in the paper "MolEncoder: Towards Optimal Masked Language Modeling for Molecules". 

## Model Description

- **Architecture:** Encoder-only transformer based on ModernBERT  
- **Parameters:** ~15M  
- **Tokenizer:** Character-level tokenizer covering full SMILES vocabulary  
- **Pretraining Objective:** Masked language modeling with optimized masking ratios (30% found to work best for molecules)  
- **Pretraining Data:** Pretrained on ~1M molecules (half of ChEMBL)  

## Key Findings

- Higher masking ratios (20–60%) outperform the standard 15% used in prior molecular BERT models.  
- Increasing model size or dataset size beyond moderate scales yields no consistent performance benefits and can degrade efficiency.  
- This 15M parameter model pretrained on ~1M molecules outperforms much larger models pretrained on more SMILES strings.  

## Intended Uses

- **Primary use:** Molecular property prediction through fine-tuning on downstream datasets   

## How to Use

Please refer to the [MolEncoder GitHub repository](https://github.com/FabianKruger/MolEncoder) for detailed instructions and ready-to-use examples on fine-tuning the model on custom data and running predictions.  

## Citation

If you use this model, please cite:
```bibtex
@Article{D5DD00369E,
author ="Krüger, Fabian P. and Österbacka, Nicklas and Kabeshov, Mikhail and Engkvist, Ola and Tetko, Igor",
title  ="MolEncoder: towards optimal masked language modeling for molecules",
journal  ="Digital Discovery",
year  ="2025",
pages  ="-",
publisher  ="RSC",
doi  ="10.1039/D5DD00369E",
url  ="http://dx.doi.org/10.1039/D5DD00369E"}
```