Fill-Mask
Transformers
Safetensors
modernbert
smiles
chemistry
BERT
molecules
File size: 2,111 Bytes
429e06e
 
0828f08
 
 
 
 
 
 
 
429e06e
 
0828f08
429e06e
0828f08
429e06e
0828f08
429e06e
0828f08
 
 
 
 
429e06e
0828f08
429e06e
0828f08
 
 
429e06e
0828f08
429e06e
0828f08
429e06e
0828f08
429e06e
0828f08
429e06e
0828f08
429e06e
0828f08
036e0d3
6e443fb
 
 
 
 
 
 
 
 
036e0d3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
library_name: transformers
tags:
- smiles
- chemistry
- BERT
- molecules
license: mit
datasets:
- fabikru/half-of-chembl-2025-randomized-smiles-cleaned
---

# MolEncoder

MolEncoder is a BERT-based chemical language model pretrained on SMILES strings using masked language modeling (MLM). It was designed to investigate optimal pretraining strategies for molecular representation learning, with a particular focus on masking ratio, dataset size, and model size. It is described in detail in the paper "MolEncoder: Towards Optimal Masked Language Modeling for Molecules". 

## Model Description

- **Architecture:** Encoder-only transformer based on ModernBERT  
- **Parameters:** ~15M  
- **Tokenizer:** Character-level tokenizer covering full SMILES vocabulary  
- **Pretraining Objective:** Masked language modeling with optimized masking ratios (30% found to work best for molecules)  
- **Pretraining Data:** Pretrained on ~1M molecules (half of ChEMBL)  

## Key Findings

- Higher masking ratios (20–60%) outperform the standard 15% used in prior molecular BERT models.  
- Increasing model size or dataset size beyond moderate scales yields no consistent performance benefits and can degrade efficiency.  
- This 15M parameter model pretrained on ~1M molecules outperforms much larger models pretrained on more SMILES strings.  

## Intended Uses

- **Primary use:** Molecular property prediction through fine-tuning on downstream datasets   

## How to Use

Please refer to the [MolEncoder GitHub repository](https://github.com/FabianKruger/MolEncoder) for detailed instructions and ready-to-use examples on fine-tuning the model on custom data and running predictions.  

## Citation

If you use this model, please cite:
```bibtex
@Article{D5DD00369E,
author ="Krüger, Fabian P. and Österbacka, Nicklas and Kabeshov, Mikhail and Engkvist, Ola and Tetko, Igor",
title  ="MolEncoder: towards optimal masked language modeling for molecules",
journal  ="Digital Discovery",
year  ="2025",
pages  ="-",
publisher  ="RSC",
doi  ="10.1039/D5DD00369E",
url  ="http://dx.doi.org/10.1039/D5DD00369E"}
```