File size: 2,111 Bytes
429e06e 0828f08 429e06e 0828f08 429e06e 0828f08 429e06e 0828f08 429e06e 0828f08 429e06e 0828f08 429e06e 0828f08 429e06e 0828f08 429e06e 0828f08 429e06e 0828f08 429e06e 0828f08 429e06e 0828f08 429e06e 0828f08 036e0d3 6e443fb 036e0d3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
---
library_name: transformers
tags:
- smiles
- chemistry
- BERT
- molecules
license: mit
datasets:
- fabikru/half-of-chembl-2025-randomized-smiles-cleaned
---
# MolEncoder
MolEncoder is a BERT-based chemical language model pretrained on SMILES strings using masked language modeling (MLM). It was designed to investigate optimal pretraining strategies for molecular representation learning, with a particular focus on masking ratio, dataset size, and model size. It is described in detail in the paper "MolEncoder: Towards Optimal Masked Language Modeling for Molecules".
## Model Description
- **Architecture:** Encoder-only transformer based on ModernBERT
- **Parameters:** ~15M
- **Tokenizer:** Character-level tokenizer covering full SMILES vocabulary
- **Pretraining Objective:** Masked language modeling with optimized masking ratios (30% found to work best for molecules)
- **Pretraining Data:** Pretrained on ~1M molecules (half of ChEMBL)
## Key Findings
- Higher masking ratios (20–60%) outperform the standard 15% used in prior molecular BERT models.
- Increasing model size or dataset size beyond moderate scales yields no consistent performance benefits and can degrade efficiency.
- This 15M parameter model pretrained on ~1M molecules outperforms much larger models pretrained on more SMILES strings.
## Intended Uses
- **Primary use:** Molecular property prediction through fine-tuning on downstream datasets
## How to Use
Please refer to the [MolEncoder GitHub repository](https://github.com/FabianKruger/MolEncoder) for detailed instructions and ready-to-use examples on fine-tuning the model on custom data and running predictions.
## Citation
If you use this model, please cite:
```bibtex
@Article{D5DD00369E,
author ="Krüger, Fabian P. and Österbacka, Nicklas and Kabeshov, Mikhail and Engkvist, Ola and Tetko, Igor",
title ="MolEncoder: towards optimal masked language modeling for molecules",
journal ="Digital Discovery",
year ="2025",
pages ="-",
publisher ="RSC",
doi ="10.1039/D5DD00369E",
url ="http://dx.doi.org/10.1039/D5DD00369E"}
``` |