Text Ranking
sentence-transformers
Safetensors
modchembert
cross-encoder
reranker
cheminformatics
smiles
Generated from Trainer
dataset_size:3269544
loss:MultipleNegativesRankingLoss
custom_code
Eval Results (legacy)
Instructions to use Derify/ChemRanker-alpha-sim with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Derify/ChemRanker-alpha-sim with sentence-transformers:
from sentence_transformers import CrossEncoder model = CrossEncoder("Derify/ChemRanker-alpha-sim", trust_remote_code=True) query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores) - Notebooks
- Google Colab
- Kaggle
Upload README.md
Browse files
README.md
CHANGED
|
@@ -52,7 +52,7 @@ model-index:
|
|
| 52 |
|
| 53 |
This [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) is finetuned from [Derify/ModChemBERT-IR-BASE](https://huggingface.co/Derify/ModChemBERT-IR-BASE) using hard-negative triplets derived from [Derify/pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity). Positive SMILES pairs are first filtered by quality and similarity constraints, then reduced to one strongest positive target per anchor molecule to create a high-signal training set for reranking. The model computes relevance scores for pairs of SMILES strings, enabling SMILES reranking and molecular semantic search.
|
| 54 |
|
| 55 |
-
For this variant, the positive selection objective is pure similarity ranking
|
| 56 |
|
| 57 |
Hard negatives are mined with [Sentence Transformers](https://www.sbert.net/) using [Derify/ChemMRL-beta](https://huggingface.co/Derify/ChemMRL-beta) as the teacher model and a TopK-PercPos-style margin setting based on [NV-Retriever](https://arxiv.org/abs/2407.15831), with `relative_margin=0.05` and `max_negative_score_threshold = pos_score * percentage_margin`. Training uses triplet-format samples with 5 mined negatives per anchor-positive pair and optimizes a multiple-negatives ranking objective, while reranking evaluation uses n-tuple samples with 30 mined negatives per query.
|
| 58 |
|
|
@@ -60,12 +60,11 @@ Hard negatives are mined with [Sentence Transformers](https://www.sbert.net/) us
|
|
| 60 |
|
| 61 |
### Model Description
|
| 62 |
- **Model Type:** Cross Encoder
|
| 63 |
-
|
| 64 |
- **Maximum Sequence Length:** 512 tokens
|
| 65 |
- **Number of Output Labels:** 1 label
|
| 66 |
- **Training Dataset:**
|
| 67 |
- [Derify/pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) Mined Hard Negatives
|
| 68 |
-
<!-- - **Language:** Unknown -->
|
| 69 |
- **License:** apache-2.0
|
| 70 |
|
| 71 |
### Model Sources
|
|
@@ -253,11 +252,12 @@ You can finetune this model on your own dataset.
|
|
| 253 |
- `optim`: stable_adamw
|
| 254 |
- `optim_args`: decouple_lr=True,max_lr=3e-05
|
| 255 |
- `dataloader_persistent_workers`: True
|
| 256 |
-
- `resume_from_checkpoint`:
|
| 257 |
- `gradient_checkpointing`: True
|
| 258 |
- `torch_compile`: True
|
| 259 |
- `torch_compile_backend`: inductor
|
| 260 |
- `torch_compile_mode`: max-autotune
|
|
|
|
| 261 |
- `batch_sampler`: no_duplicates
|
| 262 |
|
| 263 |
#### All Hyperparameters
|
|
@@ -344,7 +344,7 @@ You can finetune this model on your own dataset.
|
|
| 344 |
- `skip_memory_metrics`: True
|
| 345 |
- `use_legacy_prediction_loop`: False
|
| 346 |
- `push_to_hub`: False
|
| 347 |
-
- `resume_from_checkpoint`:
|
| 348 |
- `hub_model_id`: None
|
| 349 |
- `hub_strategy`: every_save
|
| 350 |
- `hub_private_repo`: None
|
|
@@ -372,7 +372,7 @@ You can finetune this model on your own dataset.
|
|
| 372 |
- `neftune_noise_alpha`: None
|
| 373 |
- `optim_target_modules`: None
|
| 374 |
- `batch_eval_metrics`: False
|
| 375 |
-
- `eval_on_start`:
|
| 376 |
- `use_liger_kernel`: False
|
| 377 |
- `liger_kernel_config`: None
|
| 378 |
- `eval_use_gather_object`: False
|
|
|
|
| 52 |
|
| 53 |
This [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) is finetuned from [Derify/ModChemBERT-IR-BASE](https://huggingface.co/Derify/ModChemBERT-IR-BASE) using hard-negative triplets derived from [Derify/pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity). Positive SMILES pairs are first filtered by quality and similarity constraints, then reduced to one strongest positive target per anchor molecule to create a high-signal training set for reranking. The model computes relevance scores for pairs of SMILES strings, enabling SMILES reranking and molecular semantic search.
|
| 54 |
|
| 55 |
+
For this variant, the positive selection objective is pure similarity ranking where each anchor keeps the highest-similarity candidate after filtering, rather than using a QED+similarity composite score. The quality stage uses strict inequality filtering (`QED > 0.85`, `similarity > 0.5`, with similarity also bounded below 1.0), and then keeps the top-scoring pair per anchor molecule.
|
| 56 |
|
| 57 |
Hard negatives are mined with [Sentence Transformers](https://www.sbert.net/) using [Derify/ChemMRL-beta](https://huggingface.co/Derify/ChemMRL-beta) as the teacher model and a TopK-PercPos-style margin setting based on [NV-Retriever](https://arxiv.org/abs/2407.15831), with `relative_margin=0.05` and `max_negative_score_threshold = pos_score * percentage_margin`. Training uses triplet-format samples with 5 mined negatives per anchor-positive pair and optimizes a multiple-negatives ranking objective, while reranking evaluation uses n-tuple samples with 30 mined negatives per query.
|
| 58 |
|
|
|
|
| 60 |
|
| 61 |
### Model Description
|
| 62 |
- **Model Type:** Cross Encoder
|
| 63 |
+
- **Base model:** [Derify/ModChemBERT-IR-BASE](https://huggingface.co/Derify/ModChemBERT-IR-BASE) <!-- at revision 1d8fd449edb3eadeaa5ebdd1c891e3ce95aebc3d -->
|
| 64 |
- **Maximum Sequence Length:** 512 tokens
|
| 65 |
- **Number of Output Labels:** 1 label
|
| 66 |
- **Training Dataset:**
|
| 67 |
- [Derify/pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) Mined Hard Negatives
|
|
|
|
| 68 |
- **License:** apache-2.0
|
| 69 |
|
| 70 |
### Model Sources
|
|
|
|
| 252 |
- `optim`: stable_adamw
|
| 253 |
- `optim_args`: decouple_lr=True,max_lr=3e-05
|
| 254 |
- `dataloader_persistent_workers`: True
|
| 255 |
+
- `resume_from_checkpoint`: False
|
| 256 |
- `gradient_checkpointing`: True
|
| 257 |
- `torch_compile`: True
|
| 258 |
- `torch_compile_backend`: inductor
|
| 259 |
- `torch_compile_mode`: max-autotune
|
| 260 |
+
- `eval_on_start`: True
|
| 261 |
- `batch_sampler`: no_duplicates
|
| 262 |
|
| 263 |
#### All Hyperparameters
|
|
|
|
| 344 |
- `skip_memory_metrics`: True
|
| 345 |
- `use_legacy_prediction_loop`: False
|
| 346 |
- `push_to_hub`: False
|
| 347 |
+
- `resume_from_checkpoint`: False
|
| 348 |
- `hub_model_id`: None
|
| 349 |
- `hub_strategy`: every_save
|
| 350 |
- `hub_private_repo`: None
|
|
|
|
| 372 |
- `neftune_noise_alpha`: None
|
| 373 |
- `optim_target_modules`: None
|
| 374 |
- `batch_eval_metrics`: False
|
| 375 |
+
- `eval_on_start`: True
|
| 376 |
- `use_liger_kernel`: False
|
| 377 |
- `liger_kernel_config`: None
|
| 378 |
- `eval_use_gather_object`: False
|