Model Card for Indus (indus-sde-v0.2)
This model was further pre-trained on full Science Discovery Engine (SDE) website data from nasa-smd-ibm-v0.1 after extending its context size with Masked Language Modelling task.
Model Details
- Base Model: nasa-impact/nasa-smd-ibm-v0.1
- Tokenizer: nasa-impact/nasa-smd-ibm-v0.1
- Parameters: 125M
- Pretraining Strategy: Masked Language Modeling (MLM)
Training Data
- Full Science Discovery Engine (SDE) Website Data
Training Procedure
- transformers Version: 4.48.3
- Strategy: Masked Language Modeling (MLM)
- Stage 1 Training: Increase the context size from 512 tokens to 1024 tokens and pre-train slowly only the position encoding layer for 1 epoch. (We do this to make sure we still retain the representation learned in the original upstream Indus model that was trained on huge scientific corpus.)
- Stage 2 Training: Full training with cosine warmup LR Scheduler for 5 epoch
- Masking Strategy:
- Weighted Dynamic Masking based on Keyword Importance (YAKE) and Random Masking
- The idea for masking important keywords is to force the model to generalize for "science" keywords that gives a high signal for the document
- Masked Language Model Probability: 30%
- Batch Size: 6
- Learning rate: 5e-5
- Warmup ratio: 0.1
Dataset
- Total Data Size: 545,717
- Validation Data Size: 10% of total size
- Test Data Size: 10% of total size
Evaluation
- Top-k Test Mask Accuracy: {'top1': 0.7814, 'top2': 0.8319, 'top3': 0.8548}
