Model Card for Indus (indus-sde-v0.2)

This model was further pre-trained on full Science Discovery Engine (SDE) website data from nasa-smd-ibm-v0.1 after extending its context size with Masked Language Modelling task.

Model Details

Base Model: nasa-impact/nasa-smd-ibm-v0.1
Tokenizer: nasa-impact/nasa-smd-ibm-v0.1
Parameters: 125M
Pretraining Strategy: Masked Language Modeling (MLM)

Training Data

Full Science Discovery Engine (SDE) Website Data

Training Procedure

transformers Version: 4.48.3
Strategy: Masked Language Modeling (MLM)
Stage 1 Training: Increase the context size from 512 tokens to 1024 tokens and pre-train slowly only the position encoding layer for 1 epoch. (We do this to make sure we still retain the representation learned in the original upstream Indus model that was trained on huge scientific corpus.)
Stage 2 Training: Full training with cosine warmup LR Scheduler for 5 epoch
Masking Strategy:
- Weighted Dynamic Masking based on Keyword Importance (YAKE) and Random Masking
  - The idea for masking important keywords is to force the model to generalize for "science" keywords that gives a high signal for the document
- Masked Language Model Probability: 30%
Batch Size: 6
Learning rate: 5e-5
Warmup ratio: 0.1