Model Card for Indus (indus-sde-v0.2)

This model was further pre-trained on full Science Discovery Engine (SDE) website data from nasa-smd-ibm-v0.1 after extending its context size with Masked Language Modelling task.

Model Details

  • Base Model: nasa-impact/nasa-smd-ibm-v0.1
  • Tokenizer: nasa-impact/nasa-smd-ibm-v0.1
  • Parameters: 125M
  • Pretraining Strategy: Masked Language Modeling (MLM)

Training Data

  • Full Science Discovery Engine (SDE) Website Data

Training Procedure

  • transformers Version: 4.48.3
  • Strategy: Masked Language Modeling (MLM)
  • Stage 1 Training: Increase the context size from 512 tokens to 1024 tokens and pre-train slowly only the position encoding layer for 1 epoch. (We do this to make sure we still retain the representation learned in the original upstream Indus model that was trained on huge scientific corpus.)
  • Stage 2 Training: Full training with cosine warmup LR Scheduler for 5 epoch
  • Masking Strategy:
    • Weighted Dynamic Masking based on Keyword Importance (YAKE) and Random Masking
      • The idea for masking important keywords is to force the model to generalize for "science" keywords that gives a high signal for the document
    • Masked Language Model Probability: 30%
  • Batch Size: 6
  • Learning rate: 5e-5
  • Warmup ratio: 0.1

Dataset

  • Total Data Size: 545,717
  • Validation Data Size: 10% of total size
  • Test Data Size: 10% of total size

Evaluation

  • Top-k Test Mask Accuracy: {'top1': 0.7814, 'top2': 0.8319, 'top3': 0.8548}

image

Downloads last month
19
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for nasa-impact/indus-sde-v0.2

Finetunes
1 model

Collection including nasa-impact/indus-sde-v0.2