Is this model meant for full bfloat16, AMP bfloat16 or no bfloat16?

#7
by umarbutler - opened

The paper does not make it clear.

We trained ModernBERT with amp_bf16. We'll add that detail to our next arxiv preprint update. I imagine ModernBERT will work fine with fp32, amp_bf16, or bf16. Although, the latter might need additional finetuning depending on the usecase.

@bwarner @umarbutler Does this mean we should load the model with fp32 and ignore the flash attention warning that it needs fp16 or bf16, or do you have a better suggestion? I just noticed that loading the model with bf16 and fine-tuning it with a small dataset leads to worse results than when the model is just loaded without it (i.e. with fp32). Thanks!

@ymoslem after having used ModernBERT quite extensively, I can recommend:

  • Always train with AMP (mixed precision) bfloat16.
  • Training in full bfloat16 is not a good idea as you are likely to see instability.
  • You will see little to no degradation in performance when doing inference in full bfloat16 versus AMP bfloat16.

Sign up or log in to comment