Is this model meant for full bfloat16, AMP bfloat16 or no bfloat16?
#7
by
umarbutler
- opened
The paper does not make it clear.
Bump
We trained ModernBERT with amp_bf16. We'll add that detail to our next arxiv preprint update. I imagine ModernBERT will work fine with fp32, amp_bf16, or bf16. Although, the latter might need additional finetuning depending on the usecase.
@bwarner @umarbutler Does this mean we should load the model with fp32 and ignore the flash attention warning that it needs fp16 or bf16, or do you have a better suggestion? I just noticed that loading the model with bf16 and fine-tuning it with a small dataset leads to worse results than when the model is just loaded without it (i.e. with fp32). Thanks!
@ymoslem after having used ModernBERT quite extensively, I can recommend:
- Always train with AMP (mixed precision) bfloat16.
- Training in full bfloat16 is not a good idea as you are likely to see instability.
- You will see little to no degradation in performance when doing inference in full bfloat16 versus AMP bfloat16.