How is the 2.6b model better than this one in literally every use case I have???
The benchmarks show this moe variant is better and it should be but that's not the case. Hell the Q4_km version of 2.6b performs better somehow.
Yes, this model is stronger overall (especially in code), but maybe not for your particular use cases. Could you tell us more about them?
There is a theory that said MoE model aren't that good in small scale, basically at small parameters count it will be worse in some way than the dense counter part.
But it still gonna be better at big scale.
I could be wrong but it's seems like that
Do you have a reference for this theory? I believe the trade-offs between dense and MoE models are well understood overall. In this specific case, LFM2-2.6B is a very deep model, unlike this MoE. It means that reasoning-heavy tasks might work better with the 2.6B, but that is very use-case-dependent. Overall, LFM2-8B-A1B is a stronger (and faster) model.