Mixture-of-Experts
8 videos • 53 views • by Tunadorable
1
MaskMoE: Forcing rare tokens to only use one expert
Tunadorable
Download
2
What happens when you take MoE scaling laws seriously?
Tunadorable
Download
3
Multi-Head Mixture-of-Experts
Tunadorable
Download
4
Exponentially Faster Language Modeling
Tunadorable
Download
5
MoE-Level Performance Without The Added Computation
Tunadorable
Download
6
MoE LLMs with Dense Training for Better Performance
Tunadorable
Download
7
If early layers don't need tons of experts, can we save compute?
Tunadorable
Download
8
Do we really need to use every single transformer layer?
Tunadorable
Download