Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

High Definition Standard Definition

Stats For Nerds

Theater

Video id : iAR8LkkMMIM
ImmersiveAmbientModecolor: #ccafaa (color 1)
Video Format : 22 (720p) openh264 ( https://github.com/cisco/openh264) mp4a.40.2 | 44100Hz
Audio Format: Opus - Normalized audio
PokeTubeEncryptID: 32711a632eebf9d6f7cec1f4cd9b500b42b988b5567940679ea866e5287e3d4b6d7204ee3318c283cb955d52e0a8c7b7
Proxy : eu-proxy.poketube.fun - refresh the page to change the proxy location
Date : 1715992888651 - unknown on Apple WebKit
Mystery text : aUFSOExra01NSU0gaSAgbG92ICB1IGV1LXByb3h5LnBva2V0dWJlLmZ1bg==
143 : true

Jump to Connections

Yannic Kilcher

253K Subs

920

Download

Thanks!

31,231 Views • Jan 22, 2021 • Click to toggle off description

#ai #technology #switchtransformer

Scale is the next frontier for AI. Google Brain uses sparsity and hard routing to massively increase a model's parameters, while keeping the FLOPs per forward pass constant. The Switch Transformer compares favorably to its dense counterparts in terms of speed and sample efficiency and breaks the next magic number: One Trillion Parameters.

OUTLINE:
0:00 - Intro & Overview
4:30 - Performance Gains from Scale
8:30 - Switch Transformer Architecture
17:00 - Model-, Data- and Expert-Parallelism
25:30 - Experimental Results
29:00 - Stabilizing Training
32:20 - Distillation into Dense Models
33:30 - Final Comments

Paper: arxiv.org/abs/2101.03961
Codebase T5: github.com/google-research/text-to-text-transfer-t…

Abstract:
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.

Authors: William Fedus, Barret Zoph, Noam Shazeer

Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
YouTube: youtube.com/c/yannickilcher
Twitter: twitter.com/ykilcher
Discord: discord.gg/4H8xxDF
BitChute: www.bitchute.com/channel/yannic-kilcher
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: www.linkedin.com/in/yannic-kilcher-488534136/
BiliBili: space.bilibili.com/1824646584

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannickilcher
Patreon: www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Metadata And Engagement

Views : 31,231
Genre: Science & Technology
Date of upload: Jan 22, 2021 ^^

Rating : 4.987 (3/920 LTDR)
RYD date created : 2022-03-03T22:20:54.707296Z
See in json

YouTube Comments - 91 Comments

@ChocolateMilkCultLeader
1 year ago

Crazy how Yannic has a video on every topic I want to research

5 |

@menzithesonofhopehlope7201
3 years ago

I was crossing my fingers for this video. Thank you.

9 |

@adrienforbu5165
3 years ago

impressively well explained ! Thank you Yannic !

|

@tianyangchen1339
10 months ago

Thank you for your explanation! You always make complex things easy to understand, so great!

1 |

@adrianpetrescu8583
3 years ago

man, you have a style of tell the story that I find very easy to understand. it is easy for me to learn from you :) thanks

|

@NoFunkGiven
3 years ago

Thank you for the high qualities of your videos :)

|

@alivecoding4995
10 months ago

Thanks so much, Yannic!

1 |

@sourabmangrulkar9105
3 years ago

Thanks for great explanation 😄

1 |

@florianhonicke5448
3 years ago

Thanks for another awesome video!!!

|

@simonstrandgaard5503
3 years ago

Great explanation.

1 |

@jahcane3711
3 years ago

Thanks Yannic!

2 |

@hoaxuan7074
3 years ago

Max pooling, locality sensitive hashing parameter switching, ReLU (f(x)=x connect, f(x)=0 disconnect) are all switching. Convolution, weighted sums, fast transforms (FFT Hadamard) are all dot products. Locality sensitive hash = Random projection followed by binarization. Random projection = fixed pattern of randomly chosen sign flips followed by Hadamard transform. Repeat for better quality. 3 Elements: Dot product, switching, predicate for switch state (EG. x<0).

1 |

@veedrac
3 years ago

While I can see how one can rationalize the results otherwise, it seems to me that the scaling differences between dense and Switch (or other MoE) models on downstream tasks, relative to their scaling on perplexity, is further evidence against the idea that these are just memorize-interpolators. One would, I think, expect that such memorization and interpolation would be more robust on average to MoE-style partitioning than if they were also learning more general reasoning. Yet while we see Switch-Base outperform T5-Large on perplexity, it underperforms on every downstream task except CB Trivia QA. As in, this seems like what you get if your parameter scaling was giving benefits predominantly through better memorization, and it seems of a distinctly different character.

5 |

@spicychiley
3 years ago

Re: "model parallelism has high communication costs." Yes and no. Standard data-parallelism (aka layer sequential execution) incurs the overhead of synchronizing all accelerators, reducing all gradients, doing the weight update and distributing the updated parameters again. Model parallel (aka layer parallel aka layer pipelined) execution incurs the overhead of moving the hidden activations, but the weights are not moved. If moving weights is more expensive than moving activations then you probably want to run using model parallel execution. There are many cases where pipelining a model incurs the penalty of moving weights, but avoids a lot of overheads present in layer sequential execution. From Pipelined Backpropagation at Scale: Training Large Models without Batches (Kosson et al 2020, https://arxiv.org/abs/2003.11666) "Zhang et al. (2019c) find that fine-grained pipelining can enable speedups of up to 3.5x in their setting. Li & Pedram (2017) and Chen et al. (2016) both report energy savings of up to 3x. Fine-grained pipelining can also enable efficient sparse processing which Chen et al. (2019) show can result in up to a 42.5x and an 11.3x improvement in throughput and energy efficiency, respectively." In a recent white paper Sambanova shows how they plan to pipeline model. See Figure 4 here: https://sambanova.ai/wp-content/uploads/2020/12/RDA-Whitepaper.pdf Cerebras has also talked about the benefits of pipelining models: https://www.cerebras.net/data-model-pipeline-parallel-training-neural-networks

5 |

@EnricoRos
3 years ago

I think the main takeaway for Switch-C is that it outperforms T5-XXL using 1/10th of the FLOPS (although blowing over 1T params). While the smaller Switch model gets the best performance but matching T5's compute. They haven't tried with both Equal Compute and 1T params.

2 |

@conchylicultor
3 years ago

Thank you for the summary, this was very informative. I was just wondering how did they manage to train the router weights if they are only sending exaamples to a single expert ?

7 |

@Rhannmah
3 years ago

Uh oh, this is getting out of hand! Transformers are crazy and I can't imagine what they can do with that many params... This is also amazing because it potentially gives common folk like me some hope to actually be able to run a reasonably sized transformer on local hardware.

1 |

@ammarkov
3 years ago

>tfw in my MocapNET work I use a a classifier that decides on using an ensemble trained on a subset of the problem (basically poor man's routing) and it was one of the reviewer complains.. This is a fundamentally good idea, divide and conquer!

1 |

@yaxiongzhao6640
3 years ago

So HN has a comment: https://news.ycombinator.com/item?id=26174038 (sorry if @thesz saw this, I did not ask for permission) The context is that one comment suggested that Switch Transformer is parameter-inefficient, i.e., it uses too much parameters to achieve the performance that some other architecture would achieve with much less parameters. To that comment, someone asked what's the basis for this conclusion. This comment provides the reasoning (actually from different user from the original comment of inefficiency). The gist is that TensorFlow actually does not provide the APIs for experimenting with different algorithm, quote: "researchers at Google cannot do IRLS (search provides IRLS only for logistic regression in Tensorflow), they cannot do Hessian-free optimization ([4], closed due lack of activity - notice the "we can't support RNN due to the WHILE loop" bonanza), etc. All due to the fact they have to use Tensorflow - it just does not support these things." Any comments? I actually cannot comment on TensorFlow's capability at all...

2 |

@ratanrohith1013
4 months ago

Here after Mixtral 8x7B release!

|

Go To Top

AutoPlay: