Views : 31,231
Genre: Science & Technology
Date of upload: Jan 22, 2021 ^^
Rating : 4.987 (3/920 LTDR)
RYD date created : 2022-03-03T22:20:54.707296Z
See in json
Top Comments of this video!! :3
Max pooling, locality sensitive hashing parameter switching, ReLU (f(x)=x connect, f(x)=0 disconnect) are all switching. Convolution, weighted sums, fast transforms (FFT Hadamard) are all dot products.
Locality sensitive hash = Random projection followed by binarization.
Random projection = fixed pattern of randomly chosen sign flips followed by Hadamard transform. Repeat for better quality.
3 Elements: Dot product, switching, predicate for switch state (EG. x<0).
1 |
While I can see how one can rationalize the results otherwise, it seems to me that the scaling differences between dense and Switch (or other MoE) models on downstream tasks, relative to their scaling on perplexity, is further evidence against the idea that these are just memorize-interpolators. One would, I think, expect that such memorization and interpolation would be more robust on average to MoE-style partitioning than if they were also learning more general reasoning. Yet while we see Switch-Base outperform T5-Large on perplexity, it underperforms on every downstream task except CB Trivia QA. As in, this seems like what you get if your parameter scaling was giving benefits predominantly through better memorization, and it seems of a distinctly different character.
5 |
Re: "model parallelism has high communication costs."
Yes and no. Standard data-parallelism (aka layer sequential execution) incurs the overhead of synchronizing all accelerators, reducing all gradients, doing the weight update and distributing the updated parameters again. Model parallel (aka layer parallel aka layer pipelined) execution incurs the overhead of moving the hidden activations, but the weights are not moved. If moving weights is more expensive than moving activations then you probably want to run using model parallel execution. There are many cases where pipelining a model incurs the penalty of moving weights, but avoids a lot of overheads present in layer sequential execution.
From Pipelined Backpropagation at Scale: Training Large Models without Batches (Kosson et al 2020, https://arxiv.org/abs/2003.11666)
"Zhang et al. (2019c) find that fine-grained pipelining can enable speedups of up to 3.5x in their setting. Li & Pedram (2017) and Chen et al. (2016) both report energy savings of up to 3x. Fine-grained pipelining can also enable efficient sparse processing which Chen et al. (2019) show can result in up to a 42.5x and an 11.3x improvement in throughput and energy efficiency, respectively."
In a recent white paper Sambanova shows how they plan to pipeline model. See Figure 4 here: https://sambanova.ai/wp-content/uploads/2020/12/RDA-Whitepaper.pdf
Cerebras has also talked about the benefits of pipelining models: https://www.cerebras.net/data-model-pipeline-parallel-training-neural-networks
5 |
So HN has a comment: https://news.ycombinator.com/item?id=26174038 (sorry if @thesz saw this, I did not ask for permission)
The context is that one comment suggested that Switch Transformer is parameter-inefficient, i.e., it uses too much parameters to achieve the performance that some other architecture would achieve with much less parameters.
To that comment, someone asked what's the basis for this conclusion. This comment provides the reasoning (actually from different user from the original comment of inefficiency).
The gist is that TensorFlow actually does not provide the APIs for experimenting with different algorithm, quote:
"researchers at Google cannot do IRLS (search provides IRLS only for logistic regression in Tensorflow), they cannot do Hessian-free optimization ([4], closed due lack of activity - notice the "we can't support RNN due to the WHILE loop" bonanza), etc. All due to the fact they have to use Tensorflow - it just does not support these things."
Any comments? I actually cannot comment on TensorFlow's capability at all...
2 |
@ChocolateMilkCultLeader
1 year ago
Crazy how Yannic has a video on every topic I want to research
5 |