Transformers: The best idea in AI | Andrej Karpathy and Lex Fridman

High Definition Standard Definition

Stats For Nerds

Theater

Video id : 9uw3F6rndnA
ImmersiveAmbientModecolor: #dec4be (color 2)
Video Format : 22 (720p) openh264 ( https://github.com/cisco/openh264) mp4a.40.2 | 44100Hz
Audio Format: Opus - Normalized audio
PokeTubeEncryptID: ee8d7eb4a00be5e83d9eb6477a03a19666432f9e4e76a52c2ef9bb972ef86ae3b5acd536ba923a33cd45155e8c62f4a1
Proxy : eu-proxy.poketube.fun - refresh the page to change the proxy location
Date : 1716121301738 - unknown on Apple WebKit
Mystery text : OXV3M0Y2cm5kbkEgaSAgbG92ICB1IGV1LXByb3h5LnBva2V0dWJlLmZ1bg==
143 : true

Jump to Connections

Lex Clips

1.25M Subs

7.9K

Download

Thanks!

232

360,075 Views • Nov 1, 2022 • Click to toggle off description

Lex Fridman Podcast full episode:    • Andrej Karpathy: Tesla AI, Self-Drivi...
Please support this podcast by checking out our sponsors:
- Eight Sleep: www.eightsleep.com/lex to get special savings
- BetterHelp: betterhelp.com/lex to get 10% off
- Fundrise: fundrise.com/lex
- Athletic Greens: athleticgreens.com/lex to get 1 month of fish oil

GUEST BIO:
Andrej Karpathy is a legendary AI researcher, engineer, and educator. He's the former director of AI at Tesla, a founding member of OpenAI, and an educator at Stanford.

PODCAST INFO:
Podcast website: lexfridman.com/podcast
Apple Podcasts: apple.co/2lwqZIr
Spotify: spoti.fi/2nEwCF8
RSS: lexfridman.com/feed/podcast/
Full episodes playlist:    • Lex Fridman Podcast
Clips playlist:    • Lex Fridman Podcast Clips

SOCIAL:
- Twitter: twitter.com/lexfridman
- LinkedIn: www.linkedin.com/in/lexfridman
- Facebook: www.facebook.com/lexfridman
- Instagram: www.instagram.com/lexfridman
- Medium: medium.com/@lexfridman
- Reddit: redlib.matthew.science/r/lexfridman
- Support on Patreon: www.patreon.com/lexfridman

Metadata And Engagement

Views : 360,075
Genre: Science & Technology
Date of upload: Nov 1, 2022 ^^

Rating : 4.968 (64/7,898 LTDR)
RYD date created : 2024-05-19T12:02:28.48065Z
See in json

YouTube Comments - 232 Comments

@LexClips
1 year ago

Full podcast episode: https://www.youtube.com/watch?v=cdiD-9MMpb0 Lex Fridman podcast channel: youtube.com/lexfridman Guest bio: Andrej Karpathy is a legendary AI researcher, engineer, and educator. He's the former director of AI at Tesla, a founding member of OpenAI, and an educator at Stanford.

36 |

@mauricemeijers7956
1 year ago

Andrej speaks at 1.5x speed and Lex, as always, at 3/4x. Yet, somehow they understand each other.

1K |

@totheknee
11 months ago

Damn. That last sentence. Transformers are so resilient that they haven't been touched in the past FIVE YEARS of AI! I don't think that idea can ever be overstated given how fast this thing is accelerating...

64 |

@baqirhusain5652
1 year ago

My professor Dr Sageeve Oore gave a very good intuition about residual connection. He told me that residual connections allow a network to learn the simplest possible function. No matter how many complex layer we start by learning a linear function and the complex layers add in non linearity as needed to learn true function. A fascinating advantage to this connection is that it provides great generalisation. ( Dont know why, I just felt the need to share this)

212 |

@oleglevchenko5772
2 months ago

Why Lex doesn't invite actual inventor of Transformers, e.g. Ashish Vaswani? All these people like Sam Altman, Andrej Karpathy are reaping the harvest of the invention by that paper "Attention is all we need", yet they are not invited even once to Lex talks.

21 |

@SMH1776
1 year ago

It's amazing to have a podcast where the host can hold their own with Kanye West in a manic state and also have serious conversations about state-of-the-art deep learning architectures. Lex is one of one.

259 |

@aangeli702
10 months ago

Andrej's influence on the development of the field is so underrated. He's not only actively contributing academically (i.e. through research and co-founding OpenAI), but he also communicates ideas so well to the public (for free, by the way) that, he not only helps others contribute academically to the field, but also encourages many people to get into it simply because he manages to take an overwhelmingly complex (at least for me it used to be) topic such as the Transformer and strips it down to something that can be (more easily) digested. Or maybe that's just me - as my professor in my undergrad came no where near to an explanation of Transformers that it as good and intuitive as Andrej's videos do (don't get me wrong, [most] professors know their stuff very well, but Andrej is just on a whole other level).

25 |

@diedforurwins
1 year ago

6:30 😂 imagine how fast this sounds to le

41 |

@tlz8884
1 year ago

I double checked if i was listening at 1.25x speed when Andrej was speaking

39 |

@wasp082
7 months ago

The attention name was surrounding there in the past on other different architectures. It was common to see recurrent bidirectional neural networks with "attention" on the encoder side. That's why the name "attention is all you need" comes from. That because it basically deletes the need of a recurrent or sequentially architecture.

3 |

@MsStone-ue6ek
6 months ago

Great interview. Engaging and dynamic. Thank you.

2 |

@bmatichuk
1 year ago

Karpathy has some great insights. Transformers seem to solve the NN architecture problem without hyper parameter tuning. The "next" for transformers is going to be neurosymbolic computing i.e. integrating logic with neural processing. Right now transformers have trouble with deep reasoning. Its remarkable that reasoning processing automatically arises in transformers based on pretext structure. I believe there is a deeper concept of AI waiting to be discovered. If the mechanism for auto-generated logic pathways in transformers could be discovered, then that could be scaled up to produce general AI.

83 |

@omarnomad
1 year ago

2:18 Meme your way to greatness

16 |

@amarnamarpan
9 months ago

Dr. Ashish Vaswani is a pioneer and nobody is talking about him. He is a scientist from Google Brain and the first author of the paper that introduced TANSFORMERS, and that is the backbone of all other recent models.

15 |

@MrMcSnuffyFluffy
1 year ago

Optimus Prime would be proud.

45 |

@alexforget
1 year ago

Amazing how one paper can change the course of humanity. I like that kind of return on investment, let’s get more weird ambitious.

27 |

@ReflectionOcean
5 months ago

- Understanding the Transformer architecture (0:28) - Recognizing the convergence of different neural network architectures towards Transformers for multiple sensory modalities (0:38) - Appreciating the Transformer's efficiency on modern hardware (0:57) - Reflecting on the paper's title and its meme-like quality (1:58) - Considering the expressive, optimizable, and efficient nature of Transformers (2:42) - Discussing the learning process of short algorithms in Transformers and the stability of the architecture (4:56) - Contemplating future discoveries and improvements in Transformers (7:38)

8 |

@Halopend
11 months ago

Self-attention. Transforming. It's all about giving the AI more parameters to optimize what are important internal representations of the interconnections between data itself. We've supplied first order interconnections. What about second order? Third... or is that expected to be covered in the sliding window technique itself? It would seem the more early representations we can add the greater we can couple to "the data" complex/nuance. At the other end, the more we couple to the output, the closer to alignment we can achieve. But input/output are fuzzy concepts in a sliding window technique. There is no temporal component to the information. The information is represented by large "thinking spaces" of word connections. It's somewhere between a CNN like technique to parse certain subsections of the entire thing at once, to a fully connected space between all the inputs. That said sliding is convenient as it removes the hard limit of what can be generated and makes for an easy to understand parameter we can increase at fairly small cost our increase our ability to generate long form representations exhibiting deeper level nuance/accuracy. The ability to just change the size of the window and have the network adjust seems a fairly nice way to flexibly scale the models, though there is a "cost" to moving around IE: network stability meaning you can only scale up or down so much at a time to maintain the most knowledge incurred from previous trainings. Anyway, the key ingredient is, we purposefully encode the spatial information (to the words theme-selves) to the depth we desire. Or at least that's a possible extension. The next question of course is which areas of representation can we supply more data that easily encodes within the mathematics of information we think is important to be represented in the information (that isn't covered by the processes of the system itself (having the same thing represented in multiple ways (IE: the Data + the system) ) is a path to overly-complicated systems in terms of 'growth/addendums". The easiest path is to just represent in the data itself. And patch it. But you can do stages of processing/filtering along multiple fronts and incorporate them into a larger model more easily, as long as the encodings are compatible (which I imagine will most greatly affect the growth of these systems/swapability though standardized ). Ideally this is information that is further self-represented within the data itself. FTT are a great approximations we can use to bridge continuous vs discrete knowledge. Though calculating it on word encodings feels a poor fit, we could break the "data signal" into an individual chosen subset of wavelengths. Note this doesn't not help in the next word prediction "component" of the data representation, but is a past knowledge based encoding that can be used in unison with the spatial/self-attention and parser encoding to represent the info (I'm actually not sure of the balance between spatial and self-attention except that the importance of the token in the generation of each word to the previous word (along with a possibly a higher order of inter-connections between the tokens) is contained within the input stream). If it is higher order than FFT's may already be represented and I've talked myself in a circle. I wonder what results dropouts tied to categorization would yield on the swap-ability of different components between systems? Or the ability to turn various bits/n/bobs on/off in a way tied to the data? I think that's how one can understand the partial derivative reverse flow loss functions as well, by turning off all but one path at a time to split the parts considered, but that depends on the loss function being used. I imagine categorization of subsections of data to then spit off into distinct areas would allow for finer control on representations of subsystems to increase scorability on specific test without affecting other testing areas as much. Could be antithetical to AGI style understanding, but it allow for field specific interpretation of information in a sense. Heck. What if we encoded each word as their dictionary definitions?

3 |

@danparish1344
2 months ago

“Attention is all you need” is great. It’s like a book title that you can’t forget.

1 |

@rajatavaghosh1913
2 months ago

I read the paper and was wondering Transformer is another kind of a LLM for generative tasks as they mentioned it as a model and also compared with other models at the last of the paper but finally after watching this explanation by Andrej i understood it is a kind of an architecture that learns the relationship between each sequence

|

Go To Top

AutoPlay: