OpenAI CLIP: ConnectingText and Images (Paper Explained)

High Definition Standard Definition

Stats

Theater

Video id : T9XSU0pKX2E
ImmersiveAmbientModecolor: #c8ccc9 (color 2)
Video Format : 22 (720p) openh264 ( https://github.com/cisco/openh264) mp4a.40.2 | 44100Hz
Audio Format: Opus - Normalized audio
PokeTubeEncryptID: d295e1ccfb0b551128f44f484c6cb840502e5c867ca5d4872ab8592a700aee32bbc732bde98f98d89d6fc0c9407d6e82
Proxy : eu-proxy.poketube.fun - refresh the page to change the proxy location
Date : 1714739302684 - unknown on Apple WebKit
Mystery text : VDlYU1UwcEtYMkUgaSAgbG92ICB1IGV1LXByb3h5LnBva2V0dWJlLmZ1bg==
143 : true

Jump to Connections

Yannic Kilcher

251K Subs

2.9K

Download

Thanks!

117,337 Views • Jan 12, 2021 • Click to toggle off description

#ai #openai #technology

Paper Title: Learning Transferable Visual Models From Natural Language Supervision
CLIP trains on 400 million images scraped from the web, along with text descriptions to learn a model that can connect the two modalities. The core idea is a contrastive objective combined with a large batch size. The resulting model can be turned into arbitrary zero-shot classifiers for new image & text tasks.

OUTLINE:
0:00 - Introduction
3:15 - Overview
4:40 - Connecting Images & Text
9:00 - Building Zero-Shot Classifiers
14:40 - CLIP Contrastive Training Objective
22:25 - Encoder Choices
25:00 - Zero-Shot CLIP vs Linear ResNet-50
31:50 - Zero-Shot vs Few-Shot
35:35 - Scaling Properties
36:35 - Comparison on different tasks
37:40 - Robustness to Data Shift
44:20 - Broader Impact Section
47:00 - Conclusion & Comments

Paper: cdn.openai.com/papers/Learning_Transferable_Visual…
Blog: openai.com/blog/clip/
Code: github.com/openai/CLIP

Abstract:
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.

Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
YouTube: youtube.com/c/yannickilcher
Twitter: twitter.com/ykilcher
Discord: discord.gg/4H8xxDF
BitChute: www.bitchute.com/channel/yannic-kilcher
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: www.linkedin.com/in/yannic-kilcher-488534136/
BiliBili: space.bilibili.com/1824646584

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannickilcher
Patreon: www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Metadata And Engagement

Views : 117,337
Genre: Science & Technology
Date of upload: Jan 12, 2021 ^^

Rating : 4.97 (22/2,891 LTDR)
RYD date created : 2022-04-09T18:40:08.477364Z
See in json

YouTube Comments - 93 Comments

@hmate1119
3 years ago

This channel is insanely good. Deserves even more recognition. Great work! Subscribed <3

102 |

@MachineLearningStreetTalk
3 years ago

This is a really important paper, I suggest people pay particular attention to Yannic's "robustness to data shift" section if you are short on time. I hope we can get the authors on to discuss this!

42 |

@jonatan01i
3 years ago

Thank you so much for this, especially for not keeping the promise on cutting the video short!

8 |

@user-yx5nh4tm9n
1 year ago

Man, you have a talent to explain hard things! And your english is awesome!!

2 |

@ghostlv4030
3 years ago

The idea is so simple and so hard to believe it is this effective! Okay, I see, NLP is so useful in vision now.

18 |

@MeatFingerSteam
3 years ago

Absolutely loved the Alec meme, thanks!

7 |

@naifalkhunaizi7847
7 months ago

Truly great explanation!

|

@bukovelby
1 year ago

Just a Brilliant overview!

|

@ashrafg4668
2 years ago

Thank you for the explanation!

|

@jenishah9825
3 years ago

I can't thank you enough for making such useful videos.

|

@growthmpsfunnels3358
1 year ago

Dude you are doing a great job. Perfect for the work..

|

@aminasadi1040
1 year ago

Thanks a lot for this awesome video! The explanations are very digestible even for a beginner.

|

@oflasch
2 years ago

Great explanation! 👍

1 |

@xingjian417
3 months ago

thanks for sharing

|

@florianhonicke5448
3 years ago

New video from yannic!!! Saved my day :D

|

@ShivamSingh-xf8nb
1 year ago

Amazing explaination!

|

@user-jx5pm9nx8p
10 months ago

Excellent! Thank you a lot!

|

@srinathtangudu4899
1 year ago

Your videos are so good. Thanks:)

|

@maryamaghili1148
3 years ago

Thank you for your great work! So is there any way we could find the actual label (text) they have used for training? I need to use this model for some classification tasks that I have, but I am wondering how to organize labels? I have only images with no annotation.

|

@shengyaozhuang3748
3 years ago

Interestingly, similar training methods have been explored in the field of information retrieval for searching relevant documents to the given query. So, probably a good application of CLIP could be searching a wanted photo on the internet by using a text query.

6 |

Go To Top

AutoPlay: