Stanford CS25: V4 I Jason Wei & Hyung Won Chung of OpenAI

preview_player
Показать описание
April 11, 2024
Speakers: Jason Wei & Hyung Won Chung, OpenAI

Intuitions on Language Models (Jason)

Shaping the Future of AI from the History of Transformer (Hyung Won)

About the speakers:
Jason Wei is an AI researcher based in San Francisco. He is currently working at OpenAI. He was previously a research scientist at Google Brain, where he popularized key ideas in large language models such as chain-of-thought prompting, instruction tuning, and emergent phenomena.

Hyung Won Chung is a research scientist at OpenAI ChatGPT team. He has worked on various aspects of Large Language Models: pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, multilinguality, parallelism strategies, etc. Some of the notable work includes scaling Flan paper (Flan-T5, Flan-PaLM) and T5X, the training framework used to train the PaLM language model. Before OpenAI, he was at Google Brain and before that he received a PhD from MIT.

Рекомендации по теме
Комментарии
Автор

The fact that giving more freedom to the model and having less inductive biases affected by human subjectivity actually improves performance is really iluminating. Thanks.

yoesemiat
Автор

Excellent. First talk is practical. Second is profound. Thank you.

michaelbernaski
Автор

I dont understand anything but I like how these people teach.May all get to understand the concepts thats my only prayer.

zyxbody
Автор

Outstanding. I teach an AI class and there are loads of great pedagogical nuggets here that I am going to borrow.

TrishanPanch
Автор

One of my favourite talks in recent times..learnt so much from this.

sanesanyo
Автор

He has his slides in his head! Loved the content.

ariG
Автор

What an amazing lecture. It was simple, yet groundbreaking

sady
Автор

Hilariously Jensen Huang from NVIDIA just spoke in an fireside chat recently about how they're already dependent on AI and models for designing chips so that last comment is already happening. Great talk.

itsaugbog
Автор

This lecture is super useful. really appreciate.

atdtx
Автор

Don't let this setback define your trading journey. Keep working hard and striving for success.

JasonKendra
Автор

Thanks for all the extra popping into the mic during the intro brrrruh!

laalbujhakkar
Автор

The students were asking some great questions, no wonder I don't go to Stanford

zacharykosove
Автор

Thanks for the talk! Really interesting stuff.

I had one question. At 1:04:00 Hyung suggests that uni-directional attention is preferable to bidirectional attention in turn-taking scenarios because it allows the reuse of calculated information in the KV cache.

I'm trying to understand how this fits into his broader thesis that we should be moving towards more generic approaches. On the surface the use of the KV cache doesn't feel particularly generic. Does it make sense because masked self-attention is necessary for next token generation, anyhow, so using a causal attention mask universally makes sense?

gmccreight
Автор

Types of leadership can be used as an analogy in the area of using less structure but at the same time performance is higher. A leader who utilizes an authoritarian type of leadership increases productivity within the team but decreases the team's creativity. Whereas a team under a democratic type of leadership are able to solve problems with increased creativity leading to innovative ideas.

doinitlive
Автор

Maybe the emergent behavior happens because for that task to be learned there are a set of pre-requisite tasks that need to be learned first. Just brainstorming here.

aliwaheed
Автор

I don't quite understand that the overall loss is divided into many sub-losses, is it true that llm training only uses cross_entropy as karpathy said, sorry, I'm new to this field

boybro
Автор

How do we know what is small vs large? For example, with emergent tasks, it highlights that more data could lead to more accuracy with enough compute. The small LM would have not seen accuracy improvements but the large LM did. For the tasks currently indicated as flat, couldn't we just not have enough compute now to know if these tasks would get more accurate?

heyitsjoshd
Автор

Yeah, this is a pretty great talk. It is quite hard to figure out at what technical level to hit the widest audience. This is nice. Not as nice as those flaxen locks though.

dkierans
Автор

I could listen to these gentlemen talk about this stuff all day. Thanks and kudos for making such a fascinating topic relatable.

flavioferlin
Автор

Really grateful for this being uploaded! Thank you to both speakers and to Stanford for the generosity.

Highlight of the video for me is the Hyung's sheepish refusal to get into predictions on the staying power/relevance of MoE or any specific architecture.

It felt like a wasted question since the premise of his talk is "tl;dr Sutton's Bitter Lesson"

ricopags