OpenAI Embeddings (and Controversy?!)

preview_player
Показать описание
#mlnews #openai #embeddings

COMMENTS DIRECTLY FROM THE AUTHOR (thanks a lot for reaching out Arvind :) ):
3. Finally, I'm now working on time travel so that I can cite papers from the future :)
END COMMENTS FROM THE AUTHOR

OpenAI launches an embeddings endpoint in their API, providing high-dimensional vector embeddings for use in text similarity, text search, and code search. While embeddings are universally recognized as a standard tool to process natural language, people have raised doubts about the quality of OpenAI's embeddings, as one blog post found they are often outperformed by open-source models, which are much smaller and with which embedding would cost a fraction of what OpenAI charges. In this video, we examine the claims made and determine what it all means.

OUTLINE:
0:00 - Intro
0:30 - Sponsor: Weights & Biases
2:20 - What embeddings are available?
3:55 - OpenAI shows promising results
5:25 - How good are the results really?
6:55 - Criticism: Open models might be cheaper and smaller
10:05 - Discrepancies in the results
11:00 - The author's response
11:50 - Putting things into perspective
13:35 - What about real world data?
14:40 - OpenAI's pricing strategy: Why so expensive?

Sponsor: Weights & Biases

ERRATA: At 13:20 I say "better", it should be "worse"

References:

Links:

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Рекомендации по теме
Комментарии
Автор

COMMENTS DIRECTLY FROM THE AUTHOR (thanks a lot for reaching out Arvind :) ):
3. Finally, I'm now working on time travel so that I can cite papers from the future :)

OUTLINE:
0:00 - Intro
0:30 - Sponsor: Weights & Biases
2:20 - What embeddings are available?
3:55 - OpenAI shows promising results
5:25 - How good are the results really?
6:55 - Criticism: Open models might be cheaper and smaller
10:05 - Discrepancies in the results
11:00 - The author's response
11:50 - Putting things into perspective
13:35 - What about real world data?
14:40 - OpenAI's pricing strategy: Why so expensive?

Sponsor: Weights & Biases

Merch: store.ykilcher.com

ERRATA: At 13:20 I say "better", it should be "worse"

YannicKilcher
Автор

"Furries and football fans are linearly separable" Yannic Kilcher 2022

stacksmasherninja
Автор

Hugging Face killed the moat for all these companies riding on the idea of winning by large models. Any business of a decent size can hire a good MLE and beat these models for real world use case by customizing the open models for their datasets. Real world use cases do not have to always be zero shot anyway. OpenAI is great (massive respect for some of the tech people that work there) but from their products of selling APIs I get IBM watson vibes where they are trying to trick some unsuspecting CEOs of big cos into using their services at unbelievable pricing. I have had people come to me and insist that I use GPT for freelance projects. So kudos to their marketing.

TarunKumar-ensi
Автор

I feel this is a good service for data scientist with limited NLP experience and high value projects. After all, hitting an API, getting SoTA or near SoTA without having to search for the best model for the task is invaluable.
But of course the fact that it's only on API and not available licensed and quite expensive limit the uses a lot.

MkillrYT
Автор

Nils Reimers (author of the blog post) is the coauthor of the original sentence Bert paper

uralbayhan
Автор

The cost isn't just for the model - the cost is for the whole infrastructure and the API service. It should be compared to other SaaS offerings rather than the cost of running another model over the same data. The fact that they've made it available through an API means that other product developers can outsource that entire segment of the product. Clients who use those products are often happy to pay higher costs as OM&A as opposed to front-loading the capital cost of development.

scign
Автор

I would go with the embeddings API from Eleuther AI. Similar thing with open source models.

serta
Автор

I have benchmarked some search tasks and, without any query engineering, GPT Davinci did slightly better than about a year's worth of tinkering on dozens of other models (DeBERTa, RoBERTa, etc). As the dataset is small, I was able to embed 5 variations of it with the free $15 OpenAI provides when you make an account. One of those attempts was better than the literal thousands of attempts I had made on the same project in the last year. Might be worth it to spend a couple hundred trying to optimize the query engineering aspect.

andrewcutler
Автор

seems like a useful thing to try it out. just out of curiosity, does this API support fine-tune for specific dataset?

aiexplainai
Автор

Yannic great content keep it up. What happened by the way you just aged 10 years in six months?

ademord
Автор

Great overview of Weights & Biases!

connor-shorten
Автор

Does Gwern take his tweets down as fast as they go up? I couldn't view any of Gwern's tweets, but the others are there.

scottmiller
Автор

OpenAI's marketing guy's read Amos Tversky and Daniel Kahneman, and maybe also sitting in on the paper revision meetings

NGNBoone
Автор

As bizarre as it is I find it really funny. Let’s ride those hype waves! 🏄🏼‍♂️

Bigboigoingbig
Автор

Why did OpenAI decided to charge for things? Unlike before.

wryltxw
Автор

BM25 remains the best IR after 30 years :0)

timothy-ulwp
Автор

Reading the paper it is not clear to me how the input is processed? x and y are both tokenized, plus the bos/eos token? So the model trains both token embeddings (x, y) and sentence embeddings (v)?

mkamp
Автор

If you use the DaVinci model for text generation, it's 10x cheaper (0.06$/1000 tokens). Don't understand the logic for the incredible price of embeddings.

HoriaCristescu
Автор

2mins of advertisement for W&B ?!?!?!?

zannyee
Автор

Embedding was available for me since three months ago

kevinamiri