OpenAI Embeddings (and Controversy?!)

Показать описание

#mlnews #openai #embeddings

COMMENTS DIRECTLY FROM THE AUTHOR (thanks a lot for reaching out Arvind :) ):
3. Finally, I'm now working on time travel so that I can cite papers from the future :)
END COMMENTS FROM THE AUTHOR

OpenAI launches an embeddings endpoint in their API, providing high-dimensional vector embeddings for use in text similarity, text search, and code search. While embeddings are universally recognized as a standard tool to process natural language, people have raised doubts about the quality of OpenAI's embeddings, as one blog post found they are often outperformed by open-source models, which are much smaller and with which embedding would cost a fraction of what OpenAI charges. In this video, we examine the claims made and determine what it all means.

OUTLINE:
0:00 - Intro
0:30 - Sponsor: Weights & Biases
2:20 - What embeddings are available?
3:55 - OpenAI shows promising results
5:25 - How good are the results really?
6:55 - Criticism: Open models might be cheaper and smaller
10:05 - Discrepancies in the results
11:00 - The author's response
11:50 - Putting things into perspective
13:35 - What about real world data?
14:40 - OpenAI's pricing strategy: Why so expensive?

Sponsor: Weights & Biases

ERRATA: At 13:20 I say "better", it should be "worse"

References:

Links:

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Рекомендации по теме

Комментарии

COMMENTS DIRECTLY FROM THE AUTHOR (thanks a lot for reaching out Arvind :) ):
3. Finally, I'm now working on time travel so that I can cite papers from the future :)

OUTLINE:
0:00 - Intro
0:30 - Sponsor: Weights & Biases
2:20 - What embeddings are available?
3:55 - OpenAI shows promising results
5:25 - How good are the results really?
6:55 - Criticism: Open models might be cheaper and smaller
10:05 - Discrepancies in the results
11:00 - The author's response
11:50 - Putting things into perspective
13:35 - What about real world data?
14:40 - OpenAI's pricing strategy: Why so expensive?

Sponsor: Weights & Biases

Merch: store.ykilcher.com

ERRATA: At 13:20 I say "better", it should be "worse"

YannicKilcher

"Furries and football fans are linearly separable" Yannic Kilcher 2022

stacksmasherninja

Hugging Face killed the moat for all these companies riding on the idea of winning by large models. Any business of a decent size can hire a good MLE and beat these models for real world use case by customizing the open models for their datasets. Real world use cases do not have to always be zero shot anyway. OpenAI is great (massive respect for some of the tech people that work there) but from their products of selling APIs I get IBM watson vibes where they are trying to trick some unsuspecting CEOs of big cos into using their services at unbelievable pricing. I have had people come to me and insist that I use GPT for freelance projects. So kudos to their marketing.

TarunKumar-ensi

I feel this is a good service for data scientist with limited NLP experience and high value projects. After all, hitting an API, getting SoTA or near SoTA without having to search for the best model for the task is invaluable.
But of course the fact that it's only on API and not available licensed and quite expensive limit the uses a lot.

MkillrYT

Nils Reimers (author of the blog post) is the coauthor of the original sentence Bert paper

uralbayhan

The cost isn't just for the model - the cost is for the whole infrastructure and the API service. It should be compared to other SaaS offerings rather than the cost of running another model over the same data. The fact that they've made it available through an API means that other product developers can outsource that entire segment of the product. Clients who use those products are often happy to pay higher costs as OM&A as opposed to front-loading the capital cost of development.

scign

I would go with the embeddings API from Eleuther AI. Similar thing with open source models.

serta

I have benchmarked some search tasks and, without any query engineering, GPT Davinci did slightly better than about a year's worth of tinkering on dozens of other models (DeBERTa, RoBERTa, etc). As the dataset is small, I was able to embed 5 variations of it with the free $15 OpenAI provides when you make an account. One of those attempts was better than the literal thousands of attempts I had made on the same project in the last year. Might be worth it to spend a couple hundred trying to optimize the query engineering aspect.

andrewcutler

seems like a useful thing to try it out. just out of curiosity, does this API support fine-tune for specific dataset?

aiexplainai

Yannic great content keep it up. What happened by the way you just aged 10 years in six months?

ademord

Great overview of Weights & Biases!

connor-shorten

Does Gwern take his tweets down as fast as they go up? I couldn't view any of Gwern's tweets, but the others are there.

scottmiller

OpenAI's marketing guy's read Amos Tversky and Daniel Kahneman, and maybe also sitting in on the paper revision meetings

NGNBoone

As bizarre as it is I find it really funny. Let’s ride those hype waves! 🏄🏼‍♂️

Bigboigoingbig

Why did OpenAI decided to charge for things? Unlike before.

wryltxw

BM25 remains the best IR after 30 years :0)

timothy-ulwp

Reading the paper it is not clear to me how the input is processed? x and y are both tokenized, plus the bos/eos token? So the model trains both token embeddings (x, y) and sentence embeddings (v)?

mkamp

If you use the DaVinci model for text generation, it's 10x cheaper (0.06$/1000 tokens). Don't understand the logic for the incredible price of embeddings.

HoriaCristescu

2mins of advertisement for W&B ?!?!?!?

zannyee

Embedding was available for me since three months ago

kevinamiri

OpenAI Embeddings (and Controversy?!)

OpenAI Embeddings (and Controversy?!)

Vector databases are so hot right now. WTF are they?

Elon Musk Slams Apple's Open AI Integration 🚨

The moment we stopped understanding AI [AlexNet]

How to build next-level Q&A with OpenAI

Chinese AI DeepSeek Censorship Exposed!

Is Stability turning into OpenAI?

ChatGPT is made from 100 million of these [The Perceptron]

The Embeddings are Given, The Game is in the Indexing

Google Just Lost to OpenAI´s NEW ´Deep Search´ Agent!

Searching Across Images and Text: Intro to OpenAI’s CLIP

AI News: Uncensored AI Will Create ANYTHING!

AI vs. Humans in Debates: Who Do People Trust More? #ai

Case Study: Boosting SEO with AI-Generated Content using Wordpress, Python, and OpenAI

They Stole & Copied work from Other's Code? (Opensource Drama)

Scandals in AI: Objaverse, Llama, Alpaca, and Dolly

The Dark Matter of AI [Mechanistic Interpretability]

How will AI change the world?

All You Need To Know About DeepSeek- ChatGPT Killer

ChatGPT Opens A Research Lab…For $2!

S1-E3: Generative AI, Large Language Models and ChatGPT

Masked Language Models Vs Causal Language Models in NLP #Shorts

Top CEO's who were fired from their own company/startup | #openai #chatgpt #microsoft #jobs #tc...

LLM Explained | What is LLM