New AI cascade of LLMs - FrugalGPT (Stanford)

preview_player
Показать описание
ChatGPT and GPT-4 can be expensive for a small enterprise, like $21K per month to support customer service. How to reduce the price (costs) for ChatGPT and GPT-4 services with PROMPT adaptation, LLM approximation (caching solutions and fine-tuning Transformer sub-systems) and/or LLM cascades?

Data-adaptive LLM selection!

Stanford University published new insights how to save costs and increase AI system performance, for you as a business owner or a AI developer. The optimal way to use LLM systems like ChatGPT or GPT-4 (also valid for GPT-5) to reduce the price you have to pay (to OpenAI, Microsoft, ...).

Thevideo highlights the challenges faced by small businesses in the United States that use GPT-4 for customer service, as the monthly costs can be around $21,000. Additionally, the energy and environmental impacts of running LLMs in cloud compute centers are significant. To address these issues, the authors compare the costs associated with 12 different commercial LLMs and introduce the concept of frugal GPT.

Frugal GPT aims to reduce the resources needed to run LLMs by focusing on prompt adaptation, LLM approximation, and LLM cascading. Prompt adaptation involves decreasing the length of the input prompt, while LLM approximation utilizes an external cache to store previous query responses. By checking the cache before submitting a new query, the system can avoid unnecessary interactions with the LLM. LLM cascading involves using different LLMs based on cost and performance, starting with cheaper options like GPT-3 and progressing to more expensive models like GPT-4 if necessary.

The authors conducted experiments using different LLM APIs and prompting strategies and found that frugal GPT can achieve substantial efficiency gains. In one specific case using a news dataset, they reduced the inference cost of running queries by 98% while still exceeding the performance of GPT-4. The implementation also involved a scoring function, which could be a regression model trained on a dataset of headlines using a simplified version of BERT.

The video emphasizes the potential cost savings and improved performance achieved by using a multi-AI system with various price points and performance levels. By utilizing frugal GPT and considering different LLM options based on cost and quality, significant cost reductions and even accuracy improvements can be achieved compared to using a single expensive LLM like GPT-4.

The YT video also briefly mentions the idea of going even further in reducing costs by using even lower-quality LLMs and providing additional semantic linguistic information to compensate for their limitations. However, this hypothetical scenario would require longer prompts and potentially contradict the prompt adaptation strategy discussed earlier.

Overall, the publication presents an approach to reduce the cost of using LLMs in cloud computing environments while maintaining or improving performance. The frugal GPT concept and the use of multiple LLMs with varying costs and performance levels offer potential benefits for businesses seeking cost-effective solutions for their AI applications.

All rights (data, model, charts and tables) are with the authors of the scientific preprint:
FrugalGPT: How to Use Large Language Models
While Reducing Cost and Improving Performance
by Lingjiao Chen, Matei Zaharia, James Zou

Article by Markettechpost:

#performance
#price
#ai
#gpt4
#chatgpt
#explained
#insights
Рекомендации по теме
Комментарии
Автор

Yep. I've been using Chroma and creating vector indexes locally, using free models and train them on out data and rely on gpt-3 for only some of the needs.gpt4 very rarely. Cost last month, $8

krisvq
Автор

Very funny ending... or may have been just me... Excellent video as usual...

joser
Автор

I'm currently designing a customer service app and when I started doing some maths with the prices of the GPT4 API, it kind of threw me off.
So this is very interesting, thanks!

Alain.Robert
Автор

This is a great study, and explanation. I've been looking at a similar idea to generating fine tuning datasets similar to that of the WizardLM/Evol-Instruct approach. The main model (eg starting with WizardLM itself) will generate the questions and answers with access to tools, but separate models to score questions, answers, and both together. The original/base model would need to be good enough to utilize tools with some basic reasoning, which is a limitation.

Then filtering, categorizing and packing data into training sets to fine tune the main model. Then evaluate the performance and repeat the process with the fine tuned model. My hypothesis is the main model abilities will be limited by the scoring ability of the smaller models before the parameter size. I'm still setting up this system and learning as I go, this study would likely indicate I'll be wrong (happy to be!), but I'll be interested to see where it goes. Likely smarter people than I are already further along building such a system, or such a system already exists and I have yet to find the relevant papers. Either way, I really appreciate your videos, they have provided a lot of great guidance!

DarrenReidAu
Автор

Speaking of coincidences, have you noticed in your world the increasing number of coincidences- some orchestrated, (hey Google, hey Alexa) and some to lead us to to find amazing pertinent free information tools to enable anyone to build literally anything. And this particular subject begs for collaboration and sharing. and is getting it. Great community in A.I.. ..how everyone nearly gives it away. Ironically, it definitely make it hard to get anything done if you don't make a living doing A.I. (i.e. dont code).., but you gotta wonder what cool stuff those deeply in it are working on. Doc, are you doing anything super cool? I'm working on a bootstrap agent.

josephshawa
Автор

Brilliant.🎉
But all in all, i feel a well tuned free and small LLM run on our our specific data will give best cost and performance.

CharlesOkwuagwu
Автор

FYI in GPT-4 you can literally copy-paste and send guidance template, just replace variables like {{query}} with valid string, end effect will be the same.

wojciechzielinski
Автор

nice video. does the scoring function performed by the Bert intermediary require the Bert to be fine tuned on data specific to the task and knowledge domain that the cascade will be used for? the example said train DistilBert on headlines for example but is that only appropriate if the cascade is to be used for similar queries?

theshrubberer
Автор

This video is very usefully informative. Many thanks.

jayhu
Автор

i think they used this in Bing too, except for the cache part

xiaojinyusaudiobookswebnov
Автор

wasnt sure how to solve harder questions and was thinking about his approach as a cheating (call to GPT4 for help), but here we go, its just a normal I guess

LaFragas