StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?

preview_player
Показать описание
It's hard to get LLM generate big amount of content and take in large inputs; To solve this, introducing StreamingLLM, Extend Llama-2 & Falcon's up to 4 million tokens; 22x faster inference than your standard LLM ⚡️

Now you can even generate the whole book with LLM!

🔗 Links

👋🏻 About Me

#llama2 #meta #gpt #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #largelanguagemodels #largelanguagemodel #chatgpt #gpt4 #machinelearning
Рекомендации по теме
Комментарии
Автор

I really liked the format of explaining new concepts brought on scientific papers, keep it going Jason, love your channel! You're one of the most unique AI content creator I know

arthurbm
Автор

This is my third video I’ve seen today from you and you are so consistent with providing value with your words. Thank you my new AI guru🙏

elmflor
Автор

I don't understand how can it help even for books. Will it forget everything in the middle of the book? I try to think how it works in human brain. When we reed a book we (usually) don't remember each word. What we do, we create visual images inside and it compress the book very effectively. Supposedly, these images are like tokens or maybe like embeddings and don't occupy much space in memory. Is it possible to implement something like this for LLMs? They should kind of learn during "reading the book" and they should convert texts to multimodal embeddings or even find (create) approximate path in embedding space and later they should have the ability to analyze this path later. Not sure how it should be implemented.

Dron
Автор

Oh this is great i started playing around with streaming. The part i had missing was keep the initial context. Nice find

jeffsteyn
Автор

Great job at providing information about new developments, Jason! Thanks!

BorutDelFabbro
Автор

Here’s my ideas:

It’s going to need to use prompt compression and rag.

Let’s start with prompt compression. Basically compress the users input before feeding it to a GPT4/Opus, has the main LLM respond in compressed format, then have a decompressor at the end.

Now here is where rag comes into it.

Compressed output goes into a temporary rag, forming a working and temporary knowledge base

And now we insert a rag query in between compression and passing the compressed query to the main LLM.

But, this rag query, needs to search the working knowledge base for relevant context to the question, which gets appended As context when being fed to the main LLM.

You could probably have a separate rag knowledge base in this process that stores the must remember information.

The process would look like this

Input
|
Compressor LM
|
RAG Working Memory
|
Main LLM
|
Decompressor LM
|
Output

DarrenAllatt
Автор

Thank you for the great content as always Jason. Can you please elaborate a bit more on the specific use cases of streamingLLM in the future maybe?

jasonfinance
Автор

I like these short videos as well. Time is most precious these days

heagandev
Автор

as usual, inspiring, accurate and updated. ty sir

fab_spaceinvaders
Автор

Something to consider is that when removing the text, maybe feed it to a vector store first so that you keep the data so, if it needs to remember something but doesn't have enough context it can still get the original text.

jaysonp
Автор

More chating with LLM slower answers. Can I insrese speed saving dialog? Thank you

voxyloids
Автор

Hold on Jason, the attention sink breakthrough does not necessarily mean a larger context window, does it?

KarlJuhl
Автор

Ask the llm to summarize stuff when it's running low, should be better than just rememberig the first bit and a window.

zyxwvutsrqponmlkh
Автор

are the middle tokens summarized or contextualized in any way, or is that information just lost the more data is added?

jzam
Автор

Islands of meaning in the stream of thought
Conversational heartbeat at which time the conversation is encapsulated in a metaphoric narrative
Memory Island visit spawn new queries into the dataset based on interpretation of metaphoric narrative, re-encapsulated at conclusion of current session as new memory island narrative

GrimGriz
Автор

Which model has the largest context window token limit as of today?

aldorodriguez
Автор

I would like to see an LLM have a tool that lets it access and modify its own vector database for long term memory storage. So its current memory can be short term, but it accesses and stores information in its own vector database for long term memory. Think this could be a possible solution?

shadownight
Автор

is there is still no solution for extending data?

ilyasshynbergen
Автор

Which LLM has the largest token limit to expand the context length of the chat?

aldorodriguez
Автор

Hey Jason, amazing stuff thank you for your hard work.
What about storing chat history in vector databases for long-term memory in the context of a chatbot for example ?

mehdichallakh