StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?

Показать описание

It's hard to get LLM generate big amount of content and take in large inputs; To solve this, introducing StreamingLLM, Extend Llama-2 & Falcon's up to 4 million tokens; 22x faster inference than your standard LLM ⚡️

Now you can even generate the whole book with LLM!

🔗 Links

👋🏻 About Me

#llama2 #meta #gpt #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #largelanguagemodels #largelanguagemodel #chatgpt #gpt4 #machinelearning

Рекомендации по теме

Комментарии

I really liked the format of explaining new concepts brought on scientific papers, keep it going Jason, love your channel! You're one of the most unique AI content creator I know

arthurbm

This is my third video I’ve seen today from you and you are so consistent with providing value with your words. Thank you my new AI guru🙏

elmflor

I don't understand how can it help even for books. Will it forget everything in the middle of the book? I try to think how it works in human brain. When we reed a book we (usually) don't remember each word. What we do, we create visual images inside and it compress the book very effectively. Supposedly, these images are like tokens or maybe like embeddings and don't occupy much space in memory. Is it possible to implement something like this for LLMs? They should kind of learn during "reading the book" and they should convert texts to multimodal embeddings or even find (create) approximate path in embedding space and later they should have the ability to analyze this path later. Not sure how it should be implemented.

Dron

Oh this is great i started playing around with streaming. The part i had missing was keep the initial context. Nice find

jeffsteyn

Great job at providing information about new developments, Jason! Thanks!

BorutDelFabbro

Here’s my ideas:

It’s going to need to use prompt compression and rag.

Let’s start with prompt compression. Basically compress the users input before feeding it to a GPT4/Opus, has the main LLM respond in compressed format, then have a decompressor at the end.

Now here is where rag comes into it.

Compressed output goes into a temporary rag, forming a working and temporary knowledge base

And now we insert a rag query in between compression and passing the compressed query to the main LLM.

But, this rag query, needs to search the working knowledge base for relevant context to the question, which gets appended As context when being fed to the main LLM.

You could probably have a separate rag knowledge base in this process that stores the must remember information.

The process would look like this

Input
|
Compressor LM
|
RAG Working Memory
|
Main LLM
|
Decompressor LM
|
Output

DarrenAllatt

Thank you for the great content as always Jason. Can you please elaborate a bit more on the specific use cases of streamingLLM in the future maybe?

jasonfinance

I like these short videos as well. Time is most precious these days

heagandev

as usual, inspiring, accurate and updated. ty sir

fab_spaceinvaders

Something to consider is that when removing the text, maybe feed it to a vector store first so that you keep the data so, if it needs to remember something but doesn't have enough context it can still get the original text.

jaysonp

More chating with LLM slower answers. Can I insrese speed saving dialog? Thank you

voxyloids

Hold on Jason, the attention sink breakthrough does not necessarily mean a larger context window, does it?

KarlJuhl

Ask the llm to summarize stuff when it's running low, should be better than just rememberig the first bit and a window.

zyxwvutsrqponmlkh

are the middle tokens summarized or contextualized in any way, or is that information just lost the more data is added?

jzam

Islands of meaning in the stream of thought
Conversational heartbeat at which time the conversation is encapsulated in a metaphoric narrative
Memory Island visit spawn new queries into the dataset based on interpretation of metaphoric narrative, re-encapsulated at conclusion of current session as new memory island narrative

GrimGriz

Which model has the largest context window token limit as of today?

aldorodriguez

I would like to see an LLM have a tool that lets it access and modify its own vector database for long term memory storage. So its current memory can be short term, but it accesses and stores information in its own vector database for long term memory. Think this could be a possible solution?

shadownight

is there is still no solution for extending data?

ilyasshynbergen

Which LLM has the largest token limit to expand the context length of the chat?

aldorodriguez

Hey Jason, amazing stuff thank you for your hard work.
What about storing chat history in vector databases for long-term memory in the context of a chatbot for example ?

mehdichallakh

StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?

StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?

NEW StreamingLLM by MIT & Meta: Code explained

StreamingLLM Lecture

How to code long-context LLM: LongLoRA explained on LLama 2 100K

Efficient Streaming Language Models with Attention Sinks

StreamingLLM Demo

Supercharging Large Language Models with Streaming-Llm

Run LLM's for infinite length! Research Paper Explained - StreamingLLM

HUGE 🔥 Llama 2 with 32K Context Length

Efficient Streaming Language Models with Attention Sinks (Paper Explained)

StreamingLLM - Efficient Streaming Language Models with Attention Sinks Explained

mit-han-lab/streaming-llm - Gource visualisation

streaming llm

StreamingLLM - Efficient Streaming Language Models with Attention Sinks

This AI Paper Introduces the StreamingLLM Framework for Infinite Sequence Lengths

Llama2.mojo🔥: The Fastest Llama2 Inference ever on CPU

Efficient Streaming Language Models with Attention Sinks

Run Llama 2 with 32k Context Length!

Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial

Why Do LLM’s Have Context Limits? How Can We Increase the Context? ALiBi and Landmark Attention!

Function calling Llama 2

Ep 5. How to Overcome LLM Context Window Limitations

Extending Context Window of Large Language Models via Positional Interpolation Explained