The Era of 1-bit LLMs by Microsoft | AI Paper Explained

preview_player
Показать описание
In this video we dive into a recent research paper by Microsoft: "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits".
This paper introduce an interesting and exciting architecture for large language models, called BitNet b1.58, which significantly reduces LLMs memory consumption, and speeds-up LLMs inference latency. All of that, while showing promising results, that do not fall from a comparable LLaMA model!
Large language models quantization is already tackling the same problem, and we'll explain the benefits of BitNet b1.58 comparing to common quantization techniques.

BitNet b1.58 is an improvement for the BitNet model presented few months ago.

-----------------------------------------------------------------------------------------------

👍 Please like & subscribe if you enjoy this content

-----------------------------------------------------------------------------------------------

Chapters:
0:00 Paper Introduction
0:55 Quantization
1:31 Introducing BitNet b1.58
2:55 BitNet b1.58 Benefits
4:01 BitNet b1.58 Architecture
4:46 Results
Рекомендации по теме
Комментарии
Автор

The one thing the paper neglects to mention which should have been the biggest breakthrough of the 1bit LLM is that the VRAM required for training should be drastically less than its full fat 16-bit float counterpart. It should be possible to train the 70b 1-bit model on a single RTX4090 - at present, the 70b model with any meaningful quantization cannot even be run on a single consumer GPU. I made a video on this subject last week.

At present the VRAM savings of current quantized LLMs are only apparent during execution, but what is more important is the democratization of LLM training. Lowering the barrier to training an LLM is a must to stop one company conquering the LLM space entirely.

GeorgeXian
Автор

Great! Very helpful. One suggestion I’d make: the numbers of bits being a fractional number will be unfamiliar for many people. I think it would be useful to make it clear that yes of course to represent three states in practicality you need two bits, and the 1.58 number is the theoretical Shannon-Hartley entropy

JulianHarris
Автор

We had a lecture about single bit neural networks at one of my uni courses, some 5 years ago. It was interesting.

giacintoboccia
Автор

Wow this seems promising. I hope this will reproduce properly and work in other situations too. If it is truly better in general, new hardware could be so much more efficient

emiel
Автор

wonder why people don’t use this approach from the beginning. It’s like LLMs in assembly language. And as far as I know, every linear operator has a kernel. The kernel means that a linear operator H always maps the zero vector to itself. When we use a computer, we represent the zero vector as a column matrix of n zeros. Since the layers of LLMs are in the same vector space, we have H\vec{0} = \vec{0} for any H. I apologize for my bad LaTeX, but \vec{0} is supposed to be a vector. It’s important to remember that 0 is the trivial element in the kernel. For example, let Z be the set of all integers, and let H be the multiplication operator. Then, in ordinary algebra, we have positive, zero, and negative integers. The operator is \cdot, not x. The multiplication operator is often used in quantum mechanics of many particles, where the vector space grows exponentially, just like the number of bits for multiple objects.

hypervanse
Автор

Ok, but what are the theory on WHY it achieves same performance? maybe this shows no one really understand how Neural Networks works and are giving them much more complicated steps when they could be just some "quantised" states.

erickweil
Автор

So between this, Groq hardware, Mojo language and Mamba architecture... How many of these are compatible and stack their benefits synergistically? And where they stack is the performance additive or multiplicative?

ArielTavori
Автор

📝 Summary of Key Points:

📌 The research paper discusses the era of 1bit LLMS, focusing on reducing the size of large language models to address issues related to compute and memory resources, as well as environmental concerns.

🧐 The introduction of the BitNet B 1.58 model architecture, which utilizes weights that are ternary (-1, 0, 1) to reduce the number of bits required to represent the model, leading to improved efficiency without sacrificing performance.

🚀 Benefits of the BitNet B 1.58 model include reduced memory usage, lower latency, and comparable performance to full-precision models, showcasing its potential for future applications in large language models.

💡 Additional Insights and Observations:

💬 "Quantization in machine learning refers to the process of reducing the precision of model weights to optimize memory usage and speed."
📊 The BitNet B 1.58 model demonstrates significant improvements in memory usage, latency, and perplexity compared to existing models like Lama.
🌐 The research paper presents compelling evidence of the effectiveness of the BitNet B 1.58 model through comparisons with established models and tasks.

📣 Concluding Remarks:

The era of 1bit LLMS introduces innovative approaches to reducing the size of large language models, with the BitNet B 1.58 model showing promising results in terms of efficiency and performance. This research opens up new possibilities for more accessible and environmentally friendly AI models in the future.
Generated using TalkBud

abdelkaioumbouaicha
Автор

This is something I predicted would happen in AI. It's cool to see a concrete usage of it. Ternary computers are the most efficienty computers and base 3 is the most efficient base. So this isn't surprising. Read up on Radix Economy to learn more.

burthacklin
Автор

How feasible is it to adapt BitNet b1.58's ternary quantization (-1, 0, 1) for quantum computing using qutrits, given the current state of qutrit-based hardware, error correction, and the development of specialized quantum algorithms?

gabrielsandstedt
Автор

1:22 it's 2 bytes, not 4 bytes, right?

kangalio
Автор

To summarize the Bitnet was trained by scratch.
Therefore I cannot quantisize an existing llm to 1.58 bit?
Or is there a quantisizing approach for existing llms onto 1.58bit?

simplemanideas
Автор

1:56 is the "same peformance" with "perito improvement" just an illustration of theoretical prediction or actual model weights data from real trit-model?

yash
Автор

Model weights will make a lot more sense

arjavgarg
Автор

I've made a few contributions to Quaternary algebra, I discovered the inclusive and exclusive not-gate and am currently working on proofs for them.

The issue with ternary and quaternary at the moment is that current computers have to use numerous transistors per ternary or quaternary bit. Until we have a ternary or quaternary transistor, we may have to keep using bytes just like regular integers. I haven't seen any patents for a working one that isn't several times larger than a binary transistor which makes going back to binary more efficient, of course it depends though.

I don't know what Microsoft is doing but on top of this, running ternary requires at absolute minimum 2 binary bits to run, meaning 2 physical data lines at best. Depending on how optimized everything from your languages compiler is to what kinds of operations you're performing it may use significantly more.

To run ternary on current hardware doesn't quite make practical sense, when for the same~ amount of data likes you could be using quaternary.

ithaca
Автор

Excellent explanations. This seems to be a comparison on Llama1 though, any confirmation if Llama2 models also perform similar after quantization? I am curious to know if this works on later generations, conceptually Llama2 outperforms Llama1 for the same size ( I.e 7B vs 7B, 13B vs 13B). So in effect the same weights now hold more complexity as compared to before, ie compression will work better when weights have more redundancy as compared to later versions where precision is more likely to be driving the performance differences

karansarao
Автор

I wonder what the distribution is between the three values? It would be interesting if it was evenly 33.33%.

mshonle
Автор

Why does it still work when it’s quantified from float16 to -1, 0, 1. There could be countless numbers in float16 but only 3 numbers after quantification. I’m confused on this.😂

gotachange
Автор

Technically they should call it a 2 bit LLM -- which has multiple meanings ;)

rayujohnson
Автор

Interesting how accuracy will be impacted in the end.

eck