Gzip is all You Need! (This SHOULD NOT work)

preview_player
Показать описание

Рекомендации по теме
Комментарии
Автор

The way the compression distance is constructed makes it an approximate measure of the mutual information between strings. That's why it works. Similar strings will yield a smaller NCD. So for example, strings containing similar words will compress together into shorter gzip files (relative to the lengths of the compressed files for the separate strings). But gzip is very general; unlike BERT it's not tuned to extract language features per se. In many ways, LLMs (and all generative data models) act like data compressors, but tuned to a very specific set of data (in BERT's case English text). The hidden layers of a deep NN are essentially compressing data and yielding similar outputs for similar inputs (i.e. mapping the compressed data). They are doing something really similar to a general compressor hooked up to a KNN model, but more tuned to a particular data set. If, instead of using gzip, you used a more specialised English language text compressor, you'd get better results and possibly not too far off BERT's performance (although LLMs are capable of much more than sentiment analysis).

voiceoftreason
Автор

Back when Bayesian classification was first getting really popular for spam filtering (in the late 90s or thereabout), I remember people talking about using a technique similar to this for spam classification: keep a corpus of spam messages and a corpus of ham messages, and then append the test message to both corpuses and see which one compresses better. It was just a fun little curiosity then, and wasn't really feasible for running an email server at scale, but it's fun to see that the idea's still got merit.

fluffycritter
Автор

Gzip is used as a tape for finding the "distance" between two strings. Think of it like cosine similarity except the "dot product" is taken directly from the two strings; no vector embedding in the picture. Your xx2 = len line is the key which measures the length of two strings concatenated. Suppose A and B are similar strings, different only in few characters or words. Then z(A)+z(B) >> z(A//B) ~= z(A) ~= z(B). Here z() is compressed length, // is concatenation, ~= almost same value. If A and B are dissimilar, then z(A)+z(B) ~= z(A//B). If you understand why gzip gives these two equations then you will understand why the method works.

babali
Автор

It is not comparing length of compression, the fact is that if two strings x1 and x2 are very dissimilar then concatenating x1 + '<space>' + x2 will entropy worth sum of their entropies. If they are similar then the entropy of them put together will be quite less. For example if x1 = x2, then gzip can simply encode x1 first and say repeat it twice. So if x1=x2 the distance would be very close to zero as the excess after subtracting min is just "repeat once more".

nd = (ENTROPY(x1 + '<space>' + x2) - min(...)) / (ENTROPY(x1) + ENTROPY(x2))

ENTROPY (x) = LEN(GZIP(x)) in this case. The better entropy function you use you are likely to get better results.

AnitaSV
Автор

The thing is the vector of features is not comparing compressed length of dataset entries to each other. It is comparing how their separate compressed lengths differ from compressed length of their concatenation. Given what gzip does with the strings in terms of information theory, appending a string containing similar information to the original string will result in more effective compression of the second string making use of dictionary already created from the first string. So similar strings will have smaller NCDs. It is a vector of how un-similar that string is from all other strings, i.e. how much new all the other strings have to say if you say them after the original string. Similar texts will naturally and understandably have small distance from each other and form clusters, that's why it works. Semantics is secondary, gzip is agnostic to data interpretation domain. A better, domain attuned compression algo will produce better clusters and discover new ones. But it is a nice dissection and factorization of what these algorithms really do without having to rely on any kind of NN/AI stuff, plain statistics and theory of information. Nice.

popularmisconception
Автор

I get it. You’re compressing two texts individually, and then combining the two texts and compressing that. If the combined texts compress well, it means they’re more similar. If they compress poorly, then they’re more different.

More similar texts are more likely to have a similar sentiment while less similar texts have less similar sentiment.

terjeoseberg
Автор

Compression is amazing. You can compare creativity in LLMs' outputs for writing prompts just by comparing compressing ratios.

jonmichaelgalindo
Автор

Someone should take a dataset for positive words and negative words and compare the average entropy of both. Maybe positive words have more entropy because they are used less in the English language or vice versa.

lennarth.
Автор

In line with what many have mentioned already, I feel like you could just replace the compression algorithm with splitting the text into words and making a set out of that, so that the additional length of the set isn't much longer when concattenating two strings that have many words in common.

키다리헹님
Автор

I have been doing research for quite a long time on similar methods, so here is some additional insight that is hopefully helpful in some manner:

The normalized compression distance is closely related to the kolmogorov complexity. Given a perfect (therefore non computable) compression algorithm, it would describe precisely how much additional information it would take to produce one string given the other one as input. So it is a natural measure of similarity by weakening it to a real-world compressor. And how good of a measure of similarity it is depends on how good the compression algorithm is up to the non-obtainable asymptotic result.

The rest of the method is just basically that you can turn any distance metric into a vector embedding by constructing a matrix of pairwise distances. And that these embeddings generally perform well on downstream tasks if the distance measure is a good one. If I remember correctly you can extend the whole thing beyond KNN in a fairly natural way by using the SVD of the training matrix to whiten the calculated feature vectors during inference.

timseguine
Автор

It’s long been theorized that text compression is very closely related to AI, so it isn’t totally crazy that this works. The Hutter Prize is a text compression competition with a half million Euro prize pool whose purpose is to advance AI.

robvdm
Автор

I've a hypothesis.

Compression algorithum produce a longer result when there is more variation. So, when compressing the concated string, there is a vague relationship between the similarity of words and length of the compression. eg(very simplified), "this is amazing." and "amazing product." contains the same word, so the resulting reduction is shorter than "this is amazing." and "horrible product.". Running KNN would get the indices that correspond to strings in the training set that use similar wording with the string that is being predicted. Then, if most sentences that are similar to the string being predicted have positive sentiment, then this new string is likely to have positive sentiment.

In other words, starting with string A. When concatenated with a similar string B, the entropy doesn't increase much, so the length of the compression doesn't increase much, so the distance between these strings is small. When concatenated with a different string C, the entropy does increase, so the length of the compression increase, so the distance between these strings is big. If this hypothesis is true, then the units that the compression algorithum acts upon probably contains some vague information about the sentiment.

kaishang
Автор

You know, I've been using gzip since while I was at uni I had this crazy idea of using compression to calculate entropy and detect encryption, and I thought that was clever

But this, is just plain insane :)

I love it :)

staviq
Автор

These features do make sense because if there is some similarity/correlation between two strings, gzip should produce a shorter result from the concatenation of the two strings due to redundancy in the information.

AThagoras
Автор

Looking to the pseudo code of LZ77, the algorithm handle matches and distances between strings.
In a way, gzip index strings by frequencies within a sliding window. This representation can be use to segregate common words (the, an, etc..) high frequencies, isolate less frequent ones .
They use a lossless compression to do a statistical analysis. This is smart.

Lempel - Ziv 1977 :
while input is not empty do
match := longest repeated occurrence of input that begins in window

if match exists then
d := distance to start of match
l := length of match
c := char following match in input
else
d := 0
l := 0
c := first char of input
end if

output (d, l, c)

discard l + 1 chars from front of window
s := pop l + 1 chars from front of input
append s to back of window
repeat

vincentvoillot
Автор

In your NCD method you are calculating the gzip output each time. Why can't you do this just once for each sample and then use that in the NCD method. It should reduce the time greatly, if I'm not mistaken. Surely that takes the most time from within that single function.

shocklab
Автор

I'm counting on that you already learned that compression length gives you a measure of how much information there is in the text.

Basically is a measure of entropy, because when you type words you use ascii encoding to codify the letters. Now you can do a smaller space if you will take so many tokens. But basically compression will say how many is the least to encode the text.

I was thinking how you might improve the distance between two strings. As you join it and make it proportion. I would recommend using xoring the strings between themselves (if they match it will cancer the numbers to zero) making similarish text. You can say is normalized to zero if the both text are the same. And the top take the divide it by the bigger length (as it likely that it had unmatched characters on the other).

Now you might began to to go to the philosophical side of the prediction is regarding how much information does a text have and match it with a sentiment.

israelfigueroa
Автор

🎯 Key Takeaways for quick navigation:

00:00 🧩 A low resource text classification method using K nearest neighbors and gzip compression for sentiment analysis.
01:48 📊 The proposal involves compressing text and using normalized compression distances (NCD) as features for the K nearest neighbors classifier.
04:17 ⏱️ The slowest part of the algorithm is computing the NCDs for all training samples, while the K nearest neighbors classification is fast.
06:07 🏅 Achieved around 70% accuracy in sentiment analysis with just 500 samples using K nearest neighbors and NCDs, outperforming random classification.
09:12 ❓ Questions remain about the method's validity, potential problems, and why NCDs alone would be sufficient for sentiment classification.
10:36 ❓ The creator expresses skepticism about the method, questioning the validity of comparing lengths of compressions for sentiment analysis.
11:00 📈 Accuracy varies significantly depending on the sample size, reaching around 75.7% for 10, 000 samples.
13:20 ⏱️ Linear NCD calculation for 10, 000 samples takes hours, prompting the need for parallelization using multiprocessing.
15:00 💻 Practical usage involves compressing input strings, calculating NCD vectors against training samples, and using K nearest neighbors for sentiment classification.
17:46 🧠 The method's success challenges the dominance of deep learning, reminding us of the value of revisiting first principles and exploring alternative algorithms like K nearest neighbors for NLP tasks.

Made with HARPA AI

OttoZhao-lo
Автор

Without reading the paper (and not going back to the code): Suppose you concatenate a movie review written in English to one in Swahili, compress them together, and compare those lengths against the compressed length of either. Those two lengths are going to be quite different from the first because Swahili and English share basically no features that gzip could deduplicate (which is basically how Huffmann(esque) codecs work).
If OTOH you do the same with a review in English and another in English, and they're going to be way closer.

Now, assuming that similar sentiment is expressed in similar language, the classifier works. What it won't be able to do is distinguish "This movie is better than this other movie" and "This other movie is better than this movie" -- the combined string will compress to only a couple more bytes resulting in low distance, but the sentiment is opposite.

But then there's the question of how accurate you need your results to be, and how many people are going to come up with attack vectors to confuse your classifier, and whether "good enough" isn't a much better result than "let's throw a language model at it". If anything this kind of thing can work as a baseline: Anything capable of doing actual semantic analysis has to be at least as good or something is fundamentally wrong with it.

aoeuable
Автор

This video got me back 2 years ago, when I start leaning from your videos about those beatiful algorithms, you're the best. Please more of this!

ataadevs