Elegant Compression in Text (The LZ 77 Method) - Computerphile

preview_player
Показать описание
Text compression methods such as LZ can reduce file sizes by up to 80%. Professor Brailsford explains the nuts and bolts of how it is done.

This video was filmed and edited by Sean Riley.

Рекомендации по теме
Комментарии
Автор

I did some experimentation.   If you open a zip or gz file with a hex editor there isn't anything human readable.  You have to compress using an older format such as .lzo (Not sure anything in Windows will do this but it's a compress option in Fedora Linux.)   Opening a .lzo file with a hex editor (I used ghex)  shows much of the original text and it's easy to see where text is replaced by pointers as discussed in the video.   You can see more frequent pointers as you move farther along in the document.  The pointers in the .lzo file were two, three and four bytes long.  I could not see any telltale that identifies a pointer versus text.  There has to be some method for the decompression algorithm to distinguish between original text and a pointer.   Thanks for posting the video.   It stirred my curiousity.

grayswandir
Автор

What's with all the hate here?? Can't you just appreciate his work? He's better than my university lecturers and I thank him for that.

zandaya
Автор

I love it so much when there's a computerphile episode for something I need to learn, makes it so much better than a random ppt tutorial.

jakovmatosic
Автор

Actually, wouldn't that stored string – "computer" – include the leading space?
(" computer")
Because that, too, occurs in both instances.

Also, it occurs to me that you'll never want to tag & remember a repeated string of 0, 1, or even 2 characters, because the "compressed" version of any string in this scheme, is 2 characters (bytes) long, so you're not actually saving anything until you have a string of 3 or more characters.
This allows "encrypting" the 4 bits that encode length, to mean the numbers from 3 - 18, rather than the usual 0 - 15.

These, if they are implemented in L-Z, are refinements that would properly be omitted from this overview, but I thought they might be worth bringing up here in the comments.

And BTW, thank you for a very clear and concise explanation of how this scheme works!

ffggddss
Автор

I remember in my computer science classes on compression, we were taught Lempel-Ziv-Welch (LZW). I thought that one was pretty cool, because instead of looking for specific things it has seen, it just constantly makes a dictionary of combinations of characters it has already seen and can refer back to those as it goes along. The best part is that it doesn't have to store the dictionary anywhere, because the decompressor can rebuild it while it's decompressing, based on the compressed data.

AlexanderKrivacsSchrder
Автор

Can I please have an evening in a pub with Professor Brailsford and just have him explain things to me? I wish he were my grandfather, or at the very least mentor of some sort. He's just such and all round excellent bloke. I think his computer science pupils are very lucky to have such a clear speaking and warm hearted teacher.

Professor Brailsford is the best!

TheDave
Автор

Could a pointer point to a pointer instead of having to remember another instance of the word?

grandadie
Автор

Oh, how beautifully he explained the concept! Totally impressed! Now I wish only if our professor in college had explained LZW this elaborately, but he did not.

ruddyrdx
Автор

Wouldn't it be more efficient to have j be the distance to the END of the string you are referencing? That way you can reference strings (l-1) characters farther back with the same number of bits available for j.

Basically, go j characters back and then take that character and the preceding (l-1) ones.

Doniazade
Автор

First of all I would like to thank you and your peers for the time you invest in this videos.
One thing I didn't grasp was if the implementation of the search of a given match is biased(it's prone to a given design) or it's a "closed analytic solution" and what was the creators approach, thank you for your time.

MrCOPYPASTE
Автор

I am absolutely loving this channel. Anywhere I look online for an explination for something computer or network related, like TCP, I can't understand any of it. But this channel gives me a foundation to work from, to understand stuff like this.

isaac
Автор

I think it speaks volumes of prof Brailsford that he can simplify some of the most complex topics and explain them in a way that is a joy to listen to. However when pressed for more details and information he will gladly oblige you and probably tell you all there is to know on the subject! Brilliant man, could listen to him talking on computers all day.

fredhair
Автор

I absolutely love LZ compression algorithms. Beautiful to code, beautiful to see in motion, and pretty darn effective too. Not only that, but it's easily tweaked to work with whatever data you're using just by changing a few constants.

Fantastic.

Bovineprogrammer
Автор

People often say that learning times tables etc in this day and age is a waste of time, but I disagree. If you don't need to use a calculator as but it means that you both intrinsically understand the numbers and you can be much faster. I new an accountant that could add up 3 digit columns as fast as I can mentally do single digits. In his job that would have been worth gold. Knowing powers of 2 in computer science is similar

etmax
Автор

Hopefully fixed now, thanks for spotting! >Sean

Computerphile
Автор

This sounds literally like how we use our language and even our creative process. Constantly referencing previous instances of a given word, idea, or concept... Im really enjoying this channel 

MECKENICALROBOT
Автор

Actually, along with the jump/length tuple, the LZ77 algorithm outputs an optional literal character. So, for example, the very first output of the compressor would be a jump/length tuple with a length of 0, followed by the first byte of the decompressed data.
Implementations differ on how they specify when there's a literal character or how the length/pair tuple is encoded.

felineboy
Автор

It's hilarious that people at school always say: "you shouldn't have to use your brain during summer!", when I am watching Computerphile, and enjoying learning about compression and stuff. I love this!

seanld
Автор

I've not looked into LZ but I am familiar with a lot of compression techniques.
You are of course correct that the has to be some way of differentiation. This can be designed in two different ways.

METHOD 1: Define a header exactly as you say which marks the following (say 16 bits) as a pointer. You will however need to ''escape' any chance of the same header from appearing in a normal run of the text string. Methods of escape can be done with a different header.

ChrisWalshZX
Автор

One important thing missing there, you need some sort of identification that the 2 bytes are a reference and not just more characters. You could set the highest bit of the 2 character string to on but that kind of shows another problem: most text only uses the lower 7 bits of an 8 bit byte. So if you want to compress things it seems odd to leave that because every character is 12% bigger than it needs to be.

phookadude