Elegant Compression in Text (The LZ 77 Method) - Computerphile

Показать описание

Text compression methods such as LZ can reduce file sizes by up to 80%. Professor Brailsford explains the nuts and bolts of how it is done.

This video was filmed and edited by Sean Riley.

Рекомендации по теме

Комментарии

I did some experimentation. If you open a zip or gz file with a hex editor there isn't anything human readable. You have to compress using an older format such as .lzo (Not sure anything in Windows will do this but it's a compress option in Fedora Linux.) Opening a .lzo file with a hex editor (I used ghex) shows much of the original text and it's easy to see where text is replaced by pointers as discussed in the video. You can see more frequent pointers as you move farther along in the document. The pointers in the .lzo file were two, three and four bytes long. I could not see any telltale that identifies a pointer versus text. There has to be some method for the decompression algorithm to distinguish between original text and a pointer. Thanks for posting the video. It stirred my curiousity.

grayswandir

What's with all the hate here?? Can't you just appreciate his work? He's better than my university lecturers and I thank him for that.

zandaya

I love it so much when there's a computerphile episode for something I need to learn, makes it so much better than a random ppt tutorial.

jakovmatosic

Actually, wouldn't that stored string – "computer" – include the leading space?
(" computer")
Because that, too, occurs in both instances.

Also, it occurs to me that you'll never want to tag & remember a repeated string of 0, 1, or even 2 characters, because the "compressed" version of any string in this scheme, is 2 characters (bytes) long, so you're not actually saving anything until you have a string of 3 or more characters.
This allows "encrypting" the 4 bits that encode length, to mean the numbers from 3 - 18, rather than the usual 0 - 15.

These, if they are implemented in L-Z, are refinements that would properly be omitted from this overview, but I thought they might be worth bringing up here in the comments.

And BTW, thank you for a very clear and concise explanation of how this scheme works!

ffggddss

I remember in my computer science classes on compression, we were taught Lempel-Ziv-Welch (LZW). I thought that one was pretty cool, because instead of looking for specific things it has seen, it just constantly makes a dictionary of combinations of characters it has already seen and can refer back to those as it goes along. The best part is that it doesn't have to store the dictionary anywhere, because the decompressor can rebuild it while it's decompressing, based on the compressed data.

AlexanderKrivacsSchrder

Can I please have an evening in a pub with Professor Brailsford and just have him explain things to me? I wish he were my grandfather, or at the very least mentor of some sort. He's just such and all round excellent bloke. I think his computer science pupils are very lucky to have such a clear speaking and warm hearted teacher.

Professor Brailsford is the best!

TheDave

Could a pointer point to a pointer instead of having to remember another instance of the word?

grandadie

Oh, how beautifully he explained the concept! Totally impressed! Now I wish only if our professor in college had explained LZW this elaborately, but he did not.

ruddyrdx

Wouldn't it be more efficient to have j be the distance to the END of the string you are referencing? That way you can reference strings (l-1) characters farther back with the same number of bits available for j.

Basically, go j characters back and then take that character and the preceding (l-1) ones.

Doniazade

First of all I would like to thank you and your peers for the time you invest in this videos.
One thing I didn't grasp was if the implementation of the search of a given match is biased(it's prone to a given design) or it's a "closed analytic solution" and what was the creators approach, thank you for your time.

MrCOPYPASTE

I am absolutely loving this channel. Anywhere I look online for an explination for something computer or network related, like TCP, I can't understand any of it. But this channel gives me a foundation to work from, to understand stuff like this.

isaac

I think it speaks volumes of prof Brailsford that he can simplify some of the most complex topics and explain them in a way that is a joy to listen to. However when pressed for more details and information he will gladly oblige you and probably tell you all there is to know on the subject! Brilliant man, could listen to him talking on computers all day.

fredhair

I absolutely love LZ compression algorithms. Beautiful to code, beautiful to see in motion, and pretty darn effective too. Not only that, but it's easily tweaked to work with whatever data you're using just by changing a few constants.

Fantastic.

Bovineprogrammer

People often say that learning times tables etc in this day and age is a waste of time, but I disagree. If you don't need to use a calculator as but it means that you both intrinsically understand the numbers and you can be much faster. I new an accountant that could add up 3 digit columns as fast as I can mentally do single digits. In his job that would have been worth gold. Knowing powers of 2 in computer science is similar

etmax

Hopefully fixed now, thanks for spotting! >Sean

Computerphile

This sounds literally like how we use our language and even our creative process. Constantly referencing previous instances of a given word, idea, or concept... Im really enjoying this channel

MECKENICALROBOT

Actually, along with the jump/length tuple, the LZ77 algorithm outputs an optional literal character. So, for example, the very first output of the compressor would be a jump/length tuple with a length of 0, followed by the first byte of the decompressed data.
Implementations differ on how they specify when there's a literal character or how the length/pair tuple is encoded.

felineboy

It's hilarious that people at school always say: "you shouldn't have to use your brain during summer!", when I am watching Computerphile, and enjoying learning about compression and stuff. I love this!

seanld

I've not looked into LZ but I am familiar with a lot of compression techniques.
You are of course correct that the has to be some way of differentiation. This can be designed in two different ways.

METHOD 1: Define a header exactly as you say which marks the following (say 16 bits) as a pointer. You will however need to ''escape' any chance of the same header from appearing in a normal run of the text string. Methods of escape can be done with a different header.

ChrisWalshZX

One important thing missing there, you need some sort of identification that the 2 bytes are a reference and not just more characters. You could set the highest bit of the 2 character string to on but that kind of shows another problem: most text only uses the lower 7 bits of an 8 bit byte. So if you want to compress things it seems odd to leave that because every character is 12% bigger than it needs to be.

phookadude

Elegant Compression in Text (The LZ 77 Method) - Computerphile

Elegant Compression in Text (The LZ 77 Method) - Computerphile

What Is the Difference in Lossy vs Lossless Compression?

Compression - Computerphile

ChatGPT has its own compression Algorithm

these compression algorithms could halve our image file sizes (but we don't use them) #SoMEpi

The Beauty of Lempel-Ziv Compression

Abstraction Can Make Your Code Worse

Huffman Codes: An Information Theory Perspective

Video Compression Is Magical

text analysis and lossless compression

CMPRSN (Compression Overview) - Computerphile

How PNG Works: Compromising Speed for Quality

First Contentful Paint (FCP): What it is & How to Optimize your Website for it

Build a Simple File Compression Tool with Huffman Coding! 📂 | Coding Challenge #codechallenge

Building Microservices with Go: 12 Using Gzip compression for HTTP responses

JPEG 'files' & Colour (JPEG Pt1)- Computerphile

Encodec: High Fidelity Neural Audio Compression Explained

COMP526 Unit 7-4 2020-03-23 Compression: Run-Length Encoding

5 Step technique for cutting wood under tension/compression with a bore cut

Bro’s hacking life 😭🤣

The Unreasonable Effectiveness of JPEG: A Signal Processing Approach

Explaining Audio 'Compression' in 30 seconds. 🗜️🎶

1-Minute Photoshop - Juicy HDR Effect!

Data Compression as Fast As Possible