finding plain text files vs. binary files on unix or linux - this is surprisingly tough.

preview_player
Показать описание
Yo what's up everyone my name's dave and you suck at programming.

🔗 More Links

📖 Keywords
you suck at programming #programming
#devops #bash #linux #unix #software #terminal #shellscripting #tech #stem
Рекомендации по теме
Комментарии
Автор

The real answer is "send it back to the business analyst and ask for better defined acceptance criteria"

daddysolaire
Автор

You just turn a plain text file into a ship of theseus.

daddysolaire
Автор

It's not an easy or hard problem; it isn't well defined in the first place.

All files are binary - they consist of bytes; that's the only thing that can be held in memory or stored on disk. Text is *one way* of interpreting that binary content, and not even a well-defined way in and of itself; you need an encoding. You can check whether some data is valid according to the encoding rules, but lots of text encoding rules say every byte is always meaningful. You need more heuristics after that to decide whether the resulting interpreted text actually makes sense to use as text. And then obvious rules ("are these lines super long?) might not be relevant depending on what you want to use the text for.

la.zanmal.
Автор

Not to be a complainer, but if it were me and I were making a video (which I’m not) I would at least research what an executable file header looks like prior to taking to the Internet to pontificate. On the other hand, thanks for all the free knowledge!

gregoryshields
Автор

`find -type f -exec file {} \; | grep "text"` will help you filter down to the files that are more likely text, though.

You're right, it's a nearly impossible question, but `file` does 95% of the work for you.

greyfade
Автор

I love just testing things out and seeing where i go. Then never having to do that ever again and completely forgetting how i did it before… good times… good times

jeh_io
Автор

I'd start with the file command (it checks the magic number) But clearly I've not plumbed the depths of this problem or considered bad actors.

robertkielty
Автор

You might be able to get around using the file command. That determines the mine type if you're lucky... But it can also be tricked...

gerke_kok
Автор

Most western languages have a thing for vocals. And most files are written in western languages. (Not only prose, also source code you have to consider)
Now do some statistics summing up the amounts of every value of every byte. Does is have the ascii code for e and an and i in it unusually often? Then consider this a text file.

Or just use grep.

hansdampf
Автор

if i remember correct there was three magic byte 🤔🧐

Kalvi-Karing
Автор

In my mind: printer driver not working on some particular day beause of bug in file command

kf
Автор

The magic number is arbitrarily long, and some file formats even I actually have more than one magic number, HTML comes to mind. [Html has either <!DOCTYPE html> or <html> as its magic numbers] (in fact, the whole point of the doctype tag is to tell the HTML interpreter. What type of document it has, because XML and SVG are things. Why is XML a thing still?
The issue is a character is a character because ASCII takes up the whole bite, so a validly formatted ANSI stream buff is indistinguishable from any other binary data. [That being said, a. "Plain text" file probably won't have any code points less than 0x20 But it would be valid ASCII for them to.

skeleton_craftGaming
Автор

is for sure that in UNIX system everything is a file.

ArmandoCalderon
Автор

Why not to use file utility, it tells what the file is (image, text, ....)

amorsmor
Автор

Doesn’t the command file have this built in?

SupportToolsBlog
Автор

how about: index all files and create a database table or a json file with the following schema: filename, directory_path, plaintext_percent

jaytea
Автор

of they arwnt compressed surely entropy is a good.measure and could probably reliably tell you if something is text or binary

you could definitelt.also.ise some machune learning library tools for this

CosumuHcos
Автор

Man, you're outside with nice weather, most likely some grass around to touch. And you're talking about this nonsense.

mikemiller
welcome to shbcf.ru