PDF Parser in C | Extracting Text

preview_player
Показать описание
Today let's take a look at the PDF file format. In this video we will write a program that extracts text information from a PDF file using C and zlib.

Join this channel to get access to perks

If you enjoy my work, consider buying me a cup of coffee for getting through those long coding sessions :)

Chapters:
00:00 Intro
04:46 Code
36:15 Results and Outro
Рекомендации по теме
Комментарии
Автор

I am always so amazed by how Alex and Tsoding make programming look so easy.
They aren't trying to use every complex feature the language has but just what can get the job done.

acestandard
Автор

The objects don't necessarily have to start immediately after the header lines. Since objects are all located by file offsets in the xref table at the end, you could hide data between lines 3 and 4 (adjusting the xrefs of course) and most software should ignore it.

luserdroog
Автор

Could this kind of approach be used to extract text from pdfs that have columnar text? Like an cientific article, which may be organized in 2 columns. This text is read:
- 1st column -> up - down
- 2nd column -> up - down

There are Python libraries that do not extract this text in order

Also, what about tables? Extract text from tables in order, usung sep for both rows and columns?

JaimeSanchoMolero
Автор

This is great. +1 sub and looking forward to more

hi_arav
Автор

This is good, I wonder why pdf readers don't allow this kind of functionality? maybe the big corporate doesn't want you to download their images burh

korigamik
Автор

I am trying it with pure golang no library.

YabseraPython
Автор

With out any library?
Brave... I done text extraction using some library wich braked pdf to all logical parts, it was still hard becuase of characters maps.
Beware that pdf can be constructed in many ways so probably your parser will fail on many.

AK-vxdy