Marker: This Open-Source Tool will make your PDFs LLM Ready

preview_player
Показать описание
In this video, I discuss the challenges of working with PDFs for LLM applications and introduce you to an open-source tool called Marker. Marker simplifies the conversion of complex PDF files into structured Markdown, making data extraction much easier. I compare Marker with NuGet, showing its superior performance in preserving document structure accurately. Additionally, I give a detailed tutorial on installing Marker, using it to convert single or multiple PDF files, and review some example results. If you're interested in efficient data preprocessing for LLMs, this video is for you!

Signup for Advanced RAG:

LINKS:

TIMESTAMPS
00:00 Introduction: The Importance of Good Data for LLM Applications
00:13 Challenges of Working with PDFs
00:43 Approaches to Make PDFs LLM Ready
01:10 Advantages of Using Markdowns
01:31 Introducing Marker: An Open Source Tool
02:19 Marker vs. NuGet: Performance Comparison
03:35 Features and Limitations of Marker
05:45 Installation and Setup of Marker
07:34 Converting PDFs to Markdowns: Step-by-Step Guide
08:21 Examples and Results
13:32 Conclusion and Future Videos

All Interesting Videos:

Рекомендации по теме
Комментарии
Автор

🎯 Key points for quick navigation:

00:00 *📄 Introduction and Challenges with PDFs*
- Introduction to the video topic,
- Challenges of extracting data from PDFs for LLM applications,
- Different elements and structures in PDFs complicating extraction.
01:09 *🔧 Existing Approaches to PDF Conversion*
- Overview of methods to convert PDFs to plain text,
- Use of machine learning models and OCR for extraction,
- Comparison of PDFs to Markdown for ease of processing.
02:17 *🛠️ Introduction to Marker Tool*
- Introduction to the Marker tool for converting PDFs to Markdown,
- Comparison with other tools like Nugat,
- Performance and accuracy benefits of using Marker.
03:36 *📚 Features of Marker*
- Supported document types and languages,
- Removal of headers, footers, and artifacts,
- Formatting of tables and code blocks, image extraction,
- Limitations and operational capabilities on different systems.
05:00 *📝 Licensing and Limitations*
- Licensing terms based on organizational revenue,
- Limitations in converting equations and formatting tables,
- Discussion on practical limitations noticed in usage.
05:54 *💻 Setting Up and Installing Marker*
- Steps to create a virtual environment for Marker,
- Instructions for installing PyTorch based on OS,
- Detailed steps to install Marker and optional OCR package.
07:31 *🧪 Example Conversion Process*
- Steps to convert a single PDF file to Markdown,
- Explanation of command parameters and process flow,
- Initial example with a scientific paper.
10:10 *📊 Reviewing Conversion Output*
- Review of the output structure and accuracy,
- Metadata extraction and image handling,
- Preview of converted Markdown and comparison with the original PDF.
12:13 *📜 Additional Examples and Output Review*
- Example with Andrew Ng’s CV and another paper,
- Review of the extracted content and any noticed issues,
- Importance of secondary post-processing for accuracy.
13:34 *🎥 Conclusion and Future Content*
- Summary of Marker tool’s utility and performance,
- Announcement of future videos on related topics,
- Invitation to subscribe for more content.

Made with HARPA AI

ilianos
Автор

Man, a couple of weeks ago I was fighting this PDF chaos. Thanks for your video.

ernestuz
Автор

Can you make a tutorial on the next step? I mean, How would we pass this markdowns to the LLM and the vector database?

GorkaBiurrun-qj
Автор

If anyone is having trouble where its running but not actually placing the new files in the output directory, if you followed the github example command, the "--min_length 10000" is whats doing it. It simply goes through that whole process and then decides its too short. Either reduce that number to a much lower number of chars or remove the option entirely. 30 min of hunting through TMP folders for the files and finally figured it out

kineticraft
Автор

Thanks for the video. I try it but it did a little bit of a mess with the tables on my PDF. Not working really well. The rest of the text gets resolved properly. But tables, not really, just some of them are nicely structured and done

Alvaro-cszs
Автор

Brilliant vid - it is a godsend. OCRing a PDF is just not workable, period. I gave up on attempting parsing PDF. This new information is amazing and I am once again excited.

gregsLyrics
Автор

How you tried the scanned documents instead of digital pdf? And handwritten text as well?

anandu
Автор

what about i have a table in image so ii it able to extract data properly?

bhavya-cc
Автор

Thanks for covering Marker, this is brilliant!!
Would love to see batch processing of pds using Marker.
Also, for the web scraping projects, can we include one, where we scrape data apartment rental data (that keeps changing/evolving) from websites like craigslist, etc. store it persistently in a vetorstore or db and then run a query on that info?

ai-whisperer
Автор

If you do make a video about scraping data, please go over content that requires javascript to load. It’s been difficult to find a clear guide specifically for capturing this data for LLM usage. I loved this video, thank you!

greymooses
Автор

Amazing, can't wait till test.. converting maths from pdf to LaTeX cost thousands of dollars..now it's free.

synthclub
Автор

Thank you so much for the video it was really helpful.

Can you please make a video on how to convert this complex extracted raw text/md into LLM feedable dataset? Like my aim is to fine-tune a model that can answer the question and even summarize from complex pdfs/extracted text ( that includes tables, equation etc).
But I'm confused how convert this data into reasonable format that could be easily feed to LLM model for finetuning.

Any guide or video will be really helpful. Please respond. I'm willing to pay consultant fees as well.
Many Thanks.

Sara-fpzw
Автор

I actually tried it out today before seeing this video and sadly it produced quite messed up results for a not so complicated document. Some sections and tables were parsed perfectly but even if there are some scrambled up parts the results are useless :/

tedp
Автор

I am getting float error. I have installed the CUDA version. Any suggestions?

samarthmath
Автор

Very interesting. Once you have converted the pdf file, how can we give all this info to the vectordatabase for RAG ?

mariongully
Автор

This is really helpful to prepare the pdfs before adding them to RAG . But is there any way to install this marker application as docker container?

drmetroyt
Автор

Marker only used 4gb of vram out of a a6000, can you increase the batch size and get some more speed gain? Or is it stuck at that speed regardless of the batch size? 100 seconds per page is a huge improvement over nougat, but still very slow 😢

I love the video tho, I struggled with this one time for hours making a custom script to scrape this one pdf. Definitely gonna use marker sometime soon.

Nick_With_A_Stick
Автор

Tabula-py or this? Which is better when it gets to extracting tables?

stanTrX
Автор

Are there ways I can also convert comments/annotations into a markdown format?

chauyuhin
Автор

in video git repository not mentioned whatever you given in description box

PritiSurange
welcome to shbcf.ru