Marker: This Open-Source Tool will make your PDFs LLM Ready

Показать описание

In this video, I discuss the challenges of working with PDFs for LLM applications and introduce you to an open-source tool called Marker. Marker simplifies the conversion of complex PDF files into structured Markdown, making data extraction much easier. I compare Marker with NuGet, showing its superior performance in preserving document structure accurately. Additionally, I give a detailed tutorial on installing Marker, using it to convert single or multiple PDF files, and review some example results. If you're interested in efficient data preprocessing for LLMs, this video is for you!

Signup for Advanced RAG:

LINKS:

TIMESTAMPS
00:00 Introduction: The Importance of Good Data for LLM Applications
00:13 Challenges of Working with PDFs
00:43 Approaches to Make PDFs LLM Ready
01:10 Advantages of Using Markdowns
01:31 Introducing Marker: An Open Source Tool
02:19 Marker vs. NuGet: Performance Comparison
03:35 Features and Limitations of Marker
05:45 Installation and Setup of Marker
07:34 Converting PDFs to Markdowns: Step-by-Step Guide
08:21 Examples and Results
13:32 Conclusion and Future Videos

All Interesting Videos:

Рекомендации по теме

Комментарии

🎯 Key points for quick navigation:

00:00 *📄 Introduction and Challenges with PDFs*
- Introduction to the video topic,
- Challenges of extracting data from PDFs for LLM applications,
- Different elements and structures in PDFs complicating extraction.
01:09 *🔧 Existing Approaches to PDF Conversion*
- Overview of methods to convert PDFs to plain text,
- Use of machine learning models and OCR for extraction,
- Comparison of PDFs to Markdown for ease of processing.
02:17 *🛠️ Introduction to Marker Tool*
- Introduction to the Marker tool for converting PDFs to Markdown,
- Comparison with other tools like Nugat,
- Performance and accuracy benefits of using Marker.
03:36 *📚 Features of Marker*
- Supported document types and languages,
- Removal of headers, footers, and artifacts,
- Formatting of tables and code blocks, image extraction,
- Limitations and operational capabilities on different systems.
05:00 *📝 Licensing and Limitations*
- Licensing terms based on organizational revenue,
- Limitations in converting equations and formatting tables,
- Discussion on practical limitations noticed in usage.
05:54 *💻 Setting Up and Installing Marker*
- Steps to create a virtual environment for Marker,
- Instructions for installing PyTorch based on OS,
- Detailed steps to install Marker and optional OCR package.
07:31 *🧪 Example Conversion Process*
- Steps to convert a single PDF file to Markdown,
- Explanation of command parameters and process flow,
- Initial example with a scientific paper.
10:10 *📊 Reviewing Conversion Output*
- Review of the output structure and accuracy,
- Metadata extraction and image handling,
- Preview of converted Markdown and comparison with the original PDF.
12:13 *📜 Additional Examples and Output Review*
- Example with Andrew Ng’s CV and another paper,
- Review of the extracted content and any noticed issues,
- Importance of secondary post-processing for accuracy.
13:34 *🎥 Conclusion and Future Content*
- Summary of Marker tool’s utility and performance,
- Announcement of future videos on related topics,
- Invitation to subscribe for more content.

Made with HARPA AI

ilianos

Man, a couple of weeks ago I was fighting this PDF chaos. Thanks for your video.

ernestuz

Can you make a tutorial on the next step? I mean, How would we pass this markdowns to the LLM and the vector database?

GorkaBiurrun-qj

If anyone is having trouble where its running but not actually placing the new files in the output directory, if you followed the github example command, the "--min_length 10000" is whats doing it. It simply goes through that whole process and then decides its too short. Either reduce that number to a much lower number of chars or remove the option entirely. 30 min of hunting through TMP folders for the files and finally figured it out

kineticraft

Thanks for the video. I try it but it did a little bit of a mess with the tables on my PDF. Not working really well. The rest of the text gets resolved properly. But tables, not really, just some of them are nicely structured and done

Alvaro-cszs

Brilliant vid - it is a godsend. OCRing a PDF is just not workable, period. I gave up on attempting parsing PDF. This new information is amazing and I am once again excited.

gregsLyrics

How you tried the scanned documents instead of digital pdf? And handwritten text as well?

anandu

what about i have a table in image so ii it able to extract data properly?

bhavya-cc

Thanks for covering Marker, this is brilliant!!
Would love to see batch processing of pds using Marker.
Also, for the web scraping projects, can we include one, where we scrape data apartment rental data (that keeps changing/evolving) from websites like craigslist, etc. store it persistently in a vetorstore or db and then run a query on that info?

ai-whisperer

If you do make a video about scraping data, please go over content that requires javascript to load. It’s been difficult to find a clear guide specifically for capturing this data for LLM usage. I loved this video, thank you!

greymooses

Amazing, can't wait till test.. converting maths from pdf to LaTeX cost thousands of dollars..now it's free.

synthclub

Thank you so much for the video it was really helpful.

Can you please make a video on how to convert this complex extracted raw text/md into LLM feedable dataset? Like my aim is to fine-tune a model that can answer the question and even summarize from complex pdfs/extracted text ( that includes tables, equation etc).
But I'm confused how convert this data into reasonable format that could be easily feed to LLM model for finetuning.

Any guide or video will be really helpful. Please respond. I'm willing to pay consultant fees as well.
Many Thanks.

Sara-fpzw

I actually tried it out today before seeing this video and sadly it produced quite messed up results for a not so complicated document. Some sections and tables were parsed perfectly but even if there are some scrambled up parts the results are useless :/

tedp

I am getting float error. I have installed the CUDA version. Any suggestions?

samarthmath

Very interesting. Once you have converted the pdf file, how can we give all this info to the vectordatabase for RAG ?

mariongully

This is really helpful to prepare the pdfs before adding them to RAG . But is there any way to install this marker application as docker container?

drmetroyt

Marker only used 4gb of vram out of a a6000, can you increase the batch size and get some more speed gain? Or is it stuck at that speed regardless of the batch size? 100 seconds per page is a huge improvement over nougat, but still very slow 😢

I love the video tho, I struggled with this one time for hours making a custom script to scrape this one pdf. Definitely gonna use marker sometime soon.

Nick_With_A_Stick

Tabula-py or this? Which is better when it gets to extracting tables?

stanTrX

Are there ways I can also convert comments/annotations into a markdown format?

chauyuhin

in video git repository not mentioned whatever you given in description box

PritiSurange

Marker: This Open-Source Tool will make your PDFs LLM Ready

Marker: This Open-Source Tool will make your PDFs LLM Ready

Marker:Get Your PDFs Ready for RAG & LLMs|High Accuracy Open-Source Tool #ai #llm #pdf #generati...

Free Open-Source Tool will make your PDFs Ready For RAG and LLM (2024)

EASIEST Way to Fine-Tune a LLM and Use It With Ollama

Unstract: AI Document Parser: Revolutionise Complex PDF Data Extraction! (Opensource)

Why Real Programmers LAUGH About No Code Tools & AI

How To Access The Dark Web!

The Bambu Lab X1 Carbon is Amazing! #3dprinting

Self Leveling Laser level #tools #test #diy #tips #tricks

WATCH ME DO MY CLIENTS NAILS 😱💅🏼

Engraver Laser Marking Machine 20w 30w 50w

Can you use ChatGPT as Crypto Trading Bot?

VAPING GAVE US CANCER #stopvaping

Mini Beyblade X GACHAPON Surprise! Which bey will we get? #beyblade #beybladex #anime #japan

Introducing Anno - open source image labeling tool

Lesson 3 | Working with Metadata - Complete Developer Course

The ONLY Things You Need to Stream on Twitch 😄

Don’t throw away your toilet paper roll!😱 #craft

How-to Open Image Files with JPEG Marker Type Issues

How to Get MYTHICAL FRUITS Everytime In Blox Fruits

Change Color of Mouse Pointer in Windows

How this AI Makes School 10x Easier!

The correct way to use tin soldering iron.

How To Tell REAL or Fake Diamond (TORCH TEST!)