Extract Text from PDFs & Images for LLMs Using Python

Показать описание

This video aims to provide a few technics to efficiently extract text from any type of document.
After completing this tutorial, you will have a clear idea of which tool to use depending on your use case.

Source code:

Medium Article:

Connect:

Support:

About me:
Hey there! My name is Zoumana. I got my Bachelors's in Mathematics and Computer Science, and my Master's in Computer Science with concentrations in Machine Learning & Data Science in Paris. I hold a second Masters's in Data Science & Business from Texas Tech University. I want to share my expertise with you and help you grow 📈 in your career as Data Scientist so that you can become an expert yourself.
#python #datascience #ocr #langchain #machinelearning

Tech With Zoum

Рекомендации по теме

Комментарии

Thank you for this video. I needed this information at this very moment.

eloghosaikponmwoba

Please subscribe and like the video to help me keep motivated to make awesome videos like this one. :)

techwithzoum

Fantastic tutorial, so much simplified ...great job

AbdulAhad-Family

Excellent tutorial, thank you. Your video was referenced on AI Jason's channel and I am glad he did.

kenchang

The voice is very clear and crisp for non-english viewers to understand. Content is very excellent and explained exquisitely. Could you let us know if tables in the pdf or word doc, using RAG and prompt can we able to join tables, filter data from tables and other operations in Gen AI?

SMCGPRA

Hey, im getting an error
in isfile(path)
28 """Test whether a path is a regular file"""
29 try:
---> 30 st = os.stat(path)
31 except (OSError, ValueError):
32 return False

TypeError: stat: path should be string, bytes, os.PathLike or integer, not JpegImageFile

can you help me with a fix?

rashmikasaha

مااسم صفحة الانترنت التى تكتب فيها الاكواد

cbylnrh

Hi there, I plan on using the EasyOCR Library for some sensitive Documents, Is it safe, like can any data Leaks Occur, also Is there any Documentations of the Library I can refer to ?

Thanks !!

svbkjfj

I run into the following error when I try langchain's UnstructuredImageLoader:

TypeError: stat: path should be string, bytes, os.PathLike or integer, not JpegImageFile

shooby

I used a third party tool API to extract text and tables but the image part is not working for it. It’s not even recognising the images. If I just use the python libraries instead that will recognise the image and I can save them to other folders and later work on it but the extraction of tables won’t work I guess for python libraries.

piyushchhawachharia_

Thanks so much. can we have an example for a data extration from a table on an image ?

thibauteka

@zoumdatascience Can we give the image output with text based on the questions?

vivekpatel

which among these has the best accuracy?

Abhi_interiors

Important question: If a pdf does have a picture in it, when converting to picture, firstly, is that picture added as text or is it skipped?
Secondly, is there a way to *know* that the extracted text is coming from an image within the pdf? Some sort of metadata at least to get that info?
Thanks for the video, nice content with overall breadth, wish you could answer my question.

SuiGio

I am getting list index of range for langchain. Can you suggest sometime there

anubhav

Great work. can we extract information from charts like histogram/barplot ?

susmitsekhar

كل الاساليب لتحويل الملفات فشلت ولا اعرف السبب

cbylnrh

Bonjour êtes vous malien? super tutoriel merci pour le partage

ibrahimkouma

that was very useful for me, but i faced one struggle in langchain unstructuredimageloader it say its not allowed PNGImage? i don't get it, I try to resolve it but i couldn't, pls if there is any way to contact you, I'll appreciate . thank you

hajarabdullah

Extract Text from PDFs & Images for LLMs Using Python

Extract Data from PDFs Easily & Quickly (table form/image/text/pages)

Extract PDF Content with Python

Extract Text from PDFs & Images for LLMs Using Python

Microsoft AI Builder Tutorial - Extract Data from PDF

How to Extract Typed & Handwritten Text from Images and PDFs

How to copy table from PDF to Excel File in 30seconds

Extract text from PDFs in Python using PyPDF2 : A Step-by-Step Guide- Part 01| Reading PDFs

How to Extract Data from PDF with Power Automate

Kernel Extractor for PDF

Get Data from PDFs and Send to EXCEL with Power Automate Desktop!

Bulk Combine PDF files to Excel without losing formatting & NO 3rd party software

Working with PDF files in Python | How to extract text from Pdf using Python?

How to Extract Specific Text from a PDF to Excel

How to Extract Text from PDFs and Images with Amazon Textract | OCR | NLP | Python Code | AWS

Python! Extracting Text from PDFs

How to extract text from a PDF file using Python | Python Tutorial

How to extract PDF snippets in GOODNOTES 5|One minute tutorial

[15] Use Python to extract invoice lines from a semistructured PDF AP Report

How to Extract Text From PDFs Using IronPDF

how to convert scanned pdf documents to word text online free | edit scanned pdf to text converter

Extract Text From an Image: Copy Text From Images & PDFs

PyPDF2 Crash Course - Working with PDFs in Python [2023]

How to extract text from PDFs in just a few clicks with Parseur

18 Specific Data Extraction from PDFs