Combine and Extract multiple PDF tables to clean Excel Data using Tabula library of python

preview_player
Показать описание
In this video, we will explore tabula library of Python to combine, convert and extract multiple pdf tables to cleaned excel data ready for further analysis.

We will also use pandas library of python to clean Data and do further data cleaning.

If you have java installed already and still getting an error, then please try below steps, the java setup is bit tricky but hopefully a one time setup.

from windows start option, search for Environment Variables and search for *Edit environment variables*, then follow below steps:

**
Under the System Variables click Path and then press the Edit... instead of New. Then in the next screen (Edit environment variable for the Path variable) click New and add the address, e.g. C:\Program Files (x86)\Java\jre1.8.0_201\bin. Press OK and the Path variable will be appended/updated.**

Answer taken from below:

Python Source code:
Рекомендации по теме
Комментарии
Автор

When I typed the pdf_fils or pdf_files[1] in the editor I didnt get any results. When I typed the pdf_file[0} in the terminal I got an error as the term is not recognized as the name of the cmdlet,

TheCopperMystic
Автор

01:40 how did you edit this to make the vs editor having each seperate cells. Please someone let m ekno w

TheCopperMystic
Автор

Thank you! Love this content! Only problem for me is, I have a monthly report with 61 different pdfs with three table types in each representing Deposits, Fees, and Discounts, and they vary from 2-11 pages and each table can be longer or shorter than another in each pdf so I can’t create those consistent rules like you did in this video.
Is there a way I could filter through the tables and make lists of the ones with the same heads and then append them and process them?
Thank you in advance! This video already helped me out a ton!

mpfiesty
Автор

What was the formatting you did at 1:44 ?

prakharjain
Автор

hello, i can not can not get the pdf_files[0] there is error saying the term 'pdf_files[0]' its not reconized

mustaqimjohari
Автор

Thank you this video is very helpful :) but in my case there is large pdf with more than 100 pages and columns are mentioned only on 1st page so this extracts data from first page only but i want to extract from all pages can you provide some guidance to solve this?? Thank you

AIWorld-
Автор

send source code and btw getting error like java not found, so help me resolve it, appreciate your work.

sarayumallam
Автор

Hello, I have an " processSubtype14
WARNING: Format 14 cmap table is not supported and will be ignored"

smithndongla