How to extract key-value & table info from PDF & save it as CSV - Amazon Textract tutorial p5

preview_player
Показать описание
Welcome to the part 5 video of the tutorial series on Amazon Textract. In this video, I have covered how to extract text, key-value pairs, and table information from a multi-page PDF file and save the output as CSV.

---
Support my work:
---
Paytm | Gpay: 9023197426

---
Series Tutorial
---

---
Another channel:
---

---
Connect with me
---
Рекомендации по теме
Комментарии
Автор

Excellent Chirag, you are saving ton of my time..very detailed..Thanks much

SK-gnrs
Автор

Thank you so much for these! You're a lifesaver. If you end up creating any more, it'd be really helpful to get a primer on adding in queries functionality for this multi-page pdf parser.

AJvanuw
Автор

Hi, the “T extract_async_kv_table.yaml” file you uploaded in AWS cloud Formation is different from the one in Git repository. Could you please help me with this ? I need the main file. Can you assist me?

digambarsonavane
Автор

Hi Srce Cde, i followed the exact steps that you mentioned in the video. but i'm getting below error in the Lambda function cloudwatch. Can you please help me out. Thanks a lot in advance.
[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 'lambda_function'
Traceback (most recent call last):

SurajDubey
Автор

Can you show about if the code for the video instruction has been updated?

amalsalilan
Автор

Hi Chirag, This video is really helpful. I am trying to save output files under the Folders like KV, Table, Signatures and Text it gives JobID with CSV extension format. Instead of that it want to give Input name with CSV extension. For Example, I will give Sample.PDF it want to give the Output like Sample.CSV in all the Output Folders KV, Tables, Signatures and Text folders. Could You Please assist Here.

akshayavarshini
Автор

Good day,

Thanks a lot for your videos, i have learnt a lot going through each of them, All of the implementations I tried are working besides this one, For some reason it times out on the process_response method, it gets stuck after displaying the message logging.info("Fetching response"), I even set the timeout value to 15 Minutes and tried with different files

BonginkosiBrian
Автор

Hello.
I have used your code on a multi page PDF and it's extracting only the first page

saikrishnachalavadi
Автор

How can we add ['SIGNATURE'] to the FeatureTypes and put it in the table, it there is anyway to detect a signature if it exist or not as like key-value, and if there is no signature just to return empty string or smth " ". Thank you !!

SkalarBG
Автор

Hi Chirag, Getting this error "[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': attempted relative import with no known parent package Traceback (most recent call last):" and looks like it is due to this line "from helper.helper import process_response, process_error", how to fix this?

SK-gnrs
Автор

Hi Chirag, All tutorial are really helpful. I am trying to save all the output into a folder by the filename of input file and underneath all the different subfolder like Text, Tables, kv, Textract - which I am processing, For example, I import 123.pdf so, in s3 I am trying to create 123 folder and underneath all subfolder(Text, Tables, kv, Textract) . thank you in advance :)

jalpazaveri
Автор

Wow thank you very much!! I set everything up like the video had but am still getting an error on the JobLambda saying "[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 'lambda_function' Traceback (most recent call last).

I did some googling and on stack overflow and some solutions were to validate that the handler is the same as the Lambda Function.py name (which it is), create a separate init.py blank file (like in the Process function), changing permissions on the files before zipping. I had tried all of them and not sure what the error is a result of.

curiousl
Автор

Hello, thank you very much for this tutorial. Can you help me understand why I can extract CSV from small-size PDFs (10-30 pages) but not from large PDF files (more than 100 pages) ? Also could you pleasr post how we should modify the parser to generate a single CSV with all the tables? Thank you in advance!

naillazrak
Автор

This is really helpful! Thanks so much. Have a question though

In the case, I uploaded 10 separate forms in pdf format at the same time I just need to iterate over the bucket's objects, right? Sorry if it is a dumb question. new to the whole thing . Thanks again.

turkishboy
Автор

thank you so much… great tutorial bro, i have issue when upload multiple page pdf, but it doesn't always happen in every pdf file, and when it happens I get this error message <listcomp>\n v = \" \".join([self.word_map[i] for i in relation[\"Ids\"]])\n",

skripsi
Автор

i want to ask something for textract can i get ur contact

purnishsinha
visit shbcf.ru