Python & Web Scraping Canvas PNG Image Processing for Text

Показать описание

Whilst exploring front end web scraping I came across a CANVAS HTML tag in a weather table, and when clicking on it I found I could select, as well as Xpath & CSS Selector its Image Data-URL and when I selected that & pasted it into the Browser it returned an image.
This would be a method used by the website developers of stopping people scraping their website as it returned an image with text in the image.
I took this as a bit of a challenge so downloaded the Image Data-URL via selenium and took the data and using the Base64 library encoded it and wrote it to a PNG file.
The quality of the results were poor. About a 1/3 of the numbers were usable.
It sort of worked but as I only had about 1/3 of the data that was usable I was disappointed that you couldn’t use it as a reliable process.
I tried using the python cv2 library to modify background of image to white and other transformations but the process generally degraded the resultant image and passing it back through tesseract gave me worse results.
Then I downloaded the image from the browser, that showed a white background, and when I passed that through the OCR the results were very impressive. Almost 100 accuracy (only half info showing) .
So when I looked at file and image size I found that the image from the browser had a smaller file size and was about 4500px x 100px whereas the initial image was la larger file size and the image about 6000 px x 113 px.
So when I used an image resizer program for my initial image that I had and reduced its size to about 82%, so it roughly matched the 2nd image pixel density, and ran it through the OCR again the quality of the output was exact.
So you can take a canvas image from a website to scrape it for the data.
I was pleased with the exercise. The actual method I used to get the data from the table was to go to the backend and make a Get request for the JSON data being fed to the page, a far easier method to get the information.
A bit of familiarisation with OCR & regex though, and that was pleasing
Kind regards, Max Drake

Рекомендации по теме

Комментарии

How would I go about performing OCR on legacy Desktop applications and saving the text in cells in Libre Office Calc?
I have a couple of Desktop Windows applications where I want to OCR the text fields native to those apps and play back with a text to speech library some of the OCRed text. What workflow and libraries should I use to be able to do that? Thank you!

encapsulatio

Python & Web Scraping Canvas PNG Image Processing for Text

Python & Web Scraping Canvas PNG Image Processing for Text

Always Check for the Hidden API when Web Scraping

Industrial-scale Web Scraping with AI & Proxy Networks

Web Scraping Project using Python | Build Your Own Web Scraper 🌐🐍'

How to Automate Canvas WebElement using Selenium WebDriver | Automate With Amit

Render Dynamic Pages - Web Scraping Product Links with Python

oDCM - getting familiar with Python so you can start scraping

Selenium Browser Automation in Python

convert canvas to images Python tutorial html5 maps osrs botting Part 1

Building a Web Scraping Tool in Python | Step-by-Step Tutorial

How to Scrape Data From Any Website

Python Web Scraping Tutorial • Step by Step Beginner's Guide

CODE WITH ME: Building A Python Project And Web Scraper

How to web scrape with Python | Scraping websites with Python Requests, Beautiful Soup, and Selenium

Browser Fingerprint Generator - Canvas

Advanced Web Scraping with Puppeteer: Avoid Looking Like a Bot and Pass Authentication!

Click Buttons and Type with Selenium | Python Selenium Tutorial [Part 2]

How to scrape any website in minutes - No-code tutorial

How to work with the Canvas Widget and Add Images to Tkinter | The Complete Python Pro Bootcamp

STUDENT GETS EXPOSED-ChatGPT! #chatgpt #ai

Challenge 1 - Working with Images and Setting up the Canvas | The Complete Python Pro Bootcamp

Roadmap for Python Web Scraping: Beginner to Pro

Coding for 1 Month Versus 1 Year #shorts #coding

Introduction to Web Scraping in Python