Training Tesseract 5 for a New Font

preview_player
Показать описание
Build Tesseract from source video:

GitHub repository link:

Training command:
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=
eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000

Correction: I believe the box file contains the bounding box (OBB) coordinates of the character within the image
Рекомендации по теме
Комментарии
Автор

God I love you. I just recently started messing with OCR's, specifically Tesseract, and I was reading through some documentation on the steps and after a few hours just wanted to end my life hahahaha. Thank you for this, this is extremely encouraging. I can't wait to try this!

taylorbarnes
Автор

I think the reason why the word error rate is high is because the font doesn't distinguish uppercase with lower case (it's all upper case) but the ground truth label distinguish between the two.

yichenyao
Автор

This video on training is the only source that by following this you will be able to achieve results! Many thanks for this video!

AchievementHuntGuru
Автор

thank you so much man. I've been looking everywhere for a tesseract tutorial, it all just points to the shitty unreadable docs. Without you I don't know where I'd be

donjuanpond
Автор

Tesseract's documentation is abysmal.

bunyn
Автор

Estuve rompiendome la cabeza tratando de entender el tutorial oficial y tú lo explicas de una manera sencilla. Soy tu suscriptor numero 666, Muchas Gracias.

fivalt
Автор

Haven't watched the video yet, but if this works, you'll have my eternal gratitude

videos
Автор

Hey Gabriel, I am following your steps to train on my model on hand writtent text. But it is always failing with this erro:

unicharset_extractor --output_unicharset "data/Apex/my.unicharset" --norm_mode 2 "data/Apex/all-gt"
Failed to read data from: data/Apex/all-gt
Wrote unicharset file data/Apex/my.unicharset

Can you please help me here? I am stuck. Thanks!

madhavpandey
Автор

I've been experimenting with this tutorial for three days, the file structure and the GitHub doesn't necessarily match, can you please update the repo if possible . I am having too many folder inconsistencies when trying to to connect the dots here as it was brushed over really quickly, thank you :)

ConfusedProgrammer
Автор

I tried with this font for hindi language ( Kruti Dev 010 ) even tried with Kruti Dev 016 but its showing : Error: Call PrepareToWrite before WriteTesseractBoxFile!!

ganeshrajv
Автор

Hi. Theres a font used in a game i would like to prepare for training. Would all i need to do is screencapture the words used in that font according to what you describe, or do i need a different approach?

nobafan
Автор

Great tutorial. Using WSL I was constantly getting new errors. Switching to OS installed on VirtualBox solved it. I was able to train my dataset—it's surprisingly easy.

wojd_
Автор

Thank you for making this video. But I can't wrap my head around where to put all those data files to? I'm trying to fine tune variations of letters with accents, and I'm helpless.

ombieautopilot
Автор

Hi Gabriel.
Thank you for this tutorial.
I was trying to run the code but I'm receiving this error:
Fontconfig error: Cannot load default config file: No such file: (null)
This error appears to be font-related. I've experimented with several fonts but I'm unable to resolve this issue.
Could you help me please?

shadyas.
Автор

I want to custom train Tesseract 5 to read the license plates of the car which are detected using YOLO model. How can I do these as I have couple of thousand images? Help
What are the steps I need to follow?

Leo-hkkk
Автор

While running the script 'split_training_text.py'. I am getting the following error:

Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored

Could you help me how to resolve this?

aayushjain
Автор

So far, the only tutorial on Tesseract 5, the old model of training by bash has been abandoned since December 2022

adityanjsg
Автор

the title is for new font, can I take it as new language ? using TIFF

ganeshrajv
Автор

when tesseract training is start it show the bellow warning
Can't encode transcription: 'पिए वई। ज़ख़मनि जो सूर वधंदो वियो हू चीखन्दो for Sindhi
how I can handle this problem?

DalvinderKaur-izsn
Автор

Hi.I try this on colab. I install tesseract and go on to run split_training_text.py and get this error FileNotFoundError: [Errno 2] No such file or directory: 'text2image'. Is there a solution?

listentomusicfeellikehome