How to loop through pictures for text recognition
tngo
New Altair Community Member
Hi everyone,
I am new to Rapidminer and I would appreciate if any help you can provide. I have a database with a field of URLs. All the URLs are pictures. I need to find a process that without clicking manually on URLs, I still can extract text from the URL images for every row in my dataset. My dataset has hundreds of thousands of rows.
I am new to Rapidminer and I would appreciate if any help you can provide. I have a database with a field of URLs. All the URLs are pictures. I need to find a process that without clicking manually on URLs, I still can extract text from the URL images for every row in my dataset. My dataset has hundreds of thousands of rows.
0
Answers
-
As rapidminer has no out of the box 'img to text' operators you will need to use the python extension here.
One possible workflow would be to use RM to loop all of your db records -> webmining extension to download the image and store it locally -> python using for instance opencv to read the image -> pytesseract to do the OCR to get the text -> return text to Rapidminer and continue with next image.
1 -
In deep learning extension with our new functionality, you can easily do by using "extract text from image" as this operator uses the Tesseract OCR library. In case you have multiple image then you can loop over images by adding another operator referred as "Read Image Meta-Data" inside the process.
2 -
@rdesai, Thank you so much! I tried your process and it worked. However, I either need to be able to automatically download all images from the URLs in the database to my own folder, or I need an alternative way to run this without needing to download images to a folder. Do you have any thoughts?0
-
You could use the [open file] operator, which allows you to select a file based on a url. if you combine this with the [write file] operator you can save it on your disk. You will probably need to do some tweaking with macros to define filename and folder but in essence this should work fine.0