Thursday, August 1, 2019

Arabic OCR in Python

We will use tesseract library

How to install ?
on Linux: 
sudo apt-get install tesseract-ocr
pip3 install pillow pytesseract

On Mac
brew install tesseract
brew install tesseract-lang
pip3 install pillow pytesseract

Then correct the tesseract installation path in pytesseract.py 
find pytesseract.py

default path "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pytesseract/pytesseract.py"

Change tesseract_cmd = 'tesseract' to point to tesseract installation directory
ie,
tesseract_cmd = '/usr/local/bin/tesseract'

(you can search for tesseract to validate the installation directory)





Note:
you can ignore the pervious step and add the next line in any new ocr python script 
pytesseract.pytesseract.tesseract_cmd = '<path-to-tesseract-bin>'
ie
pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract'




Test OCR using command line 
tesseract      -l ara     image.png     text.txt
convert image.png to text.txt and default language is Arabic



Simple Python Script to convert image to text

from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract'

im =Image.open('/Users/rafie/Desktop/ocr.png')
text = pytesseract.image_to_string(im,lang='ara')
print(text)










No comments: