Development Tips & Tricks : Arabic OCR in Python

Thursday, August 1, 2019

Arabic OCR in Python

We will use tesseract library

How to install ?
on Linux:
sudo apt-get install tesseract-ocr
pip3 install pillow pytesseract

On Mac
brew install tesseract
brew install tesseract-lang
pip3 install pillow pytesseract

Then correct the tesseract installation path in pytesseract.py
find pytesseract.py

default path "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pytesseract/pytesseract.py"

Change tesseract_cmd = 'tesseract' to point to tesseract installation directory
ie,
tesseract_cmd = '/usr/local/bin/tesseract'

(you can search for tesseract to validate the installation directory)

Note:

you can ignore the pervious step and add the next line in any new ocr python script

pytesseract.pytesseract.tesseract_cmd = '<path-to-tesseract-bin>'

pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract'

Test OCR using command line

tesseract -l ara image.png text.txt
convert image.png to text.txt and default language is Arabic

Simple Python Script to convert image to text

from PIL import Image

import pytesseract

pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract'

im =Image.open('/Users/rafie/Desktop/ocr.png')

text = pytesseract.image_to_string(im,lang='ara')

print(text)

Development Tips & Tricks

Topics

Thursday, August 1, 2019

Arabic OCR in Python

No comments: