UnicodeDecodeError with Tesseract OCR in Python

randomusername 2015-12-16 00:22

The problem is that python is trying to use the console's encoding (CP1252) instead of what it's meant to use (UTF-8). PyTesseract has found a unicode character and is now trying to translate it into CP1252, which it can't do. On another platform you won't encounter this error because it will get to use UTF-8.

You can try using a different function (possibly one that returns bytes instead of str so you won't have to worry about encoding). You could change the default encoding of python as mentioned in one of the comments, although that will cause problems when you go to try and print the string on the windows console. Or, and this is my recommended solution, you could download Cygwin and run python on that to get a clean UTF-8 output.

If you want a quick and dirty solution that won't break anything (yet), here's a way that you might consider:

import builtins

original_open = open
def bin_open(filename, mode='rb'):       # note, the default mode now opens in binary
    return original_open(filename, mode)

from PIL import Image
import pytesseract

img = Image.open('binarized_image.png')

try:
    builtins.open = bin_open
    bts = pytesseract.image_to_string(img)
finally:
    builtins.open = original_open

print(str(bts, 'cp1252', 'ignore'))

MPlanchard 2015-12-15 23:50:15

Seems like there's some good information potentially related to this answer here.

Benjamin Hodgson 2015-12-15 23:51:53

Yeah, that sounds like the issue I've run into before. This answer would be better if you gave some code explaining how to configure PyTesseract to open that file with a UTF8 encoding, if possible

randomusername 2015-12-16 00:19:49

@BenjaminHodgson PyTesseract doesn't have a way to specify the encoding, but we can inject our own open alternative...

Nwawel A Iroume 2015-12-16 16:32:53

@randomusername does your solution have impact on the fidelity of the resulting text extracted? iam getting lot of strange characters whereas the original document is plain english char even if it a little bit blured! an example is like iÃŽc1-zo1sÃ®Ã¢zzaÃ¯zÃœl VE0Ã2ÃŽE BP797Z5SiÃŽc1-zo1sÃ®Ã¢zzaÃ¯zÃœl VE0Ã2ÃŽE BP797Z5S

randomusername 2015-12-16 23:30:11

@NwawelAIroume no, but it does have a severe impact on the resulting output. Try printing the output as the original bytes object to see if you can salvage what you can. Or you can store the output in a file and use a UTF-8 capable text editor to view it.

Related issues

How to use python cut method to create bins, accept one parameter and return appropriate bin?

Create a dictionary from a list of lists with certain criteria

selecting columns based on row value, Python, Pandas

plotting count of zeros and ones in a dataframe

BeautifulSoup find.all() web scraping returns empty

python function. output a keys list from a dictionary if the key is todays date

Best way to perform multiple amount of Pandas lookups between two DataFrames

How to get the number of columns and the width of each column in a Pandas pivot table?

Display a column when a desired value is missing while grouping in Pandas dataframe

Python hide ticks but show tick labels