温馨提示:本文翻译自stackoverflow.com,查看原文请点击:其他 - UnicodeDecodeError with Tesseract OCR in Python
python python-tesseract tesseract

其他 - Tesseract OCR在Python中的UnicodeDecodeError

发布于 2020-03-27 10:47:31

我正在尝试使用Python中的Tesseract OCR从图像文件中提取文本,但是我遇到了一个错误,我可以弄清楚如何处理它。我的所有环境都很好,因为我在python中使用ocr测试了一些示例图片!

这是代码

from PIL import Image
import pytesseract
strs = pytesseract.image_to_string(Image.open('binarized_image.png'))

print (strs)

以下是我从Eclipse控制台收到的错误

strs = pytesseract.image_to_string(Image.open('binarized_body.png'))
  File "C:\Python35x64\lib\site-packages\pytesseract\pytesseract.py", line 167, in image_to_string
    return f.read().strip()
  File "C:\Python35x64\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 20: character maps to <undefined>

IAM Windows10使用python 3.5 x64

查看更多

查看更多

提问者
Nwawel A Iroume
被浏览
26
randomusername 2015-12-16 00:22

The problem is that python is trying to use the console's encoding (CP1252) instead of what it's meant to use (UTF-8). PyTesseract has found a unicode character and is now trying to translate it into CP1252, which it can't do. On another platform you won't encounter this error because it will get to use UTF-8.

You can try using a different function (possibly one that returns bytes instead of str so you won't have to worry about encoding). You could change the default encoding of python as mentioned in one of the comments, although that will cause problems when you go to try and print the string on the windows console. Or, and this is my recommended solution, you could download Cygwin and run python on that to get a clean UTF-8 output.

如果您想要一种快速而又肮脏的解决方案,该解决方案不会破坏任何东西(尚未),请考虑以下方法:

import builtins

original_open = open
def bin_open(filename, mode='rb'):       # note, the default mode now opens in binary
    return original_open(filename, mode)

from PIL import Image
import pytesseract

img = Image.open('binarized_image.png')

try:
    builtins.open = bin_open
    bts = pytesseract.image_to_string(img)
finally:
    builtins.open = original_open

print(str(bts, 'cp1252', 'ignore'))