其他 Tesseract OCR在Python中的UnicodeDecodeError

randomusername 2015-12-16 00:22

The problem is that python is trying to use the console's encoding (CP1252) instead of what it's meant to use (UTF-8). PyTesseract has found a unicode character and is now trying to translate it into CP1252, which it can't do. On another platform you won't encounter this error because it will get to use UTF-8.

You can try using a different function (possibly one that returns bytes instead of str so you won't have to worry about encoding). You could change the default encoding of python as mentioned in one of the comments, although that will cause problems when you go to try and print the string on the windows console. Or, and this is my recommended solution, you could download Cygwin and run python on that to get a clean UTF-8 output.

如果您想要一种快速而又肮脏的解决方案，该解决方案不会破坏任何东西（尚未），请考虑以下方法：

import builtins

original_open = open
def bin_open(filename, mode='rb'):       # note, the default mode now opens in binary
    return original_open(filename, mode)

from PIL import Image
import pytesseract

img = Image.open('binarized_image.png')

try:
    builtins.open = bin_open
    bts = pytesseract.image_to_string(img)
finally:
    builtins.open = original_open

print(str(bts, 'cp1252', 'ignore'))

MPlanchard 2015-12-15 23:50:15

好像有可能与此相关的回答一些有用的信息在这里。

Benjamin Hodgson 2015-12-15 23:51:53

是的，这听起来像是我之前遇到的问题。如果您提供了一些代码来解释如何配置PyTesseract以使用UTF8编码打开该文件，则此答案会更好。

randomusername 2015-12-16 00:19:49

@BenjaminHodgson PyTesseract没有办法指定编码，但是我们可以注入自己的open选择...

Nwawel A Iroume 2015-12-16 16:32:53

@randomusername您的解决方案对提取的结果文本的保真度有影响吗？iam有很多奇怪的字符，而原始文档即使是有点模糊也只是纯英文字符！一个例子是iÃŽc1-zo1sÃ®Ã¢zzaÃ¯zÃœlVE0Ã2ÃŽEBP797Z5SiÃŽc1-zo1sÃ®Ã€Ã§ÃçÃ§ÃléVE0Ã2ÃŽEBP797Z5S

randomusername 2015-12-16 23:30:11

@NwawelAIroume否，但是确实会对结果输出产生严重影响。尝试将输出打印为原始bytes对象，以查看是否可以挽救您的能力。或者，您可以将输出存储在文件中，并使用支持UTF-8的文本编辑器进行查看。

其他 - Tesseract OCR在Python中的UnicodeDecodeError

相关问题

热门github