我正在尝试使用Python中的Tesseract OCR从图像文件中提取文本,但是我遇到了一个错误,我可以弄清楚如何处理它。我的所有环境都很好,因为我在python中使用ocr测试了一些示例图片!
这是代码
from PIL import Image
import pytesseract
strs = pytesseract.image_to_string(Image.open('binarized_image.png'))
print (strs)
以下是我从Eclipse控制台收到的错误
strs = pytesseract.image_to_string(Image.open('binarized_body.png'))
File "C:\Python35x64\lib\site-packages\pytesseract\pytesseract.py", line 167, in image_to_string
return f.read().strip()
File "C:\Python35x64\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 20: character maps to <undefined>
IAM 在Windows10上 使用python 3.5 x64
The problem is that python is trying to use the console's encoding (CP1252) instead of what it's meant to use (UTF-8). PyTesseract has found a unicode character and is now trying to translate it into CP1252, which it can't do. On another platform you won't encounter this error because it will get to use UTF-8.
You can try using a different function (possibly one that returns bytes
instead of str
so you won't have to worry about encoding). You could change the default encoding of python as mentioned in one of the comments, although that will cause problems when you go to try and print the string on the windows console. Or, and this is my recommended solution, you could download Cygwin and run python on that to get a clean UTF-8 output.
如果您想要一种快速而又肮脏的解决方案,该解决方案不会破坏任何东西(尚未),请考虑以下方法:
import builtins
original_open = open
def bin_open(filename, mode='rb'): # note, the default mode now opens in binary
return original_open(filename, mode)
from PIL import Image
import pytesseract
img = Image.open('binarized_image.png')
try:
builtins.open = bin_open
bts = pytesseract.image_to_string(img)
finally:
builtins.open = original_open
print(str(bts, 'cp1252', 'ignore'))
好像有可能与此相关的回答一些有用的信息在这里。
是的,这听起来像是我之前遇到的问题。如果您提供了一些代码来解释如何配置PyTesseract以使用UTF8编码打开该文件,则此答案会更好。
@BenjaminHodgson PyTesseract没有办法指定编码,但是我们可以注入自己的
open
选择...@randomusername您的解决方案对提取的结果文本的保真度有影响吗?iam有很多奇怪的字符,而原始文档即使是有点模糊也只是纯英文字符!一个例子是iÃŽc1-zo1sîâzzaïzÃœlVE0Ã2ÃŽEBP797Z5SiÃŽc1-zo1sîÀçÃççÃléVE0Ã2ÃŽEBP797Z5S
@NwawelAIroume否,但是确实会对结果输出产生严重影响。尝试将输出打印为原始
bytes
对象,以查看是否可以挽救您的能力。或者,您可以将输出存储在文件中,并使用支持UTF-8的文本编辑器进行查看。