current position:Home>[Python crawler] Nine, Tesseract of machine vision and machine image recognition

[Python crawler] Nine, Tesseract of machine vision and machine image recognition

2022-11-08 09:31:53Deng Dashuai


往期内容提要:

另:Thank you all for your continued attention and support to me,I am today(2020年2月17日)Certified as a blogging expert,距19年3月14First post today,已经11个月了,感谢CSDNOfficial recognition of my blog post,博客专家是对我的莫大的激励,自己会继续努力写出更多更有质量的博文,Continuous exploration and progress in the field of research.
在这里插入图片描述


一、机器视觉

从 Google 的无人驾驶汽车到可以识别假钞的自动售卖机,Machine vision has always been a widely used application A field that is broad and has far-reaching impact and a grand vision.

我们将重点介绍机器视觉的一个分支:文字识别,介绍如何用一些 Python库来识别和使用在线图片中的文字.

We can easily read the text in the picture,But it would be very difficult for a machine to read these pictures,利用这种人类用户可以正常读取但是大多数机器人都没法读取的图片,验证码 (CAPTCHA)就出现了.验证码读取的难易程度也大不相同,有些验证码比其他的更加难读.

Translating images into text is generally referred to as optical text recognition(Optical Character Recognition, OCR).可以实现OCR的底层库并不多,目前很多库都是使用共同的几个底层 OCR 库,Or on top 进行定制.在读取和处理图像、图像相关的机器学习以及创建图像等任务中,Python 一直都是非常出色的语言.虽然有很多库可以进行图像处理,But here we only focus on the introduction:Tesseract


二、Tesseract 安装

Tesseract 是一个 OCR 库,目前由 Google 赞助(Google 也是一家以 OCR 和机器学习技术闻名于世的公司).Tesseract 是目前公认最优秀、最精确的开源 OCR 系统,除了极高的精确度,Tesseract 也具有很高的灵活性.它可以通过训练识别出任何字体,也可以识别出任何 Unicode 字符.

(1) 安装Tesseract

  • Windows系统:Download the executable installation filehttps://code.google.com/p/tesseract-ocr/downloads/list安装.
  • Linux 系统:可以通过 apt-get 安装: $sudo apt-get tesseract-ocr
  • Mac OS X系统:用Homebrew(http://brew.sh/)Other third-party libraries can be easily installed

(2) 安装pytesseract

Tesseract 是一个 Python 的命令行工具,不是通过 import 语句导入的库.安装之后,要用 tesseract 命令在 Python 的外面运行,但我们可以通过 pip 安装支持Python 版本的 Tesseract库:

pip install pytesseract

三、Process specification text

It is best to work with most text that is relatively clean、格式规范的.格式规范的文字通常可以满足一些需求,Usually formatted text has the following characteristics:

  • 使用一个标准字体(不包含手写体、草书,或者十分“花哨的”字体)
  • Even if copied or photographed,字体还是很清晰,没有多余的痕迹或污点
  • 排列整齐,没有歪歪斜斜的字
  • 没有超出图片范围,也没有残缺不全,或紧紧贴在图片的边缘

文字的一些格式问题在图片预处理时可以进行解决.例如,可以把图片转换成灰度图,调整亮度和对比度,还可以根据需要进行裁剪和旋转(Details require an understanding of image and signal processing)等.

(1) 格式规范文字的理想示例

在这里插入图片描述

通过下面的命令运行 Tesseract,读取文件并把结果写到一个文本文件中: tesseract test.jpg text

在这里插入图片描述

cat text.txt 即可显示结果.

识别结果很准确,不过符号^和*分别被表示成了双引号和单引号.大体上可以让你很舒服地阅读.

(2) 通过Python代码实现

import pytesseract
from PIL import Image

image = Image.open('test.jpg')
text = pytesseract.image_to_string(image)
print text

运行结果:

This is some text, written in Arial, that will be read by
Tesseract. Here are some symbols: [email protected]#$%"&*()

(3) 对图片进行阈值过滤和降噪处理

很多时候我们在网上会看到这样的图片:

在这里插入图片描述

Tesseract 不能完整处理这个图片,主要是因为图片背景色是渐变的,最终结果是这样:

在这里插入图片描述

随着背景色从左到右不断加深,文字变得越来越难以识别,Tesseract 识别出的 每一行的最后几个字符都是错的.遇到这类问题,可以先用 Python 脚本对图片进行清理.利用 PIL 库,We can create a threshold filter to remove the gradient background color,只把文字留下来,从而让图片更加清晰,便于 Tesseract 读取:

from PIL import Image
import subprocess

def cleanFile(filePath, newFilePath):
    image = Image.open(filePath)

    # 对图片进行阈值过滤(低于143set to black,否则为白色)
    image = image.point(lambda x: 0 if x < 143 else 255)
    # 重新保存图片
    image.save(newFilePath)

    # 调用系统的tesseract命令对图片进行OCR识别 
    subprocess.call(["tesseract", newFilePath, "output"])

    # 打开文件读取结果
    with open("output.txt", 'r') as f:
        print(f.read())

if __name__ == "__main__":
    cleanFile("text2.png", "text2clean.png")

通过一个阈值对前面的“模糊”图片进行过滤的结果:

在这里插入图片描述

除了一些标点符号不太清晰或丢失了,大部分文字都被读出来了.Tesseract 给出了最好的 结果:

在这里插入图片描述

(4) 从网站图片中抓取文字

用 Tesseract 读取硬盘里图片上的文字,可能不怎么令人兴奋,但当我们把它和网络爬虫组合使用时,就能成为一个强大的工具.

网站上的图片可能并不是故意把文字做得很花哨 (就像餐馆菜单的 JPG 图片上的艺术字),但它们上面的文字对网络爬虫来说就是隐藏起来 了,举个例子:

  • 虽然亚马逊的 robots.txt 文件允许抓取网站的产品页面,但是图书的预览页通常不让网络机 器人采集.

  • 图书的预览页是通过用户触发 Ajax 脚本进行加载的,预览图片隐藏在 div 节点 下面;其实,普通的访问者会觉得它们看起来更像是一个 Flash 动画,而不是一个图片文 件.当然,即使我们能获得图片,要把它们读成文字也没那么简单.

  • 下面的程序就解决了这个问题:首先导航到托尔斯泰的《战争与和平》的大字号印刷版 1, 打开阅读器,收集图片的 URL 链接,然后下载图片,识别图片,最后打印每个图片的文 字.因为这个程序很复杂,利用了前面几章的多个程序片段,所以我增加了一些注释以让 每段代码的目的更加清晰:

import time
from urllib.request import urlretrieve
import subprocess
from selenium import webdriver
#创建新的Selenium driver
driver = webdriver.PhantomJS()

# 用Selenium试试Firefox浏览器:
# driver = webdriver.Firefox()

driver.get("http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200")
# 单击图书预览按钮 driver.find_element_by_id("sitbLogoImg").click() imageList = set()
# 等待页面加载完成
time.sleep(5)
# 当向右箭头可以点击时,开始翻页
while "pointer" in driver.find_element_by_id("sitbReaderRightPageTurner").get_attribute("style"):
    driver.find_element_by_id("sitbReaderRightPageTurner").click()
    time.sleep(2)
    # 获取已加载的新页面(一次可以加载多个页面,但是重复的页面不能加载到集合中)
    pages = driver.find_elements_by_xpath("//div[@class='pageImage']/div/img")
    for page in pages:
        image = page.get_attribute("src")
        imageList.add(image)
driver.quit()

# 用Tesseract处理我们收集的图片URL链接
for image in sorted(imageList):
    # 保存图片
    urlretrieve(image, "page.jpg")
    p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE,stderr=subprocess.PIPE)
    f = open("page.txt", "r")
    p.wait() print(f.read())

和我们前面使用 Tesseract 读取的效果一样,This program will also print many long paragraphs in the book perfectly,第六页的预览如下所示:

6
     "A word of friendly advice, mon
     cher. Be off as soon as you can,
     that's all I have to tell you. Happy
     he who has ears to hear. Good-by,
     my dear fellow. Oh, by the by!" he
     shouted through the doorway after
     Pierre, "is it true that the countess
     has fallen into the clutches of the
     holy fathers of the Society of je-
     sus?"
     Pierre did not answer and left Ros-
     topchin's room more sullen and an-
     gry than he had ever before shown
     himself.

But when the text appears on the color cover,结果就不那么完美了:

   WEI' nrrd Peace
   Len Nlkelayevldu Iolfluy
   Readmg shmdd be ax
   wlnvame asnossxble Wenfler
   an mm m our cram: Llhvary
   - Leo Tmsloy was a Russian rwovelwst
   I and moval phflmopher med lur
   A ms Ideas 01 nonviolenx reswslance m 5 We range     0, "and"

If you want to process the text into an effect that ordinary people can understand,还需要花很多时间去处理.

比如,通过给 Tesseract 提供大量已知的文字与图片映射集,经过训练 Tesseract 就可以“学会”识别同一种字体,而且可以达到极高的精确率和准确率,Even the background color and relative position of the text in the image can be ignored.


四、处理验证码

(1) Try to process the verification code of Zhihu.com:

Many popular content management systems have added captcha modules,So how do we identify the captcha?

Captcha images generated by most websites have the following properties.

  • They are pictures dynamically generated by a program on the server side.验证码图片的 src Attributes may not be the same as normal pictures 样,比如 <img src="WebForm.aspx?id=8AP85CQKE9TJ">,But it can be done the same as the other pictures 下载和处理.
  • The answers to the pictures are stored in a server-side database.
  • Many verification codes have a time limit,If you don't fix it for too long it will fail.
  • The usual way to deal with it is,First download the verification code image to the hard drive,清理干净,然后用 Tesseract 处理 图片,Finally, the identification results that meet the requirements of the website are returned.
#!/usr/bin/env python
# -*- coding:utf-8 -*-

import requests
import time
import pytesseract
from PIL import Image
from bs4 import BeautifulSoup

	def captcha(data):
	    with open('captcha.jpg','wb') as fp:
	        fp.write(data)
	    time.sleep(1)
	    image = Image.open("captcha.jpg")
	    text = pytesseract.image_to_string(image)
	    print "The verification code after machine identification is :" + text
	    command = raw_input("请输入Yexpress consent to use,Press another key to re-enter it yourself:")
	    if (command == "Y" or command == "y"):
	        return text
	    else:
	        return raw_input('输入验证码:')
	
	def zhihuLogin(username,password):
	
	    # 构建一个保存Cookie值的session对象
	    sessiona = requests.Session()
	    headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'}
	
	    # Get page information first,找到需要POST的数据(And the current page has been recordedCookie)
	    html = sessiona.get('https://www.zhihu.com/#signin', headers=headers).content
	
	    # 找到 name 属性值为 _xsrf 的input标签,取出value里的值
	    _xsrf = BeautifulSoup(html ,'lxml').find('input', attrs={
    'name':'_xsrf'}).get('value')
	
	    # 取出验证码,r后面的值是Unix时间戳,time.time()
	    captcha_url = 'https://www.zhihu.com/captcha.gif?r=%d&type=login' % (time.time() * 1000)
	    response = sessiona.get(captcha_url, headers = headers)
	
	
	    data = {
    
	        "_xsrf":_xsrf,
	        "email":username,
	        "password":password,
	        "remember_me":True,
	        "captcha": captcha(response.content)
	    }
	
	    response = sessiona.post('https://www.zhihu.com/login/email', data = data, headers=headers)
	    print response.text
	
	    response = sessiona.get('https://www.zhihu.com/people/maozhaojun/activities', headers=headers)
	    print response.text


if __name__ == "__main__":
    zhihuLogin('[email protected]','ALAxxxxIME')

(2) Try to handle Chinese characters

If you have Chinese training data on hand,You can also try to recognize Chinese.

命令:tesseract --list-langsYou can view currently supported languages,chi_sim表示支持简体中文.

在这里插入图片描述

Then when using it,A language can be specified for recognition,如:

tesseract -l chi_sim paixu.png paixu

在这里插入图片描述

manifested in the program,则可以这么写:

#!/usr/bin/env python
# -*- coding:utf-8 -*-

from PIL import Image
import subprocess

def cleanFile(filePath)
    image = Image.open(filePath)

    # 调用系统的tesseract命令, 对图片进行OCR中文识别
    subprocess.call(["tesseract", "-l", "chi_sim", filePath, "paixu"])

    # 打开文件读取结果
    with open("paixu.txt", 'r') as f:
        print(f.read())

if __name__ == "__main__":
    cleanFile("paixu.png")

结果如下:

在这里插入图片描述


五、训练Tesseract

要使用 Tesseract 的功能,比如后面的示例中训练程序识别字母,To set one in the system first a new environment variable $TESSDATA_PREFIX,让 Tesseract 知道训练的数据文件存储在哪里,Then make one tessdata 数据文件,放到Tesseract目录下.

  • 在大多数 Linux 系统和 Mac OS X 系统上,你可以这么设置: $export TESSDATA_PREFIX=/usr/local/share/Tesseract

  • 在 Windows 系统上也类似,你可以通过下面这行命令设置环境变量: #setx TESSDATA_PREFIX C:\Program Files\Tesseract OCR\Tesseract

用下面的代码运行 Tesseract 识别图片:

tesseract captchaExample.png output

大多数其他的验证码都是比较简单的.例如,流行的 PHP 内容管理系统 Drupal There is a well-known captcha module https://www.drupal.org/project/captcha,可以生成不同难度的验证码.

Factors that affect the difficulty of verification code recognition影响原因
大小Too small fonts require additional training to be recognized
字体种类The more species, the harder it is to identify
倾斜程度Random tilt levels are confusing OCR 软件识别
Whether letters and numbers are mixedIncrease the number of characters to search for
叠加,交叉When drawing a box outside each letter,If they overlap, it is more difficult to identify
背景色、线条产生对 OCR 程序产生干扰的噪点
Background and font color contrastThe smaller the contrast, the harder it is to identify

Create a sample library for trainingTesseract

要训练 Tesseract 识别一种文字,无论是晦涩难懂的字体还是验证码,你都需要向 Tesseract 提供每个字符不同形式的样本.

The first thing to do is to collect a large number of captcha samples,The number and complexity of the samples,will determine the training effect.第二步是准确地告诉 Tesseract 一张图片中的每个字符是什么,以及每个字符的具体位置.

这里需要创建一些矩形定位文件(box file),一个验证码图片生成一个矩形定位文件,也可以通过jTessBoxEditorsoftware to modify the positioning of the rectangle.

A rectangular positioning file for an image is shown below:

      4  15 26 33 55 0
      M  38 13 67 45 0
      m  79 15 101 26 0
      C  111 33 136 60 0
      3  147 17 176 45 0

第一列符号是图片中的每个字符,后面的 4 个数字分别是包围这个字符的最小矩形的坐标 (图片左下角是原点 (0,0),4 个数字分别对应每个字符的左下角 x 坐标、左下角 y 坐标、右上角 x 坐标和右上角 y 坐标),最后一个数字“0”表示图片样本的编号.

矩形定位文件必须保存在一个 .box 后缀的文本文件中,(例如 4MmC3.box).

A good training tutorial in the blog garden:http://www.cnblogs.com/mjorcen/p/3800739.html?utm_source=tuicool&utm_medium=referral

前面的内容只是对 Tesseract A brief overview of the library's font training and recognition capabilities.如果你对 Tesseract 的其他训练方法感兴趣,甚至打算建立自己的验证码训练文件库,推荐阅读 Tesseract 官方文档:https://github.com/tesseract-ocr/tesseract/wiki,加油!


后期内容提要:

  • [Python爬虫] 十、Scrapy 框架

如果您有任何疑问或者好的建议,期待你的留言与评论!

copyright notice
author[Deng Dashuai],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/312/202211080910333508.html

Random recommended