current position:Home>A little trick every day, python can easily convert PDF to text and bid farewell to copy and paste

A little trick every day, python can easily convert PDF to text and bid farewell to copy and paste

2022-02-02 04:23:05 jiejie

For many people , take PDF Converting to editable text is just a matter of time , But there is no simple way . In the project introduced in this article , come from K1 Digital The senior machine learning engineer of Lucas Soares, Try to use OCR( Optical character recognition ) Automatic transcription pdf Slide , The transcription effect is good .

Traditional lectures are usually accompanied by a group of pdf Slide . Generally speaking , Want to take notes of such lectures , Need from pdf Copy 、 Paste a lot of content .
lately , come from K1 Digital The senior machine learning engineer of Lucas Soares I've been trying to use OCR( Optical character recognition ) Automatic transcription pdf Slide , So as to directly in markdown Manipulate their contents in the file , This avoids manual copying and pasting pdf Content , Automate this process .

 picture

On the left is the project author Lucas Soares.
Why not use traditional pdf What about the text transfer tool ?
Lucas Soares Finding traditional tools often brings more problems , It takes time to solve . He once tried to use the traditional Python software package , But there are many problems ( For example, complex regular expression patterns must be used to parse the final output, etc ), So I decided to try it Target detection and OCR To solve .
The basic process can be divided into the following steps :

  • take pdf Convert to picture ;
  • Detect and recognize the text in the image ;
  • Show sample output .

Based on deep learning OCR take pdf Transcribe into text
take pdf Convert to image
Soares The use of pdf Slides from David Silver Enhanced learning ( See below pdf Slide address ). Use 「pdf2image」 The package converts each slide into png Image format .

 picture

*pdf Slide example . Address :www.davidsilver.uk/wp-content/… The code is as follows :

from pdf2image import convert_from_path
from pdf2image.exceptions import (
 PDFInfoNotInstalledError,
 PDFPageCountError,
 PDFSyntaxError
)

pdf_path = "path/to/file/intro_RL_Lecture1.pdf"
images = convert_from_path(pdf_path)
for i, image in enumerate(images):
    fname = "image" + str(i) + ".png"
    image.save(fname, "PNG")
 Copy code 

After treatment , be-all pdf The slides are converted to png Format image :

 picture

Detect and recognize the text in the image
In order to detect and identify png Text in image ,Soares Use ocr.pytorch Text detector in Library . Follow the instructions to download the model and save it in checkpoints In the folder .
The code is as follows :

# adapted from this source: https://github.com/courao/ocr.pytorch
%load_ext autoreload
%autoreload 2
import os
from ocr import ocr
import time
import shutil
import numpy as np
import pathlib
from PIL import Image
from glob import glob
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import pytesseract

def single_pic_proc(image_file):
    image = np.array(Image.open(image_file).convert('RGB'))
    result, image_framed = ocr(image)
    return result,image_framed

image_files = glob('./input_images/*.*')
result_dir = './output_images_with_boxes/'

# If the output folder exists we will remove it and redo it.
if os.path.exists(result_dir):
    shutil.rmtree(result_dir)
os.mkdir(result_dir)

for image_file in sorted(image_files):
    result, image_framed = single_pic_proc(image_file) # detecting and recognizing the text
    filename = pathlib.Path(image_file).name
    output_file = os.path.join(result_dir, image_file.split('/')[-1])
    txt_file = os.path.join(result_dir, image_file.split('/')[-1].split('.')[0]+'.txt')
    txt_f = open(txt_file, 'w')
    Image.fromarray(image_framed).save(output_file)
    for key in result:
        txt_f.write(result[key][1]+'\n')
    txt_f.close()
 Copy code 

Set input and output folders , Then traverse all the input images ( Converted pdf Slide ), And then through single_pic_proc() Function run OCR Detection and recognition model in module , Finally, save the output to the output folder .
Which detects inheritance (inherit) 了 Pytorch CTPN Model , Identify inherited Pytorch CRNN Model , Both exist in OCR Module .
Sample output
The code is as follows :

import cv2 as cv

output_dir = pathlib.Path("./output_images_with_boxes")

# image = cv.imread(str(np.random.choice(list(output_dir.iterdir()),1)[0]))
image = cv.imread(f"{output_dir}/image7.png")
size_reshaped = (int(image.shape[1]),int(image.shape[0]))
image = cv.resize(image, size_reshaped)
cv.imshow("image", image)
cv.waitKey(0)
cv.destroyAllWindows()
 Copy code 

The left side of the figure below shows the original pdf Slide , On the right is the transcribed output text , The accuracy of post transcription is very high .

 picture

The text recognition output is as follows :

filename = f"{output_dir}/image7.txt"
with open(filename, "r") as text:
    for line in text.readlines():
        print(line.strip("\n"))
 Copy code 

By the above methods , Finally, you can get a very powerful tool to transcribe all kinds of documents , From detecting and recognizing handwritten notes to detecting and recognizing random text in photos . Own your own OCR Tools to process some text content , This is much better than relying on external software to transcribe documents .

Python It is a very diverse and well developed language , So there will certainly be many functions I didn't consider , If anyone knows , You can tell me in the comments section

copyright notice
author[jiejie],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202020423029606.html

Random recommended