current position:Home>Van * Python | save the crawled data with docx and PDF

Van * Python | save the crawled data with docx and PDF

2022-01-31 07:27:25 coder_ pig

This is my participation 11 The fourth of the yuegengwen challenge 3 God , Check out the activity details :2021 One last more challenge

0x1、 introduction

Last section 《Van*Python | Simple crawling of a planet 》 There are readers chatting privately backstage. I said :

Save crawl results to Markdown in , It's inconvenient to watch on your mobile phone ,

I

Out of line , Cough , Maybe I will use it in the future , Let's toss about , Then the afternoon fishing time , After searching a wave of keywords , I saw two conventional ways to play , Let's try , Let's toss about the seemingly simple library first ↓

pandoc

0x2、pandoc Kuchu experience

Support super ! super ! super ! Multiple types of mutual conversion , There is a big lump below :

well , You don't have to look at which file formats are supported , Basically everything you can think of , We're here to put Markdown Turn into PDF.

How to install it, you can see INSTALL.md, Like the author Windows Download the compressed package directly or use choco Installation is OK , The former is used here , Direct download Installation free package

Then try Configure environment variables , So that... Can be performed everywhere pandoc command :

Decompress the package → Enter folder → Choose pandoc.exe → Hold down shift Right click → Copy path → This computer → Right click to open properties → Find the advanced system Settings → environment variable → In system variable (S) It's about → find PATH → edit → newly build → Paste the path just copied here :

After the configuration , open cmd, type :pandoc -v, If not output :pandoc The command can't be found or something , And the following :

Indicating successful configuration , If the path is correct , After power on and power off , If it still doesn't work , Like the author, you can directly throw the unzipped file into Python Of Scripts Under the table of contents , adopt where Command acquisition Python Installation directory :

Come to the path below , Just throw all the documents here

End of configuration , Then let's see how to use :

Relatively simple , Is to execute on the command line :

pandoc -o  File to be converted   Converted file 
 Copy code 

Try putting txt Turn into pdf:

You need to specify the latex engine , There are the following options :

It's a bit of a problem , Some have to be alone next Latex Mirror image ,4 More than a G, So direct or direct conversion to Word Format of document → docx 了 :

pandoc 123.txt -o test.docx
 Copy code 

Open to see the effect :

ok , Then there are some general operations of traversing folders , Splicing cmd character string , utilize subprocess Carry out orders , The sample code is as follows :

def md_to_doc(file_path):
    cmd = "pandoc {} -o {}"
    sep_split = file_path.split(os.path.sep)
    #  Switch to the picture Directory 
    os.chdir(output_root_dir)
    #  Determine whether the folder exists , Not creating 
    cp_file_utils.is_dir_existed(os.path.join(doc_save_dir, sep_split[-2]))
    doc_file_path = os.path.join(doc_save_dir, '{}{}{}.docx'.format(sep_split[-2], os.path.sep, sep_split[-1][:-4]))
    subprocess.call(cmd.format(file_path, doc_file_path), shell=True)
    print(" Generate the file :", doc_file_path)
 Copy code 

The file generation is complete , Then write a script to put so many Word The document is synthesized into a , The following libraries are required ( direct pip Just install it ):

pip install python-docx
pip install docxcompose
 Copy code 

Then go straight to the liver code :

from docx import Document
from docxcompose.composer import Composer

def compose_docx(docx_list):
    #  First file 
    master = Document(docx_list[0])
    master.add_page_break()  #  Force a new page 
    composer = Composer(master)
    #  Subsequent documents are added and merged 
    for docx in docx_list[1:]:
        print(" Currently processing files :", docx)
        temp = Document(docx)
        temp.add_page_break()
        composer.append(temp)
    composer.save("result.docx")
    print(" File merge complete ...")
 Copy code 

After running, wait until the program is finished , Because the default merge order is by file name , When we create the file, we use the time stamp method , So don't worry about order . Look at the synthesized document , Sure :

841mb,2927 page ,WPS Open it and get stuck on the spot :

0x3、 A slightly more troublesome solution

Then let's talk about the second scheme , It's about putting Markdown Render into HTML, And then convert it to PDF, Use these two libraries :

pip install markdown
pip install pdfkit
 Copy code 

also :wkhtmltopdf, Also download the compressed package , How to configure environment variables :

Type... On the command line :wkhtmltopdf -V Check whether the configuration is effective ~

Then you can do it directly , Write the following test demo:

import pdfkit
from markdown import markdown

def md_to_pdf(file_path):
    html = markdown(cp_file_utils.read_file_text_content(file_path), output_format='html')
    pdfkit.from_string(html, "out.pdf", options={'encoding': 'utf-8'})
 Copy code 

Pass in md File path , If... Occurs after operation :

Pdfkit OSError: No wkhtmltopdf executable found

The environment variable above doesn't work , Reopen a window , Or specify the path through the following code :

import pdfkit
config = pdfkit.configuration(wkhtmltopdf=r"D:\xxx\bin\wkhtmltopdf.exe")
pdfkit.from_url(html, filename, configuration=config)
 Copy code 

Then the following error is reported during the operation :

It refers to external resources , But I didn't find , Sure try-except Capture a wave of anomalies :

def md_to_pdf(file_path):
    html = markdown(cp_file_utils.read_file_text_content(file_path), output_format='html')
    try:
        pdfkit.from_string(html, "out.pdf", options={'encoding': 'utf-8'})
    except IOError as e:
        #  Ignore exceptions directly 
        pass
    finally:
        print(" File generation finished ...")
 Copy code 

After running, open the generated PDF, Look at the effect :

ok , But this default rendering , Annotation is not supported 、 form 、LaTeX、 Code block 、 flow chart 、 Sequence diagram and Gantt Chart , More extensions need to be introduced .

Examples are as follows :

#  Enable tables Expand 
html = markdown(text, output_format='html', extensions=['tables'])
 Copy code 

Examples are as follows :

#  Install math pack 
pip install python-markdown-math

#  Enable math pack extensions 
text = markdown(text, output_format='html', extensions=['mdx_math'])
 Copy code 
  • Use of third parties Markdown Rendering HTML Tool export HTML ( Such as operation tribe ), Again into pdf

Examples are as follows :

pdfkit.from_file('test.html', 'test.pdf', options={'encoding': 'utf-8'})
 Copy code 

For more details, please refer to :Python take MarkDown turn PDF( Perfect transition HTML, contain LaTeX、 Tables, etc )

0x4、 Summary

It looks very simple , actually , There is a need for in-depth customized styles , Have to toss , But coincidentally , I didn't , Just look at it , Interested readers can feel for themselves ~

Crawling data storage methods and skills +1, The reading experience is also more ,23333, The above is the whole content of this paper , If you have any questions, please point out that , thank ~

copyright notice
author[coder_ pig],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201310727216527.html

Random recommended