current position:Home>Programmers over the age of 25 can't know a few Chinese herbal medicines. Python crawler lessons 9-9

Programmers over the age of 25 can't know a few Chinese herbal medicines. Python crawler lessons 9-9

2022-01-31 19:20:42 Dream eraser

「 This is my participation 11 The fourth of the yuegengwen challenge 17 God , Check out the activity details :2021 One last more challenge

Chinese herbal medicine in the eyes of eraser , Only Nux vomica 、 Cassia seed 、 Cocklebur 、 And lotus seeds 、 Huangyaozi 、 Bitter beans 、 Kawa Ko 、 I want face , It's from 《 Compendium of materia medica 》 Learned from . The rest also know a medlar 、 37 、 Huoxiang Zhengqi water 、 Banlangen , In order to get rid of the dilemma of not knowing Chinese herbal medicine , I decided to crawl the Chinese herbal medicine data and store it locally , This is the writing background of this paper .

First turn on the The Chinese medicinal materials mentioned just now are pasted with pictures , Get to know each other ( I really recognized one , When I was a child, I would touch a Xanthium on my leg when I was walking in the field ).

25  Programmers over the age of 18 , I don't know a few Chinese herbal medicines .Python  Reptile lesson  9-9

Analysis before crawling

The target website is :www.zhongyaocai.com/, Open the traditional Chinese medicine material warehouse and find the total 752 Page data , Each page is about 12 Data , nearly 10000 Plant medicine , Our goal today is to store these data .

25  Programmers over the age of 18 , I don't know a few Chinese herbal medicines .Python  Reptile lesson  9-9

The regular expression part can be obtained separately , The specific part to be matched HTML Source code is as follows :

<div class="poem-head">
  <a class="poem-title" href="https://www.zhongyaocai.com/zyc/gelifen_2542.htm" > Clam powder </a
  >
  <div class="poem-handler"></div>
</div>

<div class="poem-body">
  <div class="poem-sub">
    <span class="list_span"> The original form :</span
    ><span> Four horned clams , The shell is slightly quadrangular , It's tough , Shell length 36-48mm, shell ......</span>
  </div>
  <div class="poem-sub">
    <span class="list_span"> Sexual flavour :</span><span> It's salty ; Sexual cold </span>
  </div>
  <div class="poem-sub">
    <span class="list_span"> Usage and dosage :</span
    ><span> Take orally : Fried soup ,50-100g; Or into the pill 、 scattered ,3-10g......</span>
  </div>
  <div class="poem-sub">
    <span class="list_span"> Function of the attending :</span
    ><span> Clearing away heat ; Resolving phlegm and dampness ; Soft and hard ......</span>
  </div>
</div>
 Copy code 

The regular expression part is as follows :

    pattern = re.compile(
        r'<div class="poem-head"><a class="poem-title" href="(.*?)">(.*?)</a>')
    title_url = pattern.findall(html)
    xing = re.findall(
        r'<span class="list_span"> The original form :</span><span>(.*?)</span>', html)
    wei = re.findall(
        r'<span class="list_span"> Sexual flavour :</span><span>(.*?)</span>', html)
    liang = re.findall(
        r'<span class="list_span"> Usage and dosage :</span><span>(.*?)</span>', html)
    zhi = re.findall(
        r'<span class="list_span"> Function of the attending :</span><span>(.*?)</span>', html)
    items = []
 Copy code 

After the data match is successful , This time, the data will be stored locally , The format is JSON Format , Mainly avoid storing into Excel In the middle because <br> Confusion caused by symbols , Of course, this problem will not exist if it is stored directly in the database .

Code time

This case is the second part of the reptile lesson 9 speak , It's very simple , It's very simple for you now , After starting multithreading, you can directly crawl .

import requests
import re
import json
import threading
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}

flag_page = 0

def anay(html):
    pattern = re.compile(
        r'<div class="poem-head"><a class="poem-title" href="(.*?)">(.*?)</a>')
    title_url = pattern.findall(html)
    xing = re.findall(
        r'<span class="list_span"> The original form :</span><span>(.*?)</span>', html)
    wei = re.findall(
        r'<span class="list_span"> Sexual flavour :</span><span>(.*?)</span>', html)
    liang = re.findall(
        r'<span class="list_span"> Usage and dosage :</span><span>(.*?)</span>', html)
    zhi = re.findall(
        r'<span class="list_span"> Function of the attending :</span><span>(.*?)</span>', html)
    items = []
    for i in range(0, len(title_url)):
        dict_item = {
            "name": title_url[i][1],
            "url": title_url[i][0],
            "xing": xing[i],
            "wei": wei[i],
            "liang": liang[i],
            "zhi": zhi[i]
        }
        items.append(dict_item)
    return items

def save(json_data):
    with open(f"./data1/one.json", "a+", encoding="utf-8") as f:
        f.write(json_data+"\n")

def get_list():
    global flag_page
    while flag_page < 752:
        flag_page += 1
        url = f"https://www.zhongyaocai.com/zyc_p{flag_page}.htm"
        print(url)
        r = requests.get(url=url, headers=headers)
        r.encoding = "utf-8"
        data = anay(r.text)
        json_data = json.dumps({"yaos": data}, ensure_ascii=False)
        save(json_data)

if __name__ == "__main__":
    for i in range(1, 6):
        t = threading.Thread(target=get_list)
        t.setName(f't{i}')
        t.start()
 Copy code 

Data is stored locally , The format is shown in the figure below , One line of data per page , Every line is JSON Format , After reading, you can operate at will .

25  Programmers over the age of 18 , I don't know a few Chinese herbal medicines .Python  Reptile lesson  9-9

The overall summary time of the crawler class is

This series of courses mainly share requests The basics of the library , I hope everyone is in 9 After this course, I have a relatively comprehensive understanding of the library , Other knowledge points not involved will be automatically as you learn programming for a longer time 【 Learn to 】, There are already many ways of learning “ Mr. Yun ” Gave the same answer .

requests The most important thing in the library is to send requests , get data . The core methods are getpost、 And two common properties textcontent, Other contents belong to the extended part of knowledge .

Reptile lesson requests library , To this end .

25  Programmers over the age of 18 , I don't know a few Chinese herbal medicines .Python  Reptile lesson  9-9


Today is the first day of continuous writing 1/100 God . If you have ideas you want to communicate 、 technology , Feel free to leave a comment in the comments section .

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201311920393889.html

Random recommended