current position:Home>Lazy listening network, audio novel category data collection, multi-threaded fast mining cases, 23 of 120 Python crawlers

Lazy listening network, audio novel category data collection, multi-threaded fast mining cases, 23 of 120 Python crawlers

2022-02-01 17:15:52 Dream eraser

「 This is my participation 11 The fourth of the yuegengwen challenge 30 God , Check out the activity details :2021 One last more challenge


theme: vue-pro

Multithreading in Python Reptile learning process is applied , To speed up , To speed up , Speed up again .

Target site analysis

The goal of this capture is for lazy people to listen to the Internet , I chose a category at random , Audio fiction channel , Other channels can be captured in the same way , After adding traversal , You can grab the whole station .  Lazy listening network , Audio novel category data collection , Multithreaded fast mining case ,Python Reptiles 120 Examples 23 example The pagination rules of the list page are as follows This time, only the list page data is extracted , Only add multithreaded modules threading Application , Improve collection efficiency .

http://www.lrts.me/book/category/1/recommend/1/20
http://www.lrts.me/book/category/1/recommend/2/20
 Copy code 

The extraction rule template is as follows :

http://www.lrts.me/book/category/1/recommend/ Page number /20
 Copy code 

Page number of the whole station , It can be read directly by human eyes , If you add dynamic acquisition , Extract and read the data at the page .

Extract the final data source as shown in the figure below , Including the title of the book , author , Three parts of the anchor .

 Lazy listening network , Audio novel category data collection , Multithreaded fast mining case ,Python Reptiles 120 Examples 23 example

Code time

In this case, for the multi-threaded part , In addition to sharing global variables , Increase semaphore mechanism , That is, limit the number of concurrent threads .

The semaphore mechanism is simple Demo As shown below :


import threading
import time


def run(n, semaphore):
    #  Lock 
    semaphore.acquire()
    time.sleep(2)
    print(f' Running thread {n}')
    #  Release the lock 
    semaphore.release()


if __name__ == '__main__':
    num = 0
    #  Most allow  3  Threads running at the same time 
    semaphore = threading.BoundedSemaphore(3)
    for i in range(10):
        t = threading.Thread(target=run, args=(f' Thread number :{i}', semaphore))
        t.start()
    while threading.active_count() != 1:
        pass
    else:
        print(' All threads are running ')
 Copy code 

Run code , You will find that run first 3 Threads , Run again 3 Threads , Of course, there is no order between threads running at the same time .

 Lazy listening network , Audio novel category data collection , Multithreaded fast mining case ,Python Reptiles 120 Examples 23 example Semaphore , That is to use threading Modular BoundedSemaphore class , This class can be set to allow a certain number of threads to change data , That is, you can run up to several threads at the same time .

The complete code case is as follows

import threading
from threading import Lock,Thread
import random,requests
from lxml import etree

def get_headers():
    uas = [
        "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
    ]
    ua = random.choice(uas)
    headers = {
        "user-agent": ua,
        "referer": "https://www.baidu.com/"
    }
    return headers


def run(url,semaphore):
    headers = get_headers()
    semaphore.acquire()   # Lock 
    res = requests.get(url,headers=headers,timeout=5)
    if res:
        text = res.text
        element = etree.HTML(text)
        titles = element.xpath('//a[@class="book-item-name"]/text()')
        authors = element.xpath('//a[@class="author"]/text()')
        weakens = element.xpath('//a[@class="g-user-shutdown"]/text()')
        save(url,titles,authors,weakens)


    semaphore.release()    # Release 

def save(url,titles,authors,weakens):
    data_list = zip(titles,authors,weakens)
    for item in data_list:
        with open("./data.csv","a+",encoding="utf-8") as f:
            f.write(f"{item[0]},{item[1]},{item[2]}\n")
    print(url," The URL Address data is written ")
if __name__== '__main__':
    lock = Lock()
    url_format = 'https://www.lrts.me/book/category/1/recommend/{}/20'
    #  Splicing URL, Global shared variables 
    urls = [url_format.format(i) for i in range(1, 1372)]
    l = []
    semaphore = threading.BoundedSemaphore(5)   #  Most allow 5 Threads running at the same time 
    for url in urls:
        t = threading.Thread(target=run,args=(url,semaphore))
        t.start()
    while threading.active_count() !=1:
        pass
    else:
        print(' All threads are running ')
 Copy code 

In the code threading.active_count() part , Used to detect whether there are active threads , If not , Program end .

Run code , The results are as follows , So far, No 23 Example has been learned .

 Lazy listening network , Audio novel category data collection , Multithreaded fast mining case ,Python Reptiles 120 Examples 23 example

Collection time

Code warehouse address :codechina.csdn.net/hihell/pyth…, Give attention or Star Well .

== To have come , No comment , Point a praise , Put it away ?==

Today is the first day of continuous writing 203 / 365 God . You can pay attention to me , Praise me 、 Comment on me 、 Collect me .


copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011715514018.html

Random recommended