current position:Home>Python crawlers are slow? Concurrent programming to understand it

Python crawlers are slow? Concurrent programming to understand it

2022-01-30 19:12:25 Dream, killer

Little knowledge , Great challenge ! This article is participating in 「 A programmer must have a little knowledge 」 Creative activities


Web crawler is a kind of IO intensive ( Page request , File read ) Program , It will block the running of the program and consume a lot of time , and Python Provide a variety of concurrent programming methods , Can improve... To a certain extent IO The efficiency of intensive programs . Before you start again, you need to understand the following concepts !

Basic knowledge of

Concurrent : Something happens in a period of time . In a single core CPU in , Executing multiple tasks runs concurrently , Because there is only one core processor ,CPU Divide a time period into several time intervals , Each task will only be executed in its own time interval , If you don't finish the task in your own time , Will switch to the next task , Because each time period is very short , Frequent switching , So the feeling is “ meanwhile ” function .

parallel : Something happens at the same time . In multicore CPU in , Is able to achieve real “ meanwhile ” Running , When one CPU When executing a process , Other CPU Other processes can be executed , The two processes do not preempt each other CPU resources .

Sync : Each task in synchronization does not run alone , There is an alternating order between tasks , Only after the previous task , Later tasks can start running .

asynchronous : In asynchronous, each task can run alone , Tasks don't affect each other .

In the process of reptile , asynchronous Equivalent to opening a web page , You don't have to wait for the page to load , Continue to open a new web page . Sync It's equivalent to opening a web page , Wait until it is fully loaded before opening the next page .

Three ways to increase the speed of reptiles : Multithreading 、 Multi process 、 coroutines . Let's first understand what a process is , Threads , coroutines ?

process : A process is a program unit that can run independently . It's a collection of threads , Is composed of one or more threads .

Threads : It is the smallest unit of operation scheduling in the operating system , It is also the smallest running unit in the process .

coroutines : A coroutine is a smaller execution unit than a thread , It can be said to be a lightweight thread , Thread scheduling is carried out in the operating system , And the scheduling of the cooperative process is carried out in the user space . Its advantage over threads is that the switching cost is lower .


GIL Full name (Global Interpreter Lock, Global interpreter lock ) stay Python Multithreading , The execution of each thread is as follows :

obtain GIL >>> Execute the code of the corresponding thread >>> Release GIL

A thread wants to execute , First get GIL, You can put GIL As a license , And in one Python In progress ,GIL only one . Only when you get the license can you execute the thread , This will lead to , Even in multicore conditions , One Python Multiple threads under a process , Only one thread can be executed at the same time .

about IO intensive ( Page requests, etc ) In terms of tasks , It's not a big problem ; And for CPU intensive In terms of tasks , because GIL The existence of , The overall efficiency of multithreading may be lower than that of single thread .


Multithreaded application scenarios : I/O intensive The program . Such as

  • Database request
  • Page request
  • Read and write files

because GIL Why , Globally, only one thread is allowed to execute at the same time, which means : In order to ensure that each thread can complete its own tasks , It needs to be done frequently Thread switching operation .

Python To implement multithreaded programming in threading modular , Every time we create one Thread Object represents a thread , Each thread can handle different tasks .

establish Thread Objects have 2 Ways of planting .

  • Take the callback function as an argument , Create directly Thread object .
  • from threading.Thread Inheritance creates a new subclass , make carbon copies run() Method , After instantiation, it is called start() Method to start a new thread .

establish Thread object

threading.Thread(target=None, name=None, args=(), kwargs=None, *, daemon=None)

  • target: Specify to be run() Callable objects for method calls . The default is None, Means that no function is called .
  • name: The thread of . By default , Single name with “Thread-N” Formal construction of , among N It's a decimal number .
  • args: The parameter tuple of the target call (target Fixed parameters of ). The default is ().
  • kwargs: Keyword parameter Dictionary of the target call (target Variable parameters of ). The default value is None.
  • daemon: Whether to start the daemon thread , Default MainThread( The main thread ) You need to wait for other threads to end , The default value is None.
import threading
import time

def block(second):
    print(threading.current_thread().name, ' The thread is running ')
    #  Sleep  second  second 
    print(threading.current_thread().name, ' Thread end ')

print(threading.current_thread().name, ' The thread is running ')

for i in [1, 3]:
    #  establish thread Object and specify the callback function block,name, And fixed parameters i
    thread = threading.Thread(target=block, name=f'thread test {i}', args=[i])
    #  Open thread 

print(threading.current_thread().name, ' Thread end ')
 Copy code 

threading.current_thread().name Get the name of the current thread . Let's briefly talk about the logic of the above code , Define function first block, Output current thread information , Loop twice to create thread object , Then turn on the thread , Finally, the thread end information is output . Pay attention to the output order of each information , stay test1test3 The main thread ends before the thread ends .

Custom class inheritance Thread

Now modify the above example directly , Use custom classes to inherit Thread Implement multithreading .

import threading
import time

class TestThread(threading.Thread):
    def __init__(self, name=None, second=0):
        threading.Thread.__init__(self, name=name)
        self.second = second

    def run(self):
        print(threading.current_thread().name, ' The thread is running ')
        print(threading.current_thread().name, ' Thread end ')

print(threading.current_thread().name, ' The thread is running ')

for i in [1, 3]:
    thread = TestThread(name=f'thread test {i}', second=i)
    #  Open thread 

print(threading.current_thread().name, ' Thread end ')
 Copy code 

This article is just a simple beginning , Follow up will continue to share , Until mastered Python Concurrent crawlers .

For beginners Python Or want to get started Python Little buddy , You can search through wechat Python New horizons Contact the author , Exchange and study together , They all come from novices , Sometimes a simple question card takes a long time , But maybe someone else's advice will suddenly realize , I sincerely hope you can make progress together .

copyright notice
author[Dream, killer],Please bring the original link to reprint, thank you.

Random recommended