current position:Home>Python crawlers are slow? Concurrent programming to understand it
Python crawlers are slow? Concurrent programming to understand it
2022-01-30 19:12:25 【Dream, killer】
Little knowledge , Great challenge ! This article is participating in 「 A programmer must have a little knowledge 」 Creative activities
Preface
Web crawler is a kind of IO intensive ( Page request , File read ) Program , It will block the running of the program and consume a lot of time , and Python
Provide a variety of concurrent programming methods , Can improve... To a certain extent IO The efficiency of intensive programs . Before you start again, you need to understand the following concepts !
Basic knowledge of
Concurrent : Something happens in a period of time . In a single core CPU in , Executing multiple tasks runs concurrently , Because there is only one core processor ,CPU Divide a time period into several time intervals , Each task will only be executed in its own time interval , If you don't finish the task in your own time , Will switch to the next task , Because each time period is very short , Frequent switching , So the feeling is “ meanwhile ” function .
parallel : Something happens at the same time . In multicore CPU in , Is able to achieve real “ meanwhile ” Running , When one CPU When executing a process , Other CPU Other processes can be executed , The two processes do not preempt each other CPU resources .
Sync : Each task in synchronization does not run alone , There is an alternating order between tasks , Only after the previous task , Later tasks can start running .
asynchronous : In asynchronous, each task can run alone , Tasks don't affect each other .
In the process of reptile , asynchronous
Equivalent to opening a web page , You don't have to wait for the page to load , Continue to open a new web page . Sync
It's equivalent to opening a web page , Wait until it is fully loaded before opening the next page .
Three ways to increase the speed of reptiles : Multithreading 、 Multi process 、 coroutines . Let's first understand what a process is , Threads , coroutines ?
process : A process is a program unit that can run independently . It's a collection of threads , Is composed of one or more threads .
Threads : It is the smallest unit of operation scheduling in the operating system , It is also the smallest running unit in the process .
coroutines : A coroutine is a smaller execution unit than a thread , It can be said to be a lightweight thread , Thread scheduling is carried out in the operating system , And the scheduling of the cooperative process is carried out in the user space . Its advantage over threads is that the switching cost is lower .
GIL
GIL
Full name (Global Interpreter Lock, Global interpreter lock ) stay Python
Multithreading , The execution of each thread is as follows :
obtain GIL >>> Execute the code of the corresponding thread >>> Release GIL
A thread wants to execute , First get GIL
, You can put GIL
As a license , And in one Python
In progress ,GIL
only one . Only when you get the license can you execute the thread , This will lead to , Even in multicore conditions , One Python
Multiple threads under a process , Only one thread can be executed at the same time .
about IO intensive
( Page requests, etc ) In terms of tasks , It's not a big problem ; And for CPU intensive
In terms of tasks , because GIL
The existence of , The overall efficiency of multithreading may be lower than that of single thread .
Multithreading
Multithreaded application scenarios : I/O intensive The program . Such as
- Database request
- Page request
- Read and write files
because GIL
Why , Globally, only one thread is allowed to execute at the same time, which means : In order to ensure that each thread can complete its own tasks , It needs to be done frequently Thread switching operation .
Python
To implement multithreaded programming in threading
modular , Every time we create one Thread
Object represents a thread , Each thread can handle different tasks .
establish Thread Objects have 2 Ways of planting .
- Take the callback function as an argument , Create directly
Thread
object . - from
threading.Thread
Inheritance creates a new subclass , make carbon copiesrun()
Method , After instantiation, it is calledstart()
Method to start a new thread .
establish Thread object
threading.Thread(target=None, name=None, args=(), kwargs=None, *, daemon=None)
- target: Specify to be
run()
Callable objects for method calls . The default isNone
, Means that no function is called . - name: The thread of . By default , Single name with
“Thread-N”
Formal construction of , among N It's a decimal number . - args: The parameter tuple of the target call (
target
Fixed parameters of ). The default is (). - kwargs: Keyword parameter Dictionary of the target call (
target
Variable parameters of ). The default value isNone
. - daemon: Whether to start the daemon thread , Default
MainThread
( The main thread ) You need to wait for other threads to end , The default value isNone
.
import threading
import time
def block(second):
print(threading.current_thread().name, ' The thread is running ')
# Sleep second second
time.sleep(second)
print(threading.current_thread().name, ' Thread end ')
print(threading.current_thread().name, ' The thread is running ')
for i in [1, 3]:
# establish thread Object and specify the callback function block,name, And fixed parameters i
thread = threading.Thread(target=block, name=f'thread test {i}', args=[i])
# Open thread
thread.start()
print(threading.current_thread().name, ' Thread end ')
Copy code
threading.current_thread().name
Get the name of the current thread . Let's briefly talk about the logic of the above code , Define function first block
, Output current thread information , Loop twice to create thread
object , Then turn on the thread , Finally, the thread end information is output . Pay attention to the output order of each information , stay test1
、test3
The main thread ends before the thread ends .
Custom class inheritance Thread
Now modify the above example directly , Use custom classes to inherit Thread
Implement multithreading .
import threading
import time
class TestThread(threading.Thread):
def __init__(self, name=None, second=0):
threading.Thread.__init__(self, name=name)
self.second = second
def run(self):
print(threading.current_thread().name, ' The thread is running ')
time.sleep(self.second)
print(threading.current_thread().name, ' Thread end ')
print(threading.current_thread().name, ' The thread is running ')
for i in [1, 3]:
thread = TestThread(name=f'thread test {i}', second=i)
# Open thread
thread.start()
print(threading.current_thread().name, ' Thread end ')
Copy code
This article is just a simple beginning , Follow up will continue to share , Until mastered Python Concurrent crawlers .
For beginners Python
Or want to get started Python
Little buddy , You can search through wechat Python New horizons
Contact the author , Exchange and study together , They all come from novices , Sometimes a simple question card takes a long time , But maybe someone else's advice will suddenly realize , I sincerely hope you can make progress together .
copyright notice
author[Dream, killer],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201301912238997.html
The sidebar is recommended
- Exploratory data analysis (EDA) in Python using SQL and Seaborn (SNS).
- Turn audio into shareable video with Python and ffmpeg
- Using rbind in python (equivalent to R)
- Pandas: how to create an empty data frame with column names
- Talk about quantifying investment using Python
- Python, image restoration in opencv - CV2 inpaint
- Python notes (14): advanced technologies such as object-oriented programming
- Python notes (13): operations such as object-oriented programming
- Python notes (12): inheritance such as object-oriented programming
- Chapter 2: Fundamentals of python-5 Boolean
guess what you like
-
Python notes (11): encapsulation such as object-oriented programming
-
Python notes (10): concepts such as object-oriented programming
-
Gradient lifting method and its implementation in Python
-
Van * Python | simple crawling of a site course
-
Chapter 1 preliminary knowledge of pandas (list derivation and conditional assignment, anonymous function and map method, zip object and enumerate method, NP basis)
-
Nanny tutorial! Build VIM into an IDE (Python)
-
Fourier transform of Python OpenCV image processing, lesson 52
-
Introduction to python (III) network request and analysis
-
China Merchants Bank credit card number recognition project (Part I), python OpenCV image processing journey, Part 53
-
Introduction to python (IV) dynamic web page analysis and capture
Random recommended
- Python practice - capture 58 rental information and store it in MySQL database
- leetcode 119. Pascal's Triangle II(python)
- leetcode 31. Next Permutation(python)
- [algorithm learning] 807 Maintain the city skyline (Java / C / C + + / Python / go / trust)
- The rich woman's best friend asked me to write her a Taobao double 11 rush purchase script in Python, which can only be arranged
- Glom module of Python data analysis module (1)
- Python crawler actual combat, requests module, python realizes the full set of skin to capture the glory of the king
- Summarize some common mistakes of novices in Python development
- Python libraries you may not know
- [Python crawler] detailed explanation of selenium from introduction to actual combat [2]
- This is what you should do to quickly create a list in Python
- On the 55th day of the journey, python opencv perspective transformation front knowledge contour coordinate points
- Python OpenCV image area contour mark, which can be used to frame various small notes
- How to set up an asgi Django application with Postgres, nginx and uvicorn on Ubuntu 20.04
- Initial Python tuple
- Introduction to Python urllib module
- Advanced Python Basics: from functions to advanced magic methods
- Python Foundation: data structure summary
- Python Basics: from variables to exception handling
- Python notes (22): time module and calendar module
- Python notes (20): built in high-order functions
- Python notes (17): closure
- Python notes (18): decorator
- Python notes (16): generators and iterators
- Python notes (XV): List derivation
- Python tells you what timing attacks are
- Python -- file and exception
- [Python from introduction to mastery] (IV) what are the built-in data types of Python? Figure out
- Python code to scan code to pay attention to official account login
- [algorithm learning] 1221 Split balanced string (Java / C / C + + / Python / go / trust)
- Python notes (22): errors and exceptions
- Python has been hidden for ten years, and once image recognition is heard all over the world
- Python notes (21): random number module
- Python notes (19): anonymous functions
- Use Python and OpenCV to calculate and draw two-dimensional histogram
- Python, Hough circle transformation in opencv
- A library for reading and writing markdown in Python: mdutils
- Datetime of Python time operation (Part I)
- The most useful decorator in the python standard library
- Python iterators and generators