current position:Home>Python crawler native code learning (I)
Python crawler native code learning (I)
2022-01-30 21:41:45 【Cooper_ wwj】
< Affirming >: The article is my original , For quotation or commercial use, please contact me , Otherwise, we will be held responsible
introduction
At present, reptile technology has developed more and more mature , In order to speed up the climbing efficiency and shorten the time , There have been some problems such as “scrapy”、“PySpider” And so on , There have even been excellent “ octopus ” And other integrated software .
These excellent frameworks have some disadvantages for learners , They don't know what the underlying architecture looks like , The specific content is not clear , Only use , Make mistakes and don't know where the problem is , Then I was annoyed there for a long time , This is quite uncomfortable for learners .
that , What does native code look like ?
With this question , I did a series of operations myself , Found a multi-threaded way to climb the whole station . Here you are welcome to comment and give advice .
Software configuration environment
A good hunter often has a good gun in his hand , So do programmers . Here we use python3.9.2 The compiler and pycharm Integrated software .
Learning process
We are mainly divided into five parts to learn the process of crawler infrastructure , Namely :
( One ) Page analysis : Analyze the target page through developer tools , And determine how to write the crawler program
( Two ) Send a request : utilize urllib Library or requests The library initiates a request for the target page , Get response data
( 3、 ... and ) Analyze the response data and get the content : utilize re library 、xpath Library or BeautifulSoup Library to get the required content
( Four ) Store content : utilize sqlite3 library 、excel Worksheet 、word Document or Mysql Libraries store content permanently
( 5、 ... and ) The optimization process :1: Library selection ;2: Enable multithreading ;3: To reprocess ;
flow chart
copyright notice
author[Cooper_ wwj],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201302141430458.html
The sidebar is recommended
- Introduction to python (IV) dynamic web page analysis and capture
- leetcode 119. Pascal's Triangle II(python)
- leetcode 31. Next Permutation(python)
- [algorithm learning] 807 Maintain the city skyline (Java / C / C + + / Python / go / trust)
- The rich woman's best friend asked me to write her a Taobao double 11 rush purchase script in Python, which can only be arranged
- Glom module of Python data analysis module (1)
- Python crawler actual combat, requests module, python realizes the full set of skin to capture the glory of the king
- Summarize some common mistakes of novices in Python development
- Python libraries you may not know
- [Python crawler] detailed explanation of selenium from introduction to actual combat [2]
guess what you like
-
This is what you should do to quickly create a list in Python
-
On the 55th day of the journey, python opencv perspective transformation front knowledge contour coordinate points
-
Python OpenCV image area contour mark, which can be used to frame various small notes
-
How to set up an asgi Django application with Postgres, nginx and uvicorn on Ubuntu 20.04
-
Initial Python tuple
-
Introduction to Python urllib module
-
Advanced Python Basics: from functions to advanced magic methods
-
Python Foundation: data structure summary
-
Python Basics: from variables to exception handling
-
Python notes (22): time module and calendar module
Random recommended
- Python notes (20): built in high-order functions
- Python notes (17): closure
- Python notes (18): decorator
- Python notes (16): generators and iterators
- Python notes (XV): List derivation
- Python tells you what timing attacks are
- Python -- file and exception
- [Python from introduction to mastery] (IV) what are the built-in data types of Python? Figure out
- Python code to scan code to pay attention to official account login
- [algorithm learning] 1221 Split balanced string (Java / C / C + + / Python / go / trust)
- Python notes (22): errors and exceptions
- Python has been hidden for ten years, and once image recognition is heard all over the world
- Python notes (21): random number module
- Python notes (19): anonymous functions
- Use Python and OpenCV to calculate and draw two-dimensional histogram
- Python, Hough circle transformation in opencv
- A library for reading and writing markdown in Python: mdutils
- Datetime of Python time operation (Part I)
- The most useful decorator in the python standard library
- Python iterators and generators
- [Python from introduction to mastery] (V) Python's built-in data types - sequences and strings. They have no girlfriend, not a nanny, and can only be used as dry goods
- Does Python have a, = operator?
- Go through the string common sense in Python
- Fanwai 4 Handling of mouse events and solutions to common problems in Python opencv
- Summary of common functions for processing strings in Python
- When writing Python scripts, be sure to add this
- Python web crawler - Fundamentals (1)
- Pandas handles duplicate values
- Python notes (23): regular module
- Python crawlers are slow? Concurrent programming to understand it
- Parameter passing of Python function
- Stroke tuple in Python
- Talk about ordinary functions and higher-order functions in Python
- [Python data acquisition] page image crawling and saving
- [Python data collection] selenium automated test framework
- Talk about function passing and other supplements in Python
- Python programming simulation poker game
- leetcode 160. Intersection of Two Linked Lists (python)
- Python crawler actual combat, requests module, python to grab the beautiful wallpaper of a station
- Fanwai 5 Detailed description of slider in Python opencv and solutions to common problems