current position:Home>Python actual combat | just "4 steps" to get started with web crawler (with benefits)
Python actual combat | just "4 steps" to get started with web crawler (with benefits)
2022-01-30 06:25:56 【Mengy7762 Mengya】
Web crawler (Web crawler), Is to obtain the data in the network through the website 、 Then parse the data according to the target 、 Store target information . This process can be realized by automatic program , Acts like a spider . Spiders crawl on the Internet , A web page is a spider web . In this way, spiders can crawl from one web page to another .
Web crawlers are also a way to get data . For the big data industry , The value of data is self-evident , In this age of information explosion , There is too much information and data on the Internet , For small, medium and micro companies , Make rational use of reptiles to crawl valuable data , It is the only choice to make up for their inherent data shortcomings .
Based on the above analysis , We can divide the web crawler into four steps :
- Get web data
- Parse web data
- Store web page data
- Analyze web data
First step : Get web data
Get web data , That is, through the website ( URL:Uniform Resource Locator, Unify resources Locator ), Get data from the network , act as Search engine . When Enter url , We are equivalent to sending a request to the website server , After the website server receives , Processing and parsing , And then give us a corresponding . If the network is correct and the website is good , Generally, you can get web information , Otherwise, tell us an error code , such as 404. The whole process can be called request and response . Last , If you don't have a lot of time , And want to quickly python Improve , The most important thing is not afraid of hardship , I suggest you can adjust the price @762459510 , That's really good , A lot of people are making rapid progress , I need you not to be afraid of hardship ! You can take a look at it ~
There are two common ways to request ,GET and POST.GET The request is to include the parameters in url Inside , For example, enter crawlers in Baidu , Get one get request , Link to www.baidu.com/s?wd= Reptiles . and post Most requests are made in the form , That is, let you enter your user name and secret , stay url It doesn't show , It's safer .post There is no limit to the size of the request , and get There is a limit to the request , most 1024 Bytes .
stay python In the program , The above process can be realized by obtaining the source code in the web page , Then get the data in the web page . First, let's take a look at the source code of the website and the viewing method , Use google browser , Right click to select check , View the URL source code that needs to be crawled , As follows : It can be seen from the picture , stay Network In the tab , Click on the first entry , That is to say www.baidu.com, See the source code .
In this picture , The first part is General, Including the basic information of the website , Like state 200 etc. , The second part is Response Headers, Including the response information of the request , also body part , such as Set-Cookie,Server etc. . The third part is ,Request headers, Contains additional information about server usage , such as Cookie,User-Agent The content such as .
The above web page source code , stay python In language , We just need to use urllib、requests Wait for the library to realize , As follows . Here are some special instructions ,requests Than urllib It is more convenient 、 quick . Once you learn requests library , I'm sure I can't put it down .
** **
The second step : Parse web data
In the first step , We got the source code of the web page , That's data . Then it analyzes the data inside , Use... For our analysis . There are many common methods , For example, regular expressions 、xpath Analysis, etc .
stay Python In language , We use it a lot Beautiful Soup、pyquery、lxml Such as the library , It can efficiently obtain web page information , Like the properties of a node 、 Text value, etc .
Beautiful Soup The library is parsing 、 Traverse 、 maintain “ Tag tree ” The library of , Corresponding to one HTML/XML The entire content of the document . The installation method is very simple , as follows :
** **
The third step : Store web page data
After parsing the data , You can save it . If not many , You can consider saving it in txt Text 、csv Text or json Text etc. , If the number of data pieces crawled is large , We can consider storing it in the database . therefore , We need to learn MySql、MongoDB、SqlLite Usage of . More deeply , Can learn database query optimization . Last , If you don't have a lot of time , And want to quickly python Improve , The most important thing is not afraid of hardship , I suggest you can adjust the price @762459510 , That's really good , A lot of people are making rapid progress , I need you not to be afraid of hardship ! You can take a look at it ~
JSON(JavaScript Object Notation) Is a lightweight data exchange format . It's based on ECMAScript A subset of . JSON Use a completely language independent text format , But it also uses something like C The habits of the language family ( Include C、C++、Java、JavaScript、Perl、Python etc. ). These characteristics make JSON Become the ideal data exchange language . Easy to read and write , At the same time, it is also easy for machine analysis and generation ( Generally used to improve the network transmission rate ).
JSON stay python By list and dict form .Python official json The address is docs.python.org/3/library/j…
The specific use method is as follows :
** **
Step four : Analyze web data
The purpose of a crawler is to analyze web data , Get the conclusion we want . stay python Data analysis , We can use the data saved in step 3 to directly analyze , The main libraries used are as follows :NumPy、Pandas、 Matplotlib Three libraries . Last , If you don't have a lot of time , And want to quickly python Improve , The most important thing is not afraid of hardship , I suggest you can adjust the price @762459510 , That's really good , A lot of people are making rapid progress , I need you not to be afraid of hardship ! You can take a look at it ~
- NumPy : It is the basic package of high-performance scientific computing and data analysis .
- Pandas : be based on NumPy A tool of , The tool is created to solve data analysis tasks . It can be regarded as a cheating tool .
- Matplotlib:Python The most famous drawing system in Python The most famous drawing system in . It can make a scatter diagram , Broken line diagram , Bar chart , Histogram , The pie chart , Box chart, scatter chart , Broken line diagram , Bar chart , Histogram , The pie chart , Box drawings, etc .
copyright notice
author[Mengy7762 Mengya],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300625533449.html
The sidebar is recommended
- [recalling the 1970s] using Python to repair the wonderful memories of parents' generation, black-and-white photos become color photos
- You used to know Python advanced
- Pyinstaller package Python project
- 2021 IEEE programming language rankings: Python tops the list!
- Implementation of Python automatic test control
- Python advanced: [Baidu translation reverse] graphic and video teaching!!!
- Do you know the fuzzy semantics in Python syntax?
- [Python from introduction to mastery] (XXVII) learn more about pilot!
- Playing excel office automation with Python
- Some applications of heapq module of Python module
guess what you like
-
Python and go languages are so popular, which is more suitable for you?
-
Python practical skills task segmentation
-
Python simulated Login, numpy module, python simulated epidemic spread
-
Python opencv contour discovery function based on image edge extraction
-
Application of Hoff circle detection in Python opencv
-
Python reptile test ox knife (I)
-
Day 1: learn the Django framework of Python development
-
django -- minio_ S3 file storage service
-
[algorithm learning] 02.03 Delete intermediate nodes (Java / C / C + + / Python / go)
-
Similarities and differences of five pandas combinatorial functions
Random recommended
- Learning in Python + opencv -- extracting corners
- Python beginner's eighth day ()
- Necessary knowledge of Python: take you to learn regular expressions from zero
- Get your girlfriend's chat records with Python and solve the paranoia with one move
- My new book "Python 3 web crawler development practice (Second Edition)" has been recommended by the father of Python!
- From zero to familiarity, it will take you to master the use of Python len() function
- Python type hint type annotation guide
- leetcode 108. Convert Sorted Array to Binary Search Tree(python)
- For the geometric transformation of Python OpenCV image, let's first talk about the extraordinary resize function
- leetcode 701. Insert into a Binary Search Tree (python)
- For another 3 days, I sorted out 80 Python datetime examples, which must be collected!
- Python crawler actual combat | using multithreading to crawl lol HD Wallpaper
- Complete a python game in 28 minutes, "customer service play over the president card"
- The universal Python praise machine (commonly known as the brushing machine) in the whole network. Do you want to know the principle? After reading this article, you can write one yourself
- How does Python compare file differences between two paths
- Common OS operations for Python
- [Python data structure series] linear table - explanation of knowledge points + code implementation
- How Python parses web pages using BS4
- How do Python Network requests pass parameters
- Python core programming - decorator
- Python Network Programming -- create a simple UPD socket to realize mutual communication between two processes
- leetcode 110. Balanced Binary Tree(python)
- Django uses Django celery beat to dynamically add scheduled tasks
- The bear child said "you haven't seen Altman" and hurriedly studied it in Python. Unexpectedly
- Optimization iteration of nearest neighbor interpolation and bilinear interpolation algorithm for Python OpenCV image
- Bilinear interpolation algorithm for Python OpenCV image, the most detailed algorithm description in the whole network
- Use of Python partial()
- Python game development, pyGame module, python implementation of angry birds
- leetcode 1104. Path In Zigzag Labelled Binary Tree(python)
- Save time and effort. 10 lines of Python code automatically clean up duplicate files in the computer
- Learn python, know more meat, and be a "meat expert" in the technical circle. One article is enough
- [Python data structure series] "stack (sequential stack and chain stack)" -- Explanation of knowledge points + code implementation
- Datetime module of Python time series
- Python encrypts and decrypts des to solve the problem of inconsistency with Java results
- Chapter 1: introduction to Python programming-4 Hello World
- Summary of Python technical points
- 11.5K Star! An open source Python static type checking Library
- Chapter 2: Fundamentals of python-1 grammar
- [Python daily homework] day4: write a function to count the number of occurrences of each number in the incoming list and return the corresponding dictionary.
- Python uses turtle to express white