current position:Home>Python actual combat | just "4 steps" to get started with web crawler (with benefits)

Python actual combat | just "4 steps" to get started with web crawler (with benefits)

2022-01-30 06:25:56 Mengy7762 Mengya

Web crawler (Web crawler), Is to obtain the data in the network through the website 、 Then parse the data according to the target 、 Store target information . This process can be realized by automatic program , Acts like a spider . Spiders crawl on the Internet , A web page is a spider web . In this way, spiders can crawl from one web page to another .

Web crawlers are also a way to get data . For the big data industry , The value of data is self-evident , In this age of information explosion , There is too much information and data on the Internet , For small, medium and micro companies , Make rational use of reptiles to crawl valuable data , It is the only choice to make up for their inherent data shortcomings .

Based on the above analysis , We can divide the web crawler into four steps :

  • Get web data
  • Parse web data
  • Store web page data
  • Analyze web data

First step : Get web data

Get web data , That is, through the website ( URL:Uniform Resource Locator, Unify resources Locator ), Get data from the network , act as Search engine . When Enter url , We are equivalent to sending a request to the website server , After the website server receives , Processing and parsing , And then give us a corresponding . If the network is correct and the website is good , Generally, you can get web information , Otherwise, tell us an error code , such as 404. The whole process can be called request and response . Last , If you don't have a lot of time , And want to quickly python Improve , The most important thing is not afraid of hardship , I suggest you can adjust the price @762459510 , That's really good , A lot of people are making rapid progress , I need you not to be afraid of hardship ! You can take a look at it ~

There are two common ways to request ,GET and POST.GET The request is to include the parameters in url Inside , For example, enter crawlers in Baidu , Get one get request , Link to www.baidu.com/s?wd= Reptiles . and post Most requests are made in the form , That is, let you enter your user name and secret , stay url It doesn't show , It's safer .post There is no limit to the size of the request , and get There is a limit to the request , most 1024 Bytes .

stay python In the program , The above process can be realized by obtaining the source code in the web page , Then get the data in the web page . First, let's take a look at the source code of the website and the viewing method , Use google browser , Right click to select check , View the URL source code that needs to be crawled , As follows : It can be seen from the picture , stay Network In the tab , Click on the first entry , That is to say www.baidu.com, See the source code .

 picture .png In this picture , The first part is General, Including the basic information of the website , Like state 200 etc. , The second part is Response Headers, Including the response information of the request , also body part , such as Set-Cookie,Server etc. . The third part is ,Request headers, Contains additional information about server usage , such as Cookie,User-Agent The content such as .

The above web page source code , stay python In language , We just need to use urllib、requests Wait for the library to realize , As follows . Here are some special instructions ,requests Than urllib It is more convenient 、 quick . Once you learn requests library , I'm sure I can't put it down .

** **

The second step : Parse web data

In the first step , We got the source code of the web page , That's data . Then it analyzes the data inside , Use... For our analysis . There are many common methods , For example, regular expressions 、xpath Analysis, etc .

stay Python In language , We use it a lot Beautiful Soup、pyquery、lxml Such as the library , It can efficiently obtain web page information , Like the properties of a node 、 Text value, etc .

Beautiful Soup The library is parsing 、 Traverse 、 maintain “ Tag tree ” The library of , Corresponding to one HTML/XML The entire content of the document . The installation method is very simple , as follows :

** **

The third step : Store web page data

After parsing the data , You can save it . If not many , You can consider saving it in txt Text 、csv Text or json Text etc. , If the number of data pieces crawled is large , We can consider storing it in the database . therefore , We need to learn MySql、MongoDB、SqlLite Usage of . More deeply , Can learn database query optimization . Last , If you don't have a lot of time , And want to quickly python Improve , The most important thing is not afraid of hardship , I suggest you can adjust the price @762459510 , That's really good , A lot of people are making rapid progress , I need you not to be afraid of hardship ! You can take a look at it ~

JSON(JavaScript Object Notation) Is a lightweight data exchange format . It's based on ECMAScript A subset of . JSON Use a completely language independent text format , But it also uses something like C The habits of the language family ( Include C、C++、Java、JavaScript、Perl、Python etc. ). These characteristics make JSON Become the ideal data exchange language . Easy to read and write , At the same time, it is also easy for machine analysis and generation ( Generally used to improve the network transmission rate ).

JSON stay python By list and dict form .Python official json The address is docs.python.org/3/library/j…

The specific use method is as follows :

** **

Step four : Analyze web data

The purpose of a crawler is to analyze web data , Get the conclusion we want . stay python Data analysis , We can use the data saved in step 3 to directly analyze , The main libraries used are as follows :NumPy、Pandas、 Matplotlib Three libraries . Last , If you don't have a lot of time , And want to quickly python Improve , The most important thing is not afraid of hardship , I suggest you can adjust the price @762459510 , That's really good , A lot of people are making rapid progress , I need you not to be afraid of hardship ! You can take a look at it ~

  • NumPy : It is the basic package of high-performance scientific computing and data analysis .
  • Pandas : be based on NumPy A tool of , The tool is created to solve data analysis tasks . It can be regarded as a cheating tool .
  • Matplotlib:Python The most famous drawing system in Python The most famous drawing system in . It can make a scatter diagram , Broken line diagram , Bar chart , Histogram , The pie chart , Box chart, scatter chart , Broken line diagram , Bar chart , Histogram , The pie chart , Box drawings, etc .

copyright notice
author[Mengy7762 Mengya],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300625533449.html

Random recommended