current position:Home>Python series -- web crawler

Python series -- web crawler

2022-02-01 11:44:31 ALKAOUA

Popular understanding of web crawlers

seeing the name of a thing one thinks of its function , Web crawler is that you put a lot of Reptiles Put it in The Internet The above to , Fetching the data Bring it back , Then integrate them together , Storage get up

The basic steps of web crawler

With Python Language as an example :

  1. Find the page where you need to crawl the content
  2. Open the check page of the page ( Check out HTML Code , Press F12 Shortcut key to enter )
  3. stay HTML Find the data you want to extract in the code
  4. Write python Code for web page request 、 analysis
  5. Store the data

for instance

The goal is : Climb to know about “ How did you begin to be able to write Python Reptiles ?” Problem data ( problem 、 describe 、 Pay attention to several 、 Browse volume )

image.png

Actually, it's right here python The requirements are also limited to the data types you need 、 Variable 、 Operator 、 function 、 Simple syntax like modules . Recommended learning sites :Python Getting started Python Study - Liao Xuefeng's official website

1. Crawling HTML Source code

The library used here for web page requests is requests, A very popular http Request Library .

Requests The library will automatically decode the content from the server , majority unicode Character sets can be decoded seamlessly .

All of this requests Can be handled properly .

Code :

import requests
headers = {'User-Agent': Your browser headers}
#  Pass in url And the request header 
r = requests.get('https://www.zhihu.com/question/21358581',headers = headers)
#  The content of the response 
print(r.text)
 Copy code 

We will receive the page returned by the server ,requests After the parsing , It looks like this :

This is what we need html Source code !

The next thing to do is start from html Extract the four information we need .

XPath Is a door in XML The language in which information is found in a document , Can be used in XML Traversing elements and attributes in a document .

Here's another useful library xpath,xpath Libraries make it easy for you to use XPath Language search information .

since XPath Is in XML Only in the document can it work , But what we just got html Just a text string .

Then the above code :

#  take html Document conversion to XPath It can be parsed 
s = etree.HTML(r.text)
 Copy code 

Now we can use xpath Library to extract information .

xpath I won't go into details here , You can search the information on the Internet , You can learn in an hour and a half .

Here is a simple method , After you find the source code of the corresponding information in the developer page , Right click to copy xpath Address :

Follow the code above :

q_content = s.xpath('//*[@class="QuestionHeader-title"]/text()')[0] #  Get the question content 
q_describe = s.xpath('//*[@class="RichText ztext"]/text()')[0] #  Get problem description 
q_numbers = s.xpath('//*[@class="NumberBoard-itemValue"]/text()') #  Get the number of followers and views 
concern_num = q_numbers[0]
browing_num = q_numbers[1]
#  Print 
print(' problem :',q_content,'\n',' describe :',q_describe,'\n'' Pay attention to several :',concern_num,'\n'' Browse volume :',browing_num)
 Copy code 

The final result :

image.png

Complete code :

import requests
from lxml import etree
headers = {'User-Agent': Your browser headers}
r = requests.get('https://www.zhihu.com/question/21358581',headers = headers) #  Pass in url And the request header 
s = etree.HTML(r.text) #  take html Document conversion to XPath It can be parsed 
q_content = s.xpath('//*[@class="QuestionHeader-title"]/text()')[0] #  Get the question content 
q_describe = s.xpath('//*[@class="RichText ztext"]/text()')[0] #  Get problem description 
q_numbers = s.xpath('//*[@class="NumberBoard-itemValue"]/text()') #  Get the number of followers and views 
concern_num = q_numbers[0]
browing_num = q_numbers[1]
#  Print 
print(' problem :',q_content,'\n',' describe :',q_describe,'\n'' Pay attention to several :',concern_num,'\n'' Browse volume :',browing_num)
 Copy code 

The example comes from Generally speaking , What is a web crawler ? - Zhu Weijun's answer - You know

Reference article

copyright notice
author[ALKAOUA],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011144288440.html

Random recommended