current position:Home>Python series -- web crawler
Python series -- web crawler
2022-02-01 11:44:31 【ALKAOUA】
Popular understanding of web crawlers
seeing the name of a thing one thinks of its function , Web crawler is that you put a lot of Reptiles Put it in The Internet The above to , Fetching the data Bring it back , Then integrate them together , Storage get up
The basic steps of web crawler
With Python Language as an example :
- Find the page where you need to crawl the content
- Open the check page of the page ( Check out HTML Code , Press F12 Shortcut key to enter )
- stay HTML Find the data you want to extract in the code
- Write python Code for web page request 、 analysis
- Store the data
for instance
The goal is : Climb to know about “ How did you begin to be able to write Python Reptiles ?” Problem data ( problem 、 describe 、 Pay attention to several 、 Browse volume )
Actually, it's right here python The requirements are also limited to the data types you need 、 Variable 、 Operator 、 function 、 Simple syntax like modules . Recommended learning sites :Python Getting started 、Python Study - Liao Xuefeng's official website
1. Crawling HTML Source code
The library used here for web page requests is requests, A very popular http Request Library .
Requests The library will automatically decode the content from the server , majority unicode Character sets can be decoded seamlessly .
All of this requests Can be handled properly .
Code :
import requests
headers = {'User-Agent': Your browser headers}
# Pass in url And the request header
r = requests.get('https://www.zhihu.com/question/21358581',headers = headers)
# The content of the response
print(r.text)
Copy code
We will receive the page returned by the server ,requests After the parsing , It looks like this :
This is what we need html Source code !
The next thing to do is start from html Extract the four information we need .
XPath Is a door in XML The language in which information is found in a document , Can be used in XML Traversing elements and attributes in a document .
Here's another useful library xpath,xpath Libraries make it easy for you to use XPath Language search information .
since XPath Is in XML Only in the document can it work , But what we just got html Just a text string .
Then the above code :
# take html Document conversion to XPath It can be parsed
s = etree.HTML(r.text)
Copy code
Now we can use xpath Library to extract information .
xpath I won't go into details here , You can search the information on the Internet , You can learn in an hour and a half .
Here is a simple method , After you find the source code of the corresponding information in the developer page , Right click to copy xpath Address :
Follow the code above :
q_content = s.xpath('//*[@class="QuestionHeader-title"]/text()')[0] # Get the question content
q_describe = s.xpath('//*[@class="RichText ztext"]/text()')[0] # Get problem description
q_numbers = s.xpath('//*[@class="NumberBoard-itemValue"]/text()') # Get the number of followers and views
concern_num = q_numbers[0]
browing_num = q_numbers[1]
# Print
print(' problem :',q_content,'\n',' describe :',q_describe,'\n'' Pay attention to several :',concern_num,'\n'' Browse volume :',browing_num)
Copy code
The final result :
Complete code :
import requests
from lxml import etree
headers = {'User-Agent': Your browser headers}
r = requests.get('https://www.zhihu.com/question/21358581',headers = headers) # Pass in url And the request header
s = etree.HTML(r.text) # take html Document conversion to XPath It can be parsed
q_content = s.xpath('//*[@class="QuestionHeader-title"]/text()')[0] # Get the question content
q_describe = s.xpath('//*[@class="RichText ztext"]/text()')[0] # Get problem description
q_numbers = s.xpath('//*[@class="NumberBoard-itemValue"]/text()') # Get the number of followers and views
concern_num = q_numbers[0]
browing_num = q_numbers[1]
# Print
print(' problem :',q_content,'\n',' describe :',q_describe,'\n'' Pay attention to several :',concern_num,'\n'' Browse volume :',browing_num)
Copy code
The example comes from Generally speaking , What is a web crawler ? - Zhu Weijun's answer - You know
Reference article
copyright notice
author[ALKAOUA],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011144288440.html
The sidebar is recommended
- Python from 0 to 1 (day 14) - Python conditional judgment 1
- Several very interesting modules in Python
- How IOS developers learn Python Programming 15 - object oriented programming 1
- Daily python, Chapter 20, exception handling
- Understand the basis of Python collaboration in a few minutes
- [centos7] how to install and use Python under Linux
- leetcode 1130. Minimum Cost Tree From Leaf Values(python)
- leetcode 1433. Check If a String Can Break Another String(python)
- Python Matplotlib drawing 3D graphics
- Talk about deep and shallow copying in Python
guess what you like
-
Python crawler series - network requests
-
Python thread 01 understanding thread
-
Analysis of earthquake distribution in the past 10 years with Python~
-
You need to master these before learning Python crawlers
-
After the old friend (R & D post) was laid off, I wanted to join the snack bar. I collected some data in Python. It's more or less a intention
-
Python uses redis
-
Python crawler - ETF fund acquisition
-
Detailed tutorial on Python operation Tencent object storage (COS)
-
[Python] comparison of list, tuple, array and bidirectional queue methods
-
Go Python 3 usage and pit Prevention Guide
Random recommended
- Python logging log error and exception exception callback method
- Learn Python quickly and take a shortcut~
- Python from 0 to 1 (day 15) - Python conditional judgment 2
- Python crawler actual combat, requests module, python to capture headlines and take beautiful pictures
- The whole activity collected 8 proxy IP sites to pave the way for the python proxy pool, and the 15th of 120 crawlers
- Why can't list be used as dictionary key value in Python
- Python from 0 to 1 (day 16) - Python conditional judgment 3
- What is the python programming language?
- Python crawler reverse webpack, a real estate management platform login password parameter encryption logic
- Python crawler reverse, a college entrance examination volunteer filling platform encrypts the parameter signsafe and decrypts the returned results
- Python simulated Login, selenium module, python identification graphic verification code to realize automatic login
- Python -- datetime (timedelta class)
- Python's five strange skills will bring you a sense of enrichment in mastering efficient programming skills
- [Python] comparison of dictionary dict, defaultdict and orderdict
- Test driven development using Django
- Face recognition practice: face recognition using Python opencv and deep learning
- leetcode 1610. Maximum Number of Visible Points(python)
- Python thread 03 thread synchronization
- Introduction and internal principles of Python's widely used concurrent processing Library Futures
- Python - progress bar artifact tqdm usage
- Python learning notes - the fifth bullet * class & object oriented
- Python learning notes - the fourth bullet IO operation
- Python crawler actual combat: crawl all the pictures in the answer
- Quick reference manual of common regular expressions, necessary for Python text processing
- [Python] the characteristics of dictionaries and collections and the hash table behind them
- Python crawler - fund information storage
- Python crawler actual combat, pyteseract module, python realizes the visualization of boos direct employment & hook post data
- Pit filling summary: Python memory leak troubleshooting tips
- Python code reading (Chapter 61): delaying function calls
- Through the for loop, compare the differences between Python and Ruby Programming ideas
- leetcode 1606. Find Servers That Handled Most Number of Requests(python)
- leetcode 1611. Minimum One Bit Operations to Make Integers Zero(python)
- 06python learning notes - reading external text data
- [Python] functions, higher-order functions, anonymous functions and function attributes
- Python Networkx practice social network visualization
- Data analysis starts from scratch, and pandas reads and writes CSV data
- Python review (format string)
- [pandas learning notes 01] powerful tool set for analyzing structured data
- leetcode 147. Insertion Sort List(python)
- apache2. 4 + windows deployment Django (multi site)