current position:Home>Python crawler web page parsing artifact XPath quick start teaching!!!

Python crawler web page parsing artifact XPath quick start teaching!!!

2022-01-29 20:40:13 Skin shrimp

Little knowledge , Great challenge ! This article is participating in “   A programmer must have a little knowledge   ” Creative activities

This article also participates in  「 Digging force Star Program 」  , Win a creative gift bag , Challenge creation incentive fund

Code Pipi shrimp A simple and interesting boy with sand sculpture , Like most of my friends, I like listening to music 、 game , Of course, in addition to this, there is an interest in writing ,emm..., It's a long time , Let's work hard together

If you think it's good , The ball is a concern

1、Xpath Introduce

     XPath Is a door in XML The language in which information is found in a document .XPath Can be used in XML Traversing elements and attributes in a document .

2、Xpath Path expression

expression describe
nodename Select all children of this node .
/ Select from root node .
// Select the node in the document from the current node that matches the selection , Regardless of their location .
. Select the current node
.. Select the parent of the current node
@ Select Properties

3、 Explain with examples

Here I use Baidu's interface to explain to you

 Insert picture description here

== example ==: I want to get the hot picture in Baidu , Open console , We can directly according to div Labeled class Value ( This is what we usually use xpath Where there are many grammars )  Insert picture description here

from lxml import etree
import requests

headers = {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"
url = ""
response = requests.get(url=url,headers=headers)
# Use etree To analyze 
data = etree.HTML(response.text)

# Refer to the table above for comparison ,//div It can be understood as a... Under any path div label ,@class Presentation selection class attribute ,text() Said to get text Text 
name = data.xpath("//div[@class='title-text c-font-medium c-color-t']/text()")
 Copy code 

 Insert picture description here

== example ==: I want to get some information about the hot list , Referring to the figure below, it can be seen that all of them are in ul Under the label , Every message is for a li label

 Insert picture description here

data = etree.HTML(response.text)
#//ul Represents... Under any path ul label ,
# Said to get ul All under li label 
ul = data.xpath("//ul[@class='s-hotsearch-content']/li")
# Of course , You may encounter no in the process of crawling class Label for property , You can use id location , Or locate its parent tag , Look down 
#ul = data.xpath("//ul[@id='hotsearch-content-wrapper']/li")

# Traverse 
for li in ul:
	# .//span Represents any node under the current node span label , Let's go back to class Value positioning , Use text() Get text information 
    name = li.xpath(".//span[@class='title-content-title']/text()")
 Copy code 

 Insert picture description here

== example ==: Locate Baidu hot list and find its parent node, that is a Labeled href attribute  Insert picture description here

data = etree.HTML(response.text)

#.. Represents its parent node 
url = data.xpath("//div[@class='title-text c-font-medium c-color-t']/../@href")
 Copy code 

 Insert picture description here

Xpath Grammar is actually not difficult , You need to practice more , Carry out actual combat , This mastery will be quick , You can the crawler tutorial index below , There are many reptiles in it xpath Written , You can read .


I am a Code Pipi shrimp , A lover of sharing knowledge Shrimp lovers , In the future, we will continue to update blog posts that are beneficial to you , We look forward to your attention !!!

It's not easy to create , If this blog post is helpful to you , I hope you guys can connect three times with one button !, Thank you for your support , See you next time ~~~

 One key, three links .png

copyright notice
author[Skin shrimp],Please bring the original link to reprint, thank you.

Random recommended