current position:Home>Python crawler web page parsing artifact XPath quick start teaching!!!
Python crawler web page parsing artifact XPath quick start teaching!!!
2022-01-29 20:40:13 【Skin shrimp】
Little knowledge , Great challenge ! This article is participating in “ A programmer must have a little knowledge ” Creative activities
This article also participates in 「 Digging force Star Program 」 , Win a creative gift bag , Challenge creation incentive fund
Code Pipi shrimp A simple and interesting boy with sand sculpture , Like most of my friends, I like listening to music 、 game , Of course, in addition to this, there is an interest in writing ,emm..., It's a long time , Let's work hard together
If you think it's good , The ball is a concern
1、Xpath Introduce
XPath Is a door in XML The language in which information is found in a document .XPath Can be used in XML Traversing elements and attributes in a document .
2、Xpath Path expression
expression | describe |
---|---|
nodename | Select all children of this node . |
/ | Select from root node . |
// | Select the node in the document from the current node that matches the selection , Regardless of their location . |
. | Select the current node |
.. | Select the parent of the current node |
@ | Select Properties |
3、 Explain with examples
Here I use Baidu's interface to explain to you
== example ==: I want to get the hot picture in Baidu , Open console , We can directly according to div Labeled class Value ( This is what we usually use xpath Where there are many grammars )
from lxml import etree
import requests
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"
}
url = "https://www.baidu.com/"
response = requests.get(url=url,headers=headers)
# Use etree To analyze
data = etree.HTML(response.text)
# Refer to the table above for comparison ,//div It can be understood as a... Under any path div label ,@class Presentation selection class attribute ,text() Said to get text Text
name = data.xpath("//div[@class='title-text c-font-medium c-color-t']/text()")
print(name[0])
Copy code
== example ==: I want to get some information about the hot list , Referring to the figure below, it can be seen that all of them are in ul Under the label , Every message is for a li label
data = etree.HTML(response.text)
#//ul Represents... Under any path ul label ,
# Said to get ul All under li label
ul = data.xpath("//ul[@class='s-hotsearch-content']/li")
# Of course , You may encounter no in the process of crawling class Label for property , You can use id location , Or locate its parent tag , Look down
#ul = data.xpath("//ul[@id='hotsearch-content-wrapper']/li")
# Traverse
for li in ul:
# .//span Represents any node under the current node span label , Let's go back to class Value positioning , Use text() Get text information
name = li.xpath(".//span[@class='title-content-title']/text()")
print(name[0])
Copy code
== example ==: Locate Baidu hot list and find its parent node, that is a Labeled href attribute
data = etree.HTML(response.text)
#.. Represents its parent node
url = data.xpath("//div[@class='title-text c-font-medium c-color-t']/../@href")
print(url[0])
Copy code
Xpath Grammar is actually not difficult , You need to practice more , Carry out actual combat , This mastery will be quick , You can the crawler tutorial index below , There are many reptiles in it xpath Written , You can read .
Last
I am a Code Pipi shrimp , A lover of sharing knowledge Shrimp lovers , In the future, we will continue to update blog posts that are beneficial to you , We look forward to your attention !!!
It's not easy to create , If this blog post is helpful to you , I hope you guys can connect three times with one button !, Thank you for your support , See you next time ~~~
copyright notice
author[Skin shrimp],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201292040103354.html
The sidebar is recommended
- Compile D + +, and use d to call C from python
- Install tensorflow and python 3.6 in Windows 7
- Python collects and monitors system data -- psutil
- Python collects and monitors system data -- psutil
- Finally, this Python import guide has been sorted out. Look!
- Quickly build Django blog based on function calculation
- Getting started with Python - object oriented - special methods
- Teach you how to use Python to transform an alien invasion game
- You can easily get started with Excel. Python data analysis package pandas (VI): sorting
- Implementation of top-level design pattern in Python
guess what you like
-
Using linear systems in python with scipy.linalg
-
Python tiktok 5000+ V, and found that everyone love this video.
-
Using linear systems in python with scipy.linalg
-
How to get started quickly? How to learn Python
-
Modifying Python environment with Mac OS security
-
You can easily get started with Excel. Python data analysis package pandas (XI): segment matching
-
Advanced practical case: Javascript confusion of Python anti crawling
-
Better use atom to support jupyter based Python development
-
Better use atom to support jupyter based Python development
-
Fast power modulus Python implementation of large numbers
Random recommended
- Python architects recommend the book "Python programmer's Guide" which must be read by self-study Python architects. You are welcome to take it away
- Decoding the verification code of Taobao slider with Python + selenium, the road of information security
- Python game development, pyGame module, python implementation of skiing games
- This paper clarifies the chaotic switching operation and elegant derivation of Python
- You can easily get started with Excel. Python data analysis package pandas (3): making score bar
- Test Development: self study Dubbo + Python experience summary and sharing
- Python + selenium automated test: page object mode
- You can easily get started with Excel. Python data analysis package pandas (IV): any grouping score bar
- Opencv skills | saving pictures in common formats as transparent background pictures (with Python source code) - teach you to easily make logo
- You can easily get started with Excel. Python data analysis package pandas (V): duplicate value processing
- Python ThreadPoolExecutor restrictions_ work_ Queue size
- Python generates and deploys verification codes with one click (Django)
- With "Python" advanced, you can catch all the advanced syntax! Advanced function + file operation, do not look at regret Series ~
- At the beginning of "Python", you must see the series. 10000 words are only for you. It is recommended to like the collection ~
- [Python kaggle] pandas basic exercises in machine learning series (6)
- Using linear systems in python with scipy.linalg
- The founder of pandas teaches you how to use Python for data analysis (mind mapping)
- Using Python to realize national second-hand housing data capture + map display
- Python image processing, automatic generation of GIF dynamic pictures
- Pandas advanced tutorial: time processing
- How to make Python run faster? Six tips!
- Django: use of elastic search search system
- Fundamentals of Python I
- Python code reading (chapter 35): fully (deeply) expand nested lists
- Python 3.10 official release
- Solution of no Python 3.9 installation was detected when uninstalling Python
- This pandas exercise must be successfully won
- [Python homework] coupling network information dissemination
- Python application software development tool - tkinterdesigner v1.0 5.1 release!
- [Python development tool Tkinter designer]: Lecture 2: introduction to Tkinter designer's example project
- [algorithm learning] sword finger offer 64 Find 1 + 2 +... + n (Java / C / C + + / Python / go / trust)
- leetcode 58. Length of Last Word(python)
- Problems encountered in writing the HTML content of articles into the database during the development of Django blog
- leetcode 1261. Find Elements in a Contaminated Binary Tree(python)
- [algorithm learning] 1486 Array XOR operation (Java / C / C + + / Python / go / trust)
- Understand Python's built-in function and add a print function yourself
- Python implements JS encryption algorithm in thousands of music websites
- leetcode 35. Search Insert Position(python)
- leetcode 1829. Maximum XOR for Each Query(python)
- [introduction to Python visualization]: 12 small examples of complete data visualization, taking you to play with visualization ~