current position:Home>Python crawler from introduction to mastery (IV) extracting information from web pages
Python crawler from introduction to mastery (IV) extracting information from web pages
2022-01-31 17:37:51 【zhulin1028】
「 This is my participation 11 The fourth of the yuegengwen challenge 17 God , Check out the activity details :2021 One last more challenge 」
One 、 Type of data
The types of data in web pages can be divided into the following three categories :
1、 Structured data
Data that can be represented by a unified structure . You can use a relational database to represent and store , Data in two dimensions . The general characteristics are : Data in behavioral units , A row of data represents information about an entity , The properties of each row of data are the same .
such as MySQL Data in database tables :
id name age gender
aid1 ma 46 male
aid2 Jack ma, 53 male
aid3 Robin Li 49 male
2、 Semi-structured data
Is a form of structured data , It does not conform to the data model structure associated in the form of relational database or other data tables , But contains related tags , Used to separate semantic elements and to layer records and fields . therefore , It is also known as a self-describing structure . Common semi-structured data are HTML,XML and JSON etc. , In fact, it is stored in the structure of tree or graph .
such as , A simple XML Express :
<person>
<name>A</name>
<age>13</age>
<class>aid1710</class>
<gender>female</gender>
</person>
Copy code
perhaps
<person>
<name>B</name>
<gender>male</gender>
</person>
Copy code
The order of attributes in a node is not important , The number of attributes of different semi-structured data is not necessarily the same . This data format , Free to express a lot of useful information , Include self describing information ( Metadata ). therefore , The scalability of semi-structured data is very good , It is especially suitable for large-scale dissemination on the Internet .
3、 Unstructured data
Data with no fixed structure . All kinds of documents 、 picture 、 video / Audio and so on are unstructured data . For this type of data , We tend to store them directly as a whole , And is generally stored in a binary data format ;
All data except structured and semi-structured data are unstructured data .
Two 、 About XML,HTML,DOM and JSON file
1、XML, HTML, DOM
XML namely Extentsible Markup Language( Extensible markup language ), It's a meta language used to define other languages , Its predecessor is SGML( Standard universal markup language ). It has no label set (tagset), There are no grammatical rules (grammatical rule), But it has syntactic rules (syntax rule). whatever XML Documents must be well constructed for any type of application and for proper parsing (well-formed), That is, every open label must have a matching end label , Do not contain labels in reverse order , And the sentence structure should meet the requirements of technical specifications .XML Documents can be valid (valid), But it doesn't have to be effective . A valid document is one that conforms to its document type definition (DTD) Documents . If a document conforms to a pattern (schema) The provisions of the , So this document is schema valid (schema valid).
HTML(Hyper Text Mark-up Language) Hypertext markup language , yes WWW Description language of .HTML And XML The difference and connection :
XML and HTML Are used to manipulate data or data structures , It is roughly the same in structure , But they are obviously different in essence . The comprehensive information on the Internet is summarized as follows .
( One ) Different grammar requirements :
-
stay HTML Case insensitive in , stay XML Medium strict distinction .
-
stay HTML in , Sometimes it's not strict , If the context clearly shows where the paragraph or list key ends , Then you can omit
perhaps End tags like . stay XML in , It's a strict tree structure , The end tag must not be omitted .
-
stay XML in , An element with a single tag and no matching end tag must have a / Character as end . So the parser knows that it doesn't have to look up the end tag .
-
stay XML in , Property values must be enclosed in quotation marks . stay HTML in , Quotation marks are available .
-
stay HTML in , Can have property names without values . stay XML in , All attributes must have corresponding values .
-
stay XML In the document , The white space is not automatically removed by the parser ; however html It's filtering out spaces .
XML The grammar requirements are better than HTML Strictly .
( Two ) Different marks :
-
HTML Use inherent markers ; and XML No inherent markers .
-
HTML Labels are predefined ; XML The label is free 、 Self defined 、 Extensible .
( 3、 ... and ) The effect is different :
-
HTML It's used to show data ; XML Is used to describe data 、 Storing data , So it can be used as a persistent medium .HTML Combine data and display , Show this data on the page ;xml Separate data from display . XML Designed to describe data , The focus is on the content of the data .HTML Designed to display data , The focus is on the appearance of the data .
-
XML No HTML substitute ,XML and HTML It's two different languages . XML Not to replace HTML; actually XML It can be regarded as right HTML A supplement to .XML and HTML Different goals HTML The design goal of is to display data and focus on data appearance , and XML The goal of the design is to describe the data and focus on the content of the data .
-
There's no action XML, And HTML be similar , XML No operations ( Common ground ).
-
about XML The best description might be : XML It's a cross platform , And soft 、 Hardware independent , Tools for processing and transmitting information .
-
XML The future will be everywhere ,XML Will become the most common data processing and data transmission tools .
About DOM:
Document object model (Document Object Model, abbreviation DOM), yes W3C Organization recommended standard programming interface for handling extensible markup language . On the web , Organize pages ( Or document ) Objects are organized in a tree structure , The standard model used to represent objects in a document is called DOM.Document Object Model Our history can be traced back to 1990 Microsoft and Netscape Of “ Browser Wars ”, Both sides in order to JavaScript And JScript Life and death , So large scale gives browser powerful function . Microsoft has added a lot of proprietary things to its Web technology , both VBScript、ActiveX、 And Microsoft's own DHTML Format, etc. , So that many web pages using non Microsoft platforms and browsers can not be displayed normally .DOM It is a masterpiece produced at that time .
DOM= Document Object Model, Document object model ,DOM The content and structure of a document can be accessed and modified in a platform and language independent way . let me put it another way , This is to show and deal with a HTML or XML Common methods of documentation .DOM Very important ,DOM The design of object management organization (OMG) Based on the rules of , So it can be used in any programming language . At first people thought it was a kind of letting JavaScript Portability between browsers , however DOM The application of has gone far beyond this scope .DOM Technology enables user pages to change dynamically , For example, you can dynamically show or hide an element , Change their properties , Add an element, etc ,DOM Technology greatly enhances the interactivity of pages .
DOM It's actually a document model described in an object-oriented way .DOM Defines the objects needed to represent and modify documents 、 The behavior and properties of these objects and the relationship between them . You can put DOM It is considered as a tree representation of data and structure on the page , But of course, the page may not be implemented in this tree way .
adopt JavaScript, You can refactor the entire HTML file . You can add 、 remove 、 Change or rearrange items on a page . To change something on the page ,JavaScript You need to get the right HTML Access to all elements in the document . This entrance , Together with HTML Element to add 、 Move 、 Methods and properties changed or removed , They are all obtained through the document object model (DOM).
2、JSON file
JSON(JavaScript Object Notation, JS Object tag ) Is a lightweight data exchange format . It's based on ECMAScript (w3c To formulate the JS standard ) A subset of , Use text format completely independent of programming language to store and represent data . A simple and clear hierarchy makes JSON Become the ideal data exchange language . Easy to read and write , At the same time, it is also easy for machine analysis and generation , And effectively improve the network transmission efficiency .
JSON Rule of grammar :
stay JS In language , Everything is an object . therefore , Any type of support is available through JSON To express , Like strings 、 Numbers 、 object 、 Array etc. .
But objects and arrays are two special and common types :
1. Objects are represented as key value pairs
2. Data is separated by commas
3. Curly braces hold objects
4. Square brackets hold arrays
JSON Key value pairs are used to hold JS A way of targeting , and JS The writing method of the object is also similar ,
key / The key name in the value pair combination is written before and in double quotation marks "" The parcel , Use a colon : Separate , And then it's worth
{"firstName": "Json","class":"aid1710"}
It's easy to understand , Equivalent to this JavaScript sentence :
{firstName : "Json","class":"aid1710"}
JSON And JS Relationship of objects :
A lot of people don't know JSON and JS Relationship of objects , Even who is not clear . Actually , It's understandable :JSON yes JS String representation of object , It uses text to represent a JS Object information , The essence is a string .
Such as :
var obj = {a: 'Hello', b: 'World'}; // This is an object , Note that key names can also be enclosed in quotation marks
var json = '{"a": "Hello", "b": "World"}'; // This is a JSON character string , The essence is a string .
Copy code
Python About China JSON Simple demonstration of the operation of :
See... For code examples josnTest.py
JSON and XML Comparison :
1. Readability :
JSON and XML Its readability is comparable , On one side is simple grammar , One side is the standard label form , It's hard to tell the difference .
2. Extensibility :
XML It's naturally extensible ,JSON Of course, there are , Nothing is XML Can be extended and JSON But it can't be extended . however JSON stay Javascript Home game , Can be stored Javascript Compound objects , with xml Incomparable advantages .
3. Coding difficulty :
XML There are plenty of coding tools , such as Dom4j、JDom etc. ,JSON There are also tools provided . Without tools , I believe that skilled developers can write what they want quickly xml Documentation and JSON character string , however ,xml There are many more structural characters in the document .
4. Decoding difficulty
XML There are two ways of parsing :
One is to parse through the document model , That is, a group of tags are exported through the parent tag . for example :xmlData.getElementsByTagName("tagName"), But this is to be used when the document structure is known in advance , General encapsulation is not possible .
Another way is to traverse nodes (document as well as childNodes). This can be achieved by recursion , However, the parsed data are still in different forms , Often can not meet the pre requirements . All such extensible structure data must be difficult to parse .JSON The same is true . If you know in advance JSON In the case of structure , Use JSON It's wonderful to transfer data , Can write very practical, beautiful and readable code .
If you are a pure front-end Developer , I'm sure I'll like it very much JSON. But if you're an application developer , I don't like it so much , After all xml Is the real structured markup language , For data transfer . And if you don't know JSON To analyze the structure of JSON Words , It was a nightmare . It takes time and effort not to say , Code can also become redundant and procrastinating , The results are not satisfactory .
However, this does not affect the choice of many foreground developers JSON. because json.js Medium toJSONString() You can see that JSON The string structure of . Of course not using this string , This is still a nightmare . Commonly used JSON When people see this string , That's right JSON The structure of the is clear , It's easier to operate JSON. The above is in Javascript For data transfer only xml And JSON Parsing .
stay Javascript In the territory ,JSON It's home after all , Of course, its advantages are far superior to xml.
If JSON Storage in Javascript Compound objects , And if you don't know its structure , I believe many programmers are crying and parsing JSON Of . In addition to the above ,JSON and XML Another big difference is the effective data rate .JSON It has higher efficiency when transmitted as a packet format , This is because JSON Unlike XML That requires a strict closed label , This greatly improves the ratio of effective data volume to total packets , So as to reduce the same data traffic , The transmission pressure of the network .
Example comparison :
XML and JSON All use structured methods to tag data , Let's make a simple comparison .
use XML Data of some provinces and cities in China are as follows :
<?xml version="1.0" encoding="utf-8"?>
<country>
<name> China </name>
<province>
<name> heilongjiang </name>
<cities>
<city> Harbin </city>
<city> Daqing </city>
</cities>
</province>
<province>
<name> guangdong </name>
<cities>
<city> Guangzhou </city>
<city> Shenzhen </city>
<city> zhuhai </city>
</cities>
</province>
<province>
<name> Taiwan </name>
<cities>
<city> Taipei </city>
<city> Kaohsiung </city>
</cities>
</province>
<province>
<name> xinjiang </name>
<cities>
<city> urumqi </city>
</cities>
</province>
</country>
Copy code
use JSON Shown by the following :
{
"name": " China ",
"province": [{
"name": " heilongjiang ",
"cities": {
"city": [" Harbin ", " Daqing "]
}
}, {
"name": " guangdong ",
"cities": {
"city": [" Guangzhou ", " Shenzhen ", " zhuhai "]
}
}, {
"name": " Taiwan ",
"cities": {
"city": [" Taipei ", " Kaohsiung "]
}
}, {
"name": " xinjiang ",
"cities": {
"city": [" urumqi "]
}
}]
}
Copy code
You can see :JSON Simple grammar format and clear hierarchy are obviously better than XML Easy to read , And in terms of data exchange , because JSON The characters used are more than XML much less , It can greatly save the bandwidth of data transmission .
3、 ... and 、 How to extract information from web pages
1、 XPath And lxml
XPath Is a door in XML The language in which information is found in a document , Yes XPath Our understanding is a lot of advanced XML The basis of application ,XPath stay XML Navigate through elements and attributes in .
lxml It's a XML The third party of Python library , It is encapsulated in the bottom layer with C language-written libxml2 and libxslt, And with simple and powerful Python API, Compatible with and enhanced the famous Element Tree API.
install :pip install lxml
Use :from lxml import etree
1. XPath The term :
stay XPath In context ,XML The document is treated as a node tree , The root node of the node tree is also called the document node . XPath The nodes in the node tree (Node) Divided into seven categories : Elements (Element), attribute (Attribute), Text (Text), Namespace (Namespace), A processing instruction (Processing-instruction), notes (Comment) And document nodes (Document nodes).
to glance at XML Document examples :
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
Copy code
The above XML In the document :
( This is a “ root ”)
J K. Rowling ( This is a “ Elements ”)
lang="en" ( This is a “ attribute ”)
From another perspective, it :
bookstore ( root )
book ( Elements )
title ( Elements )
lang = en ( attribute )
text = Harry Potter ( Text )
author ( Elements )
text = J K. Rowling ( Text )
year ( Elements )
text = 2005 ( Text )
price ( Elements )
text = 29.99 ( Text )
2. The relationship between nodes
Father (Parent): Every element must have a parent node , The parent of the topmost element is the root node . Similarly, every attribute must have a parent , Their parent is the element . The above example XML In the document , root bookstore Is an element book Parent node ,book Is an element title, author, year, price Parent node ,title yes lang Parent node .
Son (Children): An element can have zero or more children . The above example XML In the document ,title, author, year, price yes book Child nodes of .
Compatriot (Sibling): Nodes with the same parent node are siblings of each other , Also known as each other's brother nodes . The above example XM In the document ,title, author, year, price Each other's compatriots .
Forefathers (Ancestor): The parent node of a node 、 Father's father , And so on, all the nodes between the root nodes are traced . The above example XM In the document ,title, author, year, price Our ancestors were book, bookstore.
Progeny (Descendant): The child node of a node 、 Son of son , And so on to all nodes between the last child node . The above example XM In the document ,bookstore Our offspring are title, author, year, price .
3. Select node
The following is the expression of the basic path , remember XPath All path expressions are based on a node , For example, the original current node is usually the root node , This is related to Linux The principle of lower path switching is the same .
Expression description :
nodename Select the node named... Under the matched node nodename The child element node of
/ If the / start , Indicates that the root node is used as the selection starting point .
// Select nodes from the descendants of matched nodes , Regardless of the location of the target node .
. Select the current node .
.. Select the parent element node of the current node .
@ Select Properties .
4. wildcard
* Match any element .
@* Match any property .
node() Match any type of node .
5. Anticipation (Predicates) or Conditional selection
Prediction is used to find a specific node or a node that meets certain conditions , The prediction expression is in square brackets . Use “|” Operator , You can choose to match “ or ” Several paths of conditions .
See the following code for specific examples lxmlTest.py.
6. Axis
XPath Axis : The coordinate axis is used to define the node set for the current node .
Axis name meaning
ancestor Select all predecessor elements and root nodes of the current node .
ancestor-or-self Select all predecessors of the current node and the current node itself .
attibute Select all attributes of the current node .
child Select all child elements of the current node .
descendant Select all descendant elements of the current node .
descendant-or-self Select all descendant elements of the current node and the current node itself .
following Select all nodes after the end tag of the current node in the document .
following-sibling Select all peers after the current node .
namespace Select all namespace nodes of the current node .
parent Select the parent of the current node .
preceding Select all nodes before the start label of the current node .
preceding-sibling Select all peers before the current node .
self Select the current node .
7. Expression for location path
The location path can be an absolute path , It could be a relative path . Absolute path to “/” start . Each path includes one or more steps , Between each step with “/” Separate .
Absolute path :/step/step/…
Relative paths :step/step/…
Each step is calculated according to the nodes in the current node set .
Step (step) Including three parts :
Axis (axis): Defines the relationship between the selected node and the current node .
Node test (node-test): Identify nodes inside a coordinate axis .
Anticipation (predicate): A pre judgment condition is proposed to filter the node set .
Step grammar : Axis :: Node test [ Anticipation ]
2、 BeautifulSoup4
Beautiful Soup Yes, it is Python Write a HTML/XML The parser , It can handle non-standard tags well and generate parse trees (parse tree). It provides simple and common navigation (navigating), Search and modify the parse tree . It can save you a lot of programming time .
install :(sudo) pip install beautifuilsoup4
Use :
Import... Into the program Beautiful Soup library :
from BeautifulSoup import BeautifulSoup # For processing HTML
from BeautifulSoup import BeautifulStoneSoup # For processing XML
import BeautifulSoup # To get everything
Copy code
# Code example
from bs4 import BeautifulSoup
import re
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
Copy code
Locate some soup The element is simple , For example, the above example :
soup.contents[0].name
# u'html'
soup.contents[0].contents[0].name
# u'head'
head = soup.contents[0].contents[0]
head.parent.name
# u'html'
head.next
# <title>Page title</title>
head.nextSibling.name
# u'body'
head.nextSibling.contents[0]
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
head.nextSibling.contents[0].nextSibling
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
Copy code
You can also use soup, Get a specific tag or tag with a specific attribute , modify soup It's also very simple. ;
BS4 And lxml Comparison :
lxml C Realization , Only local traversal , fast ; complex , Grammar is not very friendly ;
BS4 Python Realization , The entire document will be loaded , slow ; Simple ,API Hommization ;
3、 Regular expressions re
Used to retrieve \ Replace those that match a pattern ( The rules ) The text of , For text filtering or rule matching , The most powerful is regular expressions , yes python An indispensable weapon in reptiles .
Basic matching rules :
[0-9] Any number , Equivalent \d
[a-z] Any lowercase letter
[A-Z] Any capital letter
[^0-9] Match non numeric , Equivalent \D
\w Equivalent [a-z0-9_], Alphanumeric underline
\W Equivalent pair \w Take the
. Any character
[] Match any internal character or subexpression
[^] Take non... For character set
- Match the preceding character or subexpression 0 Times or times
- Match the previous character at least 1 Time
? Match the previous character 0 Times or times
^ Match the beginning of a string
$ End of match string
Python Using regular expressions
Python Of re modular
pattern Compiled regular expressions
Several important methods :
match: Match once from the beginning ;
search: Match once , From a certain position ;
findall: Match all ;
split: Separate ;
sub: Replace ;
Two modes that need attention :
Greedy mode :(.*)
Lazy mode :(.*?)
- Use regular expressions to achieve the following effect :
hold i=d%0A&from=AUTO&to=AUTO&smartresult=dict
Convert to the following form :
i:d%0A
from:AUTO
to:AUTO
smartresult:dict
summary : Regular ,BS,lxml Comparison
copyright notice
author[zhulin1028],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201311737494815.html
The sidebar is recommended
- Python - convert Matplotlib image to numpy Array or PIL Image
- Python and Java crawl personal blog information and export it to excel
- Using class decorators in Python
- Untested Python code is not far from crashing
- Python efficient derivation (8)
- Python requests Library
- leetcode 2047. Number of Valid Words in a Sentence(python)
- leetcode 2027. Minimum Moves to Convert String(python)
- How IOS developers learn Python Programming 5 - data types 2
- leetcode 1971. Find if Path Exists in Graph(python)
guess what you like
-
leetcode 1984. Minimum Difference Between Highest and Lowest of K Scores(python)
-
Python interface automation test framework (basic) -- basic syntax
-
Detailed explanation of Python derivation
-
Python reptile lesson 2-9 Chinese monster database. It is found that there is a classification of color (he) desire (Xie) monsters during operation
-
A brief note on the method of creating Python virtual environment in Intranet Environment
-
[worth collecting] for Python beginners, sort out the common errors of beginners + Python Mini applet! (code attached)
-
[Python souvenir book] two people in one room have three meals and four seasons: 'how many years is it only XX years away from a hundred years of good marriage' ~?? Just come in and have a look.
-
The unknown side of Python functions
-
Python based interface automation test project, complete actual project, with source code sharing
-
A python artifact handles automatic chart color matching
Random recommended
- Python crawls the map of Gaode and the weather conditions of each city
- leetcode 1275. Find Winner on a Tic Tac Toe Game(python)
- leetcode 2016. Maximum Difference Between Increasing Elements(python)
- Run through Python date and time processing (Part 2)
- Application of urllib package in Python
- Django API Version (II)
- Python utility module playsound
- Database addition, deletion, modification and query of Python Sqlalchemy basic operation
- Tiobe November programming language ranking: Python surpasses C language to become the first! PHP is about to fall out of the top ten?
- Learn how to use opencv and python to realize face recognition!
- Using OpenCV and python to identify credit card numbers
- Principle of Python Apriori algorithm (11)
- Python AI steals your voice in 5 seconds
- A glance at Python's file processing (Part 1)
- Python cloud cat
- Python crawler actual combat, pyecharts module, python data analysis tells you which goods are popular on free fish~
- Using pandas to implement SQL group_ concat
- How IOS developers learn Python Programming 8 - set type 3
- windows10+apache2. 4 + Django deployment
- Django parser
- leetcode 1560. Most Visited Sector in a Circular Track(python)
- leetcode 1995. Count Special Quadruplets(python)
- How to program based on interfaces using Python
- leetcode 1286. Iterator for Combination(python)
- leetcode 1418. Display Table of Food Orders in a Restaurant (python)
- Python Matplotlib drawing histogram
- Python development foundation summary (VII) database + FTP + character coding + source code security
- Python modular package management and import mechanism
- Django serialization (II)
- Python dataloader error "dataloader worker (PID XXX) is killed by signal" solution
- apache2. 4 + Django + windows 10 Automated Deployment
- leetcode 1222. Queens That Can Attack the King(python)
- leetcode 1387. Sort Integers by The Power Value (python)
- Tiger sniffing 24-hour praise device, a case with a crawler skill, python crawler lesson 7-9
- Python object oriented programming 01: introduction classes and objects
- Baidu Post: high definition Python
- Python Matplotlib drawing contour map
- Python crawler actual combat, requests module, python realizes IMDB movie top data visualization
- Python classic: explain programming and development from simple to deep and step by step
- Python implements URL availability monitoring and instant push