current position:Home>Python crawler from introduction to mastery (IV) extracting information from web pages

Python crawler from introduction to mastery (IV) extracting information from web pages

2022-01-31 17:37:51 zhulin1028

「 This is my participation 11 The fourth of the yuegengwen challenge 17 God , Check out the activity details :2021 One last more challenge

  One 、 Type of data

The types of data in web pages can be divided into the following three categories :

1、 Structured data

Data that can be represented by a unified structure . You can use a relational database to represent and store , Data in two dimensions . The general characteristics are : Data in behavioral units , A row of data represents information about an entity , The properties of each row of data are the same .

such as MySQL Data in database tables :

id          name       age      gender

aid1        ma       46       male

aid2        Jack ma,         53       male

aid3        Robin Li       49       male

2、 Semi-structured data

Is a form of structured data , It does not conform to the data model structure associated in the form of relational database or other data tables , But contains related tags , Used to separate semantic elements and to layer records and fields . therefore , It is also known as a self-describing structure . Common semi-structured data are HTML,XML and JSON etc. , In fact, it is stored in the structure of tree or graph .

such as , A simple XML Express :

<person>

    <name>A</name>

<age>13</age>

<class>aid1710</class>

    <gender>female</gender>

</person>
 Copy code 

perhaps

<person>

    <name>B</name>

    <gender>male</gender>

</person>
 Copy code 

The order of attributes in a node is not important , The number of attributes of different semi-structured data is not necessarily the same . This data format , Free to express a lot of useful information , Include self describing information ( Metadata ). therefore , The scalability of semi-structured data is very good , It is especially suitable for large-scale dissemination on the Internet .

3、 Unstructured data

Data with no fixed structure . All kinds of documents 、 picture 、 video / Audio and so on are unstructured data . For this type of data , We tend to store them directly as a whole , And is generally stored in a binary data format ;

All data except structured and semi-structured data are unstructured data .

Two 、 About XML,HTML,DOM and JSON file

1、XML, HTML, DOM

XML namely Extentsible Markup Language( Extensible markup language ), It's a meta language used to define other languages , Its predecessor is SGML( Standard universal markup language ). It has no label set (tagset), There are no grammatical rules (grammatical rule), But it has syntactic rules (syntax rule). whatever XML Documents must be well constructed for any type of application and for proper parsing (well-formed), That is, every open label must have a matching end label , Do not contain labels in reverse order , And the sentence structure should meet the requirements of technical specifications .XML Documents can be valid (valid), But it doesn't have to be effective . A valid document is one that conforms to its document type definition (DTD) Documents . If a document conforms to a pattern (schema) The provisions of the , So this document is schema valid (schema valid).

HTML(Hyper Text Mark-up Language) Hypertext markup language , yes WWW Description language of .HTML And XML The difference and connection :

  XML and HTML Are used to manipulate data or data structures , It is roughly the same in structure , But they are obviously different in essence . The comprehensive information on the Internet is summarized as follows .

( One ) Different grammar requirements :

  1. stay HTML Case insensitive in , stay XML Medium strict distinction .

  2. stay HTML in , Sometimes it's not strict , If the context clearly shows where the paragraph or list key ends , Then you can omit

    perhaps
  3. End tags like . stay XML in , It's a strict tree structure , The end tag must not be omitted .

  4. stay XML in , An element with a single tag and no matching end tag must have a / Character as end . So the parser knows that it doesn't have to look up the end tag .

  5. stay XML in , Property values must be enclosed in quotation marks . stay HTML in , Quotation marks are available .

  6. stay HTML in , Can have property names without values . stay XML in , All attributes must have corresponding values .

  7. stay XML In the document , The white space is not automatically removed by the parser ; however html It's filtering out spaces .

XML The grammar requirements are better than HTML Strictly .

( Two ) Different marks :

  1. HTML Use inherent markers ; and XML No inherent markers .

  2. HTML Labels are predefined ; XML The label is free 、 Self defined 、 Extensible .

( 3、 ... and ) The effect is different :

  1. HTML It's used to show data ; XML Is used to describe data 、 Storing data , So it can be used as a persistent medium .HTML Combine data and display , Show this data on the page ;xml Separate data from display . XML Designed to describe data , The focus is on the content of the data .HTML Designed to display data , The focus is on the appearance of the data .

  2. XML No HTML substitute ,XML and HTML It's two different languages . XML Not to replace HTML; actually XML It can be regarded as right HTML A supplement to .XML and HTML Different goals HTML The design goal of is to display data and focus on data appearance , and XML The goal of the design is to describe the data and focus on the content of the data .

  3. There's no action XML, And HTML be similar , XML No operations ( Common ground ).

  4. about XML The best description might be : XML It's a cross platform , And soft 、 Hardware independent , Tools for processing and transmitting information .

  5. XML The future will be everywhere ,XML Will become the most common data processing and data transmission tools .

About DOM:

Document object model (Document Object Model, abbreviation DOM), yes W3C Organization recommended standard programming interface for handling extensible markup language . On the web , Organize pages ( Or document ) Objects are organized in a tree structure , The standard model used to represent objects in a document is called DOM.Document Object Model Our history can be traced back to 1990 Microsoft and Netscape Of “ Browser Wars ”, Both sides in order to JavaScript And JScript Life and death , So large scale gives browser powerful function . Microsoft has added a lot of proprietary things to its Web technology , both VBScript、ActiveX、 And Microsoft's own DHTML Format, etc. , So that many web pages using non Microsoft platforms and browsers can not be displayed normally .DOM It is a masterpiece produced at that time . ​

 DOM= Document Object Model, Document object model ,DOM The content and structure of a document can be accessed and modified in a platform and language independent way . let me put it another way , This is to show and deal with a HTML or XML Common methods of documentation .DOM Very important ,DOM The design of object management organization (OMG) Based on the rules of , So it can be used in any programming language . At first people thought it was a kind of letting JavaScript Portability between browsers , however DOM The application of has gone far beyond this scope .DOM Technology enables user pages to change dynamically , For example, you can dynamically show or hide an element , Change their properties , Add an element, etc ,DOM Technology greatly enhances the interactivity of pages .

DOM It's actually a document model described in an object-oriented way .DOM Defines the objects needed to represent and modify documents 、 The behavior and properties of these objects and the relationship between them . You can put DOM It is considered as a tree representation of data and structure on the page , But of course, the page may not be implemented in this tree way .

adopt JavaScript, You can refactor the entire HTML file . You can add 、 remove 、 Change or rearrange items on a page . To change something on the page ,JavaScript You need to get the right HTML Access to all elements in the document . This entrance , Together with HTML Element to add 、 Move 、 Methods and properties changed or removed , They are all obtained through the document object model (DOM).

2、JSON file

JSON(JavaScript Object Notation, JS Object tag ) Is a lightweight data exchange format . It's based on ECMAScript (w3c To formulate the JS standard ) A subset of , Use text format completely independent of programming language to store and represent data . A simple and clear hierarchy makes JSON Become the ideal data exchange language . Easy to read and write , At the same time, it is also easy for machine analysis and generation , And effectively improve the network transmission efficiency .

JSON Rule of grammar :

stay JS In language , Everything is an object . therefore , Any type of support is available through JSON To express , Like strings 、 Numbers 、 object 、 Array etc. .

But objects and arrays are two special and common types :

                        1. Objects are represented as key value pairs

                       2. Data is separated by commas

                       3. Curly braces hold objects

                       4. Square brackets hold arrays

JSON Key value pairs are used to hold JS A way of targeting , and JS The writing method of the object is also similar ,

key / The key name in the value pair combination is written before and in double quotation marks "" The parcel , Use a colon : Separate , And then it's worth

{"firstName": "Json","class":"aid1710"}

It's easy to understand , Equivalent to this JavaScript sentence :

{firstName : "Json","class":"aid1710"}

JSON And JS Relationship of objects :

A lot of people don't know JSON and JS Relationship of objects , Even who is not clear . Actually , It's understandable :JSON yes JS String representation of object , It uses text to represent a JS Object information , The essence is a string .

Such as :

var obj = {a: 'Hello', b: 'World'}; // This is an object , Note that key names can also be enclosed in quotation marks 

var json = '{"a": "Hello", "b": "World"}'; // This is a  JSON  character string , The essence is a string .
 Copy code 

Python About China JSON Simple demonstration of the operation of :

See... For code examples josnTest.py

JSON and XML Comparison :

1. Readability :

JSON and XML Its readability is comparable , On one side is simple grammar , One side is the standard label form , It's hard to tell the difference .

2. Extensibility :

XML It's naturally extensible ,JSON Of course, there are , Nothing is XML Can be extended and JSON But it can't be extended . however JSON stay Javascript Home game , Can be stored Javascript Compound objects , with xml Incomparable advantages .

3. Coding difficulty :

XML There are plenty of coding tools , such as Dom4j、JDom etc. ,JSON There are also tools provided . Without tools , I believe that skilled developers can write what they want quickly xml Documentation and JSON character string , however ,xml There are many more structural characters in the document . ​

 4. Decoding difficulty

XML There are two ways of parsing :

One is to parse through the document model , That is, a group of tags are exported through the parent tag . for example :xmlData.getElementsByTagName("tagName"), But this is to be used when the document structure is known in advance , General encapsulation is not possible .

        Another way is to traverse nodes (document as well as childNodes). This can be achieved by recursion , However, the parsed data are still in different forms , Often can not meet the pre requirements . All such extensible structure data must be difficult to parse .JSON The same is true . If you know in advance JSON In the case of structure , Use JSON It's wonderful to transfer data , Can write very practical, beautiful and readable code .

If you are a pure front-end Developer , I'm sure I'll like it very much JSON. But if you're an application developer , I don't like it so much , After all xml Is the real structured markup language , For data transfer . And if you don't know JSON To analyze the structure of JSON Words , It was a nightmare . It takes time and effort not to say , Code can also become redundant and procrastinating , The results are not satisfactory .

However, this does not affect the choice of many foreground developers JSON. because json.js Medium toJSONString() You can see that JSON The string structure of . Of course not using this string , This is still a nightmare . Commonly used JSON When people see this string , That's right JSON The structure of the is clear , It's easier to operate JSON. The above is in Javascript For data transfer only xml And JSON Parsing .

stay Javascript In the territory ,JSON It's home after all , Of course, its advantages are far superior to xml.

If JSON Storage in Javascript Compound objects , And if you don't know its structure , I believe many programmers are crying and parsing JSON Of . In addition to the above ,JSON and XML Another big difference is the effective data rate .JSON It has higher efficiency when transmitted as a packet format , This is because JSON Unlike XML That requires a strict closed label , This greatly improves the ratio of effective data volume to total packets , So as to reduce the same data traffic , The transmission pressure of the network .

Example comparison :

XML and JSON All use structured methods to tag data , Let's make a simple comparison .

use XML Data of some provinces and cities in China are as follows :

<?xml version="1.0" encoding="utf-8"?>

<country>

    <name> China </name>

    <province>

        <name> heilongjiang </name>

        <cities>

            <city> Harbin </city>

            <city> Daqing </city>

        </cities>

    </province>

    <province>

        <name> guangdong </name>

        <cities>

            <city> Guangzhou </city>

            <city> Shenzhen </city>

            <city> zhuhai </city>

        </cities>

    </province>

    <province>

        <name> Taiwan </name>

        <cities>

            <city> Taipei </city>

            <city> Kaohsiung </city>

        </cities>

    </province>

    <province>

        <name> xinjiang </name>

        <cities>

            <city> urumqi </city>

        </cities>

    </province>

</country>
 Copy code 

use JSON Shown by the following :

{

    "name": " China ",

    "province": [{

        "name": " heilongjiang ",

        "cities": {

            "city": [" Harbin ", " Daqing "]

        }

    }, {

        "name": " guangdong ",

        "cities": {

            "city": [" Guangzhou ", " Shenzhen ", " zhuhai "]

        }

    }, {

        "name": " Taiwan ",

        "cities": {

            "city": [" Taipei ", " Kaohsiung "]

        }

    }, {

        "name": " xinjiang ",

        "cities": {

            "city": [" urumqi "]

        }

    }]

}
 Copy code 

You can see :JSON Simple grammar format and clear hierarchy are obviously better than XML Easy to read , And in terms of data exchange , because JSON The characters used are more than XML much less , It can greatly save the bandwidth of data transmission .

3、 ... and 、 How to extract information from web pages

1、 XPath And lxml

XPath Is a door in XML The language in which information is found in a document , Yes XPath Our understanding is a lot of advanced XML The basis of application ,XPath stay XML Navigate through elements and attributes in .

lxml It's a XML The third party of Python library , It is encapsulated in the bottom layer with C language-written libxml2 and libxslt, And with simple and powerful Python API, Compatible with and enhanced the famous Element Tree API.

install :pip install lxml

Use :from lxml import etree ​

1.      XPath The term :

stay XPath In context ,XML The document is treated as a node tree , The root node of the node tree is also called the document node . XPath The nodes in the node tree (Node) Divided into seven categories : Elements (Element), attribute (Attribute), Text (Text), Namespace (Namespace), A processing instruction (Processing-instruction), notes (Comment) And document nodes (Document nodes).

to glance at XML Document examples :

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>

  <title lang="en">Harry Potter</title>

  <author>J K. Rowling</author>

  <year>2005</year>

  <price>29.99</price>

</book>

</bookstore>
 Copy code 

  The above XML In the document :

                            ( This is a “ root ”)

                            J K. Rowling ( This is a “ Elements ”)

                            lang="en" ( This is a “ attribute ”)

  From another perspective, it :

bookstore                      ( root )

book                         ( Elements )

title                        ( Elements )

           lang  = en                   ( attribute )

           text = Harry Potter          ( Text )

           author                       ( Elements )

           text = J K. Rowling          ( Text )

           year                         ( Elements )

           text = 2005                  ( Text )

           price                        ( Elements )

       text = 29.99                 ( Text )

2.  The relationship between nodes

Father (Parent): Every element must have a parent node , The parent of the topmost element is the root node . Similarly, every attribute must have a parent , Their parent is the element . The above example XML In the document , root bookstore Is an element book Parent node ,book Is an element title, author, year, price Parent node ,title yes lang Parent node .

Son (Children): An element can have zero or more children . The above example XML In the document ,title, author, year, price yes book Child nodes of .

Compatriot (Sibling): Nodes with the same parent node are siblings of each other , Also known as each other's brother nodes . The above example XM In the document ,title, author, year, price Each other's compatriots .

Forefathers (Ancestor): The parent node of a node 、 Father's father , And so on, all the nodes between the root nodes are traced . The above example XM In the document ,title, author, year, price Our ancestors were book, bookstore.

Progeny (Descendant): The child node of a node 、 Son of son , And so on to all nodes between the last child node . The above example XM In the document ,bookstore Our offspring are title, author, year, price .

3. Select node

The following is the expression of the basic path , remember XPath All path expressions are based on a node , For example, the original current node is usually the root node , This is related to Linux The principle of lower path switching is the same .

Expression description :

nodename       Select the node named... Under the matched node nodename The child element node of

/              If the / start , Indicates that the root node is used as the selection starting point .

//             Select nodes from the descendants of matched nodes , Regardless of the location of the target node .

.               Select the current node .

..             Select the parent element node of the current node .

@            Select Properties .

4. wildcard

*          Match any element .

@*        Match any property .

node()      Match any type of node .

5. Anticipation (Predicates) or Conditional selection

Prediction is used to find a specific node or a node that meets certain conditions , The prediction expression is in square brackets . Use “|” Operator , You can choose to match “ or ” Several paths of conditions .

See the following code for specific examples lxmlTest.py.

6. Axis

XPath Axis : The coordinate axis is used to define the node set for the current node .

Axis name           meaning

ancestor                     Select all predecessor elements and root nodes of the current node .

ancestor-or-self             Select all predecessors of the current node and the current node itself .

attibute               Select all attributes of the current node .

child                       Select all child elements of the current node .

descendant                Select all descendant elements of the current node .

descendant-or-self              Select all descendant elements of the current node and the current node itself .

following                     Select all nodes after the end tag of the current node in the document .

following-sibling          Select all peers after the current node .

namespace                Select all namespace nodes of the current node .

parent                   Select the parent of the current node .

preceding                   Select all nodes before the start label of the current node .

preceding-sibling        Select all peers before the current node .

self                             Select the current node .

7. Expression for location path

The location path can be an absolute path , It could be a relative path . Absolute path to “/” start . Each path includes one or more steps , Between each step with “/” Separate .

    Absolute path :/step/step/…

    Relative paths :step/step/…

Each step is calculated according to the nodes in the current node set .

Step (step) Including three parts :

    Axis (axis):      Defines the relationship between the selected node and the current node .

    Node test (node-test): Identify nodes inside a coordinate axis .

    Anticipation (predicate):    A pre judgment condition is proposed to filter the node set .

Step grammar : Axis :: Node test [ Anticipation ]

2、 BeautifulSoup4

Beautiful Soup Yes, it is Python Write a HTML/XML The parser , It can handle non-standard tags well and generate parse trees (parse tree). It provides simple and common navigation (navigating), Search and modify the parse tree . It can save you a lot of programming time .

install :(sudo) pip install beautifuilsoup4

Use : ​

  Import... Into the program Beautiful Soup library :

from BeautifulSoup import BeautifulSoup          # For processing HTML

from BeautifulSoup import BeautifulStoneSoup     # For processing XML

import BeautifulSoup                             # To get everything
 Copy code 


#  Code example 

from bs4 import BeautifulSoup

import re

doc = ['<html><head><title>Page title</title></head>',

       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',

       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',

       '</html>']

soup = BeautifulSoup(''.join(doc))

print soup.prettify()
 Copy code 

Locate some soup The element is simple , For example, the above example :

soup.contents[0].name

# u'html'



soup.contents[0].contents[0].name

# u'head'



head = soup.contents[0].contents[0]

head.parent.name

# u'html'



head.next

# <title>Page title</title>



head.nextSibling.name

# u'body'



head.nextSibling.contents[0]

# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>



head.nextSibling.contents[0].nextSibling

# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

  
 Copy code 

    You can also use soup, Get a specific tag or tag with a specific attribute , modify soup It's also very simple. ;

BS4 And lxml Comparison :

lxml    C Realization , Only local traversal , fast ;        complex , Grammar is not very friendly ;

BS4     Python Realization , The entire document will be loaded , slow ; Simple ,API Hommization ;

3、 Regular expressions re

Used to retrieve \ Replace those that match a pattern ( The rules ) The text of , For text filtering or rule matching , The most powerful is regular expressions , yes python An indispensable weapon in reptiles .

Basic matching rules :

[0-9] Any number , Equivalent \d

[a-z] Any lowercase letter

[A-Z] Any capital letter

[^0-9] Match non numeric , Equivalent \D

\w Equivalent [a-z0-9_], Alphanumeric underline

\W Equivalent pair \w Take the

. Any character

[] Match any internal character or subexpression

[^] Take non... For character set

  • Match the preceding character or subexpression 0 Times or times
  • Match the previous character at least 1 Time

? Match the previous character 0 Times or times

^ Match the beginning of a string

$ End of match string

Python Using regular expressions

Python Of re modular

pattern Compiled regular expressions

Several important methods :

match: Match once from the beginning ;

search: Match once , From a certain position ;

findall: Match all ;

split: Separate ;

sub: Replace ;

Two modes that need attention :

Greedy mode :(.*)

Lazy mode :(.*?)

  1. Use regular expressions to achieve the following effect :

hold i=d%0A&from=AUTO&to=AUTO&smartresult=dict

Convert to the following form :

i:d%0A

from:AUTO

to:AUTO

smartresult:dict

summary : Regular ,BS,lxml Comparison

 

copyright notice
author[zhulin1028],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201311737494815.html

Random recommended