current position:Home>Python crawler from entry to mastery (V) challenges of dynamic web pages

Python crawler from entry to mastery (V) challenges of dynamic web pages

2022-01-31 20:12:23 zhulin1028

「 This is my participation 11 The fourth of the yuegengwen challenge 18 God , Check out the activity details :2021 One last more challenge

Preface

Data from many websites , For example, the price of goods on e-commerce websites , Comments and so on will be loaded dynamically , In this way, the crawler may not be able to directly obtain relevant data when it just accesses . So how to deal with such a problem ?

One 、 Usage scenarios of dynamic web pages

Let's take a look at the following example :

​  

        This is the scene of reading a book on Jingdong . We found that after opening a Book , The price of books , Ranking and other information and book review information are not loaded immediately when we first open the website . It is obtained through two requests or multiple asynchronous requests . Such pages are dynamic pages .

About the use of dynamic pages :

         Scenarios that you want to refresh asynchronously . Some web pages have a lot of content , Loading at one time puts great pressure on the server , And some users don't see everything ;

Two 、 Go back to and HTTP The original method by which the server sends the request data

1、GET Method

GET Add parameter data queue to URL in ,Key and Value Each field of the corresponds one by one ; stay URL Can be seen in .

Browser's URL There are some symbols in , Characters are not well recognized . Then we need a set of coding methods to convey information . So the sender needs to do urlencode; The receiver needs to do urldecode;

www.baidu.com/s?ie=utf-8&…

Online testing tools : tool.chinaz.com/tools/urlen…

1.www.baidu.com/s?wd=DNS

?xxx=yyy&time=zzz get Identification of the request

2.acb.com/login?name=…

login: name=zhangsan  password=123

2、 POST Method

By way of an example POST Use of methods :

​  This is the page translated by Youdao , A close look will reveal , Each time the user enters a word he wants to translate , Page URL The information doesn't change . This is a typical asynchronous use Ajax Technology , use JSON Format for data transmission . ​

  3、 ... and 、 More difficult to deal with dynamic websites

1、 Deal with websites that need interactive simulation of multiple data

We sometimes encounter large websites like Taobao , Pay special attention to data copyright , Their websites are maintained by a large number of engineers and technicians , They may also use multiple interactive packets to complete the interaction between the web server and the user browser . If we still use the traditional method of analyzing data packets at this time, it will be more complex , It's more difficult . that , Is there a once and for all way , To solve such problems ?

Our solution is :Selenium + PhantomJS.

     Our crawler is actually simulating the behavior of the browser .

2、 Selenium

One Web Automated test tool , It was originally developed for website automation testing ; We play games with button sprites ;Selenium You can do something similar , But it does this in the browser .

install : sudo pip install selenium(pip install selenium)

stay Python in from selenium import webdriver To test whether it is installed

explain : Want to use Python Children's shoes who do automated testing can study it well Selenium Use .

3、 PhantomJS And browser

explain : We have an interface in class Firefox browser , So as to facilitate teaching ;

One is based on webkit No interface (headless) Browser , It can load the website into memory and execute the... On the page JS, But it doesn't have a graphical user interface , So it consumes less resources ;

install :     sudo apt install phantomjs ( This method may not be installed completely , Some functions cannot be used )

Linux Ubuntu The method of complete installation under ( see blog.csdn.net/m0_38124502…

)

Wget

bitbucket.org/ariya/phant…

 cd download

 tar -xvf phantomjs-2.1.1-linux-x86_64.tar.bz2

 cd phantomjs-2.1.1-linux-x86_64/

 cd bin/

 sudo cp phantomjs /usr/bin

python - start-up -> Browser process phantomjs,

test :

SpiderCodes\Phantomjs..  For example helloworld.js, pageload.js

To test ;

Be careful :   **** It may cause resource leakage ; To avoid this happening , There needs to be a strategy to... At the right time kill phantomjs process .

Four 、 Summary of dynamic website information capture

in general , Our crawler should try to simulate the behavior of real users accessing the server website on the browser . If we use GET or POST To simulate the behavior of communication between browser and server , Lower cost , However, it is difficult to cheat the server when dealing with complex websites or websites carefully defended by the server .Selenim+PhantomJS Our solution will make our program look more like ordinary users , But its efficiency will be much lower , The speed will be much slower . Many new challenges may be encountered when crawling data on a large scale .( For example, the setting of website size , Setting of waiting time, etc )

copyright notice
author[zhulin1028],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201312012225813.html

Random recommended