current position:Home>Python crawler from entry to mastery (V) challenges of dynamic web pages
Python crawler from entry to mastery (V) challenges of dynamic web pages
2022-01-31 20:12:23 【zhulin1028】
「 This is my participation 11 The fourth of the yuegengwen challenge 18 God , Check out the activity details :2021 One last more challenge 」
Preface
Data from many websites , For example, the price of goods on e-commerce websites , Comments and so on will be loaded dynamically , In this way, the crawler may not be able to directly obtain relevant data when it just accesses . So how to deal with such a problem ?
One 、 Usage scenarios of dynamic web pages
Let's take a look at the following example :
This is the scene of reading a book on Jingdong . We found that after opening a Book , The price of books , Ranking and other information and book review information are not loaded immediately when we first open the website . It is obtained through two requests or multiple asynchronous requests . Such pages are dynamic pages .
About the use of dynamic pages :
Scenarios that you want to refresh asynchronously . Some web pages have a lot of content , Loading at one time puts great pressure on the server , And some users don't see everything ;
Two 、 Go back to and HTTP The original method by which the server sends the request data
1、GET Method
GET Add parameter data queue to URL in ,Key and Value Each field of the corresponds one by one ; stay URL Can be seen in .
Browser's URL There are some symbols in , Characters are not well recognized . Then we need a set of coding methods to convey information . So the sender needs to do urlencode; The receiver needs to do urldecode;
Online testing tools : tool.chinaz.com/tools/urlen…
?xxx=yyy&time=zzz get Identification of the request
login: name=zhangsan password=123
2、 POST Method
By way of an example POST Use of methods :
This is the page translated by Youdao , A close look will reveal , Each time the user enters a word he wants to translate , Page URL The information doesn't change . This is a typical asynchronous use Ajax Technology , use JSON Format for data transmission .
3、 ... and 、 More difficult to deal with dynamic websites
1、 Deal with websites that need interactive simulation of multiple data
We sometimes encounter large websites like Taobao , Pay special attention to data copyright , Their websites are maintained by a large number of engineers and technicians , They may also use multiple interactive packets to complete the interaction between the web server and the user browser . If we still use the traditional method of analyzing data packets at this time, it will be more complex , It's more difficult . that , Is there a once and for all way , To solve such problems ?
Our solution is :Selenium + PhantomJS.
Our crawler is actually simulating the behavior of the browser .
2、 Selenium
One Web Automated test tool , It was originally developed for website automation testing ; We play games with button sprites ;Selenium You can do something similar , But it does this in the browser .
install : sudo pip install selenium(pip install selenium)
stay Python in from selenium import webdriver To test whether it is installed
explain : Want to use Python Children's shoes who do automated testing can study it well Selenium Use .
3、 PhantomJS And browser
explain : We have an interface in class Firefox browser , So as to facilitate teaching ;
One is based on webkit No interface (headless) Browser , It can load the website into memory and execute the... On the page JS, But it doesn't have a graphical user interface , So it consumes less resources ;
install : sudo apt install phantomjs ( This method may not be installed completely , Some functions cannot be used )
Linux Ubuntu The method of complete installation under ( see blog.csdn.net/m0_38124502…
)
Wget
cd download
tar -xvf phantomjs-2.1.1-linux-x86_64.tar.bz2
cd phantomjs-2.1.1-linux-x86_64/
cd bin/
sudo cp phantomjs /usr/bin
python - start-up -> Browser process phantomjs,
test :
SpiderCodes\Phantomjs.. For example helloworld.js, pageload.js
To test ;
Be careful : **** It may cause resource leakage ; To avoid this happening , There needs to be a strategy to... At the right time kill phantomjs process .
Four 、 Summary of dynamic website information capture
in general , Our crawler should try to simulate the behavior of real users accessing the server website on the browser . If we use GET or POST To simulate the behavior of communication between browser and server , Lower cost , However, it is difficult to cheat the server when dealing with complex websites or websites carefully defended by the server .Selenim+PhantomJS Our solution will make our program look more like ordinary users , But its efficiency will be much lower , The speed will be much slower . Many new challenges may be encountered when crawling data on a large scale .( For example, the setting of website size , Setting of waiting time, etc )
copyright notice
author[zhulin1028],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201312012225813.html
The sidebar is recommended
- Python crawls the map of Gaode and the weather conditions of each city
- leetcode 1275. Find Winner on a Tic Tac Toe Game(python)
- leetcode 2016. Maximum Difference Between Increasing Elements(python)
- Run through Python date and time processing (Part 2)
- Application of urllib package in Python
- Django API Version (II)
- Python utility module playsound
- Database addition, deletion, modification and query of Python Sqlalchemy basic operation
- Tiobe November programming language ranking: Python surpasses C language to become the first! PHP is about to fall out of the top ten?
- Learn how to use opencv and python to realize face recognition!
guess what you like
-
Using OpenCV and python to identify credit card numbers
-
Principle of Python Apriori algorithm (11)
-
Python AI steals your voice in 5 seconds
-
A glance at Python's file processing (Part 1)
-
Python cloud cat
-
Python crawler actual combat, pyecharts module, python data analysis tells you which goods are popular on free fish~
-
Using pandas to implement SQL group_ concat
-
How IOS developers learn Python Programming 8 - set type 3
-
windows10+apache2. 4 + Django deployment
-
Django parser
Random recommended
- leetcode 1560. Most Visited Sector in a Circular Track(python)
- leetcode 1995. Count Special Quadruplets(python)
- How to program based on interfaces using Python
- leetcode 1286. Iterator for Combination(python)
- leetcode 1418. Display Table of Food Orders in a Restaurant (python)
- Python Matplotlib drawing histogram
- Python development foundation summary (VII) database + FTP + character coding + source code security
- Python modular package management and import mechanism
- Django serialization (II)
- Python dataloader error "dataloader worker (PID XXX) is killed by signal" solution
- apache2. 4 + Django + windows 10 Automated Deployment
- leetcode 1222. Queens That Can Attack the King(python)
- leetcode 1387. Sort Integers by The Power Value (python)
- Tiger sniffing 24-hour praise device, a case with a crawler skill, python crawler lesson 7-9
- Python object oriented programming 01: introduction classes and objects
- Baidu Post: high definition Python
- Python Matplotlib drawing contour map
- Python crawler actual combat, requests module, python realizes IMDB movie top data visualization
- Python classic: explain programming and development from simple to deep and step by step
- Python implements URL availability monitoring and instant push
- Python avatar animation, come and generate your own animation avatar
- leetcode 1884. Egg Drop With 2 Eggs and N Floors(python)
- leetcode 1910. Remove All Occurrences of a Substring(python)
- Python and binary
- First acquaintance with Python class
- [Python data collection] scrapy book acquisition and coding analysis
- Python crawler from introduction to mastery (IV) extracting information from web pages
- Python crawler from entry to mastery (III) implementation of simple crawler
- The apscheduler module in Python implements scheduled tasks
- 1379. Find the same node in the cloned binary tree (Java / C + + / Python)
- Python connects redis, singleton and thread pool, and resolves problems encountered
- Python from 0 to 1 (day 11) - Python data application 1
- Python bisect module
- Python + OpenGL realizes real-time interactive writing on blocks with B-spline curves
- Use the properties of Python VTK implicit functions to select and cut data
- Learn these 10000 passages and become a humorous person in the IT workplace. Python crawler lessons 8-9
- leetcode 986. Interval List Intersections(python)
- leetcode 1860. Incremental Memory Leak(python)
- How to teach yourself Python? How long will it take?
- Python Matplotlib drawing pie chart