current position:Home>python crawler introductory case day08: set map network

python crawler introductory case day08: set map network

2022-08-06 13:14:14Self-taught Python crawler

python crawler entry case day08: Jitu network

Target Site

jituwang

Destination URL

http://jituwang.com/

Development environment

1, window112, python3.73, PyCharm Community Edition 2021.2.14, dual-core browser5, the browser comes with developer tools

Site Analytics

 Pull down the scroll bar and find that the website does not load new pictures. It is initially judged that the website is a static website.Swipe to the bottom of the webpage and find that the website has been set to turn pages. Use the developer tools that come with the browser to capturePacket analysis, it is found that the pictures displayed on the webpage can be found in the webpage data packet, then the network can be finally determined.The site is a static website, as shown in the figure:

insert image description here
insert image description here

In order to reduce the risk of the crawler being anti-crawling, we first register the website account, then log in to the website and grab the login data packet of the website, and use the session through the data packet.To perform website simulation login, as shown in the figure:

insert image description here
insert image description here
Insert image description here
insert image description here

Data Analysis

We first persist the web data package data to the local in html format, then open it with a dual-core browser, and locate the image url and image name with the help of the Xpather browser plug-in,The analysis found that the pictures stored in the two lists are the latest update and the most viewed, in fact, there is also a list called the document list but there is no data in it, as shown in the figure:

insert image description here

insert image description here
insert image description here
Insert image description here

Source code

Please add image description

Crawled pictures

insert image description here

Summary

1, when using xpath to parse data, if no data is parsed, first consider whether the web page returns data, and then consider whether there is an error in the xpath expression;2, when you encounter a loop, please use the tube keyword yield as much as possible when possible.;Privacy data such as 3, account number, password, etc., do not write it directly in the code.class="token punctuation">) function, received from the outside world;4, to simulate Ajax requests, be sure to add this "X-Requested-With": "XMLHttpRequest";5, before converting the response data into string format data, be sure to check the web page encoding format, this case web page source code charset=gb23126, when using xpath to parse data, zip will be used for subsequent data processing() function, this function is very powerful, you must learn to use it, and you must not forget this function

copyright notice
author[Self-taught Python crawler],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/218/202208061306256412.html

Random recommended