current position:Home>You need to master these before learning Python crawlers

You need to master these before learning Python crawlers

2022-02-01 03:06:27 Internet Lao Xin

This is my participation 11 The fourth of the yuegengwen challenge 17 God , Activity details link to view :2021 One last more challenge

Common protocols

http and https http agreement : Hypertext transfer protocol , It's a release and acceptance HTML Page method , The port is 80

https agreement :http Encrypted version of the protocol , stay HTTP Add the following ssl layer , The port is 443

The following is the official website of meituan : You can see that the port is 443

 Insert picture description here

URL and RUI

Common request methods

http The protocol stipulates that in the process of data interaction between browser and server, an interaction mode must be selected stay http The agreement defines 8 In the request mode , Common is get and post request

get request : Generally, data is only obtained from the server , It doesn't have any impact on server resources .

 Insert picture description here Pay attention to when asking :

  • url
  • Request mode
  • Request header

post request : Send data to the server ( land ), Upload files, etc , When it has an impact on server resources , Will use post request .

But some websites have anti crawler mechanism , You check the information , Is also used post request , So when we write about reptiles , Be sure to analyze the website .

Common request header parameters :

http Agreement , Send a request to the server , The data is divided into three parts :

  • Put the data in url in
  • The data is in body in ,(post request )
  • The data is in head in

Common request header parameters :

  • user-agent : Browser name
  • referer: From which current request url Over here
  • cookie:http Protocol is stateless , That is, a person sends two requests , The server doesn't have the ability to know if the two requests are from the same person .

 Insert picture description here

Common corresponding status codes

  • 200 Request OK , The server returns data normally
  • 301 Permanent redirection
  • 404 Requested url Could not find... On the server
  • 418 Send request encountered server side anti crawler , The server rejects the data
  • 500 Server internal error , Maybe there's a server bug

HTTP The corresponding process of the request

 Insert picture description here

Use your browser for website analysis

The website we want to analyze is : movie.douban.com  Insert picture description here

  • Elements: Used to analyze the structure of a website

The content presented on the page , stay Elements There will be corresponding elements .

 Insert picture description here

  • Console: The recruitment information will be printed here , Warning, etc .

 Insert picture description here

  • Sources
  • Network : When the page is displayed , All requests generated

headers Header information

session And cookie

session Represents a session between the server and the browser session It's a server-side mechanism , It is used to store the information needed by a specific user's session , Save in memory , cache , Or in the database .

cookie cooke It is generated by the server and sent to the client ,cookie It's saved on the client side

cookie principle : 1) establish cookie 2) Set up storage cookie 3) send out cookie 4) Read cookie

summary : Study Python Reptiles , Network knowledge is essential

copyright notice
author[Internet Lao Xin],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202010306262893.html

Random recommended