current position:Home>Python crawler from introduction to mastery (VI) form and crawler login

Python crawler from introduction to mastery (VI) form and crawler login

2022-01-31 20:12:23 zhulin1028

「 This is my participation 11 The fourth of the yuegengwen challenge 18 God , Check out the activity details :2021 One last more challenge

  Preface

In the previous chapter , We introduced how to exchange data between the client and the server . We can use GET Methods and POST Method to interact with the server , Sensitive data should only be used POST Request to send , To avoid exposing the book to URL in . Of course , The server also supports other HTTP Method , such as PUT and DELETE Other methods , But none of these methods are supported in the form .

One 、 About forms

The client browser needs to interact with the web server , The server needs to return the corresponding information according to the user input .

Look at w3c An example of :

www.w3school.com.cn/html/html_f…

About GET,POST How to interact with the server , You can see 5.2 section .

Let's focus on how to deal with the login form .

Two 、 management cookie

1、 Use cookie Sign in

HTTP The agreement itself is stateless , How to save information that has been to or logged in to the website ?

So we need to be able to HTTP Outside the protocol, some mechanism is used to identify the user . So there was Session and Cookie.

What is? Cookie, What is? Session?

conversation (Session) Tracking is Web Techniques commonly used in programs , Used to track the user's entire session . The common session tracking technology is Cookie And Session.Cookie Determine user identity by recording information on the client side ,Session Determine the user's identity by recording information on the server side .

    Cookie Meaning for “ cookie ”, By W3C organisation , The earliest by Netscape A mechanism for community development . at present Cookie Has become the norm , All the major browsers such as IE、Netscape、Firefox、Opera Such as support Cookie. because HTTP It's a stateless protocol , The server does not know the customer's identity from a single network connection . So issue a pass to the clients , Each one a , Whoever visits must bring his own pass . This allows the server to identify the customer from the pass . This is it. Cookie How it works .

Cookie It's actually a short piece of text . The client requests the server , If the server needs to record the user status , Just use response Issue one to the client browser Cookie. The client browser will take Cookie Save up . When the browser requests the site again , The browser links the requested url with the Cookie Submit to the server together . The server checks the Cookie,

To identify user status . The server can also be modified as needed Cookie The content of .

Let's take a look at how to use Cookie Do login operation . Sometimes the crawler can only crawl the information in the web page after logging in . For example, Weibo , You know , Renren, etc .

About Cookie More details of , You can see : www.w3cschool.cn/pegosu/skj8…

2、 ## Supplementary knowledge cookiejar Use

Cookie There is a time limit , There are domain restrictions , There are coding problems and so on . If you manage yourself Cookie, It will be very complicated , Especially when there are multiple Cookie When management is required , Want good management Cookie It is difficult to .

When you encounter a web page login , return 302 In case of jump ,urllib2 Of Response Will lose Set-Cookie Information about , Cause unsuccessful login .

We need a generic that can handle Cookie Tools to automatically process Set-Cookie request ; Automatically manage expired Cookie, Automatically distribute special messages in the corresponding field Cookie; In order to deal with these problems , We introduced CookieJar;

3、 ... and 、 About the verification code (CAPTCHA)

Website in order to prevent malicious fraud and attack of hacker programs , A defensive measure taken . It is said that the earliest was paypal The technology introduced by this company , Now it has been widely used in Internet websites .

    General processing verification code CAPTCHA There are two ways :

     1) When the verification code needs to be input, the program pops up a picture for the user to input ;

     2) Image recognition technology is used to identify the information in the graph ;

Optical character recognition OCR:OCR(Optical Character Recognition, Optical character recognition ) Electronic equipment ( For example, a scanner or a digital camera ) Check the characters printed on the paper , By detecting dark 、 The bright pattern determines its shape , Then use character recognition method , The process of translating shapes into computer text ;

The method of program processing complex verification code :

  1. Use Google Open source projects for Tesseract;

install Tesseract:

Ubuntu Install in :

         sudo apt-get install tesseract-ocr

pip install pytesseract

Training and testing :www.cnblogs.com/cnlian/p/57…

ordinary Python Test code :

from PIL import Image

from pytesseract import *

#  Loading pictures 

image = Image.open('test1.jpg')

#  Identification process 

text = image_to_string(image)

print(text)
 Copy code 

2. Use baidu AI wait :

\

copyright notice
author[zhulin1028],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201312012214982.html

Random recommended