current position:Home>Python crawler from introduction to mastery (VI) form and crawler login
Python crawler from introduction to mastery (VI) form and crawler login
2022-01-31 20:12:23 【zhulin1028】
「 This is my participation 11 The fourth of the yuegengwen challenge 18 God , Check out the activity details :2021 One last more challenge 」
Preface
In the previous chapter , We introduced how to exchange data between the client and the server . We can use GET Methods and POST Method to interact with the server , Sensitive data should only be used POST Request to send , To avoid exposing the book to URL in . Of course , The server also supports other HTTP Method , such as PUT and DELETE Other methods , But none of these methods are supported in the form .
One 、 About forms
The client browser needs to interact with the web server , The server needs to return the corresponding information according to the user input .
Look at w3c An example of :
www.w3school.com.cn/html/html_f…
About GET,POST How to interact with the server , You can see 5.2 section .
Let's focus on how to deal with the login form .
Two 、 management cookie
1、 Use cookie Sign in
HTTP The agreement itself is stateless , How to save information that has been to or logged in to the website ?
So we need to be able to HTTP Outside the protocol, some mechanism is used to identify the user . So there was Session and Cookie.
What is? Cookie, What is? Session?
conversation (Session) Tracking is Web Techniques commonly used in programs , Used to track the user's entire session . The common session tracking technology is Cookie And Session.Cookie Determine user identity by recording information on the client side ,Session Determine the user's identity by recording information on the server side .
Cookie Meaning for “ cookie ”, By W3C organisation , The earliest by Netscape A mechanism for community development . at present Cookie Has become the norm , All the major browsers such as IE、Netscape、Firefox、Opera Such as support Cookie. because HTTP It's a stateless protocol , The server does not know the customer's identity from a single network connection . So issue a pass to the clients , Each one a , Whoever visits must bring his own pass . This allows the server to identify the customer from the pass . This is it. Cookie How it works .
Cookie It's actually a short piece of text . The client requests the server , If the server needs to record the user status , Just use response Issue one to the client browser Cookie. The client browser will take Cookie Save up . When the browser requests the site again , The browser links the requested url with the Cookie Submit to the server together . The server checks the Cookie,
To identify user status . The server can also be modified as needed Cookie The content of .
Let's take a look at how to use Cookie Do login operation . Sometimes the crawler can only crawl the information in the web page after logging in . For example, Weibo , You know , Renren, etc .
About Cookie More details of , You can see : www.w3cschool.cn/pegosu/skj8…
2、 ## Supplementary knowledge cookiejar Use
Cookie There is a time limit , There are domain restrictions , There are coding problems and so on . If you manage yourself Cookie, It will be very complicated , Especially when there are multiple Cookie When management is required , Want good management Cookie It is difficult to .
When you encounter a web page login , return 302 In case of jump ,urllib2 Of Response Will lose Set-Cookie Information about , Cause unsuccessful login .
We need a generic that can handle Cookie Tools to automatically process Set-Cookie request ; Automatically manage expired Cookie, Automatically distribute special messages in the corresponding field Cookie; In order to deal with these problems , We introduced CookieJar;
3、 ... and 、 About the verification code (CAPTCHA)
Website in order to prevent malicious fraud and attack of hacker programs , A defensive measure taken . It is said that the earliest was paypal The technology introduced by this company , Now it has been widely used in Internet websites .
General processing verification code CAPTCHA There are two ways :
1) When the verification code needs to be input, the program pops up a picture for the user to input ;
2) Image recognition technology is used to identify the information in the graph ;
Optical character recognition OCR:OCR(Optical Character Recognition, Optical character recognition ) Electronic equipment ( For example, a scanner or a digital camera ) Check the characters printed on the paper , By detecting dark 、 The bright pattern determines its shape , Then use character recognition method , The process of translating shapes into computer text ;
The method of program processing complex verification code :
1. Use Google Open source projects for Tesseract;
install Tesseract:
Ubuntu Install in :
sudo apt-get install tesseract-ocr
pip install pytesseract
Training and testing :www.cnblogs.com/cnlian/p/57…
ordinary Python Test code :
from PIL import Image
from pytesseract import *
# Loading pictures
image = Image.open('test1.jpg')
# Identification process
text = image_to_string(image)
print(text)
Copy code
2. Use baidu AI wait :
\
copyright notice
author[zhulin1028],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201312012214982.html
The sidebar is recommended
- Python crawls the map of Gaode and the weather conditions of each city
- leetcode 1275. Find Winner on a Tic Tac Toe Game(python)
- leetcode 2016. Maximum Difference Between Increasing Elements(python)
- Run through Python date and time processing (Part 2)
- Application of urllib package in Python
- Django API Version (II)
- Python utility module playsound
- Database addition, deletion, modification and query of Python Sqlalchemy basic operation
- Tiobe November programming language ranking: Python surpasses C language to become the first! PHP is about to fall out of the top ten?
- Learn how to use opencv and python to realize face recognition!
guess what you like
-
Using OpenCV and python to identify credit card numbers
-
Principle of Python Apriori algorithm (11)
-
Python AI steals your voice in 5 seconds
-
A glance at Python's file processing (Part 1)
-
Python cloud cat
-
Python crawler actual combat, pyecharts module, python data analysis tells you which goods are popular on free fish~
-
Using pandas to implement SQL group_ concat
-
How IOS developers learn Python Programming 8 - set type 3
-
windows10+apache2. 4 + Django deployment
-
Django parser
Random recommended
- leetcode 1560. Most Visited Sector in a Circular Track(python)
- leetcode 1995. Count Special Quadruplets(python)
- How to program based on interfaces using Python
- leetcode 1286. Iterator for Combination(python)
- leetcode 1418. Display Table of Food Orders in a Restaurant (python)
- Python Matplotlib drawing histogram
- Python development foundation summary (VII) database + FTP + character coding + source code security
- Python modular package management and import mechanism
- Django serialization (II)
- Python dataloader error "dataloader worker (PID XXX) is killed by signal" solution
- apache2. 4 + Django + windows 10 Automated Deployment
- leetcode 1222. Queens That Can Attack the King(python)
- leetcode 1387. Sort Integers by The Power Value (python)
- Tiger sniffing 24-hour praise device, a case with a crawler skill, python crawler lesson 7-9
- Python object oriented programming 01: introduction classes and objects
- Baidu Post: high definition Python
- Python Matplotlib drawing contour map
- Python crawler actual combat, requests module, python realizes IMDB movie top data visualization
- Python classic: explain programming and development from simple to deep and step by step
- Python implements URL availability monitoring and instant push
- Python avatar animation, come and generate your own animation avatar
- leetcode 1884. Egg Drop With 2 Eggs and N Floors(python)
- leetcode 1910. Remove All Occurrences of a Substring(python)
- Python and binary
- First acquaintance with Python class
- [Python data collection] scrapy book acquisition and coding analysis
- Python crawler from introduction to mastery (IV) extracting information from web pages
- Python crawler from entry to mastery (III) implementation of simple crawler
- The apscheduler module in Python implements scheduled tasks
- 1379. Find the same node in the cloned binary tree (Java / C + + / Python)
- Python connects redis, singleton and thread pool, and resolves problems encountered
- Python from 0 to 1 (day 11) - Python data application 1
- Python bisect module
- Python + OpenGL realizes real-time interactive writing on blocks with B-spline curves
- Use the properties of Python VTK implicit functions to select and cut data
- Learn these 10000 passages and become a humorous person in the IT workplace. Python crawler lessons 8-9
- leetcode 986. Interval List Intersections(python)
- leetcode 1860. Incremental Memory Leak(python)
- How to teach yourself Python? How long will it take?
- Python Matplotlib drawing pie chart