current position:Home>Detailed explanation of Python + fidder 4 web crawler (the most complete in Station C)

Detailed explanation of Python + fidder 4 web crawler (the most complete in Station C)

2022-02-02 09:01:20 E iosers

fidder4

1、 Grab the bag
(1) <>:html
(2)Json: json data, It could also be an interface
(3) Css:css project
(4)Js:js project
2、 Stop Grab the bag :file->capture Click to switch
3、 Click request -> Select on the right ->Inspectors
(1) The upper right :http request
①Raw: Details of the request header
②Webforms: The requested parameters ;query_string、formdata.
(2) The lower right :http Response information (U should first break Compress information , Click on the Yellow scroll bar )
① Click the yellow bar to decode ;
②Raw: All information in response ;
③Headers: Response head ;
④Json: What the interface returns ;( Response content )
(3) Lower left instruction box ( The program can be operated quickly )
①Clear: Clear all requests ;
②Select + …: Quickly select relevant information ;
③?+ Content (com/du…): Quickly search for information that matches the content ;

summary :fidder It is a professional bag grabbing tool , Compared with google Web Developer Tools ,fidder There is no case of overwriting the information obtained from the previous web page ,fidder You can keep the relevant data captured by relevant web pages before .

fidder4  Interface display

Python-urllib library

1、 effect : A library that simulates a browser sending requests ,Python Bring your own library
2、Python 3: Integrated two libraries urllib.rquest urllib.parse
3、 Related functions
(1)Urllib.request
(2)Urlopen(url)
(3)Urlretrieve(url,image_path)
(4)Urilib.parse
I:quote
ii:unquoto
iii:urlencode
(5)Response
(6)Read() Read the corresponding content , Content is byte type
(7)Geturl() Obtain requested url
(8)Getheaders() Get the corresponding header information ( In the list is the format of Yuanzu )
(9)Getcode() Get status code
(10)Readlines()

complete url:

http://www.baidu.com:80/index/html?name=goudan&passward = 123#lala

www.baidu.com: domain name 
index/html?: file 
name=goudan&passward = 123:get() Parameters with 
#lala: Anchor point 
:80/: port 

Use the bag grabbing tool -urlopen4

import urllib.request
url = 'http://www.baidu.com'
reaponse = urllib.request.urlopen(url = url)# Send a request 
#print(response) # Print known response It's an object 
#print(response.read().decode())# Get the content in the object 
#print((response.getheaders()))# Get data in tuples 
#print(dict(response.getheaders()))# take getheaders() The data obtained by the method is displayed in the form of a dictionary .
#print(response.getcode()) # Get status code 
#print(response.readlines()) # According to the line read , Returns a list of , All byte types .
''' This is the moment B Format , That is, binary format , So we need to convert binary format to string format . 1. encode()  character string  ->  Binary system  2. decode()  Binary system  ->  character string   If you don't write any parameters in parentheses , The default is utf-8, If write , Namely gbk  Before getting the contents of the object , First convert the binary content into string format ; Before that, check the coding format of this page . '''
with open('baidu.html','w',encoding = 'utf-8') as fp:
	fp.write(response.read().decode())
''' At this point, you have obtained the content in ’baidu.html‘ My files are saved , without doubt , The format is html, After running the file , You can see the baidu home page interface .'''
with open('baidu1.html','wb') as fp: # Read directly in binary ,’wb‘ Binary reading mode 
	fp.write(response.read())

urlrequest - urlparse Build the request object

First copy the picture address

image_url = 'https://pics7.baidu.com/feed/caef76094b36acafca8bf2e6c1a8091601e99c3d.jpeg?token=13591fc3a05c20a49617777abbaabf53'
response = urllib.request.urlopen(image_url)
# Pictures can only be written in local binary format 
with open('qing.jpg','wb') as fp:
	fb.write(response.read())

(Urlopen(url)) So you can save the picture locally

The second method :

image_url = 'https://pics7.baidu.com/feed/caef76094b36acafca8bf2e6c1a8091601e99c3d.jpeg?token=13591fc3a05c20a49617777abbaabf53'
urllib.request.uelretrieve(image_url,'chun.jpg')

(Urlretrieve(url,image_path)) This allows you to write directly to

Urilib.parse

I:quote
ii:unquoto
iii:urlencode
import urllib.parse
image_url = 'https://gimg2.baidu.com/image_search/src=http%3A%2F%2Finews.gtimg.com%2Fnewsapp_match%2F0%2F11638167724%2F0.jpg&refer=http%3A%2F%2Finews.gtimg.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1638521018&t=9ae6d69f2f0ed50aba91c06a1685e596'

parse.quote() Coding function parse.unquote() Use of decoding function

url Can only be composed of specific characters , Letter 、 Numbers 、 Underline
If there's something else , such as $、 Space 、 Chinese, etc , We need to code it , Is an illegal encoding format ; At this point, use parse.quoto

url = 'http://www.baidu.com/index.html?name= The dog egg &pwd=12345'# It belongs to illegal coding format 
ret = urllib.parse.quote(url) # Coding function 
re = urllib.parse.unquoto(url) # Decoding function 
print(ret)

At this time, you can also borrow the coding webmaster tool ( Du Niang has both )
quote url Coding function , Translate Chinese into %XXX
unquote url Decoding function , take %XXX Convert to the specified character

parse.urlencode

import urllib.parse
#url = 'http://baidu.com/index.html'# Now we have to deal with this url Parameters are required when sending a request 
# The added parameters are  name age sex height, Then you need to splice it when writing code 
name = 'goudan'
age = 18
sex = 'boy'
height = 180

#url = 'http://baidu.com/index.html?name=goudan&age=18&sxe=boy&heighr=180'
# By splicing the contents of the dictionary 
data={
    
'name' = goudan,
'age' = 18,
'sex' = boy,
'height' = 180
}
# Ergodic dictionary 
for k,v in data.items()
	it.append(k+'='+str(v))
query_string = '&'.join(it)
url = url+'?'+query_string
# however urllib There are already functions completed by developers , I would like to thank you in particular Python Third party library developers 

query_string = urllib.parse.urllencode(data)

urllencode() What needs to be passed in the function is the dictionary ;query_string That's what I wrote ; and urllencode() Function can already encode illegal words , So we're using urllencode() Function, there is no need to consider the related problems of illegal characters .

copyright notice
author[E iosers],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202020901168820.html

Random recommended