current position:Home>Python crawler programming analysis and cases Chapter 1 mitmproxy + Python crawler programming

Python crawler programming analysis and cases Chapter 1 mitmproxy + Python crawler programming

2022-02-02 10:55:04 cwh5920

Chapter one use mitmproxy + python Act as interceptor agent
mitmproxy What is it?
install
function
operation
Script
event

  1. in the light of HTTP Life cycle
  2. in the light of TCP Life cycle
  3. in the light of Websocket Life cycle
  4. For network connectivity lifecycle
  5. Universal life cycle
    Example
    summary

mitmproxy What is it?
seeing the name of a thing one thinks of its function ,mitmproxy It is used for MITM Of proxy,MITM That is man in the middle attack (Man-in-the-middle attack). The agent used for man in the middle attack first forwards the request like a normal agent , Ensure the communication between the server and the client , secondly , Will check in time 、 Record the data it intercepted , Or tamper with data , Trigger server-side or client-side specific behavior .
 Insert picture description here

differ fiddler or wireshark Wait for bag grabbing tools ,mitmproxy Not only can you intercept requests to help developers view them 、 analysis , It can also be used for secondary development through custom scripts . for instance , utilize fiddler You can filter out the browser's response to a specific url Request , And look at 、 Analyze the data , But it can't meet the demand of high customization , Be similar to :“ Intercepts the browser to this url Request , Leave the returned content blank , And save the real returned content to a database , Send email notification when exception occurs ”. And for mitmproxy, Such requirements can be customized by loading python Script easy to implement .

but mitmproxy It's not really a man in the middle attack on innocent people , because mitmproxy Working in HTTP layer , And the current HTTPS The popularity of Internet enables clients to have the ability to detect and avoid man in the middle attacks , So let mitmproxy Be able to work normally , You have to make the client (APP Or browser ) Active trust mitmproxy Of SSL certificate , Or ignore the certificate exception , That means APP Or the browser belongs to the developer himself —— Obvious , This is not black production , It's doing development or testing .

What is the practical significance of such a tool ? As far as I know, at present, it is widely used to do simulation crawlers , That is, using mobile phone simulator 、 Headless browser to crawl APP Or website data ,mitmproxy As an agent, you can intercept 、 Store the data obtained by the crawler , Or modify the data to adjust the behavior of the crawler .

in fact , That's all mitmproxy Working in forward proxy mode , By adjusting the configuration ,mitmproxy It can also act as a transparent proxy 、 Reverse proxy 、 Upstream agent 、SOCKS Agent etc. , But these working modes are aimed at mitmproxy It doesn't seem to be commonly used for , Therefore, this paper only discusses the forward proxy model .
install
“ install mitmproxy” This sentence is ambiguous , It can refer to “ install mitmproxy Tools ”, It can also refer to “ install python Of mitmproxy package ”, Note that the latter includes the former .

If you just take mitmproxy Make an alternative fiddler Tools for , There is no need for customization , That's all you need “ install mitmproxy Tools ” that will do , Go to mitmproxy Official website Download one installer It can be used out of the box , You don't need to be prepared in advance python development environment . But apparently , That's not what we're talking about here , What we need “ install python Of mitmproxy package ”.

install python Of mitmproxy The bag will get mitmproxy Outside the tool , You'll also get the package dependencies you need to develop custom scripts , The installation process is not complicated .

First you need to install python, The version should be no less than 3.6, And installed the attached package management tool pip. Installation of different operating systems python 3 In different ways , Reference resources python Download page , There is no expansion here , Suppose you are ready for such an environment .

Installation starts .

stay linux in :

sudo pip3 install mitmproxy
stay windows in , Run as administrator cmd or power shell:

pip3 install mitmproxy
End of installation .

After completion , The system will have mitmproxy、mitmdump、mitmweb Three commands , because mitmproxy Command not supported in windows Running in the system ( It doesn't matter , Never mind ), We can take mitmdump Test if the installation is successful , perform :

mitmdump --version
You should see output like this :

Mitmproxy: 4.0.1
Python: 3.6.5
OpenSSL: OpenSSL 1.1.0h 27 Mar 2018
Platform: Windows-10-10.0.16299-SP0
function
To start the mitmproxy use mitmproxy、mitmdump、mitmweb Any one of these three commands can , The three commands function in the same way , And can load custom scripts , The only difference is the interface .

mitmproxy After the command starts , Will provide a command line interface , The user can see the request in real time , And filter requests through commands , View request data . Form like :
mitmweb After the command starts , A web Interface , The user can see the request in real time , And pass GUI Interact to filter requests , View request data . Form like :
mitmdump After the command starts —— You should have guessed , No interface , The program runs silently , therefore mitmdump Unable to provide filter request 、 The function of viewing data , Can only be combined with custom scripts , Work in silence .

because mitmproxy The interaction of commands is slightly complicated and does not support windows System , And our main usage is to load custom scripts , No interaction is required , So in principle, just mitmdump that will do , But considering that there is an interactive interface, it is more convenient to check errors , So here we are mitmweb Command as an example . In actual use, you can choose any command according to the situation .

start-up mitmproxy:

mitmweb
You should see the following output :

Web server listening at http://127.0.0.1:8081/
Proxy server listening at http://*:8080
mitmproxy The binding *:8080 As a proxy port , And provides a web The interactive interface is in 127.0.0.1:8081.

Now you can test the agent , Give Way Chrome With mitmproxy Proxy and ignore certificate errors . In order not to affect normal use , We're not going to change Chrome Configuration of , Instead, you use the command line with parameters to start a Chrome. If you don't use Chrome But other browsers , You can also search the corresponding startup parameters , There should be no pits . In addition, the example is based only on windows System as an example , Because use linux or mac Development students should be more familiar with the use of the command line , It should be able to deduce the corresponding operation in their respective environment .

because Chrome It's time to go through fire and water , For convenience, continue in web Interface with mitmproxy Interaction , We are wronged and seek perfection Edge Or another browser to open 127.0.0.1:8081. Insert a sentence , I use Edge It's because there are no other browsers on the machine (IE not ),Edge There is a default setting that forbids access to the loopback address , See solution .

Next, close all Chrome window , Otherwise, the additional parameters when starting the command line will be invalid . open cmd, perform :

“C:\Program Files (x86)\Google\Chrome\Application\chrome.exe” --proxy-server=127.0.0.1:8080 --ignore-certificate-errors
The long string in front is Chrome Installation path of , It should be modified according to the actual situation of the system , The latter two parameters set the proxy address and force the certificate error to be ignored . use Chrome Open a website , You can see :

image

At the same time Edge You can see up here :

image
operation
mitmproxy The operation of
Key explain
q sign out ( It's equivalent to the return key , It can return one level at a time )
d Delete the current ( Yellow arrow ) Links to
D Resume the request you just deleted
G Jump to the latest request
g Jump to the first request
C Clear the console (C It's capital )
i You can enter the file or domain name you want to block ( Commas need to be used \ To translate , chestnuts :feezu.cn)
a Release request
A Release all requests
? View the help information of the interface
^ v The up and down arrows move the cursor
enter Look at the contents of the cursor column
tab View separately Request and Response Details of
/ Search for body Contents of Li
esc Exit the editor
e Enter edit mode

author :healthbird
link :https://www.jianshu.com/p/0cc558a8d6a2
source : Simple books
The copyright belongs to the author . Commercial reprint please contact the author for authorization , Non-commercial reprint please indicate the source .
Script
Completed the above work , We already have the operation mitmproxy The basic ability of 了 . Next, start developing custom scripts , That's what it is. mitmproxy Where it's really powerful .

Script writing needs to follow mitmproxy The prescribed routine , There are two such routines , When using, you can choose one of the routines .

The first routine is , Write a py Document for mitmproxy load , Several functions are defined in the file , These functions implement some mitmproxy Events provided ,mitmproxy The corresponding function will be called when an event occurs , Form like :

import mitmproxy.http
from mitmproxy import ctx

num = 0

def request(flow: mitmproxy.http.HTTPFlow):
global num
num = num + 1
ctx.log.info(“We’ve seen %d flows” % num)

The second routine is , Write a py Document for mitmproxy load , The file defines the variables addons,addons Is an array , Each element is a class instance , These classes have several methods , These methods implement some mitmproxy Events provided ,mitmproxy The corresponding method will be called when an event occurs . These classes , It's called one by one addon, For example, one is called Counter Of addon:

import mitmproxy.http
from mitmproxy import ctx

class Counter:
def init(self):
self.num = 0

def request(self, flow: mitmproxy.http.HTTPFlow):
    self.num = self.num + 1
    ctx.log.info("We've seen %d flows" % self.num)

addons = [
Counter()
]
It is strongly recommended to use the second routine , Intuitively, you will feel that the second routine is more advanced , It will be more convenient to use and easier to manage and expand . Besides, this is also some of the official built-in addon How to implement .

Let's save the sample code of the second routine above as addons.py, Restart again mitmproxy:

mitmweb -s addons.py
When the browser uses a proxy to access , You should be able to see logs like this in the console :

Web server listening at http://127.0.0.1:8081/
Loading script addons.py
Proxy server listening at http://*:8080
We’ve seen 1 flows
……
……
We’ve seen 2 flows
……
We’ve seen 3 flows
……
We’ve seen 4 flows
……
……
We’ve seen 5 flows
……
This indicates that the custom script is effective .

event
I don't need to explain the above script. I'm sure you can understand it , Is that when request occurs , Add one counter , And print logs . This corresponds to request event , What are the events in total ? Not much , Not a few , Here is a detailed introduction .

Events are divided into... For different life cycles 5 class .“ Life cycle ” This refers to the level at which events are viewed , for instance , The same time web request , I can understand it as “HTTP request -> HTTP Respond to ” The process of , It can also be understood as “TCP Connect -> TCP signal communication -> TCP To break off ” The process of . that , If I want to refuse to come to one IP Client requests for , The function should be registered for TCP Life cycle Of tcp_start event , Or, , When I want to block a request for a specific domain name , Then you should register the function for HTTP Declare periodic http_connect event . The same is true in other cases .

The next paragraph is expected to be smelly and long , If you don't have the patience to read , Then at least look away at HTTP Life cycle events , Then jump to the example .

  1. in the light of HTTP Life cycle
    def http_connect(self, flow: mitmproxy.http.HTTPFlow):
    (Called when) Received... From client HTTP CONNECT request . stay flow Set a non - 2xx The response will return the response and disconnect .CONNECT Not commonly used HTTP Request method , The purpose is to establish a proxy connection with the server , Is only client And proxy Communication between , therefore CONNECT The request does not trigger request、response And other conventional HTTP event .

def requestheaders(self, flow: mitmproxy.http.HTTPFlow):
(Called when) From the client HTTP The header of the request was successfully read . here flow Medium request Of body It's empty. .

def request(self, flow: mitmproxy.http.HTTPFlow):
(Called when) From the client HTTP The request was successfully read completely .

def responseheaders(self, flow: mitmproxy.http.HTTPFlow):
(Called when) From the server side HTTP The header of the response was successfully read . here flow Medium response Of body It's empty. .

def response(self, flow: mitmproxy.http.HTTPFlow):
(Called when) From the server HTTP The response was successfully read completely .

def error(self, flow: mitmproxy.http.HTTPFlow):
(Called when) One happened HTTP error . For example, invalid server response 、 Disconnect, etc . Pay attention to and “ Effective HTTP Erroneous return ” It's not the same thing , The latter is a correct server response , It's just HTTP code It's just a mistake .

( Okay , You can jump to the example .)

  1. in the light of TCP Life cycle
    def tcp_start(self, flow: mitmproxy.tcp.TCPFlow):
    (Called when) Set up a TCP Connect .

def tcp_message(self, flow: mitmproxy.tcp.TCPFlow):
(Called when) TCP The connection received a message , A recent message is stored in flow.messages[-1]. Messages are modifiable .

def tcp_error(self, flow: mitmproxy.tcp.TCPFlow):
(Called when) It happened. TCP error .

def tcp_end(self, flow: mitmproxy.tcp.TCPFlow):
(Called when) TCP Connection is closed .

  1. in the light of Websocket Life cycle
    def websocket_handshake(self, flow: mitmproxy.http.HTTPFlow):
    (Called when) The client tried to build a websocket Connect . By controlling HTTP In the head for websocket To change the handshake behavior .flow Of request Property is guaranteed to be non empty .

def websocket_start(self, flow: mitmproxy.websocket.WebSocketFlow):
(Called when) Set up a websocket Connect .

def websocket_message(self, flow: mitmproxy.websocket.WebSocketFlow):
(Called when) Received a message from the client or server websocket news . A recent message is stored in flow.messages[-1]. Messages are modifiable . There are currently two message types , Corresponding BINARY Type of frame or TEXT Type of frame.

def websocket_error(self, flow: mitmproxy.websocket.WebSocketFlow):
(Called when) It happened. websocket error .

def websocket_end(self, flow: mitmproxy.websocket.WebSocketFlow):
(Called when) websocket Connection is closed .

  1. For network connectivity lifecycle
    def clientconnect(self, layer: mitmproxy.proxy.protocol.Layer):
    (Called when) The client is connected to mitmproxy. Note that one connection may correspond to multiple HTTP request .

def clientdisconnect(self, layer: mitmproxy.proxy.protocol.Layer):
(Called when) Client disconnected and mitmproxy The connection of .

def serverconnect(self, conn: mitmproxy.connections.ServerConnection):
(Called when) mitmproxy Connected to the server . Note that one connection may correspond to multiple HTTP request .

def serverdisconnect(self, conn: mitmproxy.connections.ServerConnection):
(Called when) mitmproxy Disconnected from the server .

def next_layer(self, layer: mitmproxy.proxy.protocol.Layer):
(Called when) The Internet layer Switching occurs . You can return a new layer Object to change what will be used layer. See layer The definition of .

  1. Universal life cycle
    def configure(self, updated: typing.Set[str]):
    (Called when) Configuration changes .updated A parameter is a collection like object , Contains all the changed options . stay mitmproxy Startup time , This event also triggers , And updated Include all options .

def done(self):
(Called when) addon Close or be removed , Or, mitmproxy Close itself . Because this event will be triggered after the event cycle is terminated , So this is a addon The last thing you can see . Because of this time log It's also closed , So call log Function has no output .

def load(self, entry: mitmproxy.addonmanager.Loader):
(Called when) addon On first load .entry The parameter is one Loader object , Contains add options 、 The method of command . Here is addon Configure its own place .

def log(self, entry: mitmproxy.log.LogEntry):
(Called when) adopt mitmproxy.ctx.log A new log is generated . Be careful not to log this event , Otherwise, it will cause a dead cycle .

def running(self):
(Called when) mitmproxy Fully start and start running . here ,mitmproxy The port is already bound , be-all addon All loaded .

def update(self, flows: typing.Sequence[mitmproxy.flow.Flow]):
(Called when) One or more flow The object has been modified , Usually from a different addon.

Example
It is estimated that you have fainted after watching so many events , normal , Ghosts will remember so many events . In fact, considering mitmproxy Actual use scenarios of , In most cases, we only use for HTTP Several events in the life cycle . A little more streamlined , Even just need to use http_connect、request、response Three events can accomplish most of the requirements .

Here's an example of a little black humor , Cover these three events , Show how to use mitmproxy Work .

This is the demand :

Because Baidu search is unreliable , All when the client initiates Baidu search , Record the user's search terms , Then modify the request , Change the search term to “360 Search for ”;
because 360 Search is still unreliable , All when the client accesses 360 When searching , All pages will be “ Search for ” Change the words to “ Please use Google ”.
Because Google is a non-existent website , So don't waste time trying to connect to the server , All when a client tries to access Google , Disconnect directly .
Assemble the above functions into a package called Joker Of addon, And keep the previous display named Counter Of addon, All loaded into mitmproxy.
The first requirement requires tampering with client requests , So realize a request event :

def request(self, flow: mitmproxy.http.HTTPFlow):
# Ignore non Baidu search address
if flow.request.host != “www.baidu.com” or not flow.request.path.startswith("/s"):
return

#  Confirm that there is a search term in the request parameters 
if "wd" not in flow.request.query.keys():
    ctx.log.warn("can not get search word from %s" % flow.request.pretty_url)
    return

#  Output the original search term 
ctx.log.info("catch search word: %s" % flow.request.query.get("wd"))
#  Replace the search term with “360 Search for ”
flow.request.query.set_all("wd", ["360 Search for "])

The second requirement requires tampering with the server response , So realize a response event :

def response(self, flow: mitmproxy.http.HTTPFlow):
# Ignore non 360 Search address
if flow.request.host != “www.so.com”:
return

#  All... In the response will be “ Search for ” Replace with “ Please use Google ”
text = flow.response.get_text()
text = text.replace(" Search for ", " Please use Google ")
flow.response.set_text(text)

The third requirement is to reject the client request , So realize a http_connect event :

def http_connect(self, flow: mitmproxy.http.HTTPFlow):
# Confirm that the client wants to access www.google.com
if flow.request.host == “www.google.com”:
# Return to a non 2xx Response disconnect
flow.response = http.HTTPResponse.make(404)
In order to realize the fourth requirement , We need to sort out the code , Easy to manage and easy to view .

Create a joker.py file , The content is :

import mitmproxy.http
from mitmproxy import ctx, http

class Joker:
def request(self, flow: mitmproxy.http.HTTPFlow):
if flow.request.host != “www.baidu.com” or not flow.request.path.startswith("/s"):
return

    if "wd" not in flow.request.query.keys():
        ctx.log.warn("can not get search word from %s" % flow.request.pretty_url)
        return

    ctx.log.info("catch search word: %s" % flow.request.query.get("wd"))
    flow.request.query.set_all("wd", ["360 Search for "])

def response(self, flow: mitmproxy.http.HTTPFlow):
    if flow.request.host != "www.so.com":
        return

    text = flow.response.get_text()
    text = text.replace(" Search for ", " Please use Google ")
    flow.response.set_text(text)

def http_connect(self, flow: mitmproxy.http.HTTPFlow):
    if flow.request.host == "www.google.com":
        flow.response = http.HTTPResponse.make(404)

Create a counter.py file , The content is :

import mitmproxy.http
from mitmproxy import ctx

class Counter:
def init(self):
self.num = 0

def request(self, flow: mitmproxy.http.HTTPFlow):
    self.num = self.num + 1
    ctx.log.info("We've seen %d flows" % self.num)

Create a addons.py file , The content is :

import counter
import joker

addons = [
counter.Counter(),
joker.Joker(),
]

Put the three files in the same folder , Start the command line... In this folder , function :

mitmweb -s addons.py
Old rules , Close all Chrome window , Starting from the command line Chrome And specify the agent and ignore the certificate error .

Test the running effect :

copyright notice
author[cwh5920],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202021054572973.html

Random recommended