current position:Home>What about Python memory leak? Pit filling troubleshooting tips

What about Python memory leak? Pit filling troubleshooting tips

2022-02-02 05:47:24 Charlie is not a dog

Pit filling summary :python Memory leak troubleshooting tips

Abstract : Recently, I encountered a memory leak in my work , The operation and maintenance students urgently call to solve , Therefore, while solving the problem, the system also records the common solutions to the memory leakage problem .

Recently, I encountered a memory leak in my work , The operation and maintenance students urgently call to solve , Therefore, while solving the problem, the system also records the common solutions to the memory leakage problem .

First of all, we have made clear the phenomenon of this problem :

1.     The service is 13 I went online once , And from 23 The start , There is a problem of rising memory , When the alert value is reached, restart the instance , Climbing faster .

2.     The services are deployed in A、B 2 On chip , But in addition to model reasoning , Almost all preprocessing 、 Post processing shares a set of code . and B Chip memory leak warning ,A There is no abnormality in the chip .

Train of thought : Study the dependency differences between old and new source codes and two party libraries

According to the above two conditions , The first thing that comes to mind 13 The problem introduced by the update of No , The update may come from two aspects :

1.     Self developed code

2.     Second party dependent code

From the above two perspectives :

  • One side , Use them separately Git Historical information and BeyondCompare The tool compares the source code of the two versions , And focused on reading A、B The two chip codes are processed separately , No abnormality was found .
  • On the other hand , adopt pip list The command compares two mirrored packages , Only pytz The version that the time zone tool depends on has changed .

After research and Analysis , It is considered that the memory leak caused by this package is unlikely , So put it down for the time being .

thus , By studying the source code changes of the old and new versions, find out the way to solve the memory leak problem , It seems that I can't go on .

Train of thought two : Monitor memory changes and differences between old and new versions

at present python Common memory detection tools are pympler、objgraph、tracemalloc etc. .

First , adopt objgraph Tools , For new and old Services TOP50 The types of variables were observed and statistically analyzed

objraph Common commands are as follows :

\#  Number of global types   
objgraph.show\_most\_common\_types(limit=50)  
\#  Incremental change   
objgraph.show\_growth(limit=30)
 Copy code 

Here, in order to better observe the change curve , I simply made a package , Make the data output directly to csv File for observation .

stats = objgraph.most\_common\_types(limit=50)  
stats\_path = "./types\_stats.csv"  
tmp\_dict = dict(stats)  
req\_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())  
tmp\_dict\['req\_time'\] = req\_time  
df = pd.DataFrame.from\_dict(tmp\_dict, orient='index').T  
  
if os.path.exists(stats\_path):  
    df.to\_csv(stats\_path, mode='a', header=True, index=False)  
else:  
    df.to\_csv(stats\_path, index=False)
 Copy code 

As shown in the figure below , Run on the old and new versions with a batch of pictures 1 Hours , Everything is as stable as an old dog , There is no wave in the quantity of each type .

here , I think I usually use a batch of abnormal format pictures for boundary verification before transfer test or online .

Although these anomalies , The test students must have verified it before going online , But the dead horse was regarded as a live horse, and the doctor took it for a test .

The calm data was broken , As shown in the red box below :dict、function、method、tuple、traceback The number of important types began to rise .

At this time, the mirrored memory is also increasing and there is no sign of convergence .

thus , Although it is impossible to confirm whether it is an online problem , But at least one bug. And then go back to the log , A strange phenomenon has been found :

Exceptions caused by special pictures under normal circumstances , The log should output the following information , namely check_image_type Method will only print once in the exception stack .

But the status quo is check_image_type Method repeatedly prints multiple times , And the number of repetitions increases with the number of tests .

Re studied the exception handling code here .

The exception declaration is as follows :

The throwing exception code is as follows :

The problem

After thinking, I probably figured out the root of the problem :

Here, each exception instance is equivalent to being defined as a global variable , And when throwing exceptions , It is this global variable that is thrown . When this global variable is pushed into the exception stack, the processing is completed , It won't be recycled .

Therefore, with the increasing number of wrong format picture calls , The information in the exception stack will also increase . And because the exception also contains the requested picture information , Therefore, the memory will be MB Increase in level .

But this part of the code has been online for a long time , If online is really the problem caused here , Why didn't there be any problems before , And why are you A There are no problems on the chip ?

With the above two questions , We did two verifications :

First , Confirm the previous version and A This problem also occurs on the chip .

secondly , We looked at the online call records , I found a new customer recently , Moreover, a large number of images with similar problems are used to call a certain local point ( Most of the bureau points are B chip ) The phenomenon of service . We found some online examples , The same phenomenon was observed in the log .

thus , The above questions have been basically explained , Fix this bug after , The memory overflow problem no longer occurs .

Advanced thinking

Be reasonable , When the problem is solved to this point, it seems that the work can be finished . But I asked myself a question , If you didn't print this line of log , Or developers are lazy and don't type out all the exception stacks , How to locate ?

With this question, I continued to study objgraph、pympler Tools .

It has been found that there will be a memory leak in the case of abnormal pictures , So let's focus on what's different at this time :

By the following order , We can see that every time an exception occurs , What variables have been added to the memory and the increased memory .

1.     Use objgraph Tools objgraph.show_growth(limit=20)

2.      Use pympler Tools

from pympler import tracker  
tr = tracker.SummaryTracker()  
tr.print\_diff()  

 Copy code 

Through the following code , You can print out which references these new variables come from , For further analysis .

gth = objgraph.growth(limit=20)  
for gt in gth:  
    logger.info("growth type:%s, count:%s, growth:%s" % (gt\[0\], gt\[1\], gt\[2\]))  
    if gt\[2\] > 100 or gt\[1\] > 300:  
        continue  
    objgraph.show\_backrefs(objgraph.by\_type(gt\[0\])\[0\], max\_depth=10, too\_many=5,  
                           filename="./dots/%s\_backrefs.dot" % gt\[0\])  
    objgraph.show\_refs(objgraph.by\_type(gt\[0\])\[0\], max\_depth=10, too\_many=5,  
                       filename="./dots/%s\_refs.dot" % gt\[0\])  
    objgraph.show\_chain(  
        objgraph.find\_backref\_chain(objgraph.by\_type(gt\[0\])\[0\], objgraph.is\_proper\_module),  
        filename="./dots/%s\_chain.dot" % gt\[0\]  
    )
 Copy code 

adopt graphviz Of dot Tools , For the above production graph Convert the format data into the following picture :

dot -Tpng xxx.dot -o xxx.png
 Copy code 

here , because dict、list、frame、tuple、method There are too many basic types , Observation is difficult , So here's a filter .

New memory ImageReqWrapper The call chain of

New memory traceback The call chain of :

Although with the prior knowledge , It makes us naturally pay attention to traceback Corresponding to it IMAGE_FORMAT_EXCEPTION abnormal .

But by thinking about why the above variables that should have been recycled after the service call are not recycled , Especially all traceback Variables are being IMAGE_FORMAT_EXCEPTION After the exception is called, it cannot be recycled, etc ; At the same time, do some small experiments , I believe we can locate the root of the problem soon .

thus , We can draw the following conclusions :

Because the exception thrown cannot be recycled , Cause the corresponding exception stack 、 Variables such as request body cannot be recycled , Since the request body contains picture information, each such request will result in MB Level memory leak .

in addition , During the study, it was also found that python3 It comes with a memory analysis tool tracemalloc, You can observe the relationship between code lines and memory through the following code , Although it may not be accurate , But it can also provide some clues .

import tracemalloc  
  
tracemalloc.start(25)  
snapshot = tracemalloc.take\_snapshot()  
global snapshot  
gc.collect()  
snapshot1 = tracemalloc.take\_snapshot()  
top\_stats = snapshot1.compare\_to(snapshot, 'lineno')  
logger.warning("\[ Top 20 differences \]")  
for stat in top\_stats\[:20\]:  
    if stat.size\_diff < 0:  
        continue  
    logger.warning(stat)  
snapshot = tracemalloc.take\_snapshot()
 Copy code 

If the article helps you , Let's go with a compliment

copyright notice
author[Charlie is not a dog],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202020547228549.html

Random recommended