current position:Home>Pit filling summary: Python memory leak troubleshooting tips

Pit filling summary: Python memory leak troubleshooting tips

2022-02-01 07:41:48 Huawei cloud developer community

Abstract : Recently, the service encountered a memory leak , The operation and maintenance students urgently call to solve , Therefore, while solving the problem, the system also records the common solutions to the memory leakage problem .

This article is shared from Huawei cloud community 《python Memory leak troubleshooting tips 》, author :lutianfei.

Recently, the service encountered a memory leak , The operation and maintenance students urgently call to solve , Therefore, while solving the problem, the system also records the common solutions to the memory leakage problem .

First of all, we have made clear the phenomenon of this problem :

  1. The service is 13 I went online once , And from 23 The start , There is a problem of rising memory , When the alert value is reached, restart the instance , Climbing faster .

  2. The services are deployed in A、B 2 On chip , But in addition to model reasoning , Almost all preprocessing 、 Post processing shares a set of code . and B Chip memory leak warning ,A There is no abnormality in the chip .

image.png

Train of thought : Study the dependency differences between old and new source codes and two party libraries

According to the above two conditions , The first thing that comes to mind 13 The problem introduced by the update of No , The update may come from two aspects :

  1. Self developed code

  2. Second party dependent code

From the above two perspectives :

  • One side , Use them separately Git Historical information and BeyondCompare The tool compares the source code of the two versions , And focused on reading A、B The two chip codes are processed separately , No abnormality was found .

  • On the other hand , adopt pip list The command compares two mirrored packages , Only pytz The version that the time zone tool depends on has changed .

After research and Analysis , It is considered that the memory leak caused by this package is unlikely , So put it down for the time being .

image.png

thus , By studying the source code changes of the old and new versions, find out the way to solve the memory leak problem , It seems that I can't go on .

Train of thought two : Monitor memory changes and differences between old and new versions

at present python Common memory detection tools are pympler、objgraph、tracemalloc etc. .

First , adopt objgraph Tools , For new and old Services TOP50 The types of variables were observed and statistically analyzed

objraph Common commands are as follows :

#  Number of global types 
objgraph.show_most_common_types(limit=50)

#  Incremental change 
objgraph.show_growth(limit=30)
 Copy code 

Here, in order to better observe the change curve , I simply made a package , Make the data output directly to csv File for observation .

stats = objgraph.most_common_types(limit=50)
stats_path = "./types_stats.csv"
tmp_dict = dict(stats)
req_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
tmp_dict['req_time'] = req_time
df = pd.DataFrame.from_dict(tmp_dict, orient='index').T

if os.path.exists(stats_path):
    df.to_csv(stats_path, mode='a', header=True, index=False)
else:
    df.to_csv(stats_path, index=False)
 Copy code 

As shown in the figure below , Run on the old and new versions with a batch of pictures 1 Hours , Everything is as stable as an old dog , There is no wave in the quantity of each type .

image.png

here , I think I usually use a batch of abnormal format pictures for boundary verification before transfer test or online .

Although these anomalies , The test students must have verified it before going online , But the dead horse was regarded as a live horse, and the doctor took it for a test .

The calm data was broken , As shown in the red box below :dict、function、method、tuple、traceback The number of important types began to rise .

image.png

At this time, the mirrored memory is also increasing and there is no sign of convergence .

image.png

thus , Although it is impossible to confirm whether it is an online problem , But at least one bug. And then go back to the log , A strange phenomenon has been found : Exceptions caused by special pictures under normal circumstances , The log should output the following information , namely check_image_type Method will only print once in the exception stack .

image.png

But the status quo is check_image_type Method repeatedly prints multiple times , And the number of repetitions increases with the number of tests .

image.png

Re studied the exception handling code here .

The exception declaration is as follows :

image.png

The throwing exception code is as follows :

image.png

The problem

After thinking, I probably figured out the root of the problem :

Here, each exception instance is equivalent to being defined as a global variable , And when throwing exceptions , It is this global variable that is thrown . When this global variable is pushed into the exception stack, the processing is completed , It won't be recycled .

Therefore, with the increasing number of wrong format picture calls , The information in the exception stack will also increase . And because the exception also contains the requested picture information , Therefore, the memory will be MB Increase in level .

But this part of the code has been online for a long time , If online is really the problem caused here , Why didn't there be any problems before , And why are you A There are no problems on the chip ? With the above two questions , We did two verifications :

First , Confirm the previous version and A This problem also occurs on the chip .

secondly , We looked at the online call records , I found a new customer recently , Moreover, a large number of images with similar problems are used to call a certain local point ( Most of the bureau points are B chip ) The phenomenon of service . We found some online examples , The same phenomenon was observed in the log .

thus , The above questions have been basically explained , Fix this bug after , The memory overflow problem no longer occurs .

Advanced thinking

Be reasonable , When the problem is solved to this point, it seems that the work can be finished . But I asked myself a question , If you didn't print this line of log , Or developers are lazy and don't type out all the exception stacks , How to locate ?

With this question, I continued to study objgraph、pympler Tools .

It has been found that there will be a memory leak in the case of abnormal pictures , So let's focus on what's different at this time :

By the following order , We can see that every time an exception occurs , What variables have been added to the memory and the increased memory .

  1. Use objgraph Tools objgraph.show_growth(limit=20)

image.png

  1. Use pympler Tools

    from pympler import tracker tr = tracker.SummaryTracker() tr.print_diff()

image.png

Through the following code , You can print out which references these new variables come from , For further analysis .

gth = objgraph.growth(limit=20)
for gt in gth:
    logger.info("growth type:%s, count:%s, growth:%s" % (gt[0], gt[1], gt[2]))
    if gt[2] > 100 or gt[1] > 300:
        continue
    objgraph.show_backrefs(objgraph.by_type(gt[0])[0], max_depth=10, too_many=5,
                           filename="./dots/%s_backrefs.dot" % gt[0])
    objgraph.show_refs(objgraph.by_type(gt[0])[0], max_depth=10, too_many=5,
                       filename="./dots/%s_refs.dot" % gt[0])
    objgraph.show_chain(
        objgraph.find_backref_chain(objgraph.by_type(gt[0])[0], objgraph.is_proper_module),
        filename="./dots/%s_chain.dot" % gt[0]
    )
 Copy code 

adopt graphviz Of dot Tools , For the above production graph Convert the format data into the following picture :

dot -Tpng xxx.dot -o xxx.png
 Copy code 

here , because dict、list、frame、tuple、method There are too many basic types , Observation is difficult , So here's a filter .

New memory ImageReqWrapper The call chain of

image.png

New memory traceback The call chain of :

image.png

Although with the prior knowledge , It makes us naturally pay attention to traceback Corresponding to it IMAGE_FORMAT_EXCEPTION abnormal .

But by thinking about why the above variables that should have been recycled after the service call are not recycled , Especially all traceback Variables are being IMAGE_FORMAT_EXCEPTION After the exception is called, it cannot be recycled, etc ; At the same time, do some small experiments , I believe we can locate the root of the problem soon .

another , About python3 in cache Exception Memory leak caused by , I know there is one that speaks more clearly :zhuanlan.zhihu.com/p/38600861

thus , We can draw the following conclusions : Because the exception thrown cannot be recycled , Cause the corresponding exception stack 、 Variables such as request body cannot be recycled , Since the request body contains picture information, each such request will result in MB Level memory leak .

in addition , During the study, it was also found that python3 It comes with a memory analysis tool tracemalloc, You can observe the relationship between code lines and memory through the following code , Although it may not be accurate , But it can also provide some clues .

import tracemalloc

tracemalloc.start(25)
snapshot = tracemalloc.take_snapshot()
global snapshot
gc.collect()
snapshot1 = tracemalloc.take_snapshot()
top_stats = snapshot1.compare_to(snapshot, 'lineno')
logger.warning("[ Top 20 differences ]")
for stat in top_stats[:20]:
    if stat.size_diff < 0:
        continue
    logger.warning(stat)
snapshot = tracemalloc.take_snapshot()
 Copy code 

image.png

Reference article

testerhome.com/articles/19…

blog.51cto.com/u\_3423936/…

segmentfault.com/a/119000003…

www.cnblogs.com/zzbj/p/1353…

drmingdrmer.github.io/tech/progra…

zhuanlan.zhihu.com/p/38600861

Click to follow , The first time to learn about Huawei's new cloud technology ~

copyright notice
author[Huawei cloud developer community],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202010741474164.html

Random recommended