current position:Home>Python garbage collection summary

Python garbage collection summary

2022-01-30 07:08:10 Zioyi


Reading recently 《 Algorithm and implementation of garbage collection 》, There are some commonly used garbage collection (Garbage Collect) Algorithm , Such as : Mark - eliminate 、 Reference count 、 Generational recycling and so on . Later we will talk about Python Garbage collection strategy of , Record here .

To measure GC Four elements of performance

  1. throughput

Throughput is... Per unit time GC Ability to come out . The formula is :GC Heap capacity processed / GC Time spent . Usually , People like high throughput GC Algorithm .

  1. Maximum pause time

GC The maximum time that the user thread is suspended during execution . If GC drawn-out , Will affect the normal execution of the program . Larger throughput and shorter maximum pause times are often not both .

  1. Heap efficiency

GC Is the automatic memory management function , So the ideal situation is GC Don't take up too much heap space . Two factors that affect the efficiency of the heap are : The size of the header and the usage of the heap . The larger the heap available ,GC The faster it runs ; contrary , The more you want to make effective use of the limited heap ,GC The longer it takes .

  1. The locality of the visit

There is usually a possibility of continuous access between objects with reference relationships . This is common in most programs , be called “ The locality of the visit ”. Considering access locality , Arrange objects with reference relationships closer to the heap , It can improve the utilization of data .

Python Using reference counting GC The algorithm of , The advantage of reference counting algorithm is that it can shorten Maximum pause time , The defect is that it faces great challenges in increasing and decreasing the maintenance count ( If you forget to perform the decrement operation, the object will not be released ).

Where does the reference count exist

about Python The data of , image List、Set、Tuple、Dict、Str、Int, In its underlying data structure , There will be one PyObject Members of type , Used to maintain the reference count of objects

typedef struct _object {
    Py_ssize_t ob_refcnt;
    struct _typeoject *ob_type;
} PyObject;
 Copy code 

Among them ob_refcnt Members are responsible for maintaining reference counts such , All built-in structures are retained at the beginning PyObject Structure to maintain reference counts .

Memory allocator

Python Memory allocation is not a simple call malloc/free Function to request memory from the operating system . But for efficiency reasons, reduce system calls as much as possible , Divide the memory allocator into 3 layer .

    Object-specific allocators
    _____   ______   ______       ________
   [ int ] [ dict ] [ list ] ... [ string ]       Python core         |
+3 | <----- Object-specific memory -----> | <-- Non-object memory --> |
    _______________________________       |                           |
   [   Python's object allocator   ]      |                           |
+2 | ####### Object memory ####### | <------ Internal buffers ------> |
    ______________________________________________________________    |
   [          Python's raw memory allocator (PyMem_ API)          ]   |
+1 | <----- Python memory (under PyMem manager's control) ------> |   |
   [    Underlying general-purpose allocator (ex: C library malloc)   ]
 0 | <------ Virtual memory allocated for the python process -------> |
   [                OS-specific Virtual Memory Manager (VMM)               ]
-1 | <--- Kernel dynamic storage allocation & management (page-based) ---> |
    __________________________________   __________________________________
   [                                  ] [                                  ]
-2 | <-- Physical memory: ROM/RAM --> | | <-- Secondary storage (swap) --> |

 Copy code 

The first 0 Layer to tier 3 Layer is Python Distributor level for management , If you take the dictionary object as an example ,

PyDict_New()             ----3 layer 
  PyObject_GC_New()      ----2 layer 
  PyObject_Malloc()      ----2 layer 
    new_arena()          ----1 layer 
      malloc()           ----0 layer 
 Copy code 

The first 0 Layer is to directly call the allocation function provided by the operating system , Such as malloc.Python Not all objects are generated and called 0 layer , Instead, different allocation schemes are selected according to the size of memory to be allocated :

  • The requested memory is greater than 256 byte , Use the 0 layer
  • Requested memory is less than 256 byte , Use the first and second layers

Based on experience , A large number of objects used in the coding process are less than 256 Bytes of , And the life cycle is very short , for example :

for x in range(100):
 Copy code 

If this process is called frequently malloc and free, The efficiency will be very low .Python By adding section 1 Tier and tier 2 layer , And in the first layer, we will start from the... In advance 0 The floor applies for some space to be reserved and managed , When the allocated object memory is less than 256 when , Use this space .

Python The memory structures managed are 3 Parts of :blockpoolarena, The relationship between them is as follows :

  • arena Used to manage storage pool
  • pool Used to manage storage block
  • block The smallest unit of memory available to an applicant

To avoid frequent calls malloc() and free(), The first 1 The distributor of the layer will be in the largest unit arena To preserve memory .pool Is used to effectively manage empty block The unit of .

The first 2 The allocator of layer is responsible for managing pool Internal block. take block Return the beginning address of to the applicant , And release block etc. . The distributor will pool According to 8 The size of multiple of bytes is divided to produce several block, Such as :8 Bytes of block、16 Bytes of block、24 Bytes of block、... 256 Bytes of block. When applying for memory , Will return the appropriate size block.

The first 3 Layers are object specific allocators ,Python Various built-in types in : Such as :list、dict etc. , There will be their own unique distributors , such as dict The distributor looks like this :

#ifndef PyDict_MAXFREELIST
#define PyDict_MAXFREELIST 80
static PyDictObject *free_list[PyDict_MAXFREELIST];
static int numfree = 0;
 Copy code 

When allocating dictionary objects , Will check the free list first free_list Whether there are available objects , If it has been exhausted , Then go through the second 2 layer PyObject_GC_New Go to the applicant .

As a whole Python The interaction when generating dictionary objects is as follows :

Reference counting

Python Reference counting method is implemented in , In fact, it is to change the reference quantity of each object , If the number of references to an object increases , Just add... To the counter 1, Conversely, if the number of references decreases , Just subtract... From the counter 1.

Incremental operation

The actual counting operation is performed by the macro Py_INCREF To achieve

#define Py_INCREF(op) ( \ ((PyObject*)(op))->ob_refcnt++)
 Copy code 

ob_refcnt The type of is 32 Bit environment is int type , stay 64 Bit environment is long type . Because there are sign bits , So only half of the values can be expressed as nonnegative integers . But because the pointer is basically pressing 4 Byte alignment Of , So even if the reference counter is referenced by all pointers , It won't overflow . Designed to allow negative numbers for debugging purposes , When the reference counter has a negative number , There is the possibility of excessive decrement operation or the legacy of incremental operation .

Decrement operation

The actual counting operation is performed by the macro Py_DECREF To achieve

#define Py_DECREF(op) \ if (--((PyObject*)(op))->ob_refcnt != 0) \ _Py_CHECK_REFCNT(op) \ else \ _Py_Dealloc((PyObject*)(op))

# define _Py_Dealloc(op) ( \ (*Py_TYPE(op)->tp_dealloc)((PyObject*)(op)))
 Copy code 

Count the counter first , If not for 0, Call the macro _Py_CHECK_REFCNT Check whether the reference counter becomes negative . If the counter is 0, Then call the macro _Py_Dealloc, adopt Py_TYPE Identify the type of object and call the corresponding function pointer responsible for releasing each object , such as : The function pointer responsible for releasing the tuple object is tupledealloc.

static void tupledealloc(register PyTupleObject *op) {
    register Py_ssize_t i;
    register Py_ssize_t len = Py_Size(op);
    if (len > 0) {
        i = len;
        /*  Decrement the elements in the tuple  */
        while(--i >= 0)
    /*  Release tuple object  */
    Py_TYPE(op)->tp_free((PyObject *)op);
 Copy code 

member tp_free There is a pointer to the release processing function of each object , Most of them call PyObect_GC_Del function

void PyObject_GC_Del(void *op) {
    PyGC_Head *g = AS_GC(op);
    /* ... */
 Copy code 

The call diagram of tuple decrement operation is as follows :

  _Py_Dealloc              ————  Decrement operation 
      PyObject_GC_Del      ————  Tuple release processing 
          PyObject_Free    ————  Free memory   
 Copy code 

Ownership of references

As explained above, the core operation of reference counting is counting +1 And counting -1. Just to be clear , Who will be responsible for the operation , such as :A Object by calling a function func1 To obtain the B object , that B Reference count for object +1 Who should be in charge ? This is about “ Ownership of references ”, That is, who holds the ownership of the reference , Who is responsible for decrementing the reference counter of the object when this reference is not needed .

  1. Pass reference ownership ( Return value )

That is, the function side hands over the ownership of the reference and the return value to the caller . The caller of the function is responsible for the decrement operation at the end of the reference . When objects are generated , Will give ownership of the reference to the caller , such as : Generation of dictionary objects

PyObject *dict = PyDict_New();
/*  Do something ,  After the end */
dict = NULL;
 Copy code 
  1. Ownership of lending references ( Return value )

Return the value to the caller only , As for the ownership of reference, it is just lending , The caller cannot decrement the reference count . In charge of from “ aggregate ” The function that takes the element out of the , All lending ownership , such as : Tuple gets the function of the specified index

PyObject * PyTuple_GetItem(register PyObject *op, register Py_ssize_t i) {
    /* Check operation */
    return ((PyTupleObject *)op) -> ob_item[i]
 Copy code 
  1. Take ownership of references ( Parameters )

That is, when the caller passes parameters to the function , Sometimes the reference side of this function will occupy the ownership of this parameter , At this time, the function party is responsible for the reference management of this parameter . Functions that append elements to the linked list will occupy the reference ownership of parameters , such as : A function that adds an element to a specified position in a tuple

PyObject *tuple, *dict;

tuple = PyTuple_New(3);
dict = PyDict_New();    /* Returned with ownership to dict*/
PyTuple_SetItem(tuple, 0, dict);    /* Reference ownership is tuple Occupy */
dict = NULL;
 Copy code 
  1. Lending refers to ownership ( Parameters )

That is, the caller lends the reference ownership of the parameter to the function party . When the caller of the function wants to lend the ownership of the reference , From giving the object to the function to the end of the function execution , During this time, the caller must retain ownership of the reference to the object .

Circular reference garbage collection

The citation counting method has a flaw , Can't solve the circular reference problem , When two objects refer to each other , Reference count cannot be cleared , That is, it is impossible to GC. therefore , For circular references ,Python adopt Part of the mark - Clear algorithm To solve .

Algorithm principle : Part of the mark - Clear algorithm

For example, below , There is a circular reference before the three objects on the left , Causes normal reference counts to fail to recycle them We first copy their current reference count to another area for later operations There is a premise :Python When an object is generated, it connects itself to a Object linked list in ( Connect with each other through two-way pointing ), In the figure, two-way arrows are used to indicate Based on this , We traverse Object linked list All objects in , The copy count of all their referenced objects is reduced by one For now Mark Operation , After the above processing, the copy count is still greater than 0 The object of is marked as “ Reachable objects ”, That is, they are referenced by other active objects , And mark the objects they reference as reachable objects , Connect to the list of reachable objects ; Then count the copies as 0 The object of is marked as “ Unreachable object ”, Connect to unreachable object linked list . You can see , The unreachable object is the object of circular reference , Next, we will eliminate operation , We free the memory of the objects in the unreachable object linked list , Reconnecting objects in the reachable object list will Object linked list in Here we are , We have completed the garbage collection of circular reference objects .

Not all objects have circular references

The root cause of circular references is that some objects may retain references to other objects , For such objects , We call it “ Container object ”. Image tuple 、 Lists and dictionaries, etc , All belong to container objects , You only need to solve the problem of circular reference to these container objects . Container objects are allocated header structures for circular reference garbage collection .

tyepdef union _gc_head {
    struct {
        union _ge_head *gc_next; /* For two-way linked list */
        union _gc_head *gc_prev; /* For two-way linked list */
        Py_ssize_t gc_refs;      /* Used to copy count */
    } gc;
    long double dummy;
} PyGC_Head;
 Copy code 

As mentioned earlier , When the container object is generated , Will be connected to Object linked list , Take dictionary objects as an example , Take a look at the code he generates

PyObject * PyDict_New(void) {
    regiseter PyDictObject *mp;
    /* Operation of generating objects */
    return (PyObject *)mp;

 Copy code 

_PyObject_GC_TRACK Responsible for connecting to Object linked list The operation

#define _PyObject_GC_TRACK(o) do { \ PyGC_Head *g = _Py_AS_GC(o); \ g->gc.gc_refs = _PyGC_REFS_REACHABLE; \ g->gc.gc_next = _PyGC_generation0; \ g->gc.gc_prev = _PyGC_generation0->gc.gc_prev; \ g->gc.gc_prev ->gc.gc_next = g; \ _PyGC_generation0->gc.gc_prev = g; \ } while (0);
 Copy code 

Container object generation management

Python Divide the container object linked list mentioned above into “0 generation ”、“1 generation ”、“2 generation ”, Manage through the following structure

struct gc_generation {
    PyGC_Head head;   /*  Head correction  */
    int threshold;    /*  Start GC The threshold of  */
    int count;        /*  Number of modified objects  */

# define GEN_HEAD(n) (&generations[n].head)

PyGC_Head *_PyGC_generation0 = GEN_HEAD(0); /* 0 Container object of generation  */
 Copy code 

At first, all container objects are connected 0 The object of generation . When 0 The generation container object passes through 1 Circular reference garbage collection , The surviving object is promoted to 1 generation ; Once again, the object that survived the circular reference garbage collection process is promoted to 2 generation .

Remove special cases

During circular reference garbage collection , We will have finalizers The object of is removed from the unreachable linked list

static void move_finalizers(PyGC_Head *unreachable, PyGC_Head *finalizers) {
    PyGC_Head *gc;
    PyGC_Head *next;
    for (gc = unreachable->gc.gc_next; gc != unreachable; gc = next) {
        PyObject *op = FROM_GC(gc);
        next = gc->gc.gc_next;
        if (has_finalizer(op)) {
            gc_list_move(gc, finalizers);
            gc->gc.gc_refs = GC_REACHABLE;
 Copy code 

That's why , Because it is very troublesome to want to release the terminator object with circular reference . We're not sure it's reasonable to release that object first , If you put the first 1 An object is released , What if the finalizer executing when releasing the second object references the first object ?

For these circular reference garbage objects with finalizers , Python Global variables are provided garbage, Let developers remove circular references to objects from the perspective of English programs .

import gc
 Copy code 


Python Of GC In two parts :

  • adopt Reference counting algorithm To manage the garbage collection of regular objects , And by optimizing the memory structure to minimize GC
  • adopt generational + eliminate - Mark To manage the garbage collection of circular reference objects

copyright notice
author[Zioyi],Please bring the original link to reprint, thank you.

Random recommended