current position:Home>Teach you to read the source code of Cpython (I)

Teach you to read the source code of Cpython (I)

2022-01-29 19:49:54 cxapython

Little knowledge , Great challenge ! This article is participating in “ A programmer must have a little knowledge ” Creative activities .

img

Translated from :realpython.com/cpython-sou…

Catalog

The first part - Introduce Cpython

What's in the source code ? How to compile Cpython Code What a compiler can do ? Why? Cpython Yes, it is C Language is Python Compiling ? Python The norms of language Cpython Memory management mechanism in Conclusion

The second part -Python Interpreter process

Build runtime configuration Read the file / Input Lexical analysis and syntactic analysis Abstract syntax tree Conclusion

The third part - Cpython Compiler and execution loop

compile perform Conclusion

The fourth part -Cpython Objects in the

Base object type Bool and Long Integer type review Generator type Conclusion

The fifth part Cpython Standard library

Python modular Pyhton and C modular Cpython Regression test suite Installation user defined C The righteous version

Last -Cpython Source code : Conclusion

···------------------------- Text begins ----------------------------···

The first part Introduce Cpython

Preface

In the use of Python Do you have these doubts in the process of , Use the dictionary to look up the contents , Why is it so much faster than traversing a list ? How does the generator remember the state of the variable each time it generates a value ? Why use Python We don't have to allocate memory like other languages ? The fact proved that ,CPython, It's the most popular Python edition , The runtime is human readable C and Python Coded .

This article is mainly about Cpython In the , The article covers CPython All the concepts behind the internal principles 、 How they work and how they are visually interpreted . What you will learn is :

  • Learn to read the source code
  • Compile from source CPython
  • Understand the list 、 Concepts such as dictionaries and generators and how they work inside
  • Run the test suite
  • Modify or upgrade CPython Components of the library , Maybe we can contribute in the future new Python edition

This article is long but useful , If you decide to study Cpython, So I hope you can read on , You will find that this is a good learning material . This article is divided into 5 part , You can arrange reading time according to your own time . Every part takes a certain amount of time , Study some cases in this by yourself , You will feel a sense of accomplishment , Because you have mastered Python Core concept of , This makes you a better Python The programmer . That's what we talk about Python, In fact, most of them refer to Cpython,CPython It's numerous Python One of them , In addition to that Pypy,Jpython etc. .CPython The same as the official use of Python edition , And many cases on the Internet . therefore , The main thing we're talking about here is Cpython. Be careful : This article is aimed at CPython Source code 3.8.0b3 Version of .

What's in the source code ?

CPython Source code distribution includes various tools , Libraries and components . We will explore these in this article . First , We'll focus on compilers . First from git Upload and download Cpython Source code .

git clone https://github.com/python/cpython
cd cpython
git checkout v3.8.0b3 # Switch the branch we need 
 Copy code 

Be careful : If you don't Git, Can be directly from GitHub Web site to download ZIP Source code in file . Unzip the files we downloaded , Its directory structure is as follows :

cpython/
│
├── Doc      ←  Source code documentation 
├── Grammar  ←  Computer readable language definition 
├── Include  ← C  Language header file ( The header file usually contains some reused code )
├── Lib      ← Python  Write the standard library file 
├── Mac      ← Mac  Supported files 
├── Misc     ←  miscellaneous 
├── Modules  ← C  Write the standard library file 
├── Objects  ←  Core types and object modules 
├── Parser   ← Python  Parser source code 
├── PC       ← Windows  Compile supported files 
├── PCbuild  ←  The old version of  Windows  System   Compile supported files 
├── Programs ← Python  Source code for executables and other binaries 
├── Python   ← CPython   Parser source code 
└── Tools    ←  Used to build or extend  Python  Independent tools 
 Copy code 

Next , We will compile from source CPython. This step requires C Compilers and some build tools . Different systems compile in different ways , So here I'm going to use theta mac System .

stay macOS Compiled on CPython It's simple . In the terminal , Run the following command to install C Compilers and toolkits :

$ xcode-select --install
 Copy code 

This command will pop up a prompt , Download and install a set of tools , Include Git,Make and GNU C compiler . You need one more OpenSSL A working copy of , For from PyPi.org Website access package . If you plan to use this version later to install other packages , You need to do SSL verification . stay macOS Installation on OpenSSL The easiest way to do this is to use HomeBrew. If already installed HomeBrew, You can use brew install Command to install CPython The dependencies of .

$ brew install openssl xz zlib
 Copy code 

Now you have dependencies , You can run Cpython In the catalog configure Script :

$ CPPFLAGS="-I$(brew --prefix zlib)/include" \
 LDFLAGS="-L$(brew --prefix zlib)/lib" \
 ./configure --with-openssl=$(brew --prefix openssl) --with-pydebug
 Copy code 

In the installation command above , CPPFLAGS yes c and c++ Compiler options , It's specified here zlib Location of header file , LDFLAGS yes gcc Wait for some optimization parameters that the compiler will use , Here is the designation zlib Location of library files , (brew --prefix openssl) This part means to execute the commands in brackets in the terminal , Show openssl Installation path for , You can execute the commands in parentheses in advance , Replace... With the returned result (brew --prefix openssl), The effect is the same , The backslash at the end of each line allows you to wrap without executing a command , Instead, the three lines are executed as one command .

After running the above command, a... Will be generated in the root directory of the repository Makefile, You can use it to automate the build process ../configure Step only needs to be run once . You can build... By running the following command CPython Binary .

$ make -j2 -s
 Copy code 

-j2 The sign allows make Running at the same time 2 Homework . If you have 4 Kernel , You can change it to 4. -s The sign will stop Makefile Print each command it runs to the console . You can delete it , There's too much output . During construction , You may get some mistakes , In the abstract , It tells you that not all packages can be built . for example ,_dbm,_sqlite3,_uuid,nis,ossaudiodev,spwd and _tkinter You won't be able to build... With this set of instructions . If you don't plan on developing these packages , This mistake has no effect . If you really need it, you can refer to :devguide.python.org/. The build will take a few minutes and generate a new one called python.exe Binary file . Every time you change the source code , All need to be re run make Compile . python.exe The binary is CPython Debugging binaries . Execute the following command to see Python Running version of .

$ ./python.exe
Python 3.8.0b3 (tags/v3.8.0b3:4336222407, Aug 21 2019, 10:00:03) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 
 Copy code 

( In fact, the latest one has arrived Python3.9 了 , I compiled it as follows )

img

What did the compiler do ?

The purpose of the compiler is to turn one language into another . The compilation process can be compared to translation , In English “Hello”, Translated into Chinese 「 Hello 」.

Some compilers compile code into machine code that only the machine can understand , It can be executed directly on the system . Other compilers will compile into intermediate languages , Executed by virtual machine . An important decision to make when choosing a compiler is the system portability requirement .Java and .NET CLR Will be compiled into an intermediate language , So that the compiled code can adapt to other system types .C,Go,C ++ and Pascal Will be compiled into a low-level executable , It can only be run on a system similar to compilation . We usually release Python Source code , Then go straight through Python Command to run , It's inside , Runtime CPython Will compile your code . Most people think Python It's an explanatory language . Strictly speaking, it's actually a compilation type .

Python Code doesn't compile into machine code . It's compiled into a special low-level intermediate language , Only CPython To understand the bytecode . stay Python3 The middle byte code is stored in the hidden directory .pyc In file , Provides caching for next quick execution . therefore , If you run the same without changing the source code Python Application twice , The second time will always be much faster . The reason is that the second time I loaded the bytecode and ran the program , It's not like the first time you need to compile .

Why? CPython Yes, it is C instead of Python Compiling ?

CPython Medium C It's right C References to programming languages , Hint at this Python The distribution uses C language-written . CPython The compiler in is pure C Compiling . however , Many standard library modules are pure Python or C and Python The combination of .

So why CPython Yes, it is C instead of Python Compiling ?

The answer lies in how the compiler works . There are two types of compilers :

  • Self managed compilers are compilers written in the language in which they are compiled , for example Go compiler .
  • A source to source compiler is a compiler written in another language that already has a compiler . This means that if you write a new programming language from scratch , You need an executable application to compile your compiler ! You need a compiler to do anything , So when developing a new language , They usually use the older ones first , More mature language . Save time and learning costs at the same time . A good example is Go Language . first Go The compiler uses C Compiling , then Go You can compile , The compiler is there Go Rewritten in .

CPython Keep its C Characteristics of : Many standard library modules ( Such as ssl Module or sockets modular ) It's all used C language-written , For accessing low-level operating systems API. For creating network sockets , Working with a file system or interacting with a display Windows and Linux Kernel API It's all used C language-written . So will Python The extensibility layer of focuses on C Language is meaningful . Later in this article , We will introduce Python Standard library and C modular . In addition to this outside , Have a use Python Compiling Python The compiler is called PyPy. PyPy The logo of is a Ouroboros, Represents the compiler's self managed features . the other one Python An example of a cross compiler is Jython.

The other one is Jython.Jython Yes, it is Java Compiling , from Python Source code compiled into Java Bytecode . And CPython It's easy to import C Library and from Python They are used in the same way ,Jython Make imports and references Java Modules and classes become easy .

Python language norm

CPython The source code contains Python The definition of language . This is all. Python Reference specifications for interpreter use . The specification adopts human readable and machine readable formats . The document details Python Language , What is allowed and how each statement behaves .

file

be located Doc/reference In the directory is reStructuredText The document explains Python Every functional attribute in the language . This constitutes the docs.python.org On the official Python Reference guide . In the directory, you need to know the whole language , Structure and keyword file :

cpython/Doc/reference
|
├── compound_stmts.rst
├── datamodel.rst
├── executionmodel.rst
├── expressions.rst
├── grammar.rst
├── import.rst
├── index.rst
├── introduction.rst
├── lexical_analysis.rst
├── simple_stmts.rst
└── toplevel_components.rst
 Copy code 

stay compound_stmts.rst In file , You can see a definition with A simple example of a statement .with The sentence can be in Python Used in many ways , The simplest is the instantiation of the context manager and nested code blocks :

with x():   ...
 Copy code 

You can use as Rename

with x() as y:   ...
 Copy code 

You can also chain define multiple

with x() as y, z() as jk:   ...
 Copy code 

Next , We will explore Python Computer readable documents in languages .

Grammar

The document contains human readable specifications and is stored in a single file Grammar/Grammar Machine readable specification in . Grammar The file is called Backus-Naur Form(BNF) The context representation of . BNF Not specific to Python Of , And it is often used as a symbol of grammar in many other languages . The concept of grammatical structure in programming language is from 20 century 50 years Noam Chomsky’s work on Syntactic Structures Inspired by . Python The syntax file for uses a with regular expression syntax Extended-BNF(EBNF) standard . therefore , In the grammar file, you can use :

  • * repeat
  • + Repeat at least once
  • [] As an optional part
  • | Choose any one
  • () For grouping If you search in the syntax file with sentence , You will see with The definition of the statement :
.. productionlist::   with_stmt: "with" `with_item` ("," `with_item`)* ":" `suite`   with_item: `expression` ["as" `target`]
 Copy code 

Everything in quotation marks is a string , This is how keywords are defined in a . therefore with_stmt Designated as : 1.with The beginning of a word 2. Next is with_item, It's a test and ( Optional )as expression . 3. Multiple items are separated by commas 4. In characters : ending 5. The second is suite. Some other definitions are mentioned in these two lines :

  • suite A block of code with one or more statements .
  • test It refers to a simple statement to be evaluated .
  • expr It refers to a simple expression If you want to explore these in detail , The entire... Can be defined in this file Python grammar .

If you want to see a recent example of how to use grammar , For example, in PEP572 in ,:= Operators are added to the syntax file .

  ATEQUAL                 '@='  RARROW                  '->'  ELLIPSIS                '...'+ COLONEQUAL              ':='  OP  ERRORTOKEN
 Copy code 

Use pgen

Grammar The file itself will not be Python The compiler uses . Is to use a named pgen Tools for , To create a parser table .pgen Will read the syntax file and convert it into a parser table . If you make changes to the grammar file , The parser table must be regenerated and recompiled Python.

 Be careful :pgen  The application is  Python 3.8  In the from  C  Rewrite to pure  Python.
 Copy code 

To see pgen Operating condition , Let's change Python Part of grammar . And recompile and run Python. stay Grammar See two files under the path Grammar and Tokens, We are Grammar Search for pass_stmt, Then you see this below

pass_stmt: 'pass'
 Copy code 

So let's revise that , Change to the following

pass_stmt: 'pass' | 'proceed'
 Copy code 

stay Cpython Use the root directory of make regen-grammar Command to run pgen recompile Grammar file . You should see output similar to this , Indicates that a new Include/graminit.h and Python/graminit.c file : Here are some of the output

# Regenerate Include/graminit.h and Python/graminit.c# from Grammar/Grammar using pgenPYTHONPATH=. python3 -m Parser.pgen ./Grammar/Grammar \        ./Grammar/Tokens \        ./Include/graminit.h.new \        ./Python/graminit.c.newpython3 ./Tools/scripts/update_file.py ./Include/graminit.h ./Include/graminit.h.newpython3 ./Tools/scripts/update_file.py ./Python/graminit.c ./Python/graminit.c.new
 Copy code 

Use the regenerated parser table , Recompile required CPython To see the new syntax . Use the same compilation steps used previously for the operating system .

make -j4 -s
 Copy code 

If the code compiles successfully , Execute the new CPython Binary file and start REPL.

./python.exe
 Copy code 

stay REPL in , Now you can try to define a function , Use to compile as Python The grammatical proceed Keyword substitution pass sentence .

Python 3.8.0b3 (tags/v3.8.0b3:4336222407, Aug 21 2019, 10:00:03) [Clang 10.0.1 (clang-1001.0.46.4)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> def example():...    proceed... >>> example()
 Copy code 

Here are my running results , It's interesting that there's no mistake .

img

Next , We'll talk about Tokens Document and its relation to Grammar The relationship between .

Tokens

And Grammar The syntax files in the folder together are a Tokens file , It contains each unique type found as a leaf node in the parse tree , We'll go into the parser tree later . Every token It also has a name and a unique generated ID, These names are used to simplify tokenizer I quote .

 Be careful :Tokens  File is  Python 3.8  A new feature in .
 Copy code 

for example , The left bracket is called LPAR, The semicolon is called SEMI. You'll see these tags later in this article :

LPAR                    '('RPAR                    ')'LSQB                    '['RSQB                    ']'COLON                   ':'COMMA                   ','SEMI                    ';'
 Copy code 

Just like syntax files , If you change Tokens file , You need to run it again pgen. To view... In action tokens, Can be in CPython Use in tokenize modular . Create a file called test_tokens.py Simple Python Script :

# Hello world!def my_function():   proceed
 Copy code 

And then by calling tokenize The built-in module in the standard library of passes this file . You will see the token list by line and character . Use -e Flag outputs the exact token name :

0,0-0,0:            ENCODING       'utf-8'        1,0-1,14:           COMMENT        '# Hello world!'1,14-1,15:          NL             '\n'           2,0-2,3:            NAME           'def'          2,4-2,15:           NAME           'my_function'  2,15-2,16:          LPAR           '('            2,16-2,17:          RPAR           ')'            2,17-2,18:          COLON          ':'            2,18-2,19:          NEWLINE        '\n'           3,0-3,3:            INDENT         '   '          3,3-3,7:            NAME           'proceed'         3,7-3,8:            NEWLINE        '\n'           4,0-4,0:            DEDENT         ''             4,0-4,0:            ENDMARKER      ''   
 Copy code 

img

In the output , The first column is the row / Range of column coordinates , The second column is the name of the token , The last column is the value of the token .

In the output ,tokenize The module implies some tags that are not in the file .

utf-8 Of ENCODING Mark , There is a blank line at the end ,DEDENT Close the function declaration ,ENDMARKER End file .tokenize The module is made of pure Python Compiling , be located CPython In source code Lib/tokenize.py in .

 Important note :CPython  There are two in the source code  tokenizers: One use  Python  To write , The one shown above , The other is to use  C  language-written . use  Python  Written as a utility , While using  C  Written for  Python  compiler . however , They have the same output and behavior . use  C  Language versions are designed for performance ,Python  The modules in are designed for debugging .
 Copy code 

To see C Of language tokenizer Details of , have access to -d Sign operation Python. Use the previously created test_tokens.py Script , Run it with the following command :

./python.exe -d test_tokens.py
 Copy code 

The results are as follows

Token NAME/'def' ... It's a keyword
 DFA 'file_input', state 0: Push 'stmt'
 DFA 'stmt', state 0: Push 'compound_stmt'
 DFA 'compound_stmt', state 0: Push 'funcdef'
 DFA 'funcdef', state 0: Shift.
Token NAME/'my_function' ... It's a token we know
 DFA 'funcdef', state 1: Shift.
Token LPAR/'(' ... It's a token we know
 DFA 'funcdef', state 2: Push 'parameters'
 DFA 'parameters', state 0: Shift.
Token RPAR/')' ... It's a token we know
 DFA 'parameters', state 1: Shift.
  DFA 'parameters', state 2: Direct pop.
Token COLON/':' ... It's a token we know
 DFA 'funcdef', state 3: Shift.
Token NEWLINE/'' ... It's a token we know
 DFA 'funcdef', state 5: [switch func_body_suite to suite] Push 'suite'
 DFA 'suite', state 0: Shift.
Token INDENT/'' ... It's a token we know
 DFA 'suite', state 1: Shift.
Token NAME/'proceed' ... It's a keyword
 DFA 'suite', state 3: Push 'stmt'
...
  ACCEPT.
 Copy code 

In the output , You can see that it is highlighted as a keyword . In the next chapter , We'll see how to execute Python Binary files arrive at tokenizer And what happens when you execute code from there . Now you have outlined Python Grammar and tokens And statements , There is a way to put pgen Convert output to interactive graphics . Here are Python 3.8a2 Screenshot of Syntax :

img

It doesn't matter if you can't see clearly , The... Used to generate this diagram Python package (instaviz) It will be introduced in later chapters . Let's get to know .

Python Memory management in

In this paper , You'll see right PyArena References to objects .

arena yes CPython One of the memory management structures . Code in Python/pyarena.c It contains C Memory allocation and deallocation methods .

In writing C In the program , Developers should allocate memory for data structures before writing data . This allocation marks memory as a process belonging to the operating system . When the allocated memory is no longer used and returned to the operating system's available memory block table , Developers can also deallocate or “ Release ” they . If the process allocates memory for a variable , For example, in a function or loop , When the function is complete , Memory is not automatically returned to C The operating system in . therefore , If it's not C Explicitly release... In code , Will cause memory leaks . Every time the function runs , This process will continue to consume more memory , Until the end , The system ran out of memory and crashed !Python Take this responsibility away from programmers , And use two algorithms : Reference counters and garbage collectors . Whenever the interpreter is instantiated ,PyArena Method to create and attach a memory area in the interpreter . stay CPython In the life cycle of the interpreter ,arenas Can be assigned . They are associated with linked lists .

arenas take Python The pointer list of the object is stored as PyListObject Method . Each time a new one is created Python Object time , Will use PyArena_AddPyObject Method to add a pointer to it . This function call stores the pointer in arenas list a_objects in .PyArena Method provides a second function , That is, allocate and reference the original memory block list . for example , If you add thousands of added values ,C In the code PyList Additional memory will be required . however PyList Do not allocate memory directly . The object is created from PyObject Call... With the required memory size PyArena_Malloc from PyArena Get the original memory block . This task is in Objects/oballoc.c Completion in . In the object allocation module , It can be for Python Object allocation , Free and reallocate memory . The linked list of allocated blocks is stored in arenas Inside , So when the interpreter stops , have access to PyArena_Free Release all managed memory blocks at once .

With PyListObject For example , If you use .append() Put an object into Python At the end of the list , There is no need to reallocate memory , Instead, use the memory in the existing list . .append() Method call list_resize() To handle the memory allocation of the list . Each list object keeps a list of the amount of memory allocated . If the item to be appended will fit the existing available memory , Then just add . If the list needs more memory space , It will be extended . The length of the list is extended to 0,4,8,16,25,35,46,58,72,88.

call PyMem_Realloc You can expand the memory allocated in the list . PyMem_Realloc yes pymalloc_realloc Of API Wrappers .Python One more C call malloc Special wrapper for , It sets the maximum size of memory allocation to help prevent buffer overflow errors ( See PyMem_RawMalloc). in summary :

  • The original memory block is allocated through PyMem_RawAlloc Accomplished .
  • Python The pointer to the object is stored in PyArena in .
  • PyArena It also stores a linked list of allocated memory blocks . of API For more information , see also CPython file .

Reference count

To be in Python Create variables and assign values in , Variable name must be one .

my_variable = 180392
 Copy code 

As long as Python Assign values to variables in , Will be in locals and globals Check the name of the variable in the scope , To see if it already exists . because my_variable be not in locals() or globals() In the dictionary , So I created this new object , And specify the value as a numeric constant 180392. Now there is a right my_variable References to , therefore my_variable The reference counter is incremented 1. You can CPython Of C See the function in the source code Py_INCREF and Py_DECREF. These two functions are to count the increment and decrement of the object respectively . When a variable is outside the declared range , References to objects are decremented .Python A range in can refer to a function or method , Generative or lambda function . These are some more intuitive ranges , But there are many other implicit scopes , For example, passing variables to function calls . Incrementing and decrementing references are handled in CPython The compiler and core execute loops ceval.c In file . We'll go into more detail later in this article .

Whenever you call Py_DECREF And the counter changes to 0 when , Will call PyObject_Free function . For this object , Will be called for all allocated memory PyArena_Free.

garbage collection

CPython The garbage collector is enabled by default , It happens backstage , Memory used to free objects that are no longer in use . Because the garbage collection algorithm is much more complex than the reference counter , So it won't always happen , Otherwise, it will consume a lot of CPU resources . After a certain number of operations , It happens regularly .CPython The standard library comes with a Python modular , Used with arena And garbage collector gc Module connection . The following is used in debugging mode gc Module method :

>>> import gc
>>> gc.set_debug(gc.DEBUG_STATS)
 Copy code 

This will print statistics when the garbage collector is running . You can call get_threshold To get the threshold for running the garbage collector :

>>> gc.get_threshold()
(700, 10, 10)
 Copy code 

You can also get the current threshold count :

>>> gc.get_count()
(688, 1, 1)
 Copy code 

Last , You can run the collection algorithm manually :

>>> gc.collect()
24
 Copy code 

This will call Modules/gcmodule.c In the document collect(), This file contains the implementation of the garbage collector algorithm .

Conclusion

In the 1 In the part , We introduced the structure of the source code base , How to compile from source code and Python language norm . When you learn more about Python Interpreter process , These core concepts are described in Chapter 2 Part will be crucial .

The original text is in my Zhihu column :zhuanlan.zhihu.com/p/79656976

copyright notice
author[cxapython],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201291949469211.html

Random recommended