Update: The paper version of this post is accepted at PLDI SOAP 2024. Take a look at the preprint (PDF).
PyPy is an alternative implementation of the Python language. PyPy’s C API compatibility layer has some performance issues. CF Bolz-Tereick and I are working on a way to make PyPy’s C API interactions much faster. It’s looking very promising. Here’s a sketch of how it works.
Python is pretty widely-used. For years, CPython was the only implementation, and CPython was not designed to be fast. The Python community needed some programs to go faster and determined that the best path forward was to write some modules in C and interact with them from Python. Worked just fine.
Then other Python runtimes like PyPy came along. PyPy includes a JIT compiler and its execution of normal Python code is very fast, at least until it hits a call from Python to a C extension function. Then things go a little bit sideways. First and foremost because the PyPy JIT can’t “see into” the native code; it’s generated outside of the JIT by some other compiler and is therefore opaque. And second of all because the binding API for the aforementioned C modules (“The C API”) requires a totally different object and memory representation than PyPy has internally.
PyPy has its own object model, runtime, and moving garbage collector, all to get better performance. Unfortunately, this means that whenever you call a C API function from PyPy, it has to stop what it’s doing, set up some C API scaffolding, do the C API call, and then take down the scaffolding.
For example, the C API is centered around PyObject
pointers. PyPy does not
use PyObject
s in the normal execution of Python code. When it interacts with
C API code, it has to allocate a PyObject
, make it point into the PyPy heap,
call the C API function, and then (potentially) free the PyObject
. (This
ignores GIL stuff and exception checking, which is also an issue.)
That’s a lot of overhead. (And there’s more, too. See Antonio Cuni’s excellent blog post.) And it’s a hard problem that has bitten multiple alternative Python runtimes1.
In addition to the overhead of boxing into a PyObject
, the underlying C
function that the C extension calls may not even need the PyObject
to exist
in the first place. For example, a lot of C API functions are structured like
this:
long inc_impl(long num) {
return num + 1;
}
PyObject* inc(PyObject* obj) {
long num = PyLong_AsLong(obj);
if (num < 0 && PyErr_Occurred()) {
return NULL;
}
long result = inc_impl(num);
return PyLong_FromLong(result);
}
In this example, the PyObject*
code inc
is only a wrapper around another
function inc_impl
that works directly on C integers.
All of the bits in the middle between the JIT and the C implementation (the
entire inc
function, really) are “wasted work” because the work is not needed
for the actual execution of the user’s program.
So even if the PyPy JIT is doing great work and has eliminated memory
allocation in Python code—PyPy could have unboxed some heap allocated Python
integer object into a C long, for example—it still has to heap allocate a
PyObject
for the C API… only to throw it away soon after.
If there was a way to communicate to PyPy that inc
expects a long
and is
going to unbox it into a C long
(and will also return a C long
), it
wouldn’t need to do any of these shenanigans.
And yes, ideally there wouldn’t be a C API call at all. But sometimes you have to (perhaps because you have no control over the code), and you might as well speed up that call. Let’s see if it can be done.
This is where I come into this. I was at the ECOOP conference in 2022 where CF introduced me to some other implementors of alternative Python runtimes. I got to talk to the authors of PyPy and ZipPy and GraalPython over some coffee and beer. They’re really nice.
They’ve been collectively working on a project called HPy. HPy is a new design for a C API Python that takes alternative runtimes into account. As part of this design, they were investigating a way to pipe type information from C extension modules through the C API and into a place where the host runtime can read it.
It’s a tricky problem because not only is there a C API, but also a C ABI (note the “B” for “binary”). While an API is an abstract contract between caller and callee for how to call a function, an ABI is more concrete. In the case of the C ABI, it means not changing struct layouts or sizes, adding function parameters, things like that. This is kind of a tight constraint and it wasn’t clear what the best backward-compatible way to add type information was.
Sometime either in this meeting or shortly after, I had an idea for how to do it without changing the API or ABI and I decided to take a stab at implementing it for Cinder (the Python runtime I was working on at the time).
In order to better understand the problems, let’s take a look at the kind of type information we want to add. This is the kind of type metadata we want to add to each typed method, represented as a C struct.
struct PyPyTypedMethodMetadata {
int arg_type;
int ret_type;
void* underlying_func;
};
typedef struct PyPyTypedMethodMetadata PyPyTypedMethodMetadata;
In this artificially limited example, we store the type information for one
argument (but more could be added in the future), the type information for the
return value, and the underlying (non-PyObject*
) C function pointer.
But it’s not clear where to put that in a PyMethodDef
. The existing
PyMethodDef
struct looks like this. It contains a little bit of metadata and
a C function pointer (the PyObject*
one). In an ideal world, we would “just”
add the type metadata to this struct and be done with it. But we can’t change
its size for ABI reasons.
struct PyMethodDef {
const char *ml_name; /* The name of the built-in function/method */
PyCFunction ml_meth; /* The C function that implements it */
int ml_flags; /* Combination of METH_xxx flags, which mostly
describe the args expected by the C func */
const char *ml_doc; /* The __doc__ attribute, or NULL */
};
typedef struct PyMethodDef PyMethodDef;
What to do? Well, I decided to get a little weird with it and see if we could
sneak in a pointer to the metadata somehow. My original idea was to put the
entire PyPyTypedMethodMetadata
struct behind the PyMethodDef
struct (kind
of like how malloc
works), but that wouldn’t work so well: PyMethodDef
s are
commonly statically allocated in arrays, and we can’t change the layout of
those arrays.
But what we can do is point the ml_name
field to a buffer inside another
struct2.
Then, when we notice that a method is marked as typed (with a new METH_TYPED
flag we can add to the ml_flags
bitset), we can read backwards to find the
PyPyTypedMethodMetadata
struct. Here’s how you might do that in C:
struct PyPyTypedMethodMetadata {
int arg_type;
int ret_type;
void* underlying_func;
const char ml_name[100]; // New field!
};
typedef struct PyPyTypedMethodMetadata PyPyTypedMethodMetadata;
PyPyTypedMethodMetadata*
GetTypedSignature(PyMethodDef* def)
{
assert(def->ml_flags & METH_TYPED); // A new type of flag!
return (PyPyTypedMethodMetadata*)(def->ml_name - offsetof(PyPyTypedMethodMetadata, ml_name));
}
And here’s a diagram to illustrate this because it’s really weird and confusing.
I started off with a mock implementation of this in C (no Python C API, just fake structures to sketch it out) and it worked. So I implemented a hacky version of it in Cinder, but never shipped it because my integration with Cinder was a little too hacky. I wrote up the ideas for posterity in case someone wanted to take up the project.
A year later, nobody else had, so I decided to poke CF and see if we could implement it in PyPy. We’ll see how that implementation looks in a minute. But first, an aside on where C extensions come from.
Well, in PyPy, there are none in the standard library. PyPy has been almost entirely written in Python so that the code is visible to the JIT. But people like using Python packages, and some Python packages contain C extensions.
There are a couple of different ways to write a C extension. The “simplest” (as in, all the components are visible and there is no magic and there are no external dependencies) is to hand-write it. If you don’t want to do that, you can also use a binding generator to write the glue code for you. I have the most experience with Cython, but other binding generators like nanobind, pybind11, and even CPython’s own Argument Clinic exist too!
Let’s recall the inc
/inc_impl
function from earlier. That’s a reasonable
example of a function that could be integrated as a hand-written C extension to
Python. In order to make it callable from Python, we have to make a full C
extension module. In this case, that’s just a list of function pointers and how
to call them.
#include <Python.h>
long inc_impl(long arg) {
return arg+1;
}
PyObject* inc(PyObject* module, PyObject* obj) {
(void)module;
long obj_int = PyLong_AsLong(obj);
if (obj_int == -1 && PyErr_Occurred()) {
return NULL;
}
long result = inc_impl(obj_int);
return PyLong_FromLong(result);
}
static PyMethodDef mytypedmod_methods[] = {
{"inc", inc, METH_O, "Add one to an int"},
{NULL, NULL, 0, NULL}};
static struct PyModuleDef mytypedmod_definition = {
PyModuleDef_HEAD_INIT, "mytypedmod",
"A C extension module with type information exposed.", -1,
mytypedmod_methods,
NULL,
NULL,
NULL,
NULL};
PyMODINIT_FUNC PyInit_mytypedmod(void) {
PyObject* m = PyState_FindModule(&mytypedmod_definition);
if (m != NULL) {
Py_INCREF(m);
return m;
}
return PyModule_Create(&mytypedmod_definition);
}
We have an array of PyMethodDef
structs, one for each method we want to wrap.
Then we have a PyModuleDef
to define the module, which can also contain
attributes and some other stuff. Then we provide a sort of __new__
function
for the module, in the form of a PyInit_
function. This is found by dlopen
in the C extension loader built into Python.
It’s possible to manually augment this module by adding a
PyPyTypedMethodMetadata
struct and a METH_TYPED
flag. It’s a little
cumbersome, but if it speeds up interactions with the module… well, extension
authors might be cajoled into adding the type information or at least accepting
a pull request.
But not all extensions are hand-written. Many are generated by binding generators like Cython. And Cython is interesting because it can generate the type signatures automatically…
Unlike many other binding generators for Python, Cython provides a fully-featured Python-like programming language that compiles to C. The types obey different rules than in normal Python code and can be used for optimization. Cython also has primitive types. Let’s see an example.
In this snippet of Cython code, we make a function that adds two machine integers and returns a machine integer.
cpdef int add(int a, int b):
return a + b
Cython will generate a very fast C function that adds two machine integers. Calls to this from Cython are type checked at compile time and will be as fast as your C compiler allows:
static int add(int __pyx_v_a, int __pyx_v_b) {
return __pyx_v_a + __pyx_v_b;
}
Since we used cpdef
instead of cdef
, Cython will also generate a wrapper C
extension function so that this function can be called from Python.
This means that the generated Cython wrapper code looks like (a much uglier version of) below. You don’t need to understand or really even read the big blob of cleaned-up and annotated generated code below. You just need to say “ooh” and “aah” and “wow, so many if-statements and so much allocation and so many function calls.”
And it’s also a little worse than the METH_O
example above since it has to
unwrap an array of fastcall args and do some argument processing.
static PyObject *add_and_box(CYTHON_UNUSED PyObject *__pyx_self,
int __pyx_v_a,
int __pyx_v_b) {
int result = add(__pyx_v_a, __pyx_v_b, 0);
// Check if an error occurred (unnecessary in this case)
if (result == ((int)-1) && PyErr_Occurred()) {
return NULL;
}
// Wrap result in PyObject*
return PyLong_FromLong(result);
}
static PyObject *add_python(PyObject *__pyx_self,
PyObject *const *__pyx_args,
Py_ssize_t __pyx_nargs,
PyObject *__pyx_kwds) {
// Check how many arguments were given
PyObject* values[2] = {0,0};
if (__pyx_nargs == 2) {
values[1] = __pyx_args[1];
values[0] = __pyx_args[0];
} else if (__pyx_nargs == 1) {
values[0] = __pyx_args[0];
}
// Check if any keyword arguments were given
Py_ssize_t kw_args = __Pyx_NumKwargs_FASTCALL(__pyx_kwds);
// Match up mix of position/keyword args to parameters
if (__pyx_nargs == 0) {
// ...
} else if (__pyx_nargs == 1) {
// ...
} else if (__pyx_nargs == 2) {
// ...
} else {
// ...
}
// Unwrap PyObject* args into C int
int __pyx_v_a = PyLong_AsLong(values[0]);
// Check for error (unnecessary if we know it's an int)
if ((__pyx_v_a == (int)-1) && PyErr_Occurred()) {
return NULL;
}
int __pyx_v_b = PyLong_AsLong(values[1]);
// Check for error (unnecessary if we know it's an int)
if ((__pyx_v_b == (int)-1) && PyErr_Occurred()) {
return NULL;
}
// Call generated C implementation of add
return add_and_box(__pyx_self, __pyx_v_a, __pyx_v_b);
}
Now, to be clear: this is probably the fastest thing possible for interfacing with CPython. Cython has been worked on for years and years and it’s very fast. But CPython isn’t the only runtime in town and the other runtimes have different performance characteristics, as we explored above.
Since so many C extension are generated with Cython, there’s a big opportunity: if we manage to get the Cython compiler to emit typed metadata for the functions it compiles, those functions could become much faster under runtimes such as PyPy.
In order to justify such a code change, we have to see how much faster the typed metadata makes things. So let’s benchmark.
Let’s try benchmarking the interpreter interaction with the native module with a silly benchmark. It’s a little silly because it’s not super common (in use cases I am familiar with anyway) to call C code in a hot loop like this without writing the loop in C as well. But it’ll be a good reference for the maximum amount of performance we can win back.
# bench.py
import mytypedmod
def main():
i = 0
while i < 10_000_000:
i = mytypedmod.inc(i)
return i
if __name__ == "__main__":
print(main())
We’ll try running it with CPython first because CPython doesn’t have this
problem making PyObject
s—that is just the default object representation in
the runtime.
$ python3.10 setup.py build
$ time python3.10 bench.py
10000000
846.6ms
$
Okay so the text output is a little fudged since I actually measured this with
hyperfine
, but you get the idea. CPython takes a very respectable 850ms to go
back and forth with C 10 million times.
Now let’s see how PyPy does on time, since it’s doing a lot more work at the boundary.
$ pypy3.10 setup.py build
$ time pypy3.10 bench.py
10000000
2.269s
$
Yeah, okay, so all that extra unnecessary work that PyPy does (before our
changes) ends up really adding up. Our benchmark of inc
takes three times as
long as CPython. Oof. But this post is all about adding types. What if we add
types to the C module and measure with our changes to PyPy?
Here are the changes to the C module:
diff --git a/tmp/cext.c b/tmp/cext-typed.c
index 8f5b31f..52678cb 100644
--- a/tmp/cext.c
+++ b/tmp/cext-typed.c
@@ -14,8 +14,15 @@ PyObject* inc(PyObject* module, PyObject* obj) {
return PyLong_FromLong(result);
}
+PyPyTypedMethodMetadata inc_sig = {
+ .arg_type = T_C_LONG,
+ .ret_type = T_C_LONG,
+ .underlying_func = inc_impl,
+ .ml_name = "inc",
+};
+
static PyMethodDef mytypedmod_methods[] = {
- {"inc", inc, METH_O, "Add one to an int"},
+ {inc_sig.ml_name, inc, METH_O | METH_TYPED, "Add one to an int"},
{NULL, NULL, 0, NULL}};
static struct PyModuleDef mytypedmod_definition = {
And now let’s run it with our new patched PyPy:
$ pypy3.10-patched setup.py build
$ time pypy3.10-patched bench.py
10000000
168.1ms
$
168ms! To refresh your memory, that’s 5x faster than CPython and 13x faster than baseline PyPy. I honestly did not believe my eyes when I saw this number. And CF and I think there is still room for more improvements like doing the signature/metadata finding inside the JIT instead of calling that C function.
This is extraordinarily promising.
But as I said before, most applications don’t consist of a Python program calling a C function and only a C function in a tight loop. It would be important to profile how this change affects a representative workload. That would help motivate the inclusion of these type signatures in a binding generator such as Cython.
In the meantime, let’s take a look at how the changes look in the PyPy codebase.
PyPy is comprised of two main parts:
This means that instead of writing fancy JIT compiler changes to get this to
work, we wrote an interpreter change. Their cpyext
(C API) handling code
already contains a little “interpreter” of sorts to make calls to C extensions.
It looks at ml_flags
to distinguish between METH_O
and METH_FASTCALL
, for
example.
So we added a new case that looks like this pseudocode:
diff --git a/tmp/interp.py b/tmp/typed-interp.py
index 900fa9c..b973f13 100644
--- a/tmp/interp.py
+++ b/tmp/typed-interp.py
@@ -1,7 +1,17 @@
def make_c_call(meth, args):
if meth.ml_flags & METH_O:
assert len(args) == 1
+ if meth.ml_flags & METH_TYPED:
+ return handle_meth_typed(meth, args[0])
return handle_meth_o(meth, args[0])
if meth.ml_flags & METH_FASTCALL:
return handle_meth_fastcall(meth, args)
# ...
+
+def handle_meth_typed(meth, arg):
+ sig = call_scary_c_function_to_find_sig(meth)
+ if isinstance(arg, int) and sig.arg_type == int and sig.ret_type == int:
+ unboxed_arg = convert_to_unboxed(arg)
+ unboxed_result = call_f_func(sig.underlying_func, unboxed_arg)
+ return convert_to_boxed(unboxed_result)
+ # ...
Since the JIT probably already knows about the types of the arguments to the C extension function (and probably has also unboxed them), all of the intermediate checks and allocation can be elided. This makes for much less work!
To check out the actual changes to PyPy, look at this stack of commits.
This project isn’t merged or finished. While we have a nice little test suite and a microbenchmark, ideally we would do some more:
Let us know if you have any ideas!
lifthrasiir on Hacker News points
out that this struct should be
versioned for future-proofing. They also suggest potentially avoiding
METH_TYPED
by shipping a sort of sibling symbol _PyPyTyped_foo
for each
foo
. That’s interesting.
Wenzel Jakob (of pybind11 and nanobind fame) wrote in that we should not ignore
overloads (apparently fairly common in the wild) in our implementation.
Apparently people like to make, for example, myextension.abs(x)
where the
call to abs_float
or abs_int
is dispatched in the binding glue (a real
example that I found in a search on GitHub is available
here).
A JIT like PyPy could make the dynamic dispatch go away.
Another idea, from CF, a little more difficult: what if the Cython compiler could generate Python code or bytecode? Or, what if PyPy could ingest Cython code?
This C API problem has bitten pretty much every alternative runtime to CPython. They all try changing something about the object model for performance and then they all inevitably hit the problem of pervasive C API modules in the wild. This is part of what hurt the Skybison project. ↩
I later learned that this is common in the Linux kernel. ↩