Probing the depths of a python 3.3 dictionary
How could I instrument the code that implements the dictionary so that I can count the collisions for each key inserted? I wanted to get under the hood of the dictionary, put some probes on the wires, and expose what goes on. But how???
Q: How to we get an instance of PyDictObject that underlies our dictionary, d = {1:1, 2:2, 3:3}?
A: cast(id(d), POINTER(PyDictObject)).contents
The Answer, In English...
We instantiate a pointer to data type PyDictObject and initialize it with the address of the dictionary d. We then dereference the pointer to get the PyDictObject object underlying dictionary d.
dictobj = cast(id(d), POINTER(PyDictObject)).contents
How do we access the key-value pairs stored in dictobj?
By knowing the layout of the PyDictObject type, we can access its nested fields until we get to the array holding the dictionary data.
How do I write python code that is based on the C source?
One good way to get started is to read code that someone else wrote. By starting with this code I learned some patterns. To figure out what to do for python 3.3, I had to read the source code dictobject.c and dictobject.h. I started with the C structs provided in the source.
What's a C struct? In python, think "class". It is a way of grouping together data types into a higher-level, more abstract data type.
Digging in ... The PyDictObject Type
Three C structs are involved in defining the dictionary: PyDictObject, PyDictKeysObject, and PyDictKeyEntry.
PyDictObject is the top-level struct that contains fields (in python, think attributes), one of which is PyDictKeysObject. This is another struct that contains fields, one of which is an array called dk_entries. Each element of the array is a PyDictKeyEntry, which is another struct that contains the fields of a key-value pair. When we insert key-value pairs into a dictionary, they reside in the fundamental data structure -- an array -- called dk_entries.
The ctypes module allows a python programmer to translate the C structs into python classes. Here are the structs underlying the python dictionary:
class PyDictKeyEntry(Structure):
"""An entry in a dictionary."""
_fields_ = [ ('me_hash', c_ulong),
('me_key', py_object),
('me_value', py_object), ]
The field names are identical to those in the C code. The field types are provided by this table in the ctypes module docs. I had a question about the type for me_hash, which is defined in the source code as Py_hash_t. This type is not documented anywhere. Is c_ulong the best choice? Is it portable to Windows, Linux, Mac OS (32/64 bit) or is c_size_t better? Experience certainly helps.
The PyDictKeysObject was the trickiest struct to translate into a python class because its last field is variable length. In the definition, the last field is a dk_entries array of size 1 (ie, 1 PyDictKeyEntry). As keys are added to the dictionary, this array grows.
class PyDictKeysObject(Structure):
"""An object of key entries."""
_fields_ = [
('dk_refcnt', c_ssize_t),
('dk_size', c_ssize_t),
('dk_lookup', LOOKUPFUNC), # a function prototype
('dk_usable', c_ssize_t),
('dk_entries', PyDictKeyEntry * 1), ]
# an array of size 1; size grows as keys are inserted into dictionary; this variable-sized field was the trickiest part to translate into python
The top-level struct is the dictionary object, PyDictObject:
PyDictObject._fields_ = [
('ob_refcnt', c_ssize_t),
('ob_type', c_void_p), # not in the docs
('ma_used', c_ssize_t),
('ma_keys', POINTER(PyDictKeysObject)),
('ma_values', POINTER(py_object)),
]
So, how to we get an instance of PyDictObject that underlies our dictionary, d = {1:1, 2:2, 3:3}?
obj is the PyDictObject instance that contains the key-value pairs:
obj = cast(id(d), POINTER(PyDictObject)).contents
How do we get our keys out of this obj?
By navigating down the structure of PyDictObject, we can access the keys. First get the PyDictKeysObject object:
key_obj = obj.ma_keys.contents
And Finally....Our key!!!
A pointer to the key stored in slot 0 of the dictionary is:
key_obj.dk_entries[0].me_key !!! This is our key
A Few Things I learned along the way....
C89, C99, POSIX: need to know this to ensure portability and to understand what C programmers are talking about
how to parse C declarations: even comes with a fun quiz at the end
pointer to an array (the whole array!) and why its useful (Even though Eli Bendersky says: "Truly, I can’t imagine why one would use a pointer to an array in real life."
rereading the HowTo on descriptors...again...
The Beginning... (certainly not the end)
Don't know if you made it this far, but there is TONS more to learn about the ctypes module as a foreign function library. And alternatives like a library called cffi. PyCon videos are a good source of info, too.
UPDATE: The following article just came out in the Python Weekly Newsletter: Why Python Is Slow: Looking Under the Hood This post looks at types int and list under the hood with more info on ctypes module and provides good pictures too.