Python 3.3 dictionary's hard-to-believe performance
In my post Insights Into Python's Dictionary I talked about what goes on inside python 3.3 dictionary in terms of the underlying structure (a hash table) and a high-level view of the hashing algorithm.
Hashing Performance Using a Textbook Algorithm
Using an open addressing collision strategy described in an algorithms text book, I built a hash table, inserted the words from "A Tale of Two Cities" and plotted the mean number of collisions before each key successfully found a home. This ipython notebook shows the results. (The hash table approximately doubles its size when it becomes half full. The peaks in the plot indicate resizing.) In the end, there are 4371 unique keys and the table size grows to 16381. The average number of collisions is about 3.3.
Interlude
My goal was to do the same for a python 3.3 dictionary. But how could I instrument the code that implements the dictionary so that I can count the collisions for each key inserted? I wanted to get under the hood of the dictionary, put some probes on the wires, and expose what goes on. But how??? First, I show the results.
Hashing Performance Using Python 3.3 Dictionary
This ipython notebook shows the performance results of the python 3.3 algorithm when the same keys, in the same order, are inserted into a python 3.3 dictionary. (The table resizes when it becomes 2/3 full.) The final table size is half the size of the textbook algorithm and the average number of collisions are about 1.5. Wow.
The file cpython/Objects/dictobject.c says:
Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases.
[Python's non-random algorithm] gives better-than-random behavior in common cases, and that's very desirable.
This was fun but the real challenge was to figure how to probe the internal dictionary machinery. This led me to the ctypes module. A whole new world on the other side of the python/C divide.
My next post will be on my eye-opening Interlude.












