Exploring the memoize function
*In this article we will explore some concrete examples of the many uses and intricacies of the memoize function from the Clojure standard library. If you enjoy this article, you might very well enjoy the Clojure Standard Library book: save 37% off the book with code fccborgatti. *
memoize is a function in the Clojure standard library that adds caching capabilities to another existing function using the invocation arguments as key. When the wrapped function is invoked with the same list of arguments, the result is returned immediately from the cache without any additional computation. The effects of memoize are readily visible if we print any message from the wrapped function:
(defn- f* [a b] (println (format "Cache miss for [%s %s]" a b)) (+ a b)) (def f (memoize f*)) (f 1 2) ;; Cache miss for [1 2] ;; 3 (f 1 2) ;; 3 (f 1 3) ;; Cache miss for [1 3] ;; 4
The first invocation generates the message (the following invocations don’t), confirming that the wrapped function f* is not invoked again. Conventionally, if the wrapped function is not meant to be used directly, the star * character gets added at the end of the name, while the "memoized" version is named without one.
"Memoize" invocation contract
memoize formal specification of the function signature is as follows:
"f" needs to be a function and is mandatory, so (fn? f) should return true. You can still provide memoize with a non-invokable object but the resulting function would be useless and throw an exception.
ClassCastException if "f" is not invokable.
A new function of a variable number of arguments. memoize will eventually delegate the call to the wrapped function, so the number of arguments passed to the generated function needs to be compatible.
memoize works well for non-trivial computations accepting and returning values with a relatively small memory footprint. The following example illustrates this point. The Levenshtein distance (this Wikipedia article contains a good introduction to the Levenshtein Distance algorithm) is a simple metric to measure how different two strings are. Levenshtein distance can be used, for example, to suggest corrections for the most common spelling mistakes. The Levenshtein distance is straightforward to implement, but becomes computationally intensive for longer strings (above 10 characters long). We could use memoize to save us from computing the distance of the same pair of strings over and over again. The input (the strings arguments) and the output (a small integer) are relatively small in size, so we can cache a large amount of them without exhausting memory (assuming the list of words with which the function is invoked is a finite number that we can estimate).
To feed our example we are going to use the dictionary available on Unix systems at "/use/share/dict/words" which is a plain text list of words. If we were asked to implement an auto-correction service, it could work as follows:
The user inputs a misspelled word
The system checks the distance of the word against the words in the dictionary
The results are returned back in order of smaller distance
Although, in our example, we are just implementing the essentials, concentrating mainly the application of memoize, we are also going to pre-compute a dictionary of words starting with the initials of the word, a technique to further speed-up the Levenshtein distance calculation:
(defn levenshtein* [[c1 & rest1 :as str1] ; 1. [c2 & rest2 :as str2]] (let [len1 (count str1) len2 (count str2)] (cond (zero? len1) len2 (zero? len2) len1 :else (min (inc (levenshtein* rest1 str2)) (inc (levenshtein* str1 rest2)) (+ (if (= c1 c2) 0 1) (levenshtein* rest1 rest2)))))) (def levenshtein (memoize levenshtein*)) ; 2. (defn to-words [txt init] ; 3. (->> txt slurp clojure.string/split-lines (filter #(.startsWith % init)) (remove #(> (count %) 8)) doall)) (defn best [misp dict] ; 4. (->> dict (map #(-> [% (levenshtein misp %)])) (sort-by last) (take 3))) (defn dict [init] (to-words "/usr/share/dict/words" init)) (def dict-ac (dict "ac")) ; 5. (time (best "achive" dict-ac)) ;; "Elapsed time: 4671.226198 msecs" ; 6. (["achieve" 1] ["achime" 1] ["active" 1]) (time (best "achive" dict-ac)) ;; "Elapsed time: 0.854094 msecs" ; 7. (["achieve" 1] ["achime" 1] ["active" 1])
The Levenshtein algorithm presented here is a variation of the many available on-line. The important aspect to remember is that it grows roughly as O(n*m) where m and n are the length of the strings, which is O(n^2) in the worst scenario.
This def actually builds the wrapping function through memoize, conveniently called levenshtein without the final * that is reserved for the non-memoized version.
to-words is a helper function to prepare the dictionary filtered by the initial string. to-words is part of the "static" or "learning" phase of the algorithm, since we can prepare words by initial off-line and store them for later use.
The best function is responsible for the application of the levenshtein memoized function to the words in the dictionary. It then sorts the results with sort-by and returns the lowest distances.
The def invocation defines a filtered dictionary starting by "ac" so it doesn't need to be computed multiple times. This also prevents the time function from reporting on the time needed to read and process the file.
The first invocation to search the best matches for the misspelled word returns in almost 5 seconds.
The second invocation returns much faster.
The memoized version of the Levenshtein distance function is storing each new pair of strings as key and the returned distance as the value of an internal map. Each time the function is invoked with the same arguments, the return value is looked up inside the map. Comparing long strings seems to benefit from caching intermediate results and the elapsed confirms the theory.
This example also shows the way the memoized Levenshtein distance is "trained" before actual use. The application could pre-compute the set of dictionaries by initials (similar to the indexing happening inside a database). This technique contributes to the speed-up seen in our Levenshtein implementation, but consider that there are also other algorithms out-performing Levenshtein (See the list of metrics available on Wikipedia).
What's in a name: memoize?
There is a reason why storing arguments and return values is called "memoization" instead of just "caching". Memoization is more specific because it implies two features normally present in functional languages: pure and higher order functions.
Pure functions The wrapped function needs to be referentially transparent. If there was a way for the output to be influenced by factors other than the function arguments, then cached results could be different depending on the context. The cache would then need to be aware of the context and use it as part of the key. Memoization becomes straightforward in functional languages supporting referential transparency.
Higher order functions "Higher order" refers to the property of a function to be treated as a value. As such, the function can be stored, passed to other functions or returned. Not all languages offer higher order functions, although it is now more common to offer this feature. By describing this kind of caching as "memoization" it is implied that a function can be transparently decorated with caching capabilities. "Transparently" in this context means that the original wrapped function remains untouched.
These are other functions in the Clojure standard library that have a similar or related use to memoize.
lazy-seq creates a "thunk" (wrapper function around a value) that evaluates its content on the first access and return a cached version on following calls. When the thunks are joined together in a sequence they form a lazy sequence. Lazy sequences are comparable to a cache where the order and value of the keys is predetermined. An "evaluate once" semantic on collections could be achieved with lazy-seq. Since all Clojure sequences are lazy, you might already be using a "cached data structure" without knowing it.
atom creates a Clojure Atom, one of the possible Clojure reference types. Atoms are used by memoize to store intermediate results. Use an atom directly if the memoize implementation is too restrictive for the kind of caching you need to implement. You can, for example, look into something different than a Clojure hash-map to store items in the map, like a mutable Java map with soft-references (there are several examples of use of SoftReference for caching in Java. This is a good starting point. Keep in mind that there are already libraries like core.cache to provide common caching strategies if this is what you're looking for.
Memoize Performance and Implementation Details
O(1) steps (function generation)
O(n log n) ** steps (generated function), n number of unique keys**
O(n) ** space (generated function), n number of unique keys**
The main aspect to consider about memoize, is that it stores cached items indefinitely. Constant accumulation of new cached values will eventually exhaust memory. memoize users should pay attention to these facts when designing their solution, more specifically to the prospected distribution of keys in the cache. memoize should not be used in cases of long-running services when the amount of argument permutations is potentially infinite or not easy to predict.
We can gather some statistics about the key distribution with some changes to the original memoize function. The following memoize2 contains additional atoms to collect data cache hits, misses and total number of calls at run-time.
(defn memoize2 [f] (let [mem (atom {}) ; 1. hits (atom 0) miss (atom 0) calls (atom 0)] (fn [& args] (if (identical? :done (first args)) ; 2. (let [count-chars (reduce + (map count (flatten (keys @mem))))] {:calls @calls :hits @hits :misses @miss :count-chars count-chars :bytes (* (int (/ (+ (* count-chars 2) 45) 8)) 8)}) ; 3. (do (swap! calls inc) ; 4. (if-let [e (find @mem args)] (do (swap! hits inc) (val e)) (let [ret (apply f args) (swap! miss inc)] (swap! mem assoc args ret) ret)))))))
Along with the actual cache, additional counters are added to the initial let block.
:done is a sentinel value that can be used to extract statistics during run-time.
This is an estimate of the amount of memory necessary to store the keys given the number of chars footnote:[A good enough formula to estimate the amount of memory necessary to store strings in Java is: http://www.javamex.com/tutorials/memory/string_memory_usage.shtml].
Additional swap! operations are performed to update counters.
By accessing the additional stats at run-time, we can estimate the key-space size or the memory footprint. For example, if we run the previous Levenshtein memoize example replacing memoize with memoize2 we can extract the following results:
(def levenshtein (memoize2 levenshtein*)) (best "achive" dict-ac) (["achieve" 1] ["achime" 1] ["active" 1]) (levenshtein :done) {:calls 400, :hits 0, :misses 400 :count-chars 5168 :bytes 10376} (best "achive" dict-ac) (["achieve" 1] ["achime" 1] ["active" 1]) (levenshtein :done) {:calls 800, :hits 400, :misses 400 :count-chars 5168 :bytes 10376}
As you can see, the first time the best function is invoked it generates 400 misses, while the second time it results in all hits. We can also estimate the memory taken by the strings stored in memory which is around 10Kb.
The second aspect to consider when using memoize is the additional hash-map assoc operation and atom swap! that is added for each new key combination presented as input. The hash-map adds O(n log n) steps to add a new key while the atom could under-perform under heavy thread contention. Depending on the application requirement, memoize could be built on top of a transient data structure to avoid the performance penalty of filling the cache. Another option to consider, when possible, is "warming the cache": while the application is still not serving live traffic, the cache could be populated artificially with the most common keys.
Now that you know some of the ins and outs of using memoize, perhaps you’re more interested in learning more about the Clojure standard library. If so, download the free first chapter of Clojure Standard Library and see thia Slideshare presentation for more details.