Checkpointing Conventional Program State

Next: Persistent System Objects, Previous: File System Replacement, Up: Design Patterns

6.3 Checkpointing Conventional Program State

Another challenge for many programs is saving some subset of program state. This could involve checkpointing an evolving computation, keeping track of state for the purposes of 'undo' or enabling crash recovery at key points in the program's execution.

One approach is to transform all our program state into persistent objects. However if the use of program state is slot-access intensive, this can have a significant performance impact. To improve the performance of the application, careful use of transactions is needed which further complicates program design and operation.

Can Elephant be used to provide a simple solution that retains the in-memory performance that we want? Can we do all this without having to put a ton of persistence assumptions into our main program code? The answer is yes, assuming you are willing to explicitly checkpoint your code and adhere to some simple constraints in accessing your program objects.

6.3.1 Assumptions

To maintain processing speed and convenience we would like all our objects to be standard lisp objects without special harnesses that would interfere with applying the full power of lisp. At some point during execution, we want to store the current state of a set of objects to disk and yet make it easy to reproduce the original state at a later point in time. For simplicity, we'll limit ourselves to collections of CLOS objects.

A complication is that many programs have sets of interdependant objects. These could be complex program graphs, the state of an ongoing search process or a standard OO system that uses a bunch of different program object types to run. This means that we need to persist not just object state, but also references to other objects.

Using CLOS reflection we can provide a general solution to capturing objects, slot values and references. However to reproduce references, we'll need to be able to find the object referenced and the only way to do that is to store it as well. Thus we want to create a snapshot of a closed set of self-referential objects.

The assumptions underlying the snapshot mechanism is:

Use standard CLOS objects and references to other CLOS objects. We need reflection to
Use standard hash tables to keep track of sets of objects. Your program should use the hash table as an entry point to find objects. When objects are restored, just replace an existing hash table with the new one and access your objects that way. Any parts of your program that have pointers into your objects but are not themselves snapshotted, will need to be able to refresh their pointers in some way.
Find your root object (s) and know what is “reachable” from them. Ensure that you aren't referring to standard objects outside those you want to store as they will be stored too (persistent object references are fine though). Make sure your root refers to objects that refers to other objects and so on such that all objects you want to store can be reached by some set of pointer traversals. Looping references are fine.

6.3.2 Snapshot Set

The snapshot implementation is called a snapshot-set. The next section will go into detail, but a walkthrough will help make it clearer¹.

A snapshot set is quite easy to use. Load the complete code and play with this simple walk through. The code can be located in the Elephant source tree under src/conrib/eslick/snapshot-set.lisp.

The first step is to create a snapshot-set object,

     (setf my-set (make-instance 'snapshot-set))

and add it to the root so we don't lose track of it.

     (add-to-root 'my-set my-set)

Then we need some objects to play with.

     (defclass my-test-class ()
       ((value :accessor test-value :initarg :value)
        (reference :accessor test-reference :initarg :reference)))
     
     (setf obj1 (make-instance 'my-test-class :value 1 :reference nil))
     (setf obj2 (make-instance 'my-test-class :value 2 :reference obj1))
     (setf obj3 (make-instance 'my-test-class :value 3 :reference obj2))
     
     (register-object obj3 my-set)
     (snapshot my-set)

Now your set should have persistent versions of all three classes that are reachable from obj3.

     (map-set (lambda (x) (print (test-value x))) my-set)
     =>
     3
     2
     1

Of course such fully connected objects are not always common, so we'll demonstrate using hash tables to create root indexes into our objects and sidestep registration calls entirely. We'll create a fresh set to work with.

     (setf my-set (make-instance 'snapshot-set))
     (add-to-root 'my-set my-set)
     
     (setf obj4 (make-instance 'my-test-class :value 4 :reference obj1))
     (setf obj5 (make-instance 'my-test-class :value 5 :reference nil))
     
     (setf hash (make-hash-table))
     (setf (snapshot-root my-set) hash)
     
     (setf (gethash 'obj3 hash) obj3)
     (setf (gethash 'obj4 hash) obj4)
     (setf (gethash 'obj5 hash) obj5)
     
     (snapshot my-set)

To properly simulate restoring objects, we need to drop our old hash table as well as clear the persistent object cache so the snapshot set transient object is reset.

     (setf my-set nil)
     (setf hash nil)
     (elephant::flush-instance-cache *store-controller*)

Now we'll pretend we're startup up a new session.

     (setf my-set (get-from-root 'my-set))
     (setf hash (snapshot-root my-set))

The cache is automatically populated by the implicit restore call during snapshot-set initialization, and our hash table should now have all the proper references. We'll pull out a few.

     (setf o4 (gethash 'obj4 hash))
     (setf o3 (gethash 'obj3 hash))
     (setf o2 (test-reference o3))
     
     (not (or (eq o4 obj4)
              (eq o3 obj3)
              (eq o2 obj2)))
     => t

The new objects should not be eq the old ones as we have restored fresh copies from the disk.

If you review the setup above, obj3 references obj2 which references obj1 and obj4 also references obj1. So if the objects were properly restored, these references should be eq.

     (eq (test-reference o2) (test-reference o4))
     => t

And finally we can demonstrate the restorative power of snapshot sets.

     (remhash 'obj5 hash)
     
     (gethash 'obj5 hash)
     => nil nil
     
     (restore my-set)
     (setf hash (snapshot-root my-set))
     
     (gethash 'obj5 hash)
     => #<MY-TEST-CLASS ..> t
     
     (test-value *)
     => 5

This means that while our set object was not reset, the restore operation properly restored the old reference structure of our root hash object. Unfortunately, in this implementation you have to reset your lisp pointers to get access to the restored objects.

A future version could traverse the existing object cache, dropping new references and restoring old ones so that in-memory lisp pointers were still valid.

6.3.3 Snapshot Set Implementation

In this section we walk through the implementation of the snapshot set in detail as it provides:

Insight into constraints in serialization and lisp object identity
How to leverage Elephant for some more sophisticated applications than persistent indices and class slots.
Helps you understand a useful utility (that we may add to an extensions release in the future)

To generalize the behavior discussed above, we will define a new persistent class called a snapshot set. The set itself is a wrapper around the btree, but provides all the automation to store and recover sets of standard objects.

     (defpclass snapshot-set ()
       ((index :accessor snapshot-set-index :initform (make-btree))
        (next-id :accessor snapshot-set-next-id :initform 0)
        (root :accessor snapshot-set-root :initform nil)
        (cache :accessor snapshot-set-cache
               :initform (make-hash-table :weak-keys t)
               :transient t)
        (touched :accessor snapshot-set-touched
                 :initform (make-array 20 :element-type 'fixnum
                              :initial-element 0 :fill-pointer t
                              :adjustable t)
                 :transient t))
       (:documentation "Keeps track of a set of standard objects
         allowing a single snapshot call to update the store
         controller with the latest state of all objects registered with
         this set"))

The set class keeps track of IDs, a set of cached objects in memory, the on-disk btree for storing instances by uid and the current uid variable value. Notice the use of the transient keyword argument for the cache.

There are two major operations supported by sets snapshot and restore. These save objects to disk and restore objects to memory, along with proper recovery of multiple references to the same object.

Additional operations are:

Registration: Adding and removing objects from a set
Root operations: Easy access to a single root hash table or object
Mapping: Walk over all objects in a set

To enable snapshots, we have to register a set of root objects with the set. This function ignores objects that are already cached, otherwise allocates a new ID and caches the object.

     (defmethod register-object ((object standard-object) (set snapshot-set))
       "Register a standard object.  Not recorded until
        the snapshot function is called on db"
       (aif (lookup-cached-id object set)
            (values object it)
            (let ((id (incf (snapshot-set-next-id set))))
     	 (cache-snapshot-object id object set)
     	 (values object id))))
     
     (defun lookup-cached-id (obj set)
       (gethash obj (snapshot-set-cache set)))
     
     (defun cache-snapshot-object (id obj set)
       (setf (gethash obj (snapshot-set-cache set)) id))

A parallel function registers hash tables. One very important invariant implied here is that the cache always contains objects that are eq and mapped back to a serialized object in the backing btree. There is no need, however, to immediately write objects to the store and this gives us some transactional properties: snapshots are atomic, consistent and durable. Isolation is not enforced by snapshots.

This means that the transient cache has to be valid immediately after the snapshot set is loaded from the data store.

     (defmethod initialize-instance :after ((set snapshot-set) &key lazy-load &allow-other-keys)
       (unless lazy-load (restore set)))

This also has consequences for unregistration. Removing a root object should also result in the removal of all objects that are unreachable from other roots. However, since side effects are not permanent until a snapshot operation, we merely have to garbage collect id's that were not touched during a snapshot operation. This makes unregistration simple.

     (defmethod unregister-object (object (set snapshot-set))
       "Drops the object from the cache and backing store"
       (let ((id (gethash object (snapshot-set-cache set))))
         (when (null id)
           (error "Object ~A not registered in ~A" object set))
         (drop-cached-object object set)))

But snapshots are a little bit more work.

     (defmethod snapshot ((set snapshot-set))
       "Saves all objects in the set (and any objects reachable from the
        current set of objects) to the persistent store"
       (with-transaction (:store-controller (get-con
                                              (snapshot-set-index set)))
         (loop for (obj . id) in
                   (get-cache-entries (snapshot-set-cache set))
               do
     	  (save-snapshot-object id obj set))
         (collect-untouched set)))
     
     (defun save-snapshot-object (id obj set)
       (unless (touched id set)
         (setf (get-value id (snapshot-set-index set))
     	  (cond ((standard-object-subclass-p obj)
     		 (save-proxy-object obj set))
     		((hash-table-p obj)
     		 (save-proxy-hash obj set))
     		(t (error "Cannot only snapshot standard-objects and hash-tables"))))
         (touch id set))
       id)
     
     (defun collect-untouched (set)
       (map-btree (lambda (k v)
     	       (unless (touched k set)
     		 (remove-kv k (snapshot-set-index set))))
     	     (snapshot-set-index set))
       (clear-touched set))

We go through all objects in the cache, storing objects as we go via save-snapshot-object. This function is responsible for storing objects and hash tables and recursing on any instances that are referenced. Any object that is saved is added to a touch list so they are not stored again and we can mark stored instances for the collect-untouched call which ensures that newly unreachable objects are deleted from the persistent store. Any newly found objects are added to the in-memory cache which, being a weak array, should eventually drop references to objects that are not referred to elsewhere.

It should be noted that garbage objects not garbage collected from the weak-array based cache may be stored to and restored from the persistent store. However this is merely a storage overhead as they will eventually be dropped across sessions as there are no saved references to them.

Now when we serialize a standard object, all the slot values are stored inline. This means that by default, a slot that refers to a standard object would get an immediately serialized version rather than a reference. This of course makes it impossible to restore multiple references to a single object. The approach taken here is to instantiate a proxy object which is a copy of the original class and stores references to normal values in its slots. Any references to hashes or standard classes are replaced with a reference object that records the unique id of the object so it can be properly restored.

     (defun save-proxy-object (obj set)
       (let ((svs (subsets 2 (slots-and-values obj))))
         (if (some #'reified-class-p (mapcar #'second svs))
     	(let ((proxy (make-instance (type-of obj))))
     	  (loop for (slotname value) in svs do
     	       (setf (slot-value proxy slotname)
     		     (if (reify-class-p value)
     			 (reify-value value set)
     			 value)))
     	  proxy)
     	obj)))

The function checks whether any slot value can be reified (represented by a unique id) and if so, makes a new proxy instance and properly instantiates its slots, returning it to the main store function which writes the proxy object to the btree.

On restore, we simply load all objects into memory.

     (defmethod restore ((set snapshot-set))
       "Restores a snapshot by setting the snapshot-set state to the last
     snapshot.  If this is used during runtime, the user needs to drop all
     references to objects and retrieve again from the snapshot set.  Also
     used to initialize the set state when a set is created, for example
     pulled from the root of a store-controller, unless :lazy-load is
     specified"
       (clear-cache set)
       (map-btree (lambda (id object)
     	       (load-snapshot-object id object set))
     	     (snapshot-set-index set)))
     
     (defun load-snapshot-object (id object set)
       (let ((object (ifret object (get-value id (snapshot-set-index set)))))
         (cond ((standard-object-subclass-p object)
     	   (load-proxy-object id object set))
     	  ((hash-table-p object)
     	   (load-proxy-hash id object set))
     	  (t (error "Unrecognized type ~A for id ~A in set ~A"
                         (type-of object) id set)))))

If an object has a reference object in a slot, then we simply restore that object as well. load-snapshot-object accepts null for an object so it can be used recursively when a reference object refers to an object (via the unique id) that is not yet cached. The load functions return an object so that they can used directly to create values for writing slots or hash entries.

     (defun load-proxy-object (id obj set)
       (ifret (lookup-cached-object id set)
     	 (progn
     	   (cache-snapshot-object id obj set)
     	   (let ((svs (subsets 2 (slots-and-values obj))))
     	     (loop for (slotname value) in svs do
     		  (when (setrefp value)
     		    (setf (slot-value obj slotname)
     			  (load-snapshot-object (snapshot-set-reference-id value) nil set)))))
     	   obj)))

6.3.4 Isolating multiple snapshot sets

A brief note on how to separate out the objects you want to store from those you don't may be useful. We want to snapshot groups of inter-referential objects without sucking in the whole system in one snapshot. These object sets must be closed and fully connected. If the program consists of a set of subgraphs, a root element of each graph should be stored in a hash table that is then treated as the snapshot root.

Manual registration: Objects without external references are easy, just register or unregister them from the snapshot-set as needed and then map over them to get them back.
Implicit registration: Just store objects in a hash that is the root of a snapshot-set and you are good to go.
Graphs: Graphs are easy to store as they naturally consist of a closed set of objects. If the graph nodes reference other system objects that you don't want to store, you'll need to implement something akin to the indirection provided here. Just store the root of the graph in the snapshot set root and go from there.
All instances of a type: Another easy way to create sets is to overload make-instance to store all new objects in a weak hash table that is treated as the root of a snapshot-set (NOTE: I have not verified that weak hashes are properly serialized and reproduced - I suspect they are not so you might have to copy after a restore).

For more complex applications, you can isolate these closed sets of objects by using snapshot-set root hash tables as an indirection mechanism. Instead of storing direct references in an object slot or hash value, isolation is ensured by storing keys and indirecting through a hash table to get the target object. This can be hidden from the programmer in multiple ways. The easiest way is just to make sure that when you store references you store a key and overload the slot accessor. A sketch of this follows:

     (defparameter *island1-hash* (make-hash-table))
     (defparameter *island2-hash* (make-hash-table))
     (defvar *unique-id* 0)
     
     (defclass island1-object ()
       ((pointer-to-island1 :accessor child :initform nil)
        (pointer-to-island2 :accessor neighbor :initform nil)))
     
     (defmethod neighbor :around ((obj island1-object))
       (let ((key (call-next-method)))
          (when key (gethash key *island2-hash*))))
     
     (defmethod (setf neighbor) :around (ref (obj island1-object))
       (cond ((subtypep (type-of ref) 'island2-object)
              (let ((key (find-object ref *island2-hash*)))
                (if key
                    (progn
                      (call-next-method key obj)
                      obj)
                    (progn
                      (setf (gethash (incf *unique-id*) *island2-hash*) ref)
                      (call-next-method *unique-id* obj)
                      obj))))
             (t (call-next-method))))
     
     (defun find-object (obj hash)
        (map-hash (lambda (k v)
                    (declare (ignore k))
                    (if (eq obj v)
                        (return-from find-object obj)))
                   hash))

The same template would apply to island2 references to island1 objects. You could further simplify creating these hash table indirections with a little macro:

     (defmacro def-snapshot-wrapper (accessor-name
               (source-classname target-classname hashname uid))
       (with-gensysms (obj key ref)
        `(progn
           (defmethod ,accessorname :around ((,obj ,source-classname))
              (let ((,key (call-next-method)))
                (when ,key (gethash ,key ,hashname))))
           (defmethod (setf ,accessorname) :around
                      (,ref (,obj ,source-classname))
              (cond ((subtypep (type-of ,ref) ,target-classname)
                     (let ((,key (find-object ,ref ,hashname)))
                       (if ,key
                           (progn
                             (call-next-method ,key ,obj)
                             ,obj)
                           (progn
                             (setf (gethash (incf ,uid) ,hashname) ,ref)
                             (call-next-method ,uid ,obj)
                             ,obj))))
                    (t (call-next-method)))))))
     
     (defclass island2-object ()
       ((pointer-to-island2 :accessor child :initform nil)
        (pointer-to-island1 :accessor neighbor :initform nil)))
     
     (def-snapshot-wrapper neighbor
                           (island2 island1 *island1-hash* *unique-id*))

Of course this doesn't work for multi-threaded environments, or for separating more complex collections of types. I am also sure that more elegant solutions are possible. In most cases, we assume the user will have a natural collection of objects that can be closed over by types or references so such efforts are unnecessary.

Footnotes

[1] Example provided by Ian Eslick, April 2007