4.2 Serialization details
There are consequences to trying to move values from lisp memory onto
disk in order to persist them. The first consequence is that that
pointers cannot be guaranteed to be valid and so references to lisp
objects cannot be maintained. This is very similar to the problems
with passing references in foreign function interfaces. The second,
and more frustrating limitation is that lisp operations that commit
side effects on aggregate objects, such as objects, arrays, etc,
cannot be trapped and replicated on the disk representation. This
leads up to a very important consequence: all lisp objects are stored
by value. This policy has a number of consequences which are
detailed below.
4.2.1 Restrictions of Store-by-Value
- Lisp identity can't be preserved.
Since this is a store which persists across invocations of Lisp,
this probably doesn't even make sense. However if you get an object
from the index, store it to a lisp variable, then get it again - they
will not be eq:
(setq foo (cons nil nil))
=> (NIL)
(add-to-root "my key" foo)
=> (NIL)
(add-to-root "my other key" foo)
=> (NIL)
(eq (get-from-root "my key")
(get-from-root "my other key"))
=> NIL
- Nested aggregates are serialized recursively into a single buffer.
If you store an set of objects in a hash table you try to store a hash
table, all of those objects will get stored in one large binary buffer
with the hash keys. This is true for all aggregates that can store
type T (cons, array, standard object, etc).
- Circular References.
One benefit provided by the serializer is that the recursive
serialization process does not lead to infinite loops when they
encounter circular references among aggregate types. It accomplishes
this by assigning an ID to any non-atomic object and keeping a mapping
between previously serialized objects and these ids. This same
mapping is used to reconstruct references in lisp memory on
deserialization such that the original structure is properly
reproduced.
- Storage limitations.
The serializer writes sequentially into a contiguous foreign byte
array before passing that array to a given data store's API. There
are practical limits to the size of the foreign buffer that lisp can
allocate (usually somewhere on the order of 10-100MB due to address
space fragmentation). Moreoever, most data stores will have a
practical limit to the size of a transaction or the size of key or
value they will store. Either of these considerations should
encourage you to plan to limit the size of objects that you serialize
to disk. A good rule of thumb is to stay under a handful of
megabytes. We have successfully serialized arrays over 100MB in the
past, but have not tested the robustness of these large values over
time.
- Mutated substructure does not persist.
(setf (car foo) T)
=> T
(get-from-root "my key")
=> (NIL)
This will affect all aggregate types: objects, conses, hash-tables, et
cetera. (You can of course manually re-store the cons.) In this
sense elephant does not automatically provide persistent collections.
If you want to persist every access, you have to use Persistent Sets
(see Persistent Sets) or BTrees (see Persistent BTrees).
- Serialization and deserialization can be costly. While
serialization is pretty fast, but it is still expensive to store large
objects wholesale. Also, since object identity is impossible to
maintain, deserialization must re-cons or re-allocate the entire
object every time increasing the number of GCs the system does. This
eager allocation is contrary to how most people want to use a
database: one of the reasons to use a database is if your objects
can't fit into main memory all at once.
- Merge-conflicts in heavily multi-process/threaded situations.
This is the common read-modify-write problem in all databases. We will talk
more about this in the Transaction Details section.
- Byte Ordering.
The primitive elements such as integers are written to disk in
the native byte ordering of the machine on which the lisp runs. This
means that little endian machines cannot read values written by big
endian machines and vice a versa.
- Unicode codes and Serialized Strings.
The characters and strings stored to disk can store and recover
lisp character codes that implement unicode, but the character maps
are the lisp character maps (produced by
char-code
) and not
strict unicode codes so lisps may not be able to interoperably read
characters unless they have identical character code maps for the
character sets you are reading and writing. All standard ASCII
strings should be portable. Here is what we know about specific
lisps, but this should not be taken as gospel.
- SBCL: In versions with the :sb-unicode feature (after 0.8.17)
char-code
produces proper Unicode codes
- Allegro: In the interational version,
char-code
produces proper Unicode codes for codes < 2^16
- OpenMCL: OpenMCL 1.1 supports unicode, we are unsure about earlier versions
- Lispworks: Lispworks 5 does not, to our knowledge, produce proper Unicode characters.
(This can be fixed on request iff users ask for it and are willing to pay the performance hit)
4.2.2 Atomic Types
Atomic types have no recursive substructure. That is they cannot
contain arbitrary objects and are of a bounded size. (Bignums are an
exception, but they have a predictable structure and cannot reference
or otherwise encapsulate other objects). The following is a list of
atoms and a discussion of how they are serialized.
nil
:
nil has it's own special tag in the serializer so it is easily
identifiable. nil
is an awkward value as it is also a boolean.
The boolean value t
is stored as the symbol 'T.
- fixnums:
The serializer will store both 32-bit and 64-bit fixnums. Both
types of fixnums are readable by a 32-bit or 64-bit lisp, but 64-bit
fixnums are only written if the underlying lisp is supports fixnums
between 32 and 64 bits.
- bignums:
Bignums are broken into a sequence of fixnum-sized chunks and
assembled by masking words onto the bignum. This is awfully
expensive, but it's always correct and fully portable.
- small-float:
Supported only on Lispworks 5 where type
small-float
is
not equivalent to type single-float
as it is on all other
supported platforms. Written to disk and deserialized as a single
float so any memory footprint savings of small-float
is lost.
- single-float:
32-bit floating point numbers
- double-float:
64-bit floating point numbers
- rational:
A rational is merely a ratio of two integers stored as fixnums or bignums.
- complex:
A complex is a pair of floating point values, rationals or integers.
- char:
Standalone chars are represented by their char-code and are
stored in 32-bit format to ensure that all lisps are stored correctly.
- strings:
Strings can be represented as 8, 16 or 32 bit sequences
depending on the character sizes used in the underlying lisp. Because
strings can be such a large percentage of on-disk space, Elephant uses
a peculiar method of encoding strings. Strings are converted from
their in-memory representation using
char-code
. The size of
the first character dictates the word width used for encoding. If a
character violates the word width, the string encoding is aborted and
the next larger width is chosen. The rationale here is that many
strings consist of Latin characters with codes less than 256. Strings
stored in other character sets tend to all be of codes > 256.
Therefore it is likely that the first character will properly
determine the word size of the string. (On request, we can easily make
a configuration option to fix the word width for encoding)
- pathname:
A pathname is merely the
namestring
of the path object
stored as a string. The path object is reconstructed from the
namestring using parse-namestring
during deserialization.
- symbol:
Symbols are stored as two strings, the package name and the symbol name in that package. When deserialized, the target package is searched for and the symbol is interned in that package.
4.2.3 Aggregate Types
The next list are aggregate types, meaning that elements of
that type can contain references to elements of type T
. That
means, in theory, that storing an aggregate type to disk that refers
to other objects can copy every reachable object! This is a direct
and dire consequence of the “store-by-value” restriction.
(see Persistent Classes and Objects for how to design around the
store-by-value restriction).
This list describes how aggregates are handled by the serializer.
- cons:
Cons is simply stored as a cons record containing two nested
elements. Linear lists are not treated specially (i.e. no cdr-coding)
by the serializer.
- array:
Arrays are stored as sequences of nested, serialized elements.
The array parameters are also stored so that arrays with fill
pointers, adjustable arrays can be stored and reconstructed. The only
arrays that cannot be reproduced are displaced arrays, which are
copied by value and reconstructed as standard arrays during
deserialization.
- hash-table:
Hash tables are stored as a sequence of key-value pairs, where
the key and value can be any serializable value. On deserialization,
the reconstructed key and value quantities are written incrementally
into the hash table. The hash table does remember it's test, rehash
size and threshold and it's total count. The final size of the new
hash table is set to
(* (/ size reshash-threshold) rehash-size)
.
- struct:
Structure objects are serialized using the metaprotocol. Each
slot where the value is bound is serialized by serializing the slot
name and the value in sequence. The underlying lisp must support the
struct-constructor
method so that a new, empty instance of the
structure can be created and then populated by the stored keys and
values.
- object:
Instances of subclasses of standard-object are stored almost
identically to structs. The type of the object is stored and the
object slots with bound values are serialized as slotname-value pairs.
To read an object of this type, the lisp image must have the class
defined and it must have at least the slots that are stored on disk.
There is no good method for schema evolution (redefining objects to
have less slots) of ordinary classes.
One final strategic consideration is to whether you plan on sharing the binary
database between machines or between different lisp platforms on the
same machine. This is almost possible today, but there are some
restrictions. In the section Repository Migration and Upgrade
we will discuss possible ways of migrating an existing database across
platforms and lisps.