The FSet Functional Collections Libraries

Important Links

Introduction

The FSet library is a functional implementation of set-theoretic collection datatypes. By "functional" I mean that all update operations return a new collection rather than modifying an existing one in place.

The ultimate goal of FSet is an ambitious one: to be the default choice for collection types in all the languages it supports. By "default", I mean the one you can be in the habit of using most of the time, except when you know you need something more specialized. Any collection library has strengths and weaknesses; none is good at everything. FSet seeks, however, to have no fatal weaknesses that make it terrible for some uses; it tries to do everything pretty well, at the possible expense of being the best at any one thing. This means, in roughly decreasing order of priority:

Every operation has acceptable time complexity.
The interfaces are general and easy to use.
Each implementation is well integrated with its host language.
Space overhead is reasonably small.
Every operation is reasonably fast in a constant-factor sense, considering that updates are functional (and thus inevitably more expensive than imperative updates).

To achieve these goals I have made the following implementation choices:

The primary datatypes (sets, bags, maps, sequences) are implemented using weight-balanced binary trees.
These trees are heterogeneous, in that instead of consisting of nodes all the way down, they have bounded-length vectors at the leaves. This design greatly reduces space overhead, with little effect on time performance.
While the set-like types (sets, bags, and maps) are dependent on an ordering relation for performance, they do not require this relation to be as fine-grained as equality. That is, elements that are equivalent according to the ordering, but nonetheless unequal, are treated as distinct.

These choices are discussed in more detail on the project page. Here, I want to address a more fundamental question: am I really suggesting that the default collection library for ordinary programming should be functional? And if so, why?

Yes, I am absolutely saying that: functional collections are the best choice for general use in many languages like Lisp. Let's look at some reasons why.

Collections are not objects

The first reason I want to discuss is primarily stylistic in nature and may seem abstruse, but it has very real consequences. In a nutshell: collections are not objects; they are mathematical values. More precisely, in the object-oriented analysis of a domain, it is very rare that a collection will correspond exactly to a stateful entity in the resulting model. Far more commonly, the collection is an attribute of some larger entity that maintains invariants involving the collection as well as others of its attributes.

Here's an example. Consider the object-oriented implementation of a Web shopping cart. This might have, as attributes, a bag (multiset) of items, a current total price, and a current total shipping weight. Whenever an item is added to or removed from the bag, the price and shipping weight need to be updated. That is, adding and removing items are necessarily operations on the shopping cart object as a whole; they don't just operate on the bag. This is what I mean when I say that the bag doesn't correspond to a stateful entity in the model; the stateful entity is the whole cart, with the state of the bag being just a component of that state.

Note that in this example, the price and shipping weight are, almost certainly, going to be implemented as values (numbers). Why does it seem so inevitable that the collection of items would be itself a mutable object? I think the biggest answer is simply habit. Programming began with imperative languages; an appreciation of the benefits of functional programming has only come more recently. We simply haven't gotten around to implementing functional collections in a way that would make them widely useful. Obviously, that is the situation I am seeking to rectify with FSet.

Also, a lot of the time, collection values are single-threaded — referring here not to threads of control, but to cases where a newly computed collection value replaces a previous one, such that the previous value is no longer needed. In the shopping cart example, when we add an item to the bag, we no longer care about its previous contents; so the use of a mutable collection is not ruled out, and seems natural. But actually, collections aren't special in this way; lots of values are single-threaded. Again in the example, the current price and shipping weight are also single-threaded, yet it's unlikely we would create mutable price and weight objects and update them — what would be the point?

You see, it really is mostly just habit. Concepts that can be implemented in a single machine word we have tended to implement as values, while those that require allocation we have tended to implement as mutable objects; but there's nothing necessary about this. Lisp's bignums, for example, are allocated but have value semantics, so the programmer doesn't even have to know when a computation on fixnum (single-word) operands has overflowed into a bignum (allocated) result; and similarly, Java's BigInteger, which is allocated, also has value semantics.

And there are real, practical, everyday problems with mutable collections. The worst of these has to do with passing them across interfaces (where, note, single-threadedness cannot be guaranteed). Consider the interface that the shopping cart presents to its client. Obviously, it will be necessary at some point for the client to ask the shopping cart for its items (e.g., to display them on the Web page). What is this operation going to return? If it returns the mutable collection instance the cart is using internally, then the cart exposes itself to invariant breakage: the client could add an item to the collection directly, whereupon the price and shipping weight wouldn't be updated. The cart's only way to prevent that, however, is to copy the collection before returning it: an unpalatable linear-time operation.

In this simple example it may seem implausible that the cart's client would be so obtuse as to modify the returned collection, but programmers do make errors of this kind. Indeed, in more complex situations I have seen very talented and experienced senior developers with PhD's in computer science have a real problem with this. It gets to be very tempting, when you're in a hurry, to mentally conflate the collection itself with the object that owns it — to the point of even sometimes forgetting to create such an object — with the inevitable result that critical invariants get neglected.

Nested collections

A key goal for FSet is to provide completely automatic support for arbitrarily nested collections. Compound types like maps keyed by sets of sequences just work with no additional programmer effort. While it's not every day one needs such a type, it does happen, more often than one might think. (Well, maybe I'm biased by working in compilers and static analysis — such types arise in this field fairly frequently.) It's dangerous to do this kind of nesting with mutable collections, because (for instance) modifying a set that is in use as a map key will blow the map's ordering invariant.

[Could use a good example of a nested collection type]

Thread safety and locking

As multiprocessor hardware becomes much more common, multithreaded software is becoming much more important. This brings up a second very good reason for expanding the use of functional collections: they can be thread-safe without locking. This is a very big deal, because collections are by nature general-purpose. One certainly wouldn't want to pay the cost of synchronization on every access to a collection, since many collections are used in only one thread; yet failure to synchronize a collection that really is being updated in multiple threads is a recipe for disaster. I think it is interesting to look at what the Java API documentation says about this in, for instance, the page for java.util.TreeSet:

Note that this implementation is not synchronized. If multiple threads access a set concurrently, and at least one of the threads modifies the set, it must be synchronized externally. This is typically accomplished by synchronizing on some object that naturally encapsulates the set.

Notice the agreement with my argument in the previous section: there is usually an object that naturally encapsulates any collection.

If no such object exists, the set should be "wrapped" using the Collections.synchronizedSet method.

It's nice that Java makes it easy to apply this synchronization wrapper, but (a) you have to remember to do so, and (b) once you do so, even non-modifying operations, which tend to be the most common, are synchronized. So from a performance perspective it will be tempting to try to do without synchronization whenever possible; but the price of omitting it when it is needed is very high. And threading exacerbates the quandary of whether to copy a collection before returning it from an interface; the possibility that the client will modify the returned collection in another thread, which would not just cause it to have the wrong contents but could break it completely, provides additional motivation to copy it.

Functional collections cure all these ills. They never need to be synchronized. They can be returned from interfaces without copying and without risk. And the additional performance cost of functional collections relative to imperative ones — consisting primarily of additional load on the garbage collector — is somewhat made up for by the elimination of copying and synchronization costs.

[Update: Java 1.6 now has ConcurrentSkipListSet, which is thread-safe without locking. That renders this particular point moot — for those using Java 1.6. For Lisp users, and those Java users who haven't yet switched to 1.6, it still applies. I can't resist pointing out, though, that the Java designers obviously didn't intend ConcurrentSkipListSet for widespread use, or they would have given it a shorter name.]

Collections and subclasses

(While this issue doesn't apply to Lisp, I think it's worth mentioning concerning functional collections generally.) The addition of generic types to Java raised a conundrum. One would think, given a class A with a subclass B, that Set<B> would be a subclass of Set<A>. But with mutable collections, this obvious and highly desirable rule is problematic. Consider:

  Set<B> bs;
  A some_a;
  Set<A> as = bs;   // here's the problem
  as.add(some_a);

Once the assignment on the commented line has been made, it becomes possible, as on the last line, to add an element to the set that violates its original type — the same set is, after all, stil accessible as bs. So subtyping cannot extend through generic types.

Functional collections cure this problem, making it possible for, e.g., PureSet<B> to be treated as a subclass of PureSet<A> (PureSet is the FSet/Java functional set interface). All we need is a way to specify, in the definition of the generic class, that the subtype relationship applies. Alas, Java, unlike Scala, doesn't have a way to do that. I hope one will be added soon.

Recap and summary

Do you think it's strange that I would argue that functional collections are the superior choice for general use? I think it's strange that this opinion is not already the conventional wisdom. Consider this: as I mentioned above, both Lisp and Java have functional infinite-precision integers. Why are these functional rather than mutable? One could make these arguments:

Functional semantics are consistent with the natural behavior of integers that fit in a machine word
The use of mutable integers would complicate interfaces
Mutable infinite-precision integers would have thread-safety problems
Mutable integers are bug-prone because of accidental sharing

Someone with the opposite opinion could argue thus in counterpoint:

Functional integers put more load on the GC
Functional integers are unnatural to people used to imperative programming (more on this in a moment)

With the exception of the first point — consistency with the natural behavior of machine-word integers — these arguments are all identical, on both sides, to those that apply to functional collections. And as far as that first point goes, if there was ever a property that could be called merely an accident of implementation, surely this is such a property! Do you really want to argue that, of two mathematical concepts, integers and sets, our implementations should differ in design and semantics in such a deep way just because some integers fit in a machine word? In fact, certain restricted-domain sets can be implemented as bitfields, and therefore sometimes also fit in a machine word!

Now to this argument that people used to imperative programming find functional semantics unnatural. Of course this is true, but it's also circular; the more that people are exposed to functional programming, the more natural they will find it. (I understand that it's not uncommon for inexperienced Java programmers to think that, for instance, the BigInteger.add method modifies the instance on which it's invoked. I can't resist pointing out that this is partly because the method is badly misnamed. Names of functional methods like these should be prepositions (plus, with) or nouns (union, difference); they should never be verbs like add. It's easy to see how a casual or inexperienced coder might think that a.add(b) would modify a; a.plus(b) is more likely to suggest a + b.)

So if you're willing to pay the GC cost for bignums or BigIntegers, I don't see how you can argue that the GC cost of functional collections is too high. Sure, there may be occasional situations where profiling reveals that it would be worth recoding a performance-sensitive algorithm to use imperative collections; and you always have the option to do that. But for general use, the default choice — the one to use unless you have a specific reason to do otherwise — should clearly be functional.

Time complexity

Small updates to FSet collections — insertions and deletions of single elements, for instance — take only "log time", i.e., time logarithmic in the size of the collection, because in a functional binary tree, only the particular leaf being modified, plus its ancestors up to the root, need to have new versions created; the rest of the tree is shared between the old version of the collection and the new one.

Here is a list of each FSet operation which has the given time complexity.

"constant", O(1): obtaining the size of any collection; obtaining an arbitrary element of a set or bag or an arbitrary pair of a map
"log", O(log n): set/bag membership testing; map lookup; adding or removing an element to/from a set or bag, or a pair to/from a map; sequence insertion or deletion; sequence element access (first, last, or by index); subsequence, sequence concatenation
"linear", O(n): set union, intersection, difference, and subset test; map merge; iterating over any collection with an iteration macro or an iterator; reversing a sequence; obtaining the domain set of a map; comparing two collections (constant time in many cases, but linear in the worst case); conversion of any collection to a Lisp list or vector; conversion of a Lisp list or vector to a sequence
O(n log n): conversion of a Lisp list or vector to a set, bag, or map; obtaining the range set of a map; sorting a sequence

Also, the set and bag union, intersection, difference, and subset/subbag test operations, along with map merge and all the collection compare operations, implement an optimization whereby if the operands are historically related collections — one was formed by doing a small number of single-element insertion or deletion operations on the other one, or they are both related in this way to a third set — these operations can run in log time rather than linear time.

History of FSet

I don't really know where the idea of functional collections originated. I got it from the little-known language called Refine, originally developed at Kestrel Institute in the mid-1980s. But I recently came across some documentation for an older language called SETL, and it seems pretty clear that Refine was heavily influenced by SETL in this area. But I don't know whether SETL was the first language with functional collections.

Anyway, I worked in Refine for years, mostly doing software reengineering and static analysis. It was this experience that convinced me of the stylistic benefits of fully nestable functional collections. Refine's implementation of these, however, used unordered lists, which is great for very small collections and terrible for large ones. I wanted a Common Lisp implementation with much better time complexity properties. So I wrote FSet.