As Datarank has grown, certain areas of our system had to be standardized to facilitate common coding patterns. We would like to minimize the number of dependencies in our system and minimize the knowledge-set required to jump in. When it comes to collection libraries, there are certainly a lot of choices (HPPC, Java Collections, Trove, FastUtil, GS-Collections, Javolution, Commons). In this post we will discuss the libraries that we have settled on at Datarank.

Instead of saying any collection library is better than another, this post is going to focus on the pros and cons of each library with a brief description of the suggested usages. In addition, each library will have a very basic example section that shows how it is used in the the most common contexts: Java 5/6/7, LambdaJ, and Java 8.

LambdaJ might be unfamiliar to some. It tries to fill the gap between Java’s procedural style and the more expressive functional style of other languages (or now Java 8). It uses reflection to do some neat tricks and works very well in general. However, it’s not the most performant library so it’s not ideal for low latency components.

Java Collections

The de facto standard and obvious default. However, performance and memory usage is not always the best. For any publicly shared interface the Java Collections API will be easiest to consume.

Pros

  1. No dependencies
  2. Good enough for most cases
  3. Well documented and understood

Cons

  1. Interfaces are somewhat aged and clunky
  2. Several lacking features when compared to other libraries

Examples

/*
 * very verbose; loops can't be chained and the Collections.sort call is
 * inelegant compared to the other methods
 */
public Iterable<Integer> findAllIdsForLastNameSortedByIdUsingJava7(
        Iterable<Person> people, String lastName) {

    final List<Integer> smiths = new ArrayList<>();

    for (Person person : people) {
        if (person.getLastName().equals(lastName)) {
            smiths.add(person.getId());
        }
    }

    Collections.sort(smiths);

    return smiths;
}

/*
 * less verbose but things are read "inside-out" and there is a bit of
 * "magic" going on with the "on" call
 */
public Iterable<Integer> findAllIdsForLastNameSortedByIdUsingLambdaJ(
        Iterable<Person> people, String lastName) {

    return Lambda.sort(
            Lambda.filter(new Predicate<Person>() {
                public boolean apply(Person person) {
                    return person.getLastName().equals(lastName);
                }
            }), on(Person.class).getId());
}

/*
 * forced to return an IntStream or to copy all elements into a
 * List<Integer> or int[]
 */
public IntStream findAllIdsForLastNameSortedByIdUsingJava8(
        Iterable<Person> people, String lastName) {

    // Iterator to stream is awkward but workable and the
    // overall code is cleaner
    return StreamSupport
            .stream(people.spliterator(), false)
            .filter(p -> p.getLastName().equals(lastName))
            .mapToInt(Person::getId)
            .sorted();
}

Guava

Google’s own library that arguably breaks the bounds of a “collections” library. Guava excels at handling large streams of data and simplifying cases that involve concurrency.

Pros

  1. Well designed interfaces
  2. Many useful features such as: Caching, Optional, Immutables, Improved Reflection, and more
  3. Designed to work well with large streams of data
  4. Designed to work with concurrent workloads

Cons

  1. API churn - there have been a few cases where we had to be very careful about which version was included for compatibility reasons. Specifically around older versions of Hadoop.
  2. No primitive collections

Examples

/*
 * While not as verbose as the original example it is still wordy but operators
 * are chained for better readability and flexibility; though toSortedList
 * should have a default ordering
 */
public Iterable<Integer> findAllIdsForLastNameSortedByIdUsingJava7(
        Iterable<Person> people, String lastName) {

    return FluentIterable
            .from(people)
            .filter(new Predicate<Person>() {
                public boolean apply(Person input) {
                    return input.getLastName().equals(lastName);
                }
            })
            .transform(new Function<Person, Integer>() {
                public Integer apply(Person input) {
                    return input.getId();
                }
            })
            .toSortedList(Ordering.natural());
}

/* very clean and to the point */
public Iterable<Integer> findAllIdsForLastNameSortedByIdUsingLambdaJ(
        Iterable<Person> people, String lastName) {

    return FluentIterable
            .from(extract(filter(having(on(Person.class).getLastName(),
                                        Matchers.equalTo("smith"))),
                          on(Person.class).getId()))
            .toSortedList(Ordering.natural());
}

/* extremely clean and easy to follow */
public Iterable<Integer> findAllIdsForLastNameSortedByIdUsingJava8(
        Iterable<Person> people, String lastName) {

    return FluentIterable
            .from(people)
            .filter(p -> p.getLastName().equals(lastName))
            .transform(Person::getId)
            .toSortedList(Ordering.natural());
}

GS-Collections

Goldman Sachs isn’t usually thought of as being an open source innovator but GS-Collections really fills a lot of gaps in the core Java Collections library. In particular it is very fast and memory/GC friendly with a very strong interface. In nearly every case it can replace the core libraries but for published interfaces it is probably best to stick with what is build in.

Pros

  1. Well designed interfaces
  2. Containers optimized for primitive types
  3. Very efficient (out performing Trove in most cases)
  4. Works well with Java 8 syntax
  5. Has a Code Kata project for learning how to use the library

Cons

  1. Not as well documented or known (though the documentation is good, it’s just not great)
  2. There are some interface asymmetries between primitive and object classes but these seem to be closing with each release
  3. Harder to debug due to internal structure of the class
  4. There are a few interface conflicts with Java 8. They are easy to work around but can be confusing to new users

Example

/*
 * Verbosity is on par with Guava here but toSortedList "just works" and we get
 * a nice primitive iterable
 */
public OrderedIntIterable findAllIdsForLastNameSortedByIdUsingJava7(
        Iterable<Person> people, String lastName) {

    return LazyIterate
            .adapt(people).select(new Predicate<Person>() {
                @Override
                public boolean accept(Person person) {
                    return person.getLastName().equals(lastName);
                }
            }).collectInt(new IntegerFunctionImpl<Person>() {
                @Override
                public int intValueOf(Person person) {
                    return person.getId();
                }
            })
            .toSortedList();
}

/*
 * Same as above
 */
public OrderedIntIterable findAllIdsForLastNameSortedByIdUsingLambdaJ(
        Iterable<Person> people, String lastName) {

    return LazyIterate
            .adapt(Lambda.filter(having(on(Person.class).getLastName(),
                                        Matchers.equalTo("smith")), people))
            .collectInt(new IntegerFunctionImpl<Person>() {
                @Override
                public int intValueOf(final Person anObject) {
                    return anObject.getId();
                }
            })
            .toSortedList();
}

/*
 * The best of all worlds, native iterable and fluent interface
 */
public OrderedIntIterable findAllIdsForLastNameSortedByIdUsingJava8(
        Iterable<Person> people, String lastName) {

    return LazyIterate
            .adapt(people)
            .select(p -> p.getLastName().equals(lastName))
            .collectInt(Person::getId)
            .toSortedList();
}

Conclusion

As expected, there isn’t a one size fits all library. We tend to favor GS-Collections for performance critical classes while sticking with the standard collection libraries for boundary points and general cases. Until Java 8 is fully adopted, LambdaJ will continue to be used in place of trivial loops/filter/mappings but is avoided in rare performance critical situations. Guava is used in places where concurrency is critical and its helper methods are sprinkled throughout the system.

As libraries and Java have matured it’s clear that the APIs have started to look pretty similar; with the exception that the Java Collections still lack an immutable interface. However, implementations are still very different. Guava’s thread safe classes far outperform the built in synchronized counterparts. Likewise, GS-Collections is far better at handling large collections in general and native types specifically in terms of memory and CPU performance.