Collection Pipeline

Collection pipelines are a programming pattern where you organize some computation as a sequence of operations which compose by taking a collection as output of one operation and feeding it into the next. (Common operations are filter, map, and reduce.) This pattern is common in functional programming, and also in object-oriented languages which have lambdas. This article describes the pattern with several examples of how to form pipelines, both to introduce the pattern to those unfamiliar with it, and to help people understand the core concepts so they can more easily take ideas from one language to another.

21 July 2014

Martin Fowler

First encounters
Defining Collection Pipeline

Sidebars

Smalltalk Syntax

The collection pipeline is one of the most common, and pleasing, patterns in software. It's something that's present on the unix command line, the better sorts of OO languages, and gets a lot of attention these days in functional languages. Different environments have slightly different forms, and common operations have different names, but once you get familiar with this pattern you don't want to be without it.

First encounters

I first came across the collection pipeline pattern when I started with Unix. For example, let's imagine I want to find all my bliki entries that mention "nosql" in the text. I can do this with grep:

grep -l 'nosql' bliki/entries

I might then want to find out how many words are in each entry

grep -l 'nosql' bliki/entries | xargs wc -w

and perhaps sort them by their word count

grep -l 'nosql' bliki/entries | xargs wc -w | sort -nr

and then just print the top 3 (removing the total)

grep -l 'nosql' * | xargs wc -w | sort -nr | head -6 | tail -3

Compared with other command line environments I'd come accross before (or indeed later) this was extremely powerful.

At a later date I found the same pattern when I started using Smalltalk. Lets imagine I have a collection of article objects (in someArticles) each of which has a collection of tags and a word count. I can select only those articles that have the #nosql tag with

 someArticles select: [ :each | each tags includes: #nosql]

The select method takes a single argument Lambda (defined by the square brackets, and called a "block" in smalltalk) which defines a boolean function which it applies every element in someArticles and returns a collection of only those articles for which the lambda resolves as true.

To sort the result of that code, I expand the code.

(someArticles
      select: [ :each | each tags includes: #nosql]) 
      sortBy: [:a :b | a words > b words]

Smalltalk Syntax

In Smalltalk, messages sent to objects are separated by whitespace, so while in most OO languages these days you'd write anArticle.tags, in Smalltalk you'd write anArticle tags. If a message takes an argument, you add a colon and the argument - hence anArticle tags includes: #nosql. The #nosql is a symbol. If you need more than one argument you add extra keywords aList copyFrom: 1 to: 3

To specify a lambda you use square brackets, separating the arguments with a vertical bar: aList sortBy: [:a :b | a words > b words]

The sortBy method is another method that takes a lambda, this time the code used to sort the elements. Like select it returns a new collection so I can continue the pipeline

((someArticles 
      select: [ :each | each tags includes: #nosql])
      sortBy: [:a :b | a words > b words]) 
      copyFrom: 1 to: 3

The core similarity to the unix pipeline is that each of the methods involved (select, sortBy, and copyFrom) operate on a collection of records and return a collection of records. In unix that collection is a stream with the records as lines in the stream, in Smalltalk the collection is of objects, but the basic notion is the same.

These days, I do much more programming in Ruby, where the syntax makes it nicer to set up a collection pipeline as I don't have to wrap the earlier stages of the pipeline in parentheses

some_articles
  .select{|a| a.tags.include?(:nosql)}
  .sort_by{|a| a.words}
  .take(3)

Forming a collection pipeline as a method chain is a natural approach when using an object-oriented programming language. But the same idea can be done by nested function invocations.

To go back to some basics, let's approach how you might set up a similar pipeline in common lisp. I can store each article in a structure called articles, which allows me to access the fields with functions named like article-words and article-tags. The function some-articles returns the ones I start with.

The first step is to select only the nosql articles.

(remove-if-not
   (lambda (x) (member 'nosql (article-tags x)))
   (some-articles))

As with the case with Smalltalk and Ruby, I use a function remove-if-not that takes both the list to operate on and a lambda to define the predicate. I can then expand the expression to sort them, again using a lambda

(sort
   (remove-if-not
      (lambda (x) (member 'nosql (article-tags x)))
      (some-articles))
   (lambda (a b) (> (article-words a) (article-words b))))

I then select the top 3 with subseq.

(subseq
   (sort
      (remove-if-not
         (lambda (x) (member 'nosql (article-tags x)))
         (some-articles))
      (lambda (a b) (> (article-words a) (article-words b))))
 0 3)

The pipeline is there, and you can see how it builds up pretty nicely as we go through it step by step. However it's questionable as to whether its pipeline nature is clear once you look at the final expression [1]. The unix pipeline, and the smalltalk/ruby ones have a linear ordering of the functions that matches the order in which they execute. You can easily visualize the data starting at the top-left and working its way right and/or down through the various filters. Lisp uses nested functions, so you have to resolve the ordering by reading from deepest function up.

The popular recent lisp, Clojure, avoids this problem allowing me to write it like this.

(->> (articles)
     (filter #(some #{:nosql} (:tags %)))
     (sort-by :words >)
     (take 3))

The "->>" symbol is a threading macro, which uses lisp's powerful syntactic macro capability to thread the result of each expression into an argument of the next expression. Providing you follow conventions in your libraries (such as making the subject collection the last argument in each of the transformation functions) you can use this to turn a series of nested functions into a linear pipeline.

For many functional programmers, however, using a linear approach like this isn't something important. Such programmers handle the depth ordering of nested functions just fine, which is why it took such a long time for an operator like "->>" to make it into a popular lisp.

These days I often hear fans of functional programming extoll the virtues of collection pipelines, saying that they are a powerful feature of functional languages that OO languages lack. As an old Smalltalker I find this rather annoying as Smalltalkers used them widely. The reason people say that collection pipelines aren't a feature of OO programming is that the popular OO languages like C++, Java, and C# didn't adopt Smalltalk's use of lambdas, and thus didn't have the rich array of collection methods that underpin the collection pipeline pattern. As a result collection pipelines died out for most OO programmers. Smalltalkers like me cursed the lack of lambdas when Java became the big thing in town, but we had to live with it. There were various attempts to build collection pipelines using what we could in Java; after all, to an OOer, a function is merely a class with one method. But the resulting code was so messy that even those familiar with the technique tended to give up. Ruby's comfortable support for collection pipelines was one of the main reasons I started using Ruby heavily around 2000. I'd missed things like that a lot from my Smalltalk days.

These days lambdas have shaken off much of their reputation for being an advanced and little-usable language feature. In mainstream language C# has had them for several years, and even Java has finally joined in. [2] So now collection pipelines are viable in many languages.

Defining Collection Pipeline

I consider Collection Pipeline to be a pattern of how we can modularize and compose software. Like most patterns, it pops up in lots of places, but looks superficially different when it does so. However if you understand the underlying pattern, it makes it easy to figure out what you want to do in a new environment.

A collection pipeline lays out a sequence of operations that pass a collection of items between them. Each operation takes a collection as an input and emits another collection (expect the last operation, which may be a terminal that emits a single value). The individual operations are simple, but you can create complex behavior by stringing together the various operations, much as you might plug pipes together in the physical world, hence the pipeline metaphor.

Collection Pipeline is a special case of the Pipes and Filters pattern. The filters in Pipes and Filters correspond to the operations in Collection Pipeline, I replace "filter" with operation because filter is a common name for one of the kinds of operations in a Collection Pipeline. From another perspective, the collection pipeline is a particular, but common, case of composing higher-order functions, where the functions all act on some form of sequence data structure.

The operations and the collections that are passed between the operations take different forms in various contexts.

In Unix the collection is a text file whose items are the lines in the file. Each line contains various values, separated by whitespace. The meaning of each value is given by its ordering in the line. The operations are unix processes and collections are composed using the pipeline operator with the standard output of one process piped to the standard input of the next.

In an object-oriented program the collections are a collection class (list, array, set, etc). The items in the collection are the objects within the collection, these objects may be collections themselves or contain more collections. The operations are the various methods defined on the collection class itself - usually on some high level superclass. The operations are composed through a method chain.

In a functional language the collections are collections in a similar way to that of an object-oriented language. However the items this time are generic collection types themselves - where an OO collection pipeline would use objects, a functional language would use a hashmap. The elements of the top level collection may be collections themselves and the hashmap's elements may be collections - so like the object-oriented case we can have an arbitrarily complex hierarchical structure. The operations are functions, they may be composed either by nesting or by an operator that's capable of forming a linear representation, such as Clojure's arrow operators.

The pattern pops up in other places too. When the relational model was first defined it came with a relational algebra which you can think of as a collection pipeline where the intermediate collections are constrained to be relations. But SQL doesn't use the pipeline approach, instead using an approach that's rather like comprehensions (which I'll discuss later.)

The notion of a series of transformations like this is a common approach to structuring programs - hence the harvesting of the Pipes and Filters architectural pattern. Compilers often work this way, transforming from source code, to syntax tree, through various optimizations, and then to output code. The distinguishing thing about a collection pipeline is that the common data structure between stages is a collection, which leads to a particular set of common pipeline operations.

I'm releasing this article in installments. Future installments will include more examples of collection pipelines, some comments on alternatives and laziness, and a catalog of selected pipeline operations. Follow the RSS feed or my twitter stream to know when they appear.

Footnotes

1: A more idiomatic lisp pipeline

One issue here is that the lisp example isn't that idiomatic, since it's common to use named functions (easily referenced using the #'some-function syntax) - creating small functions for particular cases as you need them. This might be a better factoring of that example.

(defun nosqlp (article)
  (member 'nosql (article-tags article)))

(subseq
 (sort
  (remove-if-not #'nosqlp (some-articles))
  #'< :key #'article-words)
 0 3)

2: Java Pipeline

Here's the initial pipeline in Java

articles.stream()
  .filter(a -> a.getTags().contains("nosql"))
  .sorted(Comparator.comparing(Article::getWords).reversed())
  .limit(3)
  .collect(toList());

As you might expect, Java manages to be extra verbose in several respects. A particular feature of collection pipelines in Java is that the pipeline functions aren't defined on a collection class, but on the Stream class (which is different to IO streams). So you have to convert the articles collection into a stream at the beginning and back to a list at the end.

Significant Revisions

21 July 2014: published first installment

Collection Pipeline

Contents

Sidebars

First encounters

Smalltalk Syntax

Defining Collection Pipeline

Footnotes

1: A more idiomatic lisp pipeline

2: Java Pipeline

Significant Revisions

Guides

Popular Articles

Books

Site Sections

ThoughtWorks