Skip to content

Building a Generic Statistics Library, Part 3: Fold

This post is the third in a series on building a statistics library using Delphi 2009’s generic types.  If you’re new to the series, you might want to start at the beginning.

Computing the Average or Standard Deviation of a set of numbers is a fairly trivial task.  For smaller sets, you can do it in your head.  It certainly does not require advanced programming techniques to do in code.  But recall that my purpose in writing this series was to answer commentor Raymond’s request for "something like MapReduce." So when you read this post and wonder why I’ve chosen such a roundabout method of doing some simple sums, keep this purpose in mind.  The point is to learn about generic list operations, not to learn how to compute an average.  If you need help computing an average, you probably won’t understand the rest of this post.

I chose statistical functions as a sample problem because it is so easy to express a statistical function in terms of operations across lists, and operations across lists are what mapping functions do well.  In this post, I am going to discuss two list operations, Fold and Map, which will be useful in implementing the Average and StandardDeviation functions. Then I’ll go into some detail about what Fold does and how I use it to compute the sum and count of list elements. I’ll discuss Map further in the next post in this series.

Both Fold and Map are "higher order functions," which simply means that they take a function as one of their arguments.  Both Fold and Map transform a list, but they do it in different ways.  Fold applies a function to pairs of entries in a list, carrying an aggregated value as it works, whereas Map generates a new list by applying a function to individual entries in a list, and creating a new list from the results of the function.

In my implementation of the average and standard deviation functions, I will use Fold to calculate the sum and count of entries in a list, and Map to create a new list containing the squared deviations of entries in the original list.

Anyone who has ever balanced a checkbook can intuitively understand what Fold does.  Yet, people often find the function confusing, because of differences in how different versions of the function operate.  So I’m going to examine it in some detail.

Fold is a function which can only be used on ordered lists such as arrays and TList. Note that when I say "ordered" here I don’t mean alphabetical or numeric order, only that each item in the list has a defined position.  In other words, something unlike a (mathematical) set. Fold takes three arguments: The list of values, a function which takes two values as arguments and returns a single value, and an "initial" value.  The Fold traverses the list, performing an operation on each item in the list, and also passing to this operation the result of the previous operation.

Since the fold function internally works on pairs of values, and since it does the same operation on each item in the list, there is also a need for an "initial" value.  Otherwise, we would not have a second argument for the passed function when we evaluated the first (or at last, depending on implementation) entry in the list. This is why I need the "mapping" function I described in my first post in this series.  The function allows me to express an "initial" value in terms of the type used as the type parameter.  However, as I’ve developed this series, I’ve decided to call the function an "explicit cast" instead of a "mapping" function.  Explicit is a Delphi operator which performs exactly the task I’m trying to do.  Using this name is both to avoid confusion with the unrelated Map function I’ll describe later, and for consistency with the notion of passing functions as an alternative to "real" operator constraints.

I will use Fold to calculate the sum of the entries in the list like this:

class function TStatistics<T>.Sum(
  const AData: TEnumerable<T>;
  AAdder: TBinaryOp<T>;
  AExplicitCast: TFunc<integer, T>): T;
var
  Init: T;
begin
  Init := AExplicitCast(0);
  Result := FoldL(AAdder, AData, Init);
end;

FoldL will execute like this: Start with the initial value.  Execute the passed function (in this case, addition), passing the initial value and the first item in the list.  Execute the function again, passing the result of the first execution and the second item in the list.  Execute the function again, passing the results of the second execution and the third item in the list.  Continue in this way until we are out of items.  Since the "initial" value I pass is zero, the result will be the sum of the values in the list.

By modifying the AAdder function slightly, I can calculate a count of the items in the list rather than a sum.  How can I modify a function passed as an argument?  By wrapping it in an anonymous method:

class function TStatistics<T>.Count(
  const AData: TEnumerable<T>;
  AAdder: TBinaryOp<T>;
  AExplicitCast: TFunc<integer, T>): T;
var
  Init: T;
begin
  Init := AExplicitCast(0);
  Result := FoldL(
    function(ALeft, ARight: T): T
    begin
      Result := AAdder(ALeft, AExplicitCast(1));
    end,
    AData,
    Init);
end;

So Count works almost exactly like Sum, except that I alter the passed AAdder function such that it ignores its second argument and always adds 1 to the running total. (Why couldn’t I just use the Count property of the list? TEnumerable<T> doesn’t have one.)

You might have noticed that I called my Fold function FoldL instead of just Fold.  This is because there are two variants of Fold: A left fold and a right fold. Despite the names, both variants process entries in the list in the same order, from start to finish.  The "left" and "right" variants differ in terms of where the "initial" value is used.  If we write down the entries in the list from left to right, the left fold variant positions the initial value on the far left-hand side of the list, and the right fold variant positions the value on the far right-hand side of the list.  In a recursive implementation of fold, the left and right variants also differ in terms of the point at which recursion occurs.  If the compiler does a tail recursion optimization, this results in a substantial performance benefit for the right fold variant.  But in a looping implementation, like the one I’m going to demonstrate, performance should be identical in both variants.

With a commutative function, like addition, the result will also be identical in both variants. So for my Count and Sum functions, I could have used either FoldL or FoldR. With a noncommutative operation such as subtraction, FoldL and FoldR will obviously give different results.  The following diagrams from Wikipedia illustrate the two functions well.

My implementation of FoldL is quite simple.  Note that this is a very limited FoldL; in keeping with my desire to not make this series overly long I’ve restricted this FoldL to only be able to return T. A more general FoldL would allow an arbitrary return type.

class function TStatistics<T>.FoldL(AFunc: TBinaryOp<T>;
  const AData: TEnumerable<T>; const AInit: T): T;
var
  Current, Previous: T;
begin
  Previous := AInit;
  for Current in AData do begin
    Previous := AFunc(Previous, Current);
  end;
  Result := Previous;
end; 

In my next post, I’ll discuss Map.

Post a Comment

Your email is never published nor shared. Required fields are marked *

Bad Behavior has blocked 713 access attempts in the last 7 days.

Close