Back to Blog

Why SelectMany is the most powerful LINQ operator

30 Sep 2016

Version after version C# becomes more and more functional language. We are now waiting for pattern matching, better tuples and other functional-inspired sugar. Funny thing is that really strong functional feature is already in C#, although implemented by the occasion. In functional world this feature is named differently in different languages - it’s “do notation” in Haskell or “computation expression” in F#. It allows to write complex, imperative code in simple way (in fact, it is all about mysterious Mr. M, but you aren’t going to need him for now). In C#, it is not as generic as its functional counterparts but it’s there.

This feature is named LINQ expressions.

In this post I will show how to use LINQ in a different ways than querying collections or generating SQL statements. Because LINQ expressions are not limited to IEnumerable<T> we can use it to handle many imperative constructs like null checking, threads, side effects or even exceptions in pure functional way.

if(blah!=null) no more

Let’s talk about null-conditional operator and null checking in general. This topic doesn’t seem to be related to functional programming concepts at the beginning but be patient. I will start by writing very small sample code from real application which we will try to refactor a bit.

So here we have a class that represents a process:

public class Process
{
  public DateTime? StartTime { get; set; }
  public DateTime? FinishTime { get; set; }

  public TimeSpan? Duration
  {
     get
     {
        if (StartTime.HasValue == false)
        {
          return null;
        }

        var finishTime = FinishTime.HasValue
  	      ? FinishTime.Value
  	      : DateTime.UtcNow;

        return finishTime - StartTime.Value;
     }
  }
}

A process might be run in some point of time and take a while, so it will have a start and finish time. It might also have a duration. The happy path is that the duration is just the difference between start and finish time. But nothing is so simple in programming, so of course we need some more ifs. Duration makes sense only if a process already started (StartTime exists). And if a process is not finished yet (FinishTime doesn’t exist), it will be the time elapsed from the start till now. All this logic is expressed in the Duration getter.

From the first sight there is no problem actually, but I want to keep it simple. The issue here is nesting that makes more complex rules in larger codebase hard to read and error prone.

In the example above there is a defensive code around Nullable<T> value checks. Let’s think about how Nullable<T> could be implemented. Default implementation is just a struct with HasValue and Value properties. Pretty natural way to implement this concept in language like C#.

But it could also be implemented with a slight different approach - instead of utilizing properties we could say that Nullable<T> can be a special collection that contains either only one element or it can be empty. In other words, the case when HasValue == true is equivalent of collection of exact one element while HasValue == false is equivalent of empty, generic collection (I’m using collection term but it actually could be lazy sequence, i.e. IEnumerable<T>). Before we elaborate on benefits of this approach let’s try to implement it. In this new version Nullable<T> could be written as an IEnumerable<T> implementation:

struct Nullable<T> : IEnumerable<T>
{
}

OK, it couldn’t. Of course, the implementation of Nullable<T> cannot be changed so the line above won’t work. But existing type can be converted to IEnumerable<T>:

public static IEnumerable<T> AsEnumerable<T>(this Nullable<T> _this)
	where T : struct
{
  if (_this.HasValue)
  {
    yield return _this.Value;
  }
}

Let’s play with it for a while. Having Nullable converted to collection we can rewrite Duration property in the way that foreach is the equivalent of ifs:

public TimeSpan? Duration
{
  foreach (var startTime in StartTime.AsEnumerable())
  {
    var finishTime = FinishTime.AsEnumerable()
                               .DefaultIfEmpty(DateTime.UtcNow);
      foreach (var endTime in finishTime)
      {
        return endTime - startTime;
      }                
    }
  return null;
}

It might not seem to have a lot of sense initially, but now this is the moment where things got more interesting. Instead of using loops, the code above can be expressed in LINQ query:

public IEnumerable<TimeSpan> Duration
{
  get
  {
    return
      from startTime in StartTime.AsEnumerable()
      from finishTime in FinishTime.AsEnumerable()
                                   .DefaultIfEmpty(DateTime.UtcNow)
      select finishTime - startTime;                
  }
}

Even if you don’t “feel” this syntax yet. Here is the thing: all the nesting is gone. No if’s especially the first one which was a some kind of clutter here and doesn’t provide any meaning in the business logic. This code is more concise, easier to read (once you get the feel of it) and immutable.

One small difference is that the return type is different. But is it, really? We just said that Nullable<T> is the same thing as IEnumerable<T> and in this case return value will meet the requirements of this contract. We just need to convert IEnumerable<T> back to original type.

Using IEnumerable<T> in this example was for the sake of showing the analogy. Actually we can take this type out of the equation now - the good news is LINQ expressions doesn’t require collections after all. Type used in LINQ expression doesn’t matter: it only has to have methods with proper signatures. There is one extension method that is called under the hood in the snippet above: SelectMany. In other words, we need to implement this extension method for Nullable<T> type.

But let’s make one step further already and implement our own type, Maybe<T>. Purpose of this type will be the same as Nullable<T> but we will also be able to handle reference types. By declaring anything as Maybe<T> we explicitly express that value is optional and consumer code needs to take care of case when no value is present. All implementation details are not important here but, not to mention DefaultIfEmpty, having such SelectMany:

public static Maybe<TResult> SelectMany<TSource, TTemp, TResult>(
  this Maybe<TSource> _this,
  Func<TSource, Maybe<TTemp>> optionSelector,
  Func<TSource,TTemp,TResult> resultSelector  )
  {
    if (_this.HasValue)
    {
      var temp =  optionSelector(_this.Value);
      if (temp.HasValue)
      {
          return new Maybe<TResult>(
            resultSelector(_this.Value,temp.Value));
      }
    }
    return new Maybe<TResult>();
  }
}		

we can write the whole class as:

public class Process
{
  public Maybe<DateTime> StartTime { get; set; }
  public Maybe<DateTime> FinishTime { get; set; }

  public Maybe<TimeSpan> Duration =>
    from startTime in StartTime
    from finishTime in FinishTime.DefaultIfEmpty(DateTime.UtcNow)
    select finishTime - startTime;
}

Because now the getter is a (LINQ) expression we can use even shorter notation with =>. And finally we end up with 3 lines of code instead of 13.

One more example. Instead of using Maybe and its HasValue we could want to check if(blah !=null). We can then write SelectMany for every object that can be null:

public static TResult SelectMany<TSource, TTemp, TResult>(
   this TSource _this,
   Func<TSource, TTemp> optionSelector,
   Func<TSource, TTemp, TResult> resultSelector)
   where TResult:class
{
   if (_this!=null)
   {
      var temp = optionSelector(_this);
      if (temp!=null)
      {
        return resultSelector(_this, temp);
      }
   }
   return null;
}

Having that method we can use LINQ expression as a null-conditional operator:

public class Node
{
   public Node Parent { get; set; }

   //this property:
   public Node GranGrandPa =>
     from parent in Parent
     from grandPa in parent.Parent
     select grandPa.Parent;

   //is equivalent of
   public Node GranGrandPa2 => Parent?.Parent?.Parent;
}

The difference between LINQ expression and built-in operator is that the former is more generic: you are not limited to getting class members. “Extracted” value may come from everywhere. You can also use let clauses like in normal LINQ queries.

private Tuple<Node, int> Sample()
{
   return from parent in Parent                   
          from grandPa in parent.Parent
          let now = DateTime.Now
          from details in GetNodeDetails(parent, now)
          select Tuple.Create(grandPa, details);
}

Method above will return null if anything turns to be null on the right side of in. Also, first null stops further execution: if Parent or parent.Parent is null then GetNodeDetails will never be called. So here we have real “LINQ to objects” now:)

Nevertheless, in this sample I wanted to show the analogy with null-conditional operator and how far actually we can get with LINQ syntax. Personally I wouldn’t implement SelectMany in this way, because it may turn to be harder to read and interpret which implementation is really used (we will implement async/await with LINQ in a moment). Instead, I would stick with explicit Maybe<T> (or Option<T>, call it what you want) in this case - it is good practice for your domain model even if you aren’t going to use all those LINQ tricks.

Reimplementing async/await

The place where functional syntax like above can be very useful is asynchronous code. The main benefit is that it allows to write asynchronous flow like normal code, without the need of nested callbacks. In fact, async/await is actually built-in implementation of pretty much the same pattern. To explain where this analogy comes from, let me start with another small example:

Task<int> GetInt(){...}

Task<string> GetString(){...}

Task<Tuple<char, double>> AsyncSample()
{
  return GetInt().ContinueWith(
    i =>
      {
        var idx = i.Result;
        var dateTime = DateTime.UtcNow;
        return GetString().ContinueWith(s =>
        {
          var diff = DateTime.UtcNow - dateTime;
          var str = s.Result;
          return Tuple.Create(str[idx], diff.TotalMilliseconds);
        }).Result;
      }                
    );
}

Yuck. Please don’t try this at home. Here we have two tasks: one comes from GetInt(), second is returned from GetString(). The latter is run when the former is finished + plus some values calculated under way. Nothing special but while I believe you can live with if’s, this one may become real monster in production code once it gets bigger. Thus let’s simplify it a little bit and get rid of nested lambdas. Besides async/await there is one more solution already available to achieve this. I’m talking about Reactive Extensions.

Rx is a one of a few examples of innovations that come from .NET world which gained wide adoption beyond their original ecosystem. Very brilliant idea that this library brings in is that events, or event streams, are treated as collections. This is a real-life example of the concept I’m writing here - it also utilizes LINQ queries initially designed only for collections. At the same time, the library is just implementation of observer pattern, with “observables” as objects that emits events which you can subscribe to.

In order to translate example above to reactive version, first we have to convert Task<T> to type that Rx can understand: IObservable<T>. This is Rx equivalent of Subject in observer pattern. How can we achieve this? Turns out, you can map asynchronous task to observable stream of events if you say that task is just an observable object that will emit exactly one event in the future. It totally makes sense: subscribing to events means starting a task while emitting an event means that the task is done (and it will call your callback). Fortunately there is no need to implement anything from scratch because Rx already provides ToObservable extension method for tasks which does exactly this.

On the other hand, observables can be converted to IEnumerable and as I mentioned they can be treated as collections. When you iterate over collection which was created from observable your program will stop on IEnumerator.OnNext() until new event is emmitted (so this is some kind of translation from asynchronous to synchronous code).

Having all those transformations under your belt, let’s map Task<T> to IObservable<T> and then to IEnumerable<T> so we can rewrite example above like this:

public static IEnumerable<Tuple<char, double>> AsyncSample()
{
  foreach (var idx in GetInt().ToObservable().ToEnumerable())//freezes on OnNext until task is finished
  {
    var dateTime = DateTime.UtcNow;
    foreach (var str in GetString().ToObservable().ToEnumerable())//freezes on OnNext until task is finished
    {
      var diff = DateTime.UtcNow - dateTime;
      yield return Tuple.Create(str[idx], diff.TotalMilliseconds);
    }
  }            
}

No lambdas at least but still some rough edges. Because we have two nested loops, we can harness SelectMany again and join them (or bind them) in LINQ query:

IEnumerable<Tuple<char, double>> AsyncSample()
{
  return from idx in GetInt().ToObservable().ToEnumerable()
    let dateTime = DateTime.UtcNow
    from str in GetString().ToObservable().ToEnumerable()
    let diff = DateTime.UtcNow - dateTime
    select Tuple.Create(str[idx], diff.TotalMilliseconds);
}

Still the main problem with this code is that it removes asynchrony and freezes caller or the method. Fortunately Rx already have SelectMany implementation for its own interfaces we can remove ToEnumerable() at all:

IObservable<Tuple<char, double>> AsyncExample()
{
  return from idx in GetInt().ToObservable()
         let dateTime = DateTime.UtcNow
         from str in GetString().ToObservable()
         let diff = DateTime.UtcNow - dateTime
         select Tuple.Create(str[idx], diff.TotalMilliseconds);
}

I imagine that if you are not familiar with Rx those steps may not fully clear, but that’s fine; all details in the snippet above are not so important. (although I encourage you to take a look at this library - there is a lot of cool stuff developed these days around reactive concepts). But one thing to remember: this analogy where Task<T> is just event source which emits exactly one event. By the way, look, this is the same analogy we used when talking about collections and nullable values (nullable value is a collection with maximum one element). If GetInt doesn’t emit event and task never ends, further code will never be called.

And like for Nullable<T>, one difference is that return type is different now. But, again, we have also ToTask<T> extension method in Rx so this is not a problem to map IObservable<T> back to original type.

Finally, we also can get rid of ToObservable<T>, if we implement special SelectMany<T> for Task<T>:

Task<V> SelectMany<T, U, V>(this Task<T> task, Func<T, Task<U>> f, Func<T, U, V> c)
{
  return Task.Run(() =>
  {                
    var t = task.Result;
    var ut = f(t);                
    var utr = ut.Result;
    return c(t, utr);
  });
}

This method tells us how to “bind” two tasks. Func<T, Task<U>> f is crucial here: it’s a some kind of “glue” that provides second task having value of T that first one produced.

You may notice that LINQ query in AsyncExample() also needs Select<T> but it is not as important. You will find full implementation along with examples in GitHub repo. I encourage you to take a look and try to play with it. It will be more efficient than explaining all details step-by-step here.

Having our “LINQ to Tasks” implemented, we can write final version:

public static Task<Tuple<char, double>> AsyncExample() =>
	from idx in GetInt()
	let dateTime = DateTime.UtcNow
	from str in GetString()
	let diff = DateTime.UtcNow - dateTime
	select Tuple.Create(str[idx], diff.TotalMilliseconds);

Now, how to write the same thing with async/await? Just replace from .. in with var .. = await, let with var, select with return and you have it:

async Task<Tuple<char, double>> AsyncExample()
{
   var idx = await GetInt();
   var dateTime = DateTime.UtcNow;
   var str = await GetString();
   var diff = DateTime.UtcNow - dateTime;
   return Tuple.Create(str[idx], diff.TotalMilliseconds);
}

Yes, you can do more with async. You can mutate state between tasks, call void methods, catch exceptions and do all this imperative “goodness”. But this is “lambda school” and I wanted to present how we can achieve the same things with the immutable functions and expressions. Because at the end of the day, all you provide to LINQ queries is Func<T, Something<U>> and other functions that are pure - get a value and return a value, nothing more. And immutability removes all classes of problems and makes the code easier to understand because pure functions, by definition, are simple.

All this gives you benefits on more higher, architectural level, much bigger than only syntactic sugar over switch statements (Yes, I’m talking about C# 7.0 “pattern matching”. This is not real pattern matching because, unlike in Haskell, Ocaml or F# implementations, you cannot assign this thing to variable. In other words, it’s not an expression but statement which is totally contrary to functional philosophy).

There is more.

And by the way, you can implement error handling and mutating state with LINQ too. There is a C# library inspired by Haskell, LanguageExt where you will find more constructs written this way.

So, that’s all about C# from me:)

Till next time.