Thursday, April 13, 2017

Lambdas, LInQ, Javascript and so on

As a .NET developer I am very familiar with LInQ, or Language Integrated Queries, which is a collection of fluent interface methods that deal with querying data from collections. However, so many people outside the .NET ecosystem are not familiar with the concept or they use it as disparate functions in their language of choice. What makes it even more confusing is that the same concept is implemented in other languages with different names. Let me give you an example:
var arr=[1,2,3,4,5,6];
var result=arr
  .Where(v=>v%2==0) //get only even values
  .Select(v=>v*10)  //return their values multiplied with 10
  .Aggregate(15,(s,v)=>v+s); //aggregate their value into a sum that starts with a seed of 15
// result should be 15+2*10+4*10+6*10=135

We see here the use of three of these methods:
  • Where - filters the values on a condition
  • Select - changes the values it returns
  • Aggregate - creates an aggregate value using an operation on all the values in the collection

Let me write you the same C# code without using these methods:
var arr=[1,2,3,4,5,6];
var result=15;
foreach(var v in arr) {
  if (v%2==0) {

In this case, some people might prefer the second version, but it is only an example. LInQ is not a silver bullet that replaces all loops, only a tool amongst many in a large toolset. Some advantages of using such a method are concise code, better readability, a common API for iterating, filtering and querying collections, etc. For example in the largely used Entity Framework or its previous incarnations such as Linq over SQL, the queries would look the same, but they would be translated into SQL and sent to the database and executed just once. So it would not get a list of thousands of records to filter it in memory, instead it would translate the expression of the function sent to the query into SQL and execute it there. The same sort of operations can be used on streams of data, rather than fixed collections, like in the case of Reactive Extensions.

Some other methods in this set include:
  • First/Last - getting the first or last element in an enumerable that satisfies a boolean condition
  • Skip - ignoring a number of values in a collection
  • Take - returning a number of values in a collection
  • Any/All - returning true if at least one or all of the items satisfy a boolean condition
  • Average/Sum/Min/Max - specific aggregating methods for the elements in the collection
  • OrderBy/OrderByDescending - sorting
  • Count - counting

There are many others, you can look them up here.

Does this system of querying data seem familiar to you? To SQL developers it will feel second nature. In SQL the same result from above would be achieved by using something like:
SELECT 15+SUM(SELECT v*10 FROM table WHERE v%2=0)

Note that other than putting the source of the data in front, LInQ syntax is almost identical.

However, in other languages this sort of data query is called map/reduce and in fact there is a very used programming model called MapReduce that applies in big data processing. In Java, the function that filters data is called filter, the one that alters the values is called map and the one that aggregates data is called reduce. Similar in Javascript. Here is the same code in Javascript:
var arr=[1,2,3,4,5,6];
var result=arr
  .filter(v=>v%2==0) //get only even values
  .map(v=>v*10)  //return their values multiplied with 10
  .reduce((s,v)=>v+s,15); //aggregate their value into a sum that starts with a seed of 15
// result should be 15+2*10+4*10+6*10=135
Note that the lambda syntax of writing functions used here is new in ECMA Script version 6. Before you would have to use the function(x) { return [something with x]; } syntax.

In Haxe, the concept is achieved by using the Lambda library and the functions are again named differently: filter for filtering, map for altering and fold for aggregating.

There is another sort of people that would instantly recognize this model of data querying: functional programming people. Indeed, SQL is a functional programming language at its core and the same standard for data querying is used very efficiently in functional programming languages, since they know whether a function is pure or not (has side effects). When only dealing with pure functions, some optimizations can be made on the query by the compiler before anything is even executed. Haskell has the same naming as Haxe (filter, map, fold) for example.

So whenever I get to review other people's code, especially people that have little experience with either SQL or C#, I cringe to see stuff like this:
var max=-1;
for (var i=0; i<arr.length; i++) {
  if (max<arr[i]) max=arr[i];
In my head this should be simply arr.max(); And considering how easy it is to implement something like this in Javascript, for example, it's a crime for not using it:
Array.prototype.max=function() { return Math.max.apply(null,this); }

Yet there is more to this than my personal preference for reading code. Composition, for example. Because this works like a fluent API or a builder pattern, one can keep adding conditions to a query. Imagine you have to filter a list of strings based on a Google like query string. At the very minimum you would need to split the query into strings and filter repeatedly on each one. Something like this:
var arr=['this is my special query string','this is a string','my query string is this awesome','no query strings here, move along','these are not the strings you are looking for'];
var query="this is a query string";
var splits=query.split(/\s+/g);
var result=arr;

There is a lot of stuff I could be saying about this subject, but I need to summarize it. It's all about inverting loops. Instead of having to go through a collection, a stream or some other data source, then executing some code for each element, this method allows you to encapsulate the operations you want to execute on those elements, pass them around, compose them, translate them, then use them on any data source in the same way. A common API means reusability, better readability of code, less written code and a simpler declaration of intent. Because we get out of the loop system, we can expand the use for other paradigms, such as infinite data streams or event buses.