Welcome to MSDN Blogs Sign in | Join | Help

This is the eleventh in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you’ll want to do so before proceeding, or at least before proceeding to copy the code into your own project and telling your boss you single-handedly solved the data layer problem over the weekend.

Part I - Reusable IQueryable base classes
Part II - Where and reusable Expression tree visitor
Part II - Local variable references
Part IV - Select
Part V - Improved Column binding
Part VI - Nested queries
Part VII - Join and SelectMany
Part VIII - OrderBy
Part IX - Removing redundant subqueries
Part X – GroupBy and Aggregates

No excuses for a late post this time. In my estimation, not only am I early but I have arrived with abundance. This is not my ordinary, keep it to the basics, slow and steady post about adding that next incremental feature to a feature-starved sample program. This is the Big Kahuna! This is the post where I finally go over the edge, lose my cool and let loose on the code. No more Mr. Nice guy.  Now, I’m really taking it to the cleaners. 

There’s so much goodness here I don’t even know where to start.  Even if I did, how could I possibly describe it all in one little post? I’m not even going to try. You’re going to have to download the source to get the real truth. A link is located at the bottom of the post. Click it. Feel the power flow over the internet, through the wires and onto your screen. Feel the radiance seep in through your eyes and take hold of your brain with the vice-like grip of a big-time wrestler.

So just what have I been cooking?  Let’s take a look at the highlights.

More Query Operators

That’s right; a lot more. Don’t believe me? Look at the code and see for yourself. Too busy? Boss breathing down your neck. Okay, here’s the list.

Distinct - Not only is there translation for this, but Distinct also makes an appearance inside most aggregates! AVG(DISTINCT t0.Price) anyone!

Skip & Take – Both work and work well together using the ROW_NUMBER & BETWEEN combo that gets incredible performance.

First, FirstOrDefault, Single & SingleOrDefault – No, these are not just the same thing. They really do behave differently.

Any, All & Contains - Not only do these guys work, but all three work with local collections. Use collection.Any(xxx) to get any predicate expression expanded into a series of OR conditions over the input set. 

Framework Methods and Properties


Hallelujahs, brothers! Now you can write queries that reference simple String, DateTime, Decimal, and Math operations, and get them translated into equivalent SQL.  I’ve not implemented every possible method but you’ll find a lot there. There are so many I can’t even list them all here. Rest assured you’ll get classics like string StartsWith, EndsWith and Contains and every variation of Equals and Compare that the framework can throw at me.

Parameters

That’s what I said, parameters. No more SQL injection attacks in the sample code, I’ve finally gotten around to adding parameter support. Take a look, it’s all rather straight forward; a new node type NamedValueExpression, a new visitor Parameterizer and a few extra lines of code to create and assign parameter values at execution time. Not every argument is parameterized either. Simple numeric values are left as literals.

Simpler Queries

The RedundantSubqueryRemover has been enhanced to know how to merge more types of SelectExpression’s together. Now, you’ll get even less layers of subqueries.

Compiled Queries

OMG! Can it be so? Yes, compiled queries too. How is it possible? It took a little refactoring of how queries are executed. There is now a new low-level Execute method on DbQueryProvider that takes a translated command, parameters and a function that converts a DbDataReader into a result (this could be sequence or a single value.)  Given that and LINQ Expression trees themselves, it is easy to see how I can turn a request to execute a LambdaExpression into constructing a function that when called, calls down to the low-level execute method with pre-translated SQL queries and a pre-translated object reader.

Unit Tests

Okay, get back up into your chair. I realized how easy it is to lose control and find the need to roll around the floor in ecstasy. However, your co-workers are starting to wonder if they should call for medical assistance. 

This does not mean I’m using TDD, or DDT or any TLA methodology; it just an apology really. The sample codebase has become so unwieldy that even I can’t trust myself anymore and have discovered so many prior features that I’ve broken in the process of doing bigger and better things that I literally broke down and wrote some simple unit tests. AND they are simple too. For the most part they prove 1) some LINQ operators are translated into the SQL I thought looked good at the time, and 2) the database server actually accepts these queries and does something with them.  There are over 200 of these tests now.  There is also a start at verifying actual semantics by looking at the results.  I started a second bank of tests for these and so far only have a few compiled-query tests listed.  More surely to come.

 

Whew!  I know that’s a lot to GROK, but take a look anyway.  If you’ve been building your own provider and have been stuck on how to do some of these kinds of features (except for tests, shame on you if you haven’t got your own tests yet), now you can get moving again.

But WAIT, there’s more!  I’m not done yet.  Even with all this I’ve shown you today, I still have a few more ideas.  More posts to come. 

This is the tenth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you'll want to find a nice shady tree, relax and meditate on why your world is so confused and full of meaningless tasks that it has kept you from pursuing the perfection of writing IQueryable providers.

Part I - Reusable IQueryable base classes
Part II - Where and reusable Expression tree visitor
Part II - Local variable references
Part IV - Select
Part V - Improved Column binding
Part VI - Nested queries
Part VII - Join and SelectMany
Part VIII - OrderBy
Part IX - Removing redundant subqueries

Last time I blamed the television writers strike for delaying me in getting out my next installment.  This time I blame the lack of one, and sunny days, and my son riding his bicycle, and the grueling, tiresome task of getting paid.  Would you believe some of the code we are writing has nothing what-so-ever to do with IQueryable LINQ providers? Crazy, I know. Maybe it would be a different story around here if we had a few more shady trees. 

Fortunately for you I have a prime piece of savory source code ready for slow grilling over a bed of hot mesquite. It's something I'd like to say I was saving for a special occasion, but the truth is I've been putting it off because I thought the code would be just too overwhelming. I mean for me, not you. You see I've been trying to do two things with this series; one, provide a guide to help developers navigate and understand the power and flexibility of the IQuerayble interface, and two, attempt to prove to myself that code written in a purely functional style (or as pure as I can make it) can end up much cleaner and easier on the eyes. I've been horrified ever since tackling OrderBy that this subject would become my undoing. Knowing how involved the translation was for LINQ to SQL I was dreading the immutable madness that would ensue.  Obviously, I survived the ordeal and so will you.

Grappling with GroupBy and Aggregates

Translating just GroupBy itself might not be so difficult if I did not have to also account for aggregates. It seems that GroupBy is not very interesting without the ability to write expressions involving Count() and Max() and Sum(), and that's where the difficulty sets in.

A query involving groups and aggregates that looks like this:

var query = from o in db.Orders

            group o by o.CustomerID into g

            select g.Max(o => o.OrderID);

Is translated into a series of method calls that looks like this:

var query = db.Orders.GroupBy(o => o.CustomerID).Select(g => g.Max(o => o.OrderID));

 

And this is a problem because as I translate from the bottom up (as I have been doing all along) building individual SELECT subqueries for each query operator, I won’t even discover the use of the aggregate ‘Max’ until after I’ve created the SELECT with the GROUP BY.  So how do I get the two pieces together in the same place at the same time without violating my functional style and immutable expression nodes?  Weirder yet, how do I even know that a call to an aggregate method should be tied back to a particular GroupBy? What if I had two group-bys?

 

So where do I start to explain?  Maybe the easy stuff first.  I’ve added a GroupBy property to SelectExpression.

 

internal class SelectExpression : Expression {
    ...

    ReadOnlyCollection<Expression> groupBy;

 

    internal SelectExpression(
        ...,

        IEnumerable<Expression> groupBy

        )

    internal ReadOnlyCollection<Expression> GroupBy {

        get { return this.groupBy; }

    }

}

 

Now every place where I construct one I have to specify a collection of grouping expressions (or null).  GroupBy expressions also get visited after everything else (so far).  Here’s the new version of VisitSelect in DbExpressionVisitor.

 

protected virtual Expression VisitSelect(SelectExpression select) {

    Expression from = this.VisitSource(select.From);

    Expression where = this.Visit(select.Where);

    ReadOnlyCollection<ColumnDeclaration> columns = this.VisitColumnDeclarations(select.Columns);

    ReadOnlyCollection<OrderExpression> orderBy = this.VisitOrderBy(select.OrderBy);

    ReadOnlyCollection<Expression> groupBy = this.VisitExpressionList(select.GroupBy);

    if (from != select.From

        || where != select.Where

        || columns != select.Columns

        || orderBy != select.OrderBy

        || groupBy != select.GroupBy

        ) {

        return new SelectExpression(select.Type, select.Alias, columns, from, where, orderBy, groupBy);

    }

    return select;

}

 

I also added an AggregateExpression class. It represents a call to an aggregate operator.

 

    internal enum AggregateType {

        Count,

        Min,

        Max,

        Sum,

        Average

    }

 

    internal class AggregateExpression : Expression {

        AggregateType aggType;

        Expression argument;

        internal AggregateExpression(Type type, AggregateType aggType, Expression argument)

            : base((ExpressionType)DbExpressionType.Aggregate, type) {

            this.aggType = aggType;

            this.argument = argument;

        }

        internal AggregateType AggregateType {

            get { return this.aggType; }

        }

        internal Expression Argument {

            get { return this.argument; }

        }

    }

 

Now I at least know what to turn those aggregates into.  But how do I get these aggregate expressions into the right place in the query tree?  What if I did nothing and just let the aggregates fall where they may?  What would it look like?  Would it even be legal SQL?

 

SELECT MAX(t5.OrderID)

FROM (

  SELECT t0.CustomerID

  FROM Orders AS t0

  GROUP BY t0.CustomerID

  ) AS t5

 

Oh, no. Danger Will Robinson! This is not legal SQL. Where does OrderID even come from? It’s not even projected out of the sub-query with the GROUP BY. And even if it were somehow projected, the result of the query would be all wrong.  It would give me a single maximum instead of one for each group.  Getting group-by & aggregates to work together is going to be tricky.

 

Is it even possible to solve it and maintain my layering approach?  What about correlated sub-queries?  I could form a sub-query that joins back to the original query based on the grouping expressions and have it contain the aggregate.

 

SELECT (

  SELECT MAX(t2.OrderID)

  FROM Orders AS t2

  WHERE ((t2.CustomerID IS NULL AND t5.CustomerID IS NULL) OR (t2.CustomerID = t5.CustomerID))

  ) AS c0

FROM (

  SELECT t0.CustomerID

  FROM Orders AS t0

  GROUP BY t0.CustomerID

  ) AS t5

 

Now that at least executes on the server. It might not be pretty and likely not efficient unless the server is really good at unscrambling my query, but it does execute and produce the correct results. So it is technically legal. But surely I can do better!

The GroupBy Operator

 

By now you are thinking, OMG, he’s going to stick us with a wacky solution like he did with nested queries!  Never fear, my friend.  Do not look at the man behind the current.  Cast your gaze ahead and all will be made clear. In my hands, I have the GroupBy operator.  Watch as it transforms into a SQL query inside the QueryBinder!

 

    protected override Expression VisitMethodCall(MethodCallExpression m) {

        if (m.Method.DeclaringType == typeof(Queryable) ||

            m.Method.DeclaringType == typeof(Enumerable)) {

            switch (m.Method.Name) {
                ...

                case "GroupBy":

                    if (m.Arguments.Count == 2) {

                        return this.BindGroupBy(

                            m.Arguments[0],

                            (LambdaExpression)StripQuotes(m.Arguments[1]),

                            null,

                            null

                            );

                    }

                    else if (m.Arguments.Count == 3) {

                        return this.BindGroupBy(

                            m.Arguments[0],

                            (LambdaExpression)StripQuotes(m.Arguments[1]),

                            (LambdaExpression)StripQuotes(m.Arguments[2]),

                            null

                            );

                    }

                    else if (m.Arguments.Count == 4) {

                        return this.BindGroupBy(

                            m.Arguments[0],

                            (LambdaExpression)StripQuotes(m.Arguments[1]),

                            (LambdaExpression)StripQuotes(m.Arguments[2]),

                            (LambdaExpression)StripQuotes(m.Arguments[3])

                            );

                    }

                    break;
                ...

            }

        }

        return base.VisitMethodCall(m);

    }

 

As you can see, I’m handling three different variations of GroupBy.  The first one simply takes a single keySelector lambda (the thing to group by).  It is supposed to return a sequence of grouping’s that each have a group key value and a collection of elements that make up that group.  The second form is the same as the first except it also includes an elementSelector lambda that allows me to specify a map between an item in the source sequence and its shape of the elements in the group. The last form includes a resultSelector lambda that allows me to specify what the key and group collection turn into, instead of assuming they always form into instances of IGrouping<K,E>. With all these variations, the translation of group-by is going to get a bit involved, so bear with me.

 

Get ready. Get set. Open your eyes. Close them. Now open them again and really look. Go!

 

    protected virtual Expression BindGroupBy(Expression source, LambdaExpression keySelector, LambdaExpression elementSelector, LambdaExpression resultSelector) {

        ProjectionExpression projection = this.VisitSequence(source);

       

        this.map[keySelector.Parameters[0]] = projection.Projector;

        Expression keyExpr = this.Visit(keySelector.Body);           

 

        Expression elemExpr = projection.Projector;

        if (elementSelector != null) {

            this.map[elementSelector.Parameters[0]] = projection.Projector;

            elemExpr = this.Visit(elementSelector.Body);

        }

       

        // Use ProjectColumns to get group-by expressions from key expression

        ProjectedColumns keyProjection = this.ProjectColumns(keyExpr, projection.Source.Alias, projection.Source.Alias);

        IEnumerable<Expression> groupExprs = keyProjection.Columns.Select(c => c.Expression);

 

        // make duplicate of source query as basis of element subquery by visiting the source again

        ProjectionExpression subqueryBasis = this.VisitSequence(source);

 

        // recompute key columns for group expressions relative to subquery (need these for doing the correlation predicate)

        this.map[keySelector.Parameters[0]] = subqueryBasis.Projector;

        Expression subqueryKey = this.Visit(keySelector.Body);

       

        // use same projection trick to get group-by expressions based on subquery

        ProjectedColumns subqueryKeyPC = this.ProjectColumns(subqueryKey, subqueryBasis.Source.Alias, subqueryBasis.Source.Alias);

        IEnumerable<Expression> subqueryGroupExprs = subqueryKeyPC.Columns.Select(c => c.Expression);

        Expression subqueryCorrelation = this.BuildPredicateWithNullsEqual(subqueryGroupExprs, groupExprs);

       

        // compute element based on duplicated subquery

        Expression subqueryElemExpr = subqueryBasis.Projector;

        if (elementSelector != null) {

            this.map[elementSelector.Parameters[0]] = subqueryBasis.Projector;

            subqueryElemExpr = this.Visit(elementSelector.Body);

        }

       

        // build subquery that projects the desired element

        string elementAlias = this.GetNextAlias();

        ProjectedColumns elementPC = this.ProjectColumns(subqueryElemExpr, elementAlias, subqueryBasis.Source.Alias);

        ProjectionExpression elementSubquery =

            new ProjectionExpression(

                new SelectExpression(TypeSystem.GetSequenceType(subqueryElemExpr.Type), elementAlias, elementPC.Columns, subqueryBasis.Source, subqueryCorrelation),

                elementPC.Projector

                );

 

        string alias = this.GetNextAlias();

        // make it possible to tie aggregates back to this group-by

        GroupByInfo info = new GroupByInfo(alias, elemExpr);

        this.groupByMap.Add(elementSubquery, info);

 

        Expression resultExpr;

        if (resultSelector != null) {

            Expression saveGroupElement = this.currentGroupElement;

            this.currentGroupElement = elementSubquery;

            // compute result expression based on key & element-subquery

            this.map[resultSelector.Parameters[0]] = keyProjection.Projector;

            this.map[resultSelector.Parameters[1]] = elementSubquery;

            resultExpr = this.Visit(resultSelector.Body);

            this.currentGroupElement = saveGroupElement;

        }

        else {

            // result must be IGrouping<K,E>

            resultExpr = Expression.New(

                typeof(Grouping<,>).MakeGenericType(keyExpr.Type, subqueryElemExpr.Type).GetConstructors()[0],

                new Expression[] { keyExpr, elementSubquery }

                );

        }

 

        ProjectedColumns pc = this.ProjectColumns(resultExpr, alias, projection.Source.Alias);

 

        // make it possible to tie aggregates back to this group-by

        Expression projectedElementSubquery = ((NewExpression)pc.Projector).Arguments[1];

        this.groupByMap.Add(projectedElementSubquery, info);

 

        return new ProjectionExpression(

            new SelectExpression(TypeSystem.GetSequenceType(resultExpr.Type), alias, pc.Columns, projection.Source, null, null, groupExprs),

            pc.Projector

            );

    }

 

It starts off easy, following the same binding pattern as the other operators.  First I bind the key selector and get a key expression. Then if I have an element selector I bind that too, to get an element expression, otherwise I use the projector expression from the incoming source as the element expression. Got it?

 

Now I have to figure out what the grouping expressions are going to be. You might be thinking I already know what they are, and there’s only one of them. It’s the key expression, right? Well, not exactly. It turns out its common to need to group by more than one property or column of data. The only way to do that with LINQ is to construct an anonymous type with multiple parts. So a key might be an anonymous type with multiple interesting fields. How do I turn that into one or more expression that can be evaluated as part of a SQL group by clause?

 

It turns out I already have a handy function that does what I want. If I treat the key expression they same way I treat a projection expression, I can use the ColumnProjector to turn the expression into a list of column declarations, each with its own scalar expression.  Then I can simply throw away the column declarations and keep the expressions that define the columns and voila I have grouping expressions!  Don’t try this at home.  Err, I mean, go ahead and try this at home. It’s easy. Watch, I’ll do it again in a moment.

 

Moving along we come to some code trying to compute a ‘basis’. What’s that?  Well, it turns out SQL doesn’t really support a GroupBy operator the way it is defined in LINQ. The LINQ operator, by default, produces a sequence of groups with each group containing the elements that got designated as part of that group. SQL group-by can only produce aggregate computations over the groups, not the groups themselves.  Oh, dear, what to do, what to do?

 

The only way to solve this problem is to form a self join back to the data I want. If I take the result of the group-by and join it back to the original query (everything up until the group-by) matching group-by key values I get all the data back that SQL was not going to let me have.  That’s what the ‘subqueryBasis’ is.  It’s a rebinding of the original input sequence. Then we rebind the key again and do my ColumnProjector trick to get a new collection of group expressions. Now I can use both sets to form a predicate for a join.

 

Next, you can see that if I did specify an element selector I now bind it with respect to this new ‘basis’ and use the resulting expression as the projection coming out of this new branch of the query tree.

 

Alas, I am not yet done.  What about that third argument?  If it’s specified then I have another lambda I have to do something with and if it is not specified then I need to figure out how I’m going to return an IGrouping<K,E>.

 

Okay, that doesn’t turn out to be too difficult.  I created a Grouping<K,E> class that I can use in case I need to actually return this stuff all the way back to the calling code.

 

    public class Grouping<TKey, TElement> : IGrouping<TKey, TElement> {

        TKey key;

        IEnumerable<TElement> group;

 

        public Grouping(TKey key, IEnumerable<TElement> group) {

            this.key = key;

            this.group = group;

        }

 

        public TKey Key {

            get { return this.key; }

        }

 

        public IEnumerator<TElement> GetEnumerator() {

            return this.group.GetEnumerator();

        }

 

        IEnumerator IEnumerable.GetEnumerator() {