9. Operators as Legos – building up Outer Join

So far, we have followed the top-down, four step query writing methodology to arrive at the LINQ query. If you are able to complement this top-down approach with a bottom-up introspection of the query in Step 3, this will greatly improve your ability to write efficient queries. In this section, we will use this bottom-up approach to constructing a query to build up an Outer Join operation. First, let's understand how a StreamInsight query works.

image

StreamInsight enriches LINQ with temporal query capability through LINQ extension methods. On the client, the C# compiler runs HelloToll.cs through a StreamInsight LINQ Provider, which translates a LINQ query into query template, query binder, and query definition XML documents. The StreamInsight engine understands only this XML-based representations of the query– it generates the execution plan, loads and links the required assemblies that constitute a running query, and executes the query.

In its compiled form, a StreamInsight query is an event flow graph made up of operators and streams. Each Operator is a stateful or stateless machine that performs a specific operation on each event arriving via one or more input streams, generates new events as a result of this operation, and outputs the event via one or more of its output streams. Streams are efficient in-memory couriers of events from one operator to another.

For any given query, you can view this query plan using the StreamInsight Debugger. In the simplest use case, start a query trace, run the query, stop the trace, and then view the trace file in the debugger (we will not digress into Debugger setup here – please consult the documentation). Let's consider the query [TopK] for this exercise.

image

Figure 20 Query Graph of [TopK] as shown in the Event Flow Debugger

c:\Program Files\Microsoft StreamInsight 2.1\Tools>trace start c:\q9-trace.etl
The command completed successfully.

Click here to view code as image

Now execute the query from Visual Studio. Once the query completes…

c:\Program Files\Microsoft StreamInsight 2.1\Tools>trace stop
The command completed successfully.

Click here to view code as image

Next, start the debugger from the Start menu, and open the file c:\q9-trace.etl. The resulting query graph is shown in Figure 11. The plan shows events flowing from top to bottom through an operator chain IMPORT -> CLEANSEINPUT -> ALTERLIFETIME -> GROUPING (which contains an AGGREGATE and PROJECT) -> PROJECT -> TOP-K -> PROJECT -> CLEANSE. We have looked at all of the LINQ level constructs that correspond to these operators, except for CLEANSEINPUT and CLEANSE. Note that there are other powerful uses for the debugger – we will not digress into those capabilities now.

Now, imagine generating such a plan manually, using just the higher level constructs we have learned so far, to build either other operators or complete queries. This is the bottom-up approach to query building that we referred to earlier. The repertoire of primitive operators in StreamInsight is not very large, and you do not have to know all of them. What is important is the process of trying to compose a query by working bottom up to construct a query using a flow diagram substituting for the actual query plan.

In the earlier section on Joins, we left out one important join type for later: Outer Join. In this section, we will build this operator bottom up by combining other primitive operators such as inner join, multicast, and project - we can put them together like so many Lego pieces to compose the outer join query.

An Outer join derives from inner join, and has the following characteristics.

• The left outer join results in an output event that contains all the fields of the payload from the left stream, with matching values from the fields in the right stream, and NULL values for the fields where no matches are found.

• The right outer join contains all fields of the payload from the right stream, and matching values from the left stream, and NULL if no matches are found.

• The full outer join results in events that contain all the fields from payloads of both the streams, with NULL values for non-matching fields.

Let's consider a left outer join.

Step 1 Model input and output events from the application’s perspective.

We will use interval events for this example.

Step 2 Understand the desired query semantics by constructing sample input and output event tables.

We will first consider the payload values to illustrate the outer join. From a temporal perspective, the semantics are that of an inner join. Assume we have two streams with payloads shown side by side below.

Start Time

End Time

License Plate

Make

Model

 

Start Time

End Time

License Plate

Toll

TagId

12:01

12:03

XYZ 1003

Toyota

Camry

 

12:03

12:08

BAC 1005

5.50

567891234

12:02

12:04

XYZ 1001

Toyota

Camry

 

12:05

12:07

ABC 1004

5.00

456789123

12:03

12:07

ZYX 1002

Honda

Accord

 

 

 

 

 

 

12:03

12:08

BAC 1005

Toyota

Camry

 

 

 

 

 

 

12:05

12:07

ABC 1004

Ford

Taurus

 

 

 

 

 

 

12:07

12:09

NJB 1006

Ford

Taurus

 

 

 

 

 

 

Click here to view table as image

The result from a left outer join of the above two streams on equality of License Plates, and based on an intrinsic comparison of the timestamps done by the inner join semantics, is shown below.

Start Time

End Time

License Plate

Make

Model

Toll

TagId

12:01

12:03

XYZ 1003

Toyota

Camry

NULL

NULL

12:02

12:04

YXZ 1001

Toyota

Camry

NULL

NULL

12:03

12:07

ZYX 1002

Honda

Accord

NULL

NULL

12:03

12:08

BAC 1005

Toyota

Camry

5.00

456789123

12:05

12:07

ABC 1004

Ford

Taurus

5.50

567891234

12:07

12:09

NJB 1006

Ford

Taurus

NULL

NULL

Click here to view table as image

Step 3 Consider the different elements of query logic required to compute the output above.

Given that this operator is not available out of the box, we will emulate left outer join behavior with the primitives at hand.

• An inner join between the two streams will give us the matching rows from the two streams

• A Left Anti Join between the two streams will give all the rows in left stream that do NOT have matching rows in the right stream. Since the left stream is the reference stream, this will give us the payload fields (License Plate, Make, Model). Now we need to add the NULL columns to this payload – this is done very easily using a Project operation on this output.

• Combine the above two streams with a Union operator to get the output we want.

This is shown graphically in Figure 20 below.

image

Figure 20. Outer Join composed from primitives

Step 4 Compose the query as a streaming transformation of input to output.

We will use the same input stream to generate the left and right streams using suitable projections. See how the query matches the graph provided above.

// Simulate the left stream input from inputStream
var outerJoin_L = from e in inputStream
                  select new
                  {
                      LicensePlate = e.LicensePlate,
                      Make = e.Make,
                      Model = e.Model,
                  };
// Simulate the right stream input from inputStream – eliminate all events with Toyota as the vehicle
// These should be the rows in the outer joined result with NULL values for Toll and LicensePlate
var outerJoin_R = from e in inputStream
                  where e.Make != "Toyota"
                  select new
                  {
                      LicensePlate = e.LicensePlate,
                      Toll = e.Toll,
                      TollId = e.TollId,
                  };
// Inner join the two simulated input streams
var innerJoin = from left in outerJoin_L
                from right in outerJoin_R
                where left.LicensePlate == right.LicensePlate
                select new TollOuterJoin
                {
                    LicensePlate = left.LicensePlate,
                    Make = left.Make,
                    Model = left.Model,
                    Toll = right.Toll,
                    TollId = right.TollId,
                };
// Left anti join the two input simulated streams, and add the Project
var leftAntiJoin = outerJoin_L
    .LeftAntiJoin(outerJoin_R, left => left.LicensePlate, right => right.LicensePlate)
    .Select(left => new TollOuterJoin
        {
            LicensePlate = left.LicensePlate,
            Make = left.Make,
            Model = left.Model,
            Toll = null,
            TollId = null
        };
// Union the two streams to complete a Left Outer Join operation
var query = innerJoin.Union(leftAntiJoin);

Click here to view code as image

Such graphing of event flows, followed by query composition in LINQ, is the bread and butter of StreamInsight query development.