3. More Windows – Hop, Skip and Jump

image

A tumbling window has the advantage of simplicity – it splits time into buckets of equal width over which aggregates can be easily computed, and it is easy to reason about the events “belonging” to a particular bucket. But it is good to be aware of some of the limitations of this windowing construct.

There is an inherent latency in reporting of the results, equal to the size of the window. In the above example, assume that a burst of 50 vehicles arrives at the tolls at 12:09. This indicator of impending congestion will be reported at time 12:12 – three minutes later.

To track the movement of vehicles at a finer granularity of time, we can use a generalized version of tumbling window, called Hopping Window. It is a fixed sized window which hops along the timeline with a particular duration. The tumbling window is a special case of a hopping window where the hop size is equal to the window size.

Let's recast the first query in the HelloToll.cs example, [Tumbling Count], to use hopping windows instead. We will incorporate some precision into the problem statement (emphasized) based on what we have learned from Step 2 of the query writing process above. Rather than show the result every 3 minutes, we will show the result every minute, but still compute the count over a 3 minute window. So the new problem statement becomes:

[Hopping Count] Report the count of vehicles being processed at some time over a 3 minute window, with the window moving in one minute hops. Provide the counts as of the last reported result as of a point in time, reflecting the vehicles processed over the last 3 minutes.

Let's walk through the query development process again.

Step 1 Model input and output events from the application’s perspective. No change from [Tumbling Count].

Step 2 Understand the desired query semantics by constructing sample input and output CHTs.

The sample input events are the same as for [Tumbling Count]. The time plot for this input is shown below. The green lines above the time axis show the progression of a hopping window of size 3 minutes, with a hop size of 1 minute. For convenience, each window is annotated with the output event name associated with the window. The annotation is placed right above the window start.

For the output, we want the query to determine the output as of a point in time, showing the events processed over the last 3 minutes.

We will specify this interval-to-point transformation in the query logic.

The output events for this sample input are shown in the bottom track of the time plot. Just like in previous example, the output events are annotated with a set of events over which this output is computed.

image

Figure 7. Time Plot for Hopping Window of size 3 and hop size 1 [Hopping Count]

The result can be reasoned about in the same terms as we did for the tumbling window. Let's reiterate a few observations: There are no events produced for windows ending at 12:17 to 12:20, since there are no input events overlapping those and we do not generate events for 0 counts. Just like in the case of tumbling windows, e1 and e2 are not included in the window starting at 12:03, since intervals are open at the end.

Step 3 Gather elements of query logic to compute the output and develop an event flow graph.

The new elements in the query compared to [Tumbling Count] are:

- Window – we choose a hopping window of size 3 minutes, with a hop size of 1 minute – as per requirement.

- Interval to Point Event Conversion – Recall that the query engine always deals with interval events. We are required to report the output as of a point in time at the end of a window.

image

Figure 8. Query Graph for the [Hopping Count] query

Step 4 Compose the query as a streaming transformation of input to output.

The resulting query is an example of compile time query composition in StreamInsight. First, countStream is composed over events provided by inputStream – it defines a Count() computation over hopping windows of size of 3 minutes, and a hop size of 1 minute. Next, the result is composed over countStream by applying ToPointEventStream which converts each event in the stream from an interval to a point event. This has the effect of truncating the EndTime of the output event to StartTime + ε (one chronon).

var countStream = from win in inputStream.HoppingWindow(TimeSpan.FromMinutes(3), TimeSpan.FromMinutes(1)
                  select win.Count();
var query = countStream.ToPointEventStream();

Click here to view code as image

ToPointEventStream is syntactic sugar on top of an operator called AlterEventDuration, which is based on a more general operator called AlterEventLifeTime. We will see these operators in the upcoming examples.

image

Figure 9. Output from [Hopping Count]

The output from the complete query in HelloToll.cs is shown in Figure 9.

For now, it is clear that a hopping window finds use in scenarios where you need to segment time into equal size windows, determine the membership of events in these windows, and output the result in granules of time (hops) that are less than the window size. It should also be clear that a tumbling window is a special case of hopping window – where the window size and the hop size are the same.

Hopping windows have the advantage of predictable memory consumption over the stream’s lifetime, given that output is released at a fixed frequency. Hopping windows are the principal mechanism in the StreamInsight release for time-weighted computations (discussed later). HoppingWindow supports more overloads to specify alignment of the window to a reference timeline, and input and output policies for treating event membership in the windows.

On the flip side, hopping windows, like tumbling windows, are chatty in situations where the window or event intervals are long, and/or when events arrive in a sporadic manner. This is because it is the window’s properties – rather than that of the incoming event – that controls the output behavior. For example, in the output above, {o5, o6} and {o12, o13, o14} output the occurrence of the same event. A smaller hop size only exacerbates the problem.

More importantly, at high event rates, hopping windows defined for aggregations over groups of events can tend to be expensive in terms of state maintenance and computation time. In StreamInsight, only the built-in aggregates such as Sum, Count, and Average are incremental in nature. But any custom aggregation (via a feature called UDA – user defined aggregate) defined by you could be expensive, by virtue of having to repeat the computation over the same set of events that have already been seen in a previous window. Future support for incremental aggregates will remove this limitation – but for StreamInsight, it is good to be aware of this constraint.