A common problem in web analytics is sessionization or session reconstruction: taking a series of user events and splitting the series into a set of sessions [1]. Often, it is done based solely on the timestamps of the events: a session ends after a certain time of user inactivity. As an example, let’s take the following click stream:
userId | page | timestamp |
0001 | “index.html” | 2015-05-10 20:13:13 |
0001 | “search.html” | 2015-05-10 20:13:20 |
0001 | “search.html” | 2015-05-10 20:19:01 |
0001 | “checkout.html” | 2015-05-10 20:25:20 |
0002 | “index.html” | 2015-05-10 21:03:31 |
0002 | “search.html” | 2015-05-10 21:04:20 |
0001 | “index.html” | 2015-05-10 24:25:25 |
0001 | “search.html” | 2015-05-10 24:25:30 |
If we assume that a session ends after at least 15 minutes of user inactivity, the sessionized version of this data would look like this:
userId | session_begin | session_end |
0001 | 2015-05-10 20:13:13 | 2015-05-10 20:25:20 |
0002 | 2015-05-10 21:03:31 | 2015-05-10 21:04:20 |
0001 | 2015-05-10 24:25:25 | 2015-05-10 24:25:30 |
So how would you solve it in SQL? The solution to a similar problem, Packing date and time intervals, was posted here. Packing of intervals means that you want to merge all intervals that overlap into one contiguous interval problem. We can transform our original sessionization problem into interval-packing problem by turning every click event into a 15-minute interval. If two such intervals overlap, it means that the original click events were less than 15 minutes apart. So packing such intervals, and then substracting 15 minutes from each row in the result, will give us the sessions. Here is an SQL query that does that:
WITH ExtendedClicks AS ( SELECT userId, timestamp as starttime, DATEADD(MINUTE, 15, timestamp) as endtime FROM Clicks ), C1 AS ( SELECT userId, starttime AS ts, +1 AS type, NULL AS e, ROW_NUMBER() OVER(PARTITION BY userId ORDER BY starttime) AS s FROM ExtendedClicks UNION ALL SELECT userId, endtime AS ts, -1 AS type, ROW_NUMBER() OVER(PARTITION BY userId ORDER BY endtime) AS e, NULL AS s FROM ExtendedClicks ), C2 AS ( SELECT C1.*, ROW_NUMBER() OVER(PARTITION BY userId ORDER BY ts, type DESC) AS se FROM C1 ), C3 AS ( SELECT userId, ts, FLOOR((ROW_NUMBER() OVER(PARTITION BY userId ORDER BY ts) - 1) / 2 + 1) AS grpnum FROM C2 WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0 ), ExtendedSessions AS ( SELECT userId, MIN(ts) AS starttime, MAX(ts) AS endtime FROM C3 GROUP BY userId, grpnum; ) SELECT userID, starttime, DATEADD(MINUTE, -15, endtime) as endtime FROM ExtendedSessions
In Rax, we introduced the temporal union operator, @\/
which solves the interval-packing problem in one line of code. The sessionization problem in Rax:
{[#:userId, $:page, |:interval]}: ExtendedClicks := project [.userId, .page, (|)[.timestamp, .timestamp + (^)"PT15M"]] Clicks; {[#:userId, |:interval]}: ExtendedSessions := project [.userId, .#1] @\/:[.userId] ExtendedClicks; {[#:userId, |:interval]}: Sessions := project [.userId, (|)[.interval.begin, .interval.end - (^)"PT15M"]] ExtendedSessions;
Which code do you prefer? Why?