How to use Spotless Data?

Session Rules

About Session Rules

Session rules can be used to cleanse gaps and overlaps in any time series session data. Time series session data is any data set that consists of a sequence of events with a defined start and stop. Common examples include:

  • User behaviours - web browsing, TV viewing, server logins
  • Schedules of events - for example TV EPG schedules

You can create a new session rule, here, with the following basic options:

  • field_type - this specifies whether the field that the rule will be linked to in the plan is the start or the stop of the session. This is the field that will be changed in any truncation
  • other_field - this specifies the other field to be used. For example, if the main field is the start, then the other_field should point to the field containing the start data
  • key_field - the key field is the field that is used to determine whether the same entity is logging a session. For user behaviours, the key field would typically be an identifier of the user, for TV schedules it would be the channel name

You then need to specify how you want Spotless to deal with any gaps or overlaps in the data

  • overlaps_option
    • if this is set to ignore then overlaps are allowed in the data and spotless does not correct them
    • if this is set to truncate then the session is truncated by moving the field selected in the plan
  • gaps_option
    • if this is set to ignore then gaps are allowed in the data and spotless does not correct them
    • if this is set to extend then the session is extended by moving the field selected in the plan
    • if this is set to insert_new then a new record inserted to fill the gap with the same key as the original record. You can specify a template for the inserted record in the field template_for_new

Examples:

If you are cleaning the following EPG data

Channel, Show, Start, Stop

HBO, Game of Thrones, 2016-12-01 13:00, 2016-12-01 14:00

HBO, Westworld, 2016-12-01 15:00, 2016-12-01 16:05

HBO, Game of Thrones, 2016-12-01 16:00, 2016-12-01 17:00

HBO, Game of Thrones, 2016-12-01 17:00, 2016-12-01 18:00

There are two problems with this file:

  1. There is a gap between the 1pm showing of Game of Thrones and the 3pm showing of Westworld. We want to insert a dummy record to fill that gap
  2. There is an overlap between the 3pm showing of Westworld and the 4pm showing of Game of Thrones. We want to truncate that record.

Assuming we want to truncate overlaps by moving the stop time, we would set-up the rule to use the stop time and specify the start time as the “other field” in the rule.

Our rule is therefore set up as follows:

  • key_field: 0 - the first column, containing the channel
  • field_type: stop - we want to move the stop time
  • other_field: 2 - the third column column, containing the start time
  • overlaps_option: truncate - we want to truncate any long sessions by moving the stop time
  • gaps_option: insert_new - we want to insert a new dummy record
  • template_for_new: channel, unknown program, start, stop - this is the template giving default values for each field. When it is applied the value of the key field, start, and stop will be replaced by actual data by Spotless

When we run this, Spotless will produce the following results, with the highlighted fields having been changed:

HBO, Game of Thrones, 2016-12-01 13:00, 2016-12-01 14:00

HBO, Unknown program, 2016-12-01 14:00, 2016-12-01 15:00

HBO, Westworld, 2016-12-01 15:00, 2016-12-01 16:00

HBO, Game of Thrones, 2016-12-01 16:00, 2016-12-01 17:00

HBO, Game of Thrones, 2016-12-01 17:00, 2016-12-01 18:00