Putting together a summary of what I’ve learnt in the past four years on setting up tracking for a web product. Some thoughts to consider when planning your tracking framework.
Been awhile since my last post, mostly because so much has happened since then. And partly because I haven’t been as disciplined as I would have liked. Started on my final major web analytics project with the company, revamping the tracking framework, alongside expanding the team and redesigned our analytics framework. All that while trying to get machine learning projects finally off the ground. To top it off, finally decided to make my move after four years with the company. So here I am, one month into a new role in public healthcare analytics.
Disclaimer: I’m not an expert in the area of web tracking. Just here to share my limited experience with it over the last four years and what I think might work at least in the limited future. Hope these are helpful ideas for anyone planning to create your web/app product tracking framework. I’ve described most in broad terms, let me know if you’d like more detail of any section. There may be other related details not mentioned, either because it was not relevant on our case or because I have overlooked it. Do also let me know issues I may have overlooked, or better solutions you have came across!
So you’ve got a website/ app up and running. You have big dreams for it. You know (in rough or exact terms) how success looks like. You’ve set the metrics to target. What’s next? You’ll need some way to track activity happening on your product to be able to calculate the metrics. In many situations Google Analytics (free version), works fine. But occasionally you may need something more granular, customizable or private. There are a couple of paid options that give you more rein over your tracking. These usually cost a bit.
Then there’s always the option of building tracking in-house. This will give you full granularity and control over how on site activity is tracked. You’ll no longer need to depend on a black box that’s impossible/ difficult to debug. As a hobby project this can be fun but on production you might not want to adopt this option unless there’s sufficient resources to maintain it. There were many things to consider but we finally decided on keeping the capability in house, relying on Google Analytics as a backup. Won’t be going into the decision process/ consideration points here.
How things were
Running into issues
Wanting to be as comprehensive as possible, a huge number of fields were added into the schema. This affected the success rate of the postback actually reaching the server. In many cases this was not easy to detect. From the client (device), if there were network issues or the user were to navigate away, data could be lost.
Next, once the data reaches the server it passes through several hoops before reaching the data lake. At each of these nodes data could drop off as well, especially when maintaining a high throughput. Not going in the fixing of that as it was more of an engineering issue that I did not have experience in. I think the engineers did superbly in addressing it over time. But broadly speaking, the publisher/ subscriber model with acknowledgements at each handoff point helped minimized data loss through the pipeline significantly.
We knew data loss was happening but only later when we experimented with parallel tracking mechanisms that we were able to quantify the amount of data being dropped.
Ballooning number of fields
So say there was a field ‘A’ that is applicable for a group of events. We cannot rely on it being sent once as it may be dropped (due to the above), so we added it to all applicable events. Multiply it a couple of times for different fields and events. The data size to be transferred over the network is compounded, worsening the issue of failing calls. This also caused problems for front-end engineers. It became onerous for them to add tracking to new pages, given the huge number of fields and configurations required.
User privacy management
Finally, in order to understand our users, user attributes have to be added somewhere. Cue the previous problem and now you have same user details across all events. To comply with upcoming privacy laws we had to be able to scrub them out, which was a major headache. Going through every event a record at a time, then editing the logs was not feasible. Deleting records was also problematic as it would make the data inconsistent across time periods. There needed to be a way to de-associate a user with an event without compromising the integrity of the data.
Reevaluating assumptions, creating solutions
Can the event be tracked some other way?
While server side calls may fail, it suffers much less from network issues that devices face. These include changing networks or going beyond reception range. Whether the user has closed the window or app will not affect sending of data too, if done on server.
Many events involve interaction with the server to deliver content/ responses. Through this the server is made aware of the occurrence of the event. The client does not need to fire an additional postback to track the event. Thus the most preferable way to track events is via server-side tracking where possible.
Limitations of server-side tracking
However, some actions taken by users on the site may not have any interaction with the server. You may also want to track not just an event happening but when it happened from the user’s perspective. The server may generate the response at time X but may only be fully rendered at time Y. Thus to be able to track this intricacies client-side tracking is required.
Client-side, pixel-based tracking
In cases the server is able to tell the event happened, but doesn’t know if and when the rendering/ event happened for the user. These small details don’t require a huge JSON object to be sent over the network. Just the couple of details missing from the server-side tracking. Pixel-based tracking, where parameters are appended to the very light request, can be used in such cases. Their completeness is generally much better than postback based tracking.
Client-side, postback-based tracking
Some events happen without the server having a clue (no server interaction required). These events will probably be more common as browsers become more powerful and can handle more on their own. Things such as changing of layout (filtering/ sorting), and probably increasingly common nowadays, machine learning predictions done on the client. Rendering of cached content may also result in the server not being in the loop. Finally, in order to send more details, pixel based tracking may not be appropriate. Given these reasons, client-side, postback based tracking still forms a significant part of event tracking in the framework. That did not mean we still have to put up with the old issues.
Must I send every detail every time?
How then can we improve on it? Earlier I had mentioned that the same details was sent multiple times across different events. This was due to likelihood that the data could be dropped, resulting in orphaned subsequent data without the crucial details. If we could address data leakage through the pipeline then we can rely on sending each detail once.
For one, some of the technologies the software engineers had experimented with had reduced loss of data through the pipeline significantly (mentioned briefly above). The remaining major issue was is the object being too big to send over the fragile network.
So it became a circular problem. The postback fails so I must send all details in subsequent events. This makes the subsequent postbacks more likely to fail due to the ever increasing postback size. Breaking out of this loop required a leap of faith.
A little on the new analytics framework: there are clients, sessions, pageviews, and on-page events, in that hierarchy. On-page events happen on a pageview. Those details relevant to the pageview should be sent on the pageview and not every on-page event. Similarly, many pageviews belong to a session, and the session details should not reside on the pageview. The session should be sent once with the details. If analysis of the pageview requires session details it can be joined later on. By having a clearly defined schema for tracking, together with improved technology and moving some tracking off the client, it became possible to significantly reduce the number of fields in each postback. This went from over a hundred fields to around 10, while maintaining the eventual comprehensiveness of data.
Must the client send this data?
Can we further lighten the load off the client? Turns out there was still some room to reduce the amount of data the client sends, yet maintaining the comprehensiveness of the data. For example, some processing was being done on the client to set fields that did not affect the user experience. These were required as part of event tracking for analysis, such as parsing of device details.
The process of obtaining these details was handed over to the server. Each incoming postback would be ‘enriched’ by the server to maintain the level of detail. Basically, details that can be determined by the server should be done by the sever, rather than appended on the client, similar to the earlier point mentioned.
Must user’s details be on the event for us to understand our users?
This is kind of a different point from the above. What we wanted was to be able to adhere to upcoming policy changes, to protect our users’ personal data, as well as make sure we can maintain the above without causing a huge long term cost or headache.
What we came up was a user hash. All events, instead of having users id/ identity/ etc tagged on to it, will only have a user hash. This can be used to identify the user if required, on a users table. The users table is the only place users’ personal data is stored, with its access strictly guarded. For analytics purposes the user hash would allow us to know when a group of events was from the same user, without knowing who exactly it is. For the use approved cases where the identity is required it can be derived from the users table.
How a user hash can maintain both user privacy and data integrity
How is this different from just using the user’s id? The main consideration came if user would like to be deleted, or have his/her history forgotten. If we used the user id on the events table, removing the id would result in us not being able to know that event A, B, C were done by the same, non-identified person. Furthermore, reprocessing the entire set of events table whenever users want to be removed/ forgotten is a computationally expensive task. Deleting entire records would make the data inconsistent for analysis.
With the user hash, the events tables do not have to be modified whenever a user wishes to be deleted or removed. When a user is deleted in the users table the earlier associated hash will not be join-able to any user. If a user wishes to have his/her history forgotten, a new user hash for the user will be generated. This will result in earlier events not being join-able to the users table and thus not identifiable. At the same time we still can tell that the earlier events (before the regeneration) were done by the same person. No reprocessing of the events table would be required for us to ‘forget’ a user.
Tracking framework considerations
- Having a clearly defined schema, hierarchy, relationship and objective of each tracked event
- Sending each detail only once at the appropriate hierarchy
- Moving tracking/ processing to the server where feasible
- Ensure no trace of user’s identifiable details on analytics events
There are many other considerations when designing a tracking framework, but here’s a simplified thought process on some of the items that we spent more time deliberating on. Hope it’s useful, and looking forward to learn from others in the comments below! The other part of the equation, the analytics framework, also re-imagined and redesigned. Might dwell into that next time.