Everything Is a Data Problem

💡 Well, almost everything. The 80/20 rule tends to override almost all rules after all!

This is a phrase which has been banging around in my head for a decent part of the last few years. It seems every time I encounter a business requirement issue or just a general bug, this statement rears its head. And even though it may seem like an oversimplification, I find it holds true in the majority of situations I’ve encountered so far - especially if one considers that the vast amount of programs and web services have a data store of some kind behind it.

So with the intro out of the way, let’s run through some scenarios:

A problem with state

A big challenge and an integral part of most modern web apps is to track state. This can take many forms, but in general it refers to knowing what an entity is doing at a particular time. Entity is a generalised term, so for the purpose of this scenario I’ll defer to a User as an example of an entity.

In order to determine the state of this user, we need to either ask questions of it or let it tell us - either way, the state needs to be kept in sync with the user entity. I find there are actually two requirements to do this accurately:

The current state of the entity
The past state of the entity

Imagine a scenario where an active user is constantly failing to log in to their account. If we track log in attempts, either in a log or elsewhere, we would have the data to act on this. Analysing this data could allow us to respond in real time to a possible brute force attack (I.e. an attacker attempting passwords in quick succession) or be proactive and reach out the user.

The data would say the user is currently logged out (state), but the past state events would show that there have been constant attempts to log in. If we do not have this data, we would not know of this - this is a data problem leading to a possible security breach.

Classification: Limited or lack of historical data

Error context

Consider a scenario where you have a product with an on-boarding procedure. Users that sign up are taken through step-by-step process to set up their accounts further. The data needs to track the state of this procedure, but also be aware of errors during the journey. This needs to be reflected in the data permanently and not be transient.

What the company sees is reduced conversion, angry customers and muted growth - all because of a data problem!

To extrapolate: Errors and bugs happen, but we need the extra data to be recorded as to give context to what led up to and possibly why an error occurred. If we are not aware of the source or the magnitude of this problem we cannot remedy it for the affected users, nor adapt our system.

I can hear you shouting “but that’s what logging is for!”. And you’d probably be right, but I think there’s a limit to how much context can be contained in a log entry before it becomes unwieldy or potentially containing sensitive info. The log entry should clearly identify the error and then point us to the place where it happened. From there we can use our recorded data context to piece together the entire scene of the crime.

Classification: Lack of contextual data

Race conditions

It happens often enough that an action or an event can happen more than once, either by design or because of an error. The database is an excellent tool to guard against this if used correctly.

Most relational databases are ACID-compliant and hence we can use the A part here - atomicity. What this means is that the database will work with a transaction as a single unit, acting on the database at that point in time. We can use this to our advantage by having our application code write to multiple tables inside a single transaction. If there are clashes the transaction will fail and the rest of the application will fail in a predictable way.

Consider a scenario where we monitor webhooks for updates on a set of related tables. Without the above strategy, it is possible that we could write to the same tables at the same time, possibly causing duplicate data or worse.

Let’s further extend this by considering what would happen if the data on those tables are read while they are being updated, especially if the update procedure is complex and takes a few seconds. This would mean that the logic reading this data would get an inaccurate reading and any logic depending on that would follow down this road. Possibly leading to my least favourite category of bugs - logic errors.

Classification: Not leveraging atomicity

Dynamic vs. static data

Consider the following scenario:

A user has a subscription with your service. This subscription is invoiced every month, and it is built off of a table of transactions, perhaps a double entry system. There might be other parts to this as well, such as definitions of transaction types and categorisation.

When building the invoices table the first thought might be to use a many-to-many relationship to group individual transactions into an invoice. However, a problem would soon arise if anything related to those transactions need to change or are erroneously changed.

The solution in my opinion is to save the individual transactions inside the invoice as an embedded jsonb. This preserves the data in the state it was at the time the invoice was generated. Changes after the fact will not affect this specific invoice, and if need be a new invoice can be created to override this one. Just as we currently cannot travel back in time, certain parts of the system - finance in this case - have to follow the same rules.

Classification: Not preserving data in its original state / Rewriting history

That’s it for now. This is a rather wide topic, and I’ll probably be adding a part 2 at some point with some more real world examples. Specifically I’d like to illustrate how a seemingly non-data problem can actually be related to data for instance UX. To be continued!