How "live" do we need our data to be?
30 November 2020 •
By Ewen Orme | Head of Development
Often there is a general feeling that data should be as current as possible to give a "live" view of whatever is being measured or analysed but exactly how up to date are we talking?
At ZatPark, huge quantities of data continuously flow in and out of the system, the temporal nature of which ranges from periodic pushes or pulls (monthly, daily, hourly, every 5 minutes) to more ad-hoc real-time API interactions. The incredible time-sensitivity of the industry means that the team are constantly reconciling the needs of the customer with both what is technically possible but also feasible with a given problem space. Our solution engineering process almost always has a component of "how often should this happen?"
If you consider 'data' as something that is raw like a simple set of numbers, then give that data context and you have 'information'. For example, you may have a certain number of vehicles come and go from a site over a period of time. If you take information and your reason over it, you get 'knowledge'; "I know that occupancy is highest on a Monday". The meaningfulness of data increases as it moves between these abstractions but is vastly dependant on the use case. In some situations, we may wish to wait for a more complete, fact-checked view before passing that visibility of information on to our customers. As we recommit to reporting and dashboarding feature sets, we strive to remove or mitigate artificial blocks to live data while ensuring it remains useful.
We often find ourselves considering how much we trust the data entering the system; do we need time to process, massage or error check a raw stream into meaningful information? Do we trust ANPR reads, for example, enough to make judgement calls immediately as and when we receive them? For some things yes, probably, like recording site occupancy. But for ticketing, we'd rather wait for a wider view of the problem space and give our clever matching algorithms a shot at it first. We may also choose to wait for data "alignment" e.g. we need a chance to gather information from various sources e.g. white list sessions to prevent reads making it to the checker and reducing the cognitive load on the human.
Another big question we face is when is "real-time" technically feasible. We can't have a reliance on connectivity for our mobile devices in the hands of operatives that may be patrolling underground or in an area of poor cellular reception. The processing overhead of in-depth reporting or dashboarding takes a toll on our infrastructure if the data sources being used are heavily contested from a continuous stream of incoming data. Without getting too technical, our database clusters are based around synchronous replication to ensure each node is consistent before allowing a data write to proceed. This is great for ensuring all readers have the same up to date view of the data but not so great at ensuring there is always resources free for reporting workloads. We are currently looking to facilitate this through two approaches. The first is to form our own data warehouse where data resides only to support uncontested reporting workloads which, once separated out, can also be logically translated and mutated into a more “fit for purpose” structure. In cases such as ANPR, it’s the amount of data rather than its contestation that makes reporting hard. In these cases, the second approach we are taking involves using Big Data techniques and technologies to house and process the data. For those interested, we’ve started using Google Big Table and Big Query. By facing the challenges of dealing with such large data quantities in a meaningful and timely fashion, the future looks exciting for giving customers better insight into their data.
Before I go, it’s an interesting point to make that outside of reporting, ensuring system performance day-to-day means, in some cases, making peace with “stale” data (albeit over short insignificant periods of time) and indeed several system design paradigms are based around this. Embracing the fact that certain parts of data change very infrequently allows us to reduce processing by utilising caching in ZatPark. We’ve seen great results in reducing database calls in recent months as part of our continued focus on making the system feel as snappy as possible by leveraging the managed caching provided by our hosting partner Google Cloud.