When we were building the 10,000ft reports feature, we were faced with a dilemma.

While the majority of reports take only milliseconds to generate (the median is 60 milliseconds at the time of this writing), a small but significant fraction of reports can take multiple seconds, and sometimes a full minute or more, due to the large volumes of data they encompass.

We really didn't want to subject our customers to the frustration of the rainbow spinner of death, so we turned to machine learning.

Because the time it takes to generate a report is approximately proportional to how much data the report encompasses, we knew it would be possible to predict the timing by measuring the volume of the data.

Then, we could use that prediction to replace the dreaded spinner of death with a reassuring progress bar.

## Generating Predictions

So, how do we generate these predictions? There are two relevant variables:

Let **x** equal the number of database records that a given report encompasses.

Let **y** equal the number of seconds the report takes to generate.

Given **x**, predict **y**. Problems of this form are well suited for simple linear regression.

Linear regression is a statistical technique for finding a line that best approximates a set of linearly correlated variables. Simple linear regression is a special case where you only have two variables, **x** and **y**.

## Linear Regression in Ruby

When it comes to machine learning, Python and R are the de facto languages to use because of their rich ecosystems of ML libraries and the communities behind them. However, our application is built in Ruby, and we wanted to keep our prediction models within the context of our existing codebase.

While ML libraries for Ruby do exist (most notably those under the SciRuby umbrella), none of them quite fit our use case. So, we decided to create our own linear regression models from scratch in Ruby.

We created a Ruby class called `LinearRegressionModel`

with two public methods, `train`

and `predict`

, and one private method, `forget`

.

class LinearRegressionModel < ApplicationRecord def predict(x) self.slope * x + self.y_intercept end def train(x, y) self.data_size += 1 self.sum_x += x self.sum_y += y self.sum_xx += x * x self.sum_xy += x * y rise = (self.data_size * self.sum_xy) - (self.sum_x * self.sum_y) run = (self.data_size * self.sum_xx) - (self.sum_x * self.sum_x) self.slope = run == 0 ? 0 : rise / run self.y_intercept = (self.sum_y / self.data_size) - (self.slope * (self.sum_x / self.data_size)) forget end private def forget if self.data_size > self.max_data_size self.sum_x = (self.sum_x / self.data_size) * self.max_data_size self.sum_y = (self.sum_y / self.data_size) * self.max_data_size self.sum_xx = (self.sum_xx / self.data_size) * self.max_data_size self.sum_xy = (self.sum_xy / self.data_size) * self.max_data_size self.data_size = self.max_data_size end end end

`predict`

and `train`

do exactly what you would expect. For every report, `predict`

is called before the report generation begins, and `train`

is called after it completes.

`train`

takes an x and y data point, increments the data size, updates the sums, and uses the formulas below to calculate the slope and y intercept of the line.

At the end of `train`

, the `forget`

method is called, which scales down the sums and data size to match the model's max data size.

We do this for two reasons.

The first reason is that the numbers have to be bounded in some way. Otherwise, they would eventually grow so large that they exceed the dynamic range that standard floating point format can represent, and the accuracy of the models would quickly degrade as the numbers get rounded off at higher and higher orders of magnitude.

The second reason is that we want our models to stay flexible. This adjustment creates a helpful bias such that recent data is weighted more heavily than older data, hence the name `forget`

. We update these models on the fly after every report, and we want the models to automatically adapt to changing circumstances.

For example, if we were to upgrade our hardware such that all database queries run faster, the prediction models would suddenly be overestimating the query times. But because the models are biased toward the most recent data, they will readjust relatively quickly as new data comes in.

## Making Adjustments

One thing we learned after observing these models in production was that larger reports have larger variance in query times. So, even though the predictions for these large reports are accurate on average, individual predictions can sometimes be off by a significant margin.

To address this, we made an adjustment to the front-end progress bar code to interpret the predictions pessimistically by adding extra buffer. We reasoned that it's better to under promise and over deliver than the other way around.

const calculateWaitTime = (prediction) => { const exponentialBuffer = 0.01 * Math.pow(prediction, 2); const linearBuffer = 2 * prediction; const buffer = Math.min(exponentialBuffer, linearBuffer); return prediction + buffer; };

The buffer is calculated as one percent of the square of the prediction, with an upper bound of two times the prediction. So, if the prediction is 10 seconds, then 1 second of buffer is added for a total of 11 seconds. If the prediction is 60 seconds, then an extra 36 seconds is added for a total of 96 seconds.

We calculate it this way because smaller predictions are accurate enough that they don't need much buffer. It's only the really large reports that need it.

## Setting Expectations

In an ideal world, there would be no need for a progress bar at all because all reports run super fast. But in the real world, computation takes time.

From a UX point of view, waiting for a process to complete is much easier to tolerate when you know what to expect. No one enjoys staring at a rainbow spinner of death. Now, thanks to linear regression, our users don’t have to.