Why do we need to estimate next sprint velocity?

When working in projects Scrum Masters and Delivery managers often need to estimate a velocity of the team during next sprint. Even in agile approach clients would like to know what they can expect from next release. Creating burndown chart without correctly estimated velocity might be misleading and result in too high client's expectations. Looking from a team side, it's good if SM could assess how much a team can deliver to propose sprint goal and stretched goals to challenge a team.

How to estimate next sprint velocity?

According to Scrum Book next sprint velocity can be predicted using running (rolling) average from last 3 sprints: http://scrumbook.org/value-stream/running-average-velocity.html The approach could work if the team capacity is not changing much between sprints. If you'd like to achieve better accuracy even when part of your team is on holidays I'd propose slightly different approach using capacity to velocity ratio.

Capacity to velocity relation

To ensure that team capacity during a sprint is taken into consideration I'd recommend to calculate a ratio between delivered story points and team capacity for each closed sprint. The ratio should be much more stable indicator in time than velocity itself.

$$ratio = \frac{sprint\ velocity}{team\ capacity\ during\ sprint}$$

Team capacity it's not difficult to calculate. I'm adding days spent by tech architects, developers and test engineers on the project during sprint. In my opinion they all contribute to the team output. For example if tester is on holidays, the team needs to perfom tests and it decreases the number of features delivered because they need to split their focus on more tasks. This is not very granular approach because value delivered by particular team members may vary but it should be sufficient to compare estimation methods.

Include bugs in velocity calculations

Including effort spent on defects in velocity is usually not recommended. I'd argue that it's worth to perform double accounting and calculate velocity with and without bugs (see: https://www.infoq.com/news/2011/09/bug-fixes-velocity). In my teams I ask developers to note how difficult was the fix after they will finish it. To achieve more accurate predictions I added bug fixing effort represented in Story Points to sum of SP for delivered stories.

Libraries and methods that are used in the notebook:

In [138]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

import xgboost as xgb
import matplotlib
matplotlib.rcParams['figure.figsize'] = [10, 5]
from sklearn.model_selection import train_test_split
import os

os.environ['KMP_DUPLICATE_LIB_OK']='True'
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
In [139]:
closedSprints = pd.read_csv('sprintsData.csv', delimiter=';')
closedSprints['ratio'] = round(closedSprints['velocity'] / closedSprints['capacity'], 2) 
closedSprints.head()
print("Number of sprint observations: {}".format(len(closedSprints)))
Number of sprint observations: 20

The training data comes from one of my projects. There are not many observations in our training dataset. Unfortunately there is a common situation when it comes to predicting velocity. Let's see if it would be enough to train prediction models.

Below is the function that presents a forecast along with the historic velocity. Additionally I'd like to display prediction interval including 80% of predictions. The assumption is that errors have normal distribution.

If you're interested in prediction intervals and why they're necessary in planning, please refer to my previous post: Prediction intervals in Agile forecasting

In [140]:
# method used to visualise forecast with actual values and prediction interval.
def dispForecast(forecast, method):
    forecast['error'] = forecast['forecast'] - forecast['velocity']
    errorStd = round(forecast.error.std(), 2)
    forecast['80_up'] = np.clip(forecast['forecast'] + 1.28 * errorStd, a_min=0, a_max=None)
    forecast['80_down'] = np.clip(forecast['forecast'] - 1.28 * errorStd, a_min=0, a_max=None)
    
    _ = plt.xticks( forecast.index.values , forecast.sprintNo ) # location, labels
    _ = plt.plot( forecast['velocity'])
    _ = plt.plot( forecast['forecast'])
    
    
    _ = plt.fill_between(forecast.index.values, forecast['80_up'],forecast['80_down'], 
                    color='orange', alpha='0.1')

    
    _ = plt.title('Forecast method: {}'.format(method))
    _ = plt.xlabel('Sprint no')
    _ = plt.ylabel('Velocity with bugs')
    _ = plt.legend(['velocity', 'velocity forecast'])
    _ = plt.show()
    
    
    forecast.error.hist(bins=20)
    print('Error standard deviation for forecast method \'{}\': {}'.format(method, errorStd))
    print('Error mean for forecast method \'{}\': {}'.format(method, round(forecast.error.mean(), 2)))
    
    errors = forecast['error'].dropna()
    rmse = (errors ** 2).mean() ** 0.5
    print('Root mean square error: {}'.format(round(rmse, 2)))

3 last sprints velocity average

I'd like to use 3 last sprints velocity average as a baseline to compare more sophisticated models. My assumption is that any of next models won't be worse than this.

In [141]:
data = closedSprints.copy()
# calculate rolling average of 3 last sprint velocities
data['forecast'] =  data['velocity'].shift().rolling(3, min_periods=1).mean()
dispForecast(data, '3 last sprints velocity')
Error standard deviation for forecast method '3 last sprints velocity': 8.17
Error mean for forecast method '3 last sprints velocity': 2.04
Root mean square error: 8.21

Comments

Error histogram is skewed to the right, error might be reduced if 2.04 would be deduced from initial forecast. As you can see the error standard deviation is quite high which makes our prediction interval quite large. I'd argue that it is not very useful for planning purposes.

3 last velocity to capacity ratio

Next approach is taking the team capacity into consideration but it's still using naive average. It should improve forecast but probably not that much.

In [142]:
data = closedSprints.copy()
# calculate rolling average of 3 last ratios
data['ratioForecast'] =  data['ratio'].shift().rolling(3, min_periods=3).mean()

# calculate velocity prediction using capacity and average ratio
data['forecast'] = round(data['capacity'] * data['ratioForecast'], 3)


dispForecast(data, 'Last 3 rolling average ratio')
Error standard deviation for forecast method 'Last 3 rolling average ratio': 7.36
Error mean for forecast method 'Last 3 rolling average ratio': 2.03
Root mean square error: 7.43

Comments

The ratio average histogram is not similar to normal distribution. In this case calculated prediction interval might be inaccurate. Standard deviation is bit lower than in previous example but still quite large. I'd be cautious when using this for estimating next sprints velocity.

Linear regression

Linear regression is one of the popular machine learning techniques. In short it tries to find a relation between 2 variables.

In [143]:
data = closedSprints.copy()
# Create linear regression object
regressor = linear_model.LinearRegression(normalize=True)


X = data[['capacity']]
Y = data['velocity']
# Train the model using the training sets
regressor.fit(X, Y)

data['forecast'] = regressor.predict(data[['capacity']])
#data['forecast'] = regressor.coef_ * data['capacity'] + regressor.intercept_

dispForecast(data, 'Linear regression')
Error standard deviation for forecast method 'Linear regression': 7.87
Error mean for forecast method 'Linear regression': -0.0
Root mean square error: 7.68

As you can see the standard deviation is bit larger than in previous examples. Maybe the hipothesis that the relation between capacity and velocity is linear is not true. Let's try to add squared capacity to the equation and see if the forecast will be improved.

In [144]:
regressor = linear_model.LinearRegression(normalize=True)

# add squared capacity
data['capacitySquared'] = data['capacity'] ** 2
X = data[['capacity', 'capacitySquared']]
Y = data['velocity']
# Train the model using the training sets
regressor.fit(X, Y)

data['forecast'] = regressor.predict(data[['capacity', 'capacitySquared']])
data['error'] =  data['velocity'] - data['forecast']
#data['forecast'] = regressor.coef_ * data['capacity'] + regressor.intercept_

dispForecast(data, 'Linear regression with capacity squared')
Error standard deviation for forecast method 'Linear regression with capacity squared': 7.58
Error mean for forecast method 'Linear regression with capacity squared': 0.0
Root mean square error: 7.39

Comments

Standard deviation is bit lower after adding squared capacity. In my opinion there is place for improvement here. I'd check partial correlation of the errors to check if we're not missing any features that could benefit the model. Most probably adding additional information like average size of stories, number of stories, etc. should further improve the model.

XGBoost

In next example I'd like to try boosted trees method to tackle the problem. Again, I'd like to encourage you to learn how it works - Introduction to boosted trees As usual the code is quite simple.

In [145]:
data = closedSprints.copy()
X = data[['capacity']]
y = data['velocity']

regressor = xgb.XGBRegressor()
regressor.fit(X, y)
data['forecast'] = regressor.predict(X)

    
dispForecast(data, 'xgboost')
Error standard deviation for forecast method 'xgboost': 4.53
Error mean for forecast method 'xgboost': -0.02
Root mean square error: 4.42

Comments

And we have the winner. XGBoost achieved the best result without any tunings. There is a risk that model is severly overfitted so I'd be bit cautious with the results. From the other hand I can see big potential to improve the results further by addding more features as in linear regression.

Summary

Model RMSE Standard deviation
3 last sprints velocity average 8.21 8.17
3 last velocity to capacity ratio 7.43 7.36
Linear regression 7.68 7.87
XGBoost 4.42 4.53

It seems that relation between capacity and velocity is not linear and boosted tree approach is giving the best results here. By adding new features like average story size in backlog and/or in progress before sprint start I could most probably decrease the error and standard deviation even further. I hope that it would encourage you to experiment with your data from Jira to better estimate velocity of your teams!