Removing the bubbles: solving bottlenecks in software product development
A challenge with software product development is visualising the work so that you can spot where there are delays in the process of converting ideas from “concept to cash”. This post shows how a cumulative flow diagram helped identify a pattern of queues over time. Removing these queues had many benefits such as fewer errors, increased team communication and improved team capacity.
Make the work visible
The first task is making the work visible. In knowledge work, such as software development, it is difficult to see the work being done, which is why a visualisation approach such as kanban can be so useful. Here’s a view of a kanban board from an earlier client team:
The kanban board is useful for a “moment in time” view, but it’s not possible to easily see patterns that might develop over time. Looking at the kanban board on a particular day doesn’t make it easy to answer questions like these:
- How long have these work items been waiting in this column (stage)?
- How long does it usually take for work items in this stage of the process to complete?”
- How often do we see queues in this step? How long do they last for?
- Are these queues a special event or do they happen regularly (touching on the difference between common and special cause I’ve mention in an earlier blog)
To find these answers and look more clearly for patterns over time we built a cumulative flow diagram (CFD, also called a ‘finger chart’) by counting the number of post-it notes in each stage (column) in the team’s process after each daily stand-up. Unlike my earlier post on using three forks and a hand-drawn chart to help a team improve in this case we used an Excel spread sheet.
Visualise the work over time to better understand queues (‘bubbles’)
The cumulative flow diagram for this team helped make visible that there were consistent queues of work in the functional testing and acceptance testing processes over time. These queues are visible as “bubbles” that develop in the cumulative flow diagram. See the highlighted in orange and red stages below (click the image for a larger version).
Do the detective work necessary to understand what causes the queues (‘bubbles’)
Around two-thirds of the way through the above chart (which covered about 36 weeks) we decided to focus on studying what was causing the queues to develop in functional and acceptance testing.
The functional testing involved someone other than the person who developed the functionality (user story) validating that it worked functionally (there were no obvious errors). Once functional testing was complete then the acceptance testing stage was performed by a business analyst or the product manager.
The team were releasing to production every second Wednesday. On the middle Wednesday the person who did the functional testing switched to doing the integration testing (ensuring the features which were created as a package to go to production worked individually and combined, as well as running a set of manual regression test scripts to make sure that the new functionality hadn’t had any impact on the rest of the website). During the week spent on Integration testing, no functional testing was done, which we believed was the cause of the queues or orange bubbles on the chart.
Creating a new policy to reduce the queues (‘bubbles’)
We sat down with the person who performed the Functional and Integration Testing and mapped out the schedule of their work across the fortnight between releases (see the hand-drawn diagram we came up with below).
We also mapped out a new “policy” that described what the person doing testing did for for the week spent integration testing:
While performing the Integration Testing in the week before the release, if there are any work items in the Functional Testing column, spend up to an hour each day doing them.
We experimented with the new policy for the last third of the cumulative flow chart. The cumulative flow diagram showed that the queue (bubble) in the Functional Testing (orange) step virtually disappeared, as did the queue in the Acceptance Testing (red) stage. The CFD not only highlighted the initial problem, but it also validated the experimental change we made in policy resulted in an improvement (it allowed us to answer the critical question – “did the change we made to our process result in an improvement?”)
It’s the system!
This example demonstrates how changing the way the work is structured can produce improvements without having to change the work that team members were doing. This example shows that the queues caused by the way the work was structured (e.g. the system we had designed) and not the work of the team members. It speaks to Deming’s ‘provocation’ that “95% of the variation [in how long the work takes] is due to the system and not the individuals”.
There were many benefits to the changes that we made above:
- Removing the queue in functional testing meant that if a problem was found then the developer got faster feedback. Getting feedback faster reduced the time it took a developer to “get their head back into the issue” and fix the problems. It also improved the communication between members of the team – the developers were more likely to speak to the person who did test at stand-up about the work that was coming because they knew it would be tested quickly, rather than potentially sitting in a queue waiting for a week.
- By reducing the bottleneck in Functional Testing also reduced the same bottleneck in Acceptance Testing.
- The reduced “thrashing” from having issues discovered close to the release date meant the team’s capacity to do work increased.
- As there were fewer queues it reduced the pressure on team members, helping them feel less rushed which improved the quality of life for the team, reduced “rushing” leading to better quality and team morale.
Hi, I’m Benjamin. I hope that you enjoyed the post. I’m a consultant and coach who helps IT teams and their managers create more effective business results. You can find out more about me and my services. Contact me for a conversation about your situation and how I could help.
Control / Capability Charts on a Kanban Software Development Project
Corey Ladas and David J Anderson have recently spoken about how lots of software/IT teams using Kanban board have created Cumulative Flow Diagram but few of them have done control (or capability) charts or histogram.
On the software development project I’m the Project Manager of we have started using control charts for several reasons:
- To better understand the variation over time in a number of key areas (how many things we deliver to production each fortnight and how many days it takes from a developer starting for a story to be ready for production.
- To better understand how we can improve our process by separating common cause problems from special cause problems.
Control Chart 1: Features Released to Production
The first control chart we’ve used is the number of Jiras (think Story) we have delivered per iteration length and number of available developers.
I’ve split the periods on the charts to reflect pre- and post-Go Live.
What this chart shows me is that the system is in statistical control. None of the points are outside the Upper Control Limit. There is currently a slight downward trend in the number of Jiras we deliver (six consecutive releases are under the mean).
As the Project Manager I’d like to achieve two things. First, I’d like to increase the amount of value we deliver to our end customers. The first point is tricky since count of Jiras has a very rough correlation with end customer value. I don’t want to encourage the team to focus on the numbers (e.g. I don’t want lots of small stories instead of larger ones, if the larger ones have value). Second, I’d like to reduce the variability over time. I’m puzzling over whether a sizing metric on the stories would help reduce some of the noise in the variation.
Given that the chart shows me that we’re in control and if we want to make improvements on the amount that we deliver each fortnight, we probably need to look for common cause / system-wide influences, rather than a special cause, such as asking the developers to “just work harder”. A lot of the delay recently has been having to deal with other teams within the IT department who have longer lead times than we do. From a systems perspective, improving our ability to work with other teams is where I think we’ll gain more throughput improvements than focusing on improvements within our team.
Cycle Time: From Development to Ready for Release
The second area we have used Control Charts is looking at the time it takes for a Story to progress through the following states in our process (and on the Kanban board):
- Developer Design (Technical Design Discussion, How to Test written)
- Development Underway (Including Code Review if it wasn’t developed by pairing developers)
- Functional Testing (this is where they wait for Test to test them)
- Acceptance Testing (this is where they wait for the BA to test them)
We started out with a histogram to understand the range of times it was taking tasks to cross these columns on the board.
The histogram gives a good overview of how long the tasks are taking, but it’s more interesting to see this in a Control Chart since the Upper Control Limit helps highlight which tasks took excessively long and are likely to be due to special causes that are worth root cause analysing. I prefer the timeline view of the control chart to the histogram since it shows if things are changing over time as well as more clearly illustrating the outliers.
Here we can see that there are two tasks which were excessively long. These were special causes since one was due to working in a new way with another IT team, and the other was due to a “pile up” of work in progress with one developer who was tasked with performance and scalability testing the application (which required co-ordination with other groups to access the testing infrastructure). Removing these two outlier shows some other tasks to investigate, again, mainly around working on tasks that involve co-ordination with other teams.
The charts and histograms in these two areas suggest that the most productive improvements in our development system are going to come from working out ways of working better with other teams.