Removing the bubbles: solving bottlenecks in software product development
A challenge with software product development is visualising the work so that you can spot where there are delays in the process of converting ideas from “concept to cash”. This post shows how a cumulative flow diagram helped identify a pattern of queues over time. Removing these queues had many benefits such as fewer errors, increased team communication and improved team capacity.
Make the work visible
The first task is making the work visible. In knowledge work, such as software development, it is difficult to see the work being done, which is why a visualisation approach such as kanban can be so useful. Here’s a view of a kanban board from an earlier client team:
The kanban board is useful for a “moment in time” view, but it’s not possible to easily see patterns that might develop over time. Looking at the kanban board on a particular day doesn’t make it easy to answer questions like these:
- How long have these work items been waiting in this column (stage)?
- How long does it usually take for work items in this stage of the process to complete?”
- How often do we see queues in this step? How long do they last for?
- Are these queues a special event or do they happen regularly (touching on the difference between common and special cause I’ve mention in an earlier blog)
To find these answers and look more clearly for patterns over time we built a cumulative flow diagram (CFD, also called a ‘finger chart’) by counting the number of post-it notes in each stage (column) in the team’s process after each daily stand-up. Unlike my earlier post on using three forks and a hand-drawn chart to help a team improve in this case we used an Excel spread sheet.
Visualise the work over time to better understand queues (‘bubbles’)
The cumulative flow diagram for this team helped make visible that there were consistent queues of work in the functional testing and acceptance testing processes over time. These queues are visible as “bubbles” that develop in the cumulative flow diagram. See the highlighted in orange and red stages below (click the image for a larger version).
Do the detective work necessary to understand what causes the queues (‘bubbles’)
Around two-thirds of the way through the above chart (which covered about 36 weeks) we decided to focus on studying what was causing the queues to develop in functional and acceptance testing.
The functional testing involved someone other than the person who developed the functionality (user story) validating that it worked functionally (there were no obvious errors). Once functional testing was complete then the acceptance testing stage was performed by a business analyst or the product manager.
The team were releasing to production every second Wednesday. On the middle Wednesday the person who did the functional testing switched to doing the integration testing (ensuring the features which were created as a package to go to production worked individually and combined, as well as running a set of manual regression test scripts to make sure that the new functionality hadn’t had any impact on the rest of the website). During the week spent on Integration testing, no functional testing was done, which we believed was the cause of the queues or orange bubbles on the chart.
Creating a new policy to reduce the queues (‘bubbles’)
We sat down with the person who performed the Functional and Integration Testing and mapped out the schedule of their work across the fortnight between releases (see the hand-drawn diagram we came up with below).
We also mapped out a new “policy” that described what the person doing testing did for for the week spent integration testing:
While performing the Integration Testing in the week before the release, if there are any work items in the Functional Testing column, spend up to an hour each day doing them.
We experimented with the new policy for the last third of the cumulative flow chart. The cumulative flow diagram showed that the queue (bubble) in the Functional Testing (orange) step virtually disappeared, as did the queue in the Acceptance Testing (red) stage. The CFD not only highlighted the initial problem, but it also validated the experimental change we made in policy resulted in an improvement (it allowed us to answer the critical question – “did the change we made to our process result in an improvement?”)
It’s the system!
This example demonstrates how changing the way the work is structured can produce improvements without having to change the work that team members were doing. This example shows that the queues caused by the way the work was structured (e.g. the system we had designed) and not the work of the team members. It speaks to Deming’s ‘provocation’ that “95% of the variation [in how long the work takes] is due to the system and not the individuals”.
Benefits
There were many benefits to the changes that we made above:
- Removing the queue in functional testing meant that if a problem was found then the developer got faster feedback. Getting feedback faster reduced the time it took a developer to “get their head back into the issue” and fix the problems. It also improved the communication between members of the team – the developers were more likely to speak to the person who did test at stand-up about the work that was coming because they knew it would be tested quickly, rather than potentially sitting in a queue waiting for a week.
- By reducing the bottleneck in Functional Testing also reduced the same bottleneck in Acceptance Testing.
- The reduced “thrashing” from having issues discovered close to the release date meant the team’s capacity to do work increased.
- As there were fewer queues it reduced the pressure on team members, helping them feel less rushed which improved the quality of life for the team, reduced “rushing” leading to better quality and team morale.
Hi, I’m Benjamin. I hope that you enjoyed the post. I’m a consultant and coach who helps IT teams and their managers create more effective business results. You can find out more about me and my services. Contact me for a conversation about your situation and how I could help.
The importance of understanding variation or how to avoid treating all contractors as thieves
Here’s a story of how managers detected a problem, but by not understanding the cause of the problem of the type of variation the problem represented, applied the wrong type of solution which meant things were worse for everyone:
Once upon a time in a large financial institution that had many thousands of people in their headquarters, a handful of hourly-paid contractors got their manager to sign their time sheets for times they did not work.
This was clearly a type of fraud and the police were called and the contractors went to jail.
The senior managers looked for a way of making sure it would Never Happen Again.
They came up with a cunning plan! Connect the time clocks in the security gates with the electronic time tracking system for all contractors (yes, even those on day rates).
A little while later, some of the contractors began to change their behaviour. They started to watch the clocks themselves and only work the weekly 40 hour minimum number of hours. When they went out for a big lunch, they stayed out longer if they’d “done their time already this week”.
One clever team of contractors even worked out the rounding rules of the gate system so that if they arrived by 9:14 in the morning it would round their time back until 9:00 effectively saving them from having to work 70 minutes each week. Some of them even set timers to go off around the end of the day so they didn’t stay a minute longer than they were being paid for!
This story highlights the importance of understanding the cause of a problem and the type of variation the problem represents before trying to solve the problem.
Common cause vs special cause
In this case, the small handful of hourly paid contractors were not representative of the thousand of other full time employees and contractors in the building. So the fraud they committed was not a signal that something was wrong with all the people in the building, but instead just a tiny minority. Rather than seeing this problem as a signal that represented a special event with an identifiable cause (referred to as special cause variation) the management acted as if this was a problem with all contractors in the building (something that could have happened in any team at any time – referred to as common cause variation)
In a special cause situation it’s worth asking “is there a specific root cause that explains what happened here?” because it’s likely there are a small number of identifiable causes. In the absence of good data (such as a longitudinal plot of data), a useful rule of thumb is to ask “If we replaced this bunch of people with another bunch of people would the problem occur?”. In this story, there were hundreds of other hourly-paid contractors in other teams who did not fabricated their timesheets, so the answer is probably ‘no’ indicating that this was likely a special cause situation. In a common cause situation there’s no point asking “what was the cause of this?” because there are multiple sources of variation (causes) all contributing to the problem.
The fix for a special cause situation is to go to the root cause and see if it can be prevented. Indeed in this story it would have been useful to question the manager involved and understand what lead him to sign timesheets for times that his team did not work. The fix for common cause variation (and most variation is common cause) is to go and study the situation, experiment and try and look for patterns or trends in the data before making a change to the system.
Implementing the wrong type of fix is tampering and mostly makes things worse
As this story illustrates, applying a common cause solution to a special cause problem – “tampering” as Deming called it – can lead to bad results. Making all contractors (even those on day rates) use the electronic time keeping system sent that all contractors were thieves! And as Deming says, if you muck people around they will use their ingenuity finding ways around the system instead of working towards the purpose of the system. Applying a common cause solution to a special cause problem will reduce humans intrinsic motivation because it can seem unreasonable and unjust.
The story above is actually a reverse of the more common scenario where managers often treat what is a systemic problem as a special cause and blame the individual. There are many examples of this such as setting targets for sales in call centres (tip: most of the sales are the result of customers who want to buy phoning in, rather than the technique of the person who receives the call).
Have you seen examples of tampering where the wrong type of fix was applied to a problem? (such as yesterday’s blog where a manager tried to change the team’s process to cater for the behaviour of specific individuals) Do you have stories of fixes that were applied to the whole system when there was a clear special cause that could have been prevented at its source (e.g. sign-offs in a deployment process)? Please share your story in the comments.
Image credit: flickr
Hi, I’m Benjamin. I hope that you enjoyed the post. I’m a consultant and coach who helps IT teams and their managers create more effective business results. You can find out more about me and my services. Contact me for a conversation about your situation and how I could help.