Improving Quality with Story Points
Why story point estimates go wrong — cognitive biases and group effects — and a story point 'tax' to drive quality upward.

What are story points
Story points are unit-less numbers based on the relative time and effort to complete a software development task. These numbers are typically not linear; for example, you pick between numbers in the Fibonacci sequence (e.g. 1, 2, 3, 5, 8, 13, …). These numbers are only meaningful to a specific team — they cannot be used to compare teams.
Let’s imagine we’re building a web application for reading and sending emails. A team of developers would split the task of developing said application into distinct high-level chunks, typically called Stories. One of those pieces could be “Ability to compose new emails”. Let’s also assume that this particular team has not changed and previously worked on a similar feature for a bulletin board application, labeled “Ability to compose new threads”. They could estimate it would take the same amount of time and effort again (i.e. “5”). Alternatively, they can estimate a bit less (i.e. “3”) if they take into account code reuse and previous experience, or a bit more because their estimate for the similar feature was too low (i.e. “8”).
The team would work on several of these development tasks and come up with their velocity (i.e. “30”), meaning the number of story points they can complete in a period of time (e.g. every two work-weeks). The ideal case is that eventually this process will give the team a way of estimating how much work they can take on for said period of time. Everyone gets a chance to vote on a number for a particular work item without looking at others’ votes. They also have the opportunity to try to convince their peers that their number is the right choice.
What’s wrong
It turns out that software engineers are notoriously bad at estimating. Agile methodologies bring development teams this idea of story points in order to mitigate that, but in my experience there’s room for improvement.
We can try to find some answers by looking at the cognitive branch of psychology. It studies how people think, including decision making and problem solving. Specifically, there’s a list of cognitive biases to which we are all generally vulnerable.
One of these tendencies to think in a certain way is Rosy Retrospection, which is remembering the past better than it actually was. For example, psychologists interviewed three groups going on different vacations before, during, and after their journeys. Most followed the pattern of initial anticipation, followed by mild disappointment. Generally, most subjects some time later reviewed the events more favorably than they actually experienced them.
Just like that one, there are many others that deal with how we perceive and think about the past. Consistency Bias is incorrectly remembering one’s past attitudes and behavior as resembling present attitudes and behavior. Hindsight Bias is the inclination to see events that have already occurred as being more predictable than they were before they took place. There’s also Egocentric Bias, which is recalling the past in a self-serving manner — for example, remembering one’s code as higher quality than it actually was.
There are also issues linked to working with groups. An example is the Next-in-Line Effect, which means that a person in a group has diminished recall for the words of others who spoke immediately before or after this person. There’s also the Bandwagon Effect, which is the tendency to do (or believe) things because many other people do (or believe) the same.
A popular paper by cognitive psychologist George A. Miller cites that the number of objects an average human can hold in working memory is 7 ± 2. This is one of the bases for David Allen’s Getting Things Done time management method to improve productivity. The idea is that you can’t keep many things in your mind at the same time, and you should be writing them down somewhere for future processing.
My takeaway from all of this is that a team will likely remember past events incorrectly, assuming these events were more obvious and predictable than they were. Trying to mitigate this with group discussions will likely introduce another set of biases, such as remembering a subset of what was said and agreeing with the majority. Meanwhile, everyone’s mind is only able to keep track of a small subset of considerations — even assuming the best-case scenario where the whole team is focusing all their mental bandwidth on a single activity.
The goal
The best-case scenario so far is high consistency: fairly good estimates for about the same quality of work. My goal is to factor in a way of getting higher quality while preserving reasonably consistent estimates over time.
Possible solution: story point tax
Let’s list all the items that could impact the amount of time and effort it takes to finish a particular piece of work. Some examples:
- Stability of the build or continuous integration system.
- Existing unit test coverage.
- Expense of writing tests.
- Support-related issues.
- Time to merge into the release branch.
- Known vs unknown problem domain.
- Third-party dependencies.
- Legal work.
- Documentation work.
- Context switching.
- etc.
The idea is to assign some points, or fractions of a point, to all of these issues. Some are task-specific (i.e. legal work) while others can apply to every task (i.e. build health).
For example, imagine a team is inheriting legacy code with low test coverage and lengthy or hard-to-run tests. Everyone on that development team will likely see a negative impact on their time and effort. It means the system could break more easily or require a lot of manual, time-consuming testing to finish their work item correctly, thus lowering productivity. At some point, when test coverage improves, the “tax” or expense to productivity is lowered for that particular item.
Having this list beforehand prevents relying on the team’s mental bandwidth to account for all of these issues. As the team strives to lower this “tax”, the overall quality of the product should increase from every perspective: planners, developers, and stakeholders. Also, having more data would likely translate to a positive impact on estimate accuracy.
The past-related biases are somewhat mitigated in the sense that the team would keep said list frequently up to date. A quick example would be a not-so-frequent yet painful task that is partially forgotten and then later badly remembered because a list like the one above was not kept. One specific example would be the expense of tackling a new problem domain at the start of a release — “It wasn’t that hard to learn all that new terminology”.
Having all team members individually and anonymously collaborate on the items that should be on the list and the “tax” amounts will likely help deal with the group-based biases. Because the initial list is not generated in an open discussion, it’s less likely most people will flock towards what the most vocal members of the team think should be there.
Conclusion
Try to supplement story points with a list of things — taxes — that negatively impact performance. Let the whole team initially collaborate individually and anonymously on said list, and keep it frequently up to date. Supplement it with group discussions. Use it to help drive quality. For example: “More stories could get done if we had better test coverage, so let’s improve test coverage. Did test coverage improve? Yes — lower its tax value.”
