Net Objectives

Net Objectives
If you are interested in coaching or training in ATDD or TDD please click here.

Thursday, December 8, 2011

Lies, Damned Lies, and Code Coverage

Download the Podcast

As unit testing has gained a strong foothold in many development organizations, many teams are now laboring under a code coverage requirement.  75% - 80% of the code, typically, must be covered by unit tests.  Most popular Integrated Development Environments (IDE’s) include tools for measuring this percentage, often as part of their testing framework.

Let’s ask a couple of questions, however:
  1. "What does code coverage actually measure?"
  2. "What does mandating a code coverage percentage get you?"
These two will yield another: 
  
     3. “Is code coverage actually useful for anything?”


What does code coverage actually measure?

Test-related code coverage measures the percentage of code[1] that is executed when the suite of unit tests run.  By demanding a high percentage of coverage, management is attempting to ensure quality; the premise being that if the code is invoked during the suite's execution it is therefore guaranteed to be correct.

But, consider this:

// pseudocode
class Foo {
    public ret someAlgorithm(par parameter){
        // some complex algorithm that should be tested
    }
    public ret someOtherAlgorithm(par parameter) {
        // some other complex algorithm that should be tested
    }
}

class FooTest {
    public void testOfFooBehvaior() {
        Foo testFoo = new Foo();
        testFoo.someAlgorithm(Any.par());
        testFoo.someOtherAlgorithm(Any.par());
        assertTrue(true);
    }
}

Anyone want to run the code coverage on this?  It is going to clock at 100%, assuming the algorithms comprise single code execution paths.  You might need to do a bit more if the paths branch (using different parameters in the calls), or more, depending on the type of coverage you're aiming for, and the test will always pass.  It’s a test of nothing (true always being, you know, true).

Code coverage does not measure how much code is tested, it covers how many lines of code is executed. Now, I can hear you saying “yeah, but that’s a completely contrived example! Why would anyone do that?

Even if the developers would not dare to do something so brazen, they might be tempted to write the simplest tests they could, perhaps using a tool that automatically generates a test-per-method to save time. These tests would simply reflect the current code's behavior, not the correct behavior of the system; what the system does, not what it is needs to do

Why would they do that?

Why indeed.  That gets us to question number two.

What does mandating a code coverage percentage get you?

There is an old adage in project management: “You get what you measure”[2].  Woe-betide the organization that decides to pay its developers based on the number of bugs they fix per quarter. There will be a lot of bugs to fix in that code!  Or, more realistically, many teams have been compensated for the number of lines of code they generate.  Not surprisingly they have been writing lots and lots of unnecessary code. This is just human nature.

If developers are writing unit tests because “the boss says so” then they have no real professional or personal motivation driving the activity.  They’re doing it because they have to, not because they want to.  Thus, they will put in whatever effort they have to in order to increase their code coverage to the required level and not one bit more.  It becomes a “tedious thing I have to do to before I can check in my code, period."

At a recent Net Objectives conference a member of the audience[3] came up after a talk we gave on TDD and shared a piece of code he had found in a code base he had inherited.  It was a class with a single method that did something legitimate (but was, apparently, difficult to write a unit test for  -- maybe it had a void return).  But the developer had added a second method.  This second method created an integer i, incremented the integer 700 times (not in a loop, but literaly 700 “i++; increments), and returned the result.  His unit test then called this second method and asserted the return was 700.  Because this bogus method was so lengthy he got his 75% code coverage without calling the legitimate method at all.  How had he arrived at 700?  He probably started with a smaller number and kept copying-and-pasting the “i++;“ until the coverage hit 75%.

Here again, this is a rather extreme case.  What’s not so extreme is leaving code in place that is actually never used (“dead code”) simply because it has tests, and since removing the code would mean removing the tests, this would lower the code coverage.  Should we keep dead code just so we can keep the tests? 

You get what you measure.

The only way to get developers to write the tests we really want them to write (and the only way to reliably get anyone to do anything, frankly) is to point out to them why they should care, what benefit will accrue to them if they spend time, energy, and passion to create them.  Most of the other topics we will write about in this work will, in one way or another, provide this motivation[4]. 

But... then is code coverage actually useful for anything? 

Yes, and here we will see an example of something that occurs repeatedly in TDD: using a tool for something other than it was intended for.  Something better.

Often in TDD, especially in the initial stage of developing some particular behavior of the system, we find ourselves less than certain about how to proceed, what exactly a requirement means, or just what the system’s code should do.  When we’re “in the weeds” we might choose to investigate the issue by writing a lot of small tests to work out the edges, boundaries, and permutations of a behavior in order to improve our understanding of it.  These “triangulation tests” [5] can be very useful, but they are often largely redundant.  Once we get the understanding we need, can write the proper test, and create the proper behavior in the system to get it to pass, we then will want to remove some, most, or all of the triangulation tests.

But... is it some, most, or all?  Here is where the code coverage measurement will help. Before removing a test that you believe to be redundant, run the coverage percentage and note it.  It should, in TDD, be 100% or very close to it.  Now remove the theoretically redundant test.  Finally, re-run the coverage percentage.  If it has slipped, even a little, then one of two things must be true: either the test you removed was not entirely redundant, or you have dead code somewhere in the system.  Either way now is the time to figure it out and fix it.

Here’s another use:  In TDD we usually find that the test suite, once we’re done developing the system, will serve other purposes.  One such purpose is this: if you come back to the system six months later  the suite of tests might be the best thing to read in order to get re-familiarized with the system.  If they all compile, and pass, then they are accurate to the system [6].  However, can we be certain that no one has added to the system without adding to the test suite?  Sure.  Run the code coverage.  If it’s not 100% then someone has enhanced the system without doing TDD, and you know it.

Developers who run code coverage for these purposes love their coverage tool.  And, as we’ll see, the kind of tests we’ll be learning about in this work will be the tests that developers love, care for, and always keep current to the system.  Because they help us to succeed.

----

[1] We know there are different types and levels of code coverage, the blog is relevant for all of them. See http://en.wikipedia.org/wiki/Code_coverage for more on the subject.
[2] This is often attributed to Lord Kelvin, but he actually said “If you cannot measure it, your knowledge is meager and unsatisfactory.”  Tom Peter’s paraphrase is more to our point: “"What gets measured, gets done ."   Or we can go to Albert Einstein, who wrote on his wall: "Not everything that counts can be counted, and not everything that can be counted counts."
[3] Paddy Healey is the gentleman.  
[4] This should not be read as a slam on developers, btw.  We’re often given bureaucratic tasks to complete in life, and it’s understandable that we have little energy on them.  We simply need to make sure our tests are not in that category!
[5] Much more on this in another blog.
[6] Much much more on this in another blog!

Monday, November 28, 2011

Redefining Test-Driven Development, Pt. 2

Download the Podcast 


In part 1 we said “How you do something new is often influenced to a great extent by what you think you are doing.”  Let’s add that, similarly, changing the way you do something you are already doing can come from a new understanding of its nature.

Something development teams already do (or, in our opinion, really should be doing) is to write a specification of the system before they create it.  This specification comes from an analysis of requirements, and reflects the development team’s understanding of the business value of the system from the customer’s perspective and the technology used to create the solution.  “The spec” is then referred to throughout the development process as fundamental guidance for everything the team does.

Specifications have great value; this value, however, is not persistent.  

Let’s say you created a specification in a traditional way: you wrote a document, embedded some design diagrams charts and graphs, and so forth.  This would form an artifact that expressed your understanding of the system.

Let’s further say that you used this specification to work from, completed the development process, released the system, and moved on.  

Now, eighteen months later, the customer wants to make changes to the system.  You’ve been away from the system for quite a while, and you’re fuzzy on the details, so job one is to re-acquaint yourself with it.  Should you re-read that specification you created way back when?  You could, but how do you know it is still accurate?  Someone could easily have made changes to the system and not updated the spec accordingly.

We all know we should not do that, but as a practical matter it happens all the time.  People make changes with limited time and resources, and under pressure... and often they simply neglect the spec entirely, or they update it incompletely or incorrectly.

And even if you don’t have any reason to suspect this has happened, how can you know, really know for sure, that it has not?  The only way is to examine the system in detail and compare it to the spec.  If you have to do this, they what good did having the written spec really do you?

So, consider this, a typical unit test:

// pseudocode
public class AccountTest {
    public void testAccountAmortizesCorrectly() {
        double value = Any.value();
        int term = Any.term();
        int yearToWriteOff = Any.yearUpTo(term);

        Account testAccount = new Account(value, term);
        double expectedAmount = max(value/term, 100.00);

        double actualAmount = testAccount.amortize(yearToWriteOff);

        assertAreEqual(expectedAmount, actualAmount, 1);
    }
}

Look closely.  What does this tell you?

  1. There is an object called Account that can amortize itself
  2. Account takes a value and a term via its constructor
  3. Value is double, term is int, and neither are constrained (“Any”) [1]
  4. Amortize means “write off”
  5. All years amortize in the same way (“Any” again)
  6. You call an amortize() method and pass the year to write off (an int) to it
  7. The way you know how much to write off is value/term, but no more than 100.00
  8. We do not care about pennies (the tolerance for the assertion is 1)
Would you not say that this could serve, at least for the development team, as a specification?  It tells you how the system should work, how it is structured, the API specifics (both constructor and public methods), etc... everything that a traditional spec would record.

Compare now, in the scenario where you’re coming back eighteen months later, this kind of specification to the document you would normally create.  You can run this “unit test” immediately, watch it compile (the API’s have not changed if it does), watch it pass (the behavior of the system has not changed if it does), and thus confirm that it is still accurate with no effort at all.  If we then further stipulate that every behavior of the system has a test like this, and we can run them all with a single click of the mouse, then we know our test suite is accurate to the code.  Now run your code coverage measurement... is it 100%?  Now you know that there is no additional behavior that has been added by someone else without that person adding such a unit test.

So, in TDD we do not write tests.  We write specifications.  Executable specifications.

Note that the testing framework itself (with just about every tool you’ll encounter) uses the term “assert.”  Look that one up:

Assert(v) to state with assurance, confidence, or force; state strongly or positively; affirm; aver: He asserted his innocence of the crime. [2]

Note this not “check” or “examine to determine if” or “confirm”.  When we assert something we do not say “this should be true” we say “this is true”.  It’s a statement of truth not an investigation.  It is not a test, but a fact about the system.

This simple shift in thinking from “I am writing a test” to “I am writing a specification” changes so many things about how you’ll write them, what you write and won’t write, what qualities you will look for and emphasize, how you’ll name things... and on and on... that we won’t even try to enumerate them here.  We’ll write an entire posting just about this (Testing as Specification).

So, why do we still call them tests?  Two reasons.

  1. First, “Test-Driven Development” is the term we are stuck with.  Language is a living thing, a shared thing, and we cannot dictate on our own what things are called.  We’d love to call it what it is: “Behaviour-Based Analysis and Design”, and we think of it that way, but at the end of the day...
  2. We’re not going to throw these executable specifications away when we’re done driving our development with them.  Why would we?  It took effort to make them, and we want to be able to refer to them later.  But you know what else they magically turn into at this point?  Tests!  We can used them to test against system regression when we need to refactor it.  These are regression tests we got for no extra effort, by the way.[3]

    So, does TDD add new work to the development team?  No.  We were going to write a specification anyway, we’re just doing it in a different way now.  A better way, because it will be written in cold, hard code (rather than vaguely in human language), and it will be automatically verifiable against the real system at any point we desire, with no effort on our part.

    And additionally, for free, it will produce a regression suite at well.  Most teams struggle mightily and do all sorts of shenanigans (see our upcoming blog Lies, Damn Lies, and Code Coverage) to achieve 75% to 80% code coverage.  We will have 100% [4] and we don’t have to do anything additional to get it.

    All this leaves is the third objection from part 1...  what about the maintenance burden we take on when we have to keep the test suite up to date?  What about new requirements that cause dozens or even hundreds of tests to break, and have to be repaired?

    Yes indeed, what about that?  Must have something to do with the word... Sustainable.

    Stay tuned.




    ----

    [1] We’ll talk about Any in a future blog
    [2] http://dictionary.reference.com/browse/assert
    [3] Not that we are saying our test suite will replace all traditional testing.  It will not.  But as a regression test suite it has a lot of value both for developers and testers alike
    [4] ...or very close to it.  Nothing is ever perfect, after all

    Friday, November 18, 2011

    Redefining Test-Driven Development, Pt. 1

    Download the Podcast


    How you do something new is often influenced to a great extent by what you think you are doing -- its precise nature, the steps and work-flows, and how it relates to other things that you already do and understand.  The term “Test-Driven Development”, while well-established in our industry, is perhaps an unfortunate choice of words to describe what we are doing, and thus how we choose to do it.  Here in part 1 we’ll examine the problem, and then later in part 2 we’ll suggest a solution.

    Let’s start with the word “test”.  This is a word we already have a definition for; typically we think of a test as an evaluation of something, or a judgement of something relative to a standard, or perhaps an action that determines the correctness or incorrectness about something.  Test is a verb: “I shall test this.”  It is also a noun: “Let’s conduct a test to find out if this works.” 

    In any case, the presumption is that there is something that is either correct, or operates correctly, or does not.  Clearly this is a nonsensical idea if the thing to be tested does not actually exist yet. 

    In a typical TDD process, we write the test before we create the code we’re testing [1].  At the “testing point”, there is nothing to test.  Will the test fail?  Of course it will [2].  Something that does not exist can neither be right nor can it do the right thing.  So it would seem that we’re not really doing anything meaningful [3]. 

    Some of you are probably thinking: “The test won’t fail.  It won’t even compile!”  Very true, but this is only because our technology (typically) works the way it does.  In another technology (Python, for example) referencing something that does not exist might simply cause the system to ignore you, or return 0, or null, or something else.  This is one reason why we like strongly-typed languages and strict compilers.  However, note what the compiler is actually saying: “This makes no sense!  You’re trying to refer to something that does not exist!” 

    All of this would seem to indicate that we have to do it the other way ‘round: that we’ve got to create the thing to be tested before we can create the test.  It’s just common sense.

    Then there is the notion of “driven”.  The notion of “test” in conflict with the notion of “driven”.  If one activity drives another, then one would normally expect the driving activity to precede the driven activity, temporally.  If thing X happens which then causes thing Y, and if this causality can be proven, then we can say X drove Y.  But if the test must be created after the tested thing, then how can the test drive the tested?
     
    Finally we have “development”.  Development is the creation of something, usually from a plan or goal or set of principles.  If tests are to drive development, then they must cause it.  Thus they must constitute the plan or goal or set of principles.  But tests in the traditional software sense are not plans, they are an examination of the system to determine if it meets its success criteria.. 

    This confusion can cause lots of problems:
    1. People won’t get the point, and will reject the idea intellectually: “that makes no sense” 
    2. People will see this as “new work” for the team to do, and will thus slow the team down: “that will be wasteful”
    3.  People will see the product (a collection of tests) as a new maintenance burden for the team: “that cannot be sustained over time”
    In other words, TDD tests would seem to constitute at best a tremendous added cost, and at worse a totally meaningless one.  This is categorically untrue, and we begin by re-defining what we’re doing. 

    In TDD, as it turns out, we don’t write tests first.  In fact... in TDD we don’t write tests at all. 

    Stay tuned for part 2... :) 


    --- 

    [1] As we will see in future blogs, the test-first technique does not actually equate to TDD, but it is a very common approach, and very compatible with TDD. 

    [2] ...and what if it doesn’t?  What would that mean?  That’s the subject of another blog... 

    [3] I can tell you a-priori that any test written before the thing it tests exists will fail, without even knowing what the test is about.  Therefore actually writing the test and watching it fail is not going to tell me something I didn’t already know.  So why do it?

    Wednesday, November 9, 2011

    Test Reflexology, Part 1 (second post)

    Download the Podcast

    ...continued from previous post...


    Overly Protective Test
    Sometimes when examining a test we find it to be much larger in size than the production class. Oftentimes we can just split the test into multiple tests – but not in this case – remember our initial assumption is that the tests are as good as they can get. What could be the cause then? It could be because the test is overly protective.

    In a protective test we end up testing not only the specified behavior, but we are also testing to ensure that another behavior implemented by the tested unit does not interfere with the original behavior. For example, if unit X deals with a computation and with data caching, we will need to ensure that as the results of the computation are independent of, for example, inserting a new item into the cache. 

    The need for such a test is often a result of perfect hindsight – a bug.   For instance, we discover later that certain computations accidentally alter the items in the cache.

    Before fixing the bug, a developer practicing TDD will update the test to ensure that the secondary behavior never again interferes with the primary behavior. This is a good idea, but the need to create this overly protective test indicates a design issue – yes, you’ve guessed it; we have another problem with cohesion. The lack of cohesion leads to missing encapsulation[5] which has allowed the secondary behavior to couple unexpectedly to the primary behavior and affect it. The solution is to extract the behaviors into individual, encapsulated entities and prevent the coupling from occurring. Encapsulated entities cannot encroach on each other’s state.

    Once we see an overly protective test we are sure to see it many times, whenever the primary behavior is tested conjunction with a different secondary behavior. This is done to insure that none of other activities of the unit interferes with the primary behavior we really want to test. The number of discrete test scenarios will increase geometrically because all behaviors need to be tested in conjunction with all of the other behaviors. Even if it were possible to create all these scenarios and test them in a reasonable period of time, there is an obvious redundancy in the tests which is highly undesirable[6]. 

    Combinatorial Scenarios
    We often see tests that do not test all the possible scenarios but rather a selected subset of the possible scenarios. This is because there are just too many of these scenarios to reasonably go through all of them. For example, let’s assume that during the week an employee is allowed to be late at most once. If we treat the week as five distinct days: Mon, Tue, Wed, Thu, and Fri, we will need to test the 5 scenarios where the employee is late once to prove that there is no action taken; we also need to test the 20 scenarios where the employee is late twice to make sure that an action is taken. This test reeks of repetition and would either be very long or would require some parametrization or iteration built into it to reduce the number or redundant scenarios. Alternatively, we may choose to test only a subset of the scenarios which leads to incomplete testing.

    The test is shedding a light on a problem, namely that we are not choosing the correct abstractions in our design. In the example above, we should have considered the week to be a collection of days and verified that if in that collection we have only one day of tardiness – no action is taken, and if two – an action is taken. In this case we only need 2 tests to guarantee that the behavior is correct. This makes the complexity of the test the same as the complexity of the code and proves that the correct abstraction in this case is of a collection of days rather than individual days.

    Stay Tuned for Test Reflexology, Part 2
    Part one has focused on how a given unit test can provide you with insights about the quality of your design.  Part two will extend this notion into the entire suite of tests; how the nature of the suite can also let you know when your design may be wanting.  Coming soon!


    -----

    [5] Inside the scope of a class, we really cannot encapsulate much.  “Private” means nothing.  Temporary method variable are really the only encapsulation available “inside the curly braces”.

    [6] This will be the subject a future blog, we promise.