Tests & Dependencies
You’re probably mocking too much.
One of the most challenging parts of testing is dealing with granularity. On the integrated end of the spectrum, we can write an end-to-end test to drive a web application via Selenium for instance. On the isolated end of the spectrum, we can test a single pure function like so:
expect(standardDeviation([1, 2, 4])).toEqual(1.52753)
To the extent we want to to increase granularity, we can approach the problem from the top or bottom of the call stack. For instance, if a()
calls b()
, we can increase granularity by calling a()
and mocking b()
. Or we can isolate from the bottom up, by directly calling b()
. We can even do both. Suppose b()
calls c()
; then we can directly call b()
but mock c()
, thus isolating b()
from a()
at top, and c()
at the bottom.
Thesis
My thesis is that developers routinely make suboptimal decision about the level of isolation, both from the top and bottom.
Goals
How do we define what is optimal? The best set of succinct test goals I’ve found is “Four Goals of a Good Test Suite” by the inimitable Matthew Kane Parker, head of engineering at Pivotal Labs. They are as follows (summaries mine):
- Fast: A slow test suite reduces our productivity.
- Clean: A complex test suite is hard to maintain, and makes it hard to identify the culprit when tests fail.
- Confidence: We should trust that if our tests are green, the code does what it’s supposed to, and we can deploy.
- Freedom: Tests should make it easier to refactor, not harder. This generally means they should focus on behavior over implementation.
Let’s see how these goals are served by various testing approaches. Let’s consider our standardDeviation()
method from the test code above. Suppose it looks like this.
function standardDeviation(values){
var avg = MathExtras.average(values);
var squareDiffs = values.map(function(value){
var diff = value - avg;
var sqrDiff = diff * diff;
return sqrDiff;
});
var avgSquareDiff = MathExtras.average(squareDiffs);
var stdDev = Math.sqrt(avgSquareDiff);
return stdDev;
}
This code depends on both MathExtras.average()
and on Math.sqrt()
. Yet it would be hard to imagine a reasonable person suggesting we mock out (technically, stub out—although I’ll use “mocking” and “stubbing” fairly interchangeably throughout) those two methods. What would would be the benefit?
- The performance benefit would be negligible, or likely even negative. These aren’t exactly password-cracking functions.
- Because those methods are queries not commands, we don’t have to add any duplicative code to our test to verify their side effects, so there’s no DRY-ness savings to mocking either.
- Because our test calls all the way through, it gives us high confidence in our result.
- Because our test is completely agnostic to implementation details, it leaves us free to refactor without having to change our test. We could substitute
FastMathLibrary.sqrt()
and our tests would still happily pass.
Now let’s contrast this with some code that would be better with mocking.
def delete_user(user)
File.delete(user_data_file_path(user))
# Many lines of code we don't care about for this example...
end
If we let the File.delete()
call through, it’ll raise an error No such file or directory
. But if we create a file in our test, then we have to add test code to assert it was deleted. One could reasonably complain that, aside from testing that we passed the right args to File.delete()
, this bloats our test with duplicative testing of the File.delete()
method that we already know works. Note, that’s not to say there’s anything wrong with testing File.delete()
; it’s just that we don’t want to bloat or complicate our test in order to do so. This is a nuanced point that’s often missed, and which I’ll revisit. Finally, this is all to say nothing of the potential for polluting our filesystem any time we kill a running spec, as well as the additional risk of flaky-spec-inducing collisions when running specs in parallel. The case for mocking the call to File.delete()
is strong.
Real World Anti-patterns
In my experience, developers often confuse these two use cases I just described, calling for mocks in functional/query cases like our standardDeviation()
example, often with the justification that “this is a unit test—we don’t want to re-test Dependency.foo()
”. As I mentioned above, there’s nothing inherently problematic about re-testing a dependency; you just don’t want to add too much complexity or performance loss to do it. I find that I’m more likely to mock commands than queries.
As for the notion of a “unit test”, this is a term that’s pervasively misused. The aforementioned Matthew Kane Parker explains in another excellent piece:
There’s a rumor circulating in agile engineering circles that the “unit” referred to in “unit test” is the unit of production code isolated for testing. Interestingly, that wasn’t what many meant by “unit” when the term “unit testing” first started being used.
Before SUnit (SmallTalk Unit, the first unit testing framework), there were already automated tests. Quality assurance engineers, for instance, often scripted tests in an effort to automate manual parts of their work.
The problem is, it was quite common to see order-dependency in those test suites. In other words, a test couldn’t run on its own, because it needed all of the tests before it to run first so that the system would be put into a specific state that the test would then act and assert on.
This led to all kinds of problems, as you might imagine.
Thus, one of the primary goals of SUnit was that each test could run in isolation from all of the other tests. In other words, a “unit test” is a test that’s isolated from all of the other tests in the test suite. The “unit” is the test itself!
But today, it’s quite common to see a very different definition of unit testing proffered and propagated. Many engineers believe that for every class, and for every public method of every class, they must create a corresponding “unit test.” Indeed, they believe that the “unit” is the production code — and that therefore each and every public method of the production code is a unit that must be tested.
This ties back in to the issue of granularity. In order for us to meet the goal of Freedom, we want our tests to be focused on behavior rather than implementation. We want them to be as integrated as is feasible—literally the exact opposite of granular.
What are the limits of feasibility? Aside from the caveats we’ve already looked at, another risk of going too integrated is that, while we have a robust safety net to tell us that something broke, it may be difficult to pinpoint what broke. There are complex competing concerns here, but it’s critical that we break free of the dogma around dependency mocking and “unit tests”. In particular, we want to abandon the false dichotomy of isolated vs. integrated specs, instead seeing the level of granularity as a dial setting on a wide spectrum.
Example
I once built a frontend feature whereby a user could enter a resource ID to view its edit history. I’ll describe it in the present tense for ease of writing.
Our React component makes a JSON XHR GET request via a function we’ll call wrapper()
, which makes the AJAX call and performs some complex manipulation to the response data before returning it. We then dispatch a series of Redux actions, some of which include that processed response data.
My Jest spec injects a fake implementation of the dispatch()
method, which pushes each action into a dispatchedActions
array. My assertion then looks something like:
expect(dispatchedActions).toEqual([
{ type: 'START_SPINNER', payload: { ... } },
several more expected actions...
])
The response JSON comes from a Rails controller action (the backend API), and one of the backend tests for that action saves the resulting JSON as a fixture file, which our frontend test loads in order to stub the call to the AJAX call. This means our tests call through the wrapper functionality that processes the response data and dispatches the actions we want to test, and also call through the XHR library we use (Axios). In this way, we get an extremely high level of Confidence and Freedom.
The Wrong Boundary
More than one of my teammates suggested that instead of stubbing the XHR call at the level of the XHR API, we should stub the call to wrapper()
, and also return a manually created JSON structure instead of the fixture generated by the real code in the backend spec. The argument centered on the idea that I was re-testing the behavior of wrapper
as well as the Axios library. And that was seen as bad chiefly because those are already tested (indeed, we do not even own the Axios library).
In my view, the mistake here is that it costs us virtually nothing to call Axios was merely an implementation detail that could have been swapped out for any other HTTP library. By mocking all the way down to the browser’s XHR call, my tests expanded the level of freedom to change implementation details without breaking.
UPDATE: I just recently (in the fall of Nov 2022—years after originally writing this post) saw a team experience a major bug caused by upgrading the Axios library, which changed the interface/arguments. A stubbing/mocking approach likely wouldn’t have caught this; you’d just be asserting that you called Axios with the wrong arguments.
So those are the benefits, but what about the costs of my approach? To connect this example to our original goals, note that all of this code is functional in nature; there are no side effects to be tested. The Jest assertion looks exactly the same whether I use deep or shallow mocking — and there is no significant performance cost to the deeper integration. This is a win by all four test design goals enumerated by Matthew Kane Parker.
History
We stand on the shoulders of giants. And if you really want to gain a full appreciation of these principles, there’s some history and further reading you should know about.
In 2009, J.B. Rainsberger presented “Integration Tests are a Scam” to an audience at the Agile (North America) conference in Chicago. After some reflection, he rephrased this as “Integrated Tests are a Scam”. His point was about the unsuitability of tests which integrate multiple “interesting behaviors” of a system, which is primarily a matter of the combinatorial explosion of code paths as the scope of a test increases, as well as the slow feedback that comes from e.g. exercising our code via a web browser. We’ll come back to this.
In July 2011, Gary Barnhardt published his screencast “Test Isolation and Refactoring” (paywall), which dealt with the problem of mocking a dependency whose method signature changes (at least when using a dynamic language). Xavier Shay soon followed up with a blog post and video response about this problem he called “interface mocking” (essentially the very problem we just discussed above, with stubbing arguments to the Axios library). His solution was to create a gem called rspec-fire, which was subsequently subsumed by the verifying doubles feature in RSpec 3. The overarching point is that isolated specs come with the major liability that they don’t ensure the correct wiring of components at their boundaries.
The perils of mocking apparently remained active in Gary’s mind, for in May 2012, he published another screencast called “The Mock Obsession Problem” (paywall), in which he rewrites a mock-laden test to be “simpler, more direct, and with less indirection via mocks”. By July, his thoughts had evolved into a screencast called “Functioral Core, Imperative Shell”.
By November of that year, Gary’s thoughts had gelled into the now legendary “Boundaries” talk, which he presented at the 2012 Software Craftsmanship North America (SCNA) conference. Less than six minutes in, Gary cites J.B. Rainsberger’s “Integration Tests are a Scam”. He then expounds on his thesis that isolated tests are good for testing code with few dependencies but many cases or “code paths”, whereas integrated tests are good for testing code with many dependencies but few paths. If you skip to the 7:20 mark, he further stresses that functional code is inherently isolated, without mocks. Thus if we place our complex code (“interesting behaviors” in the words of Rainsberger) at the end of the call stack, by extracting its dependencies, we can isolate from the top rather than from the bottom (i.e. via deeper method calls rather than via mocks, as noted above). The practical ramification of all this is that we can get a lot of benefits by isolating complex logic behind a functional core with an imperative shell.
And I actually recently noticed that David Heinemeier Hansson had already made this argument in a 2014 talk he gave, which I had watched without really picking up on this segment where he talks about this very issue.
One Additional Caveat
So far, we’ve identified a couple of very specific cases for mocking dependencies:
- The dependency is a command rather than a query. (Our tests would have duplicative side-effects if we called through, going against DRY.)
- The dependency incurs a significant performance cost if we call through.
Behavioral vs. Semantic Functions
Another somewhat esoteric case that occurred to me is when you care about the dependency as an implementation detail rather than its behavior. For contrast, our standard deviation function cares about getting the right values out, regardless of which square root function implementation use.
But suppose a JSON API contains a value like { display_name: DisplayName.call(user) }
. The DisplayName
module is used in multiple places and has its own specs. From the perspective of our API, e.g. a Rails controller action, the details of the display name are irrelevant. It’s aware only that there’s the semantic notion of displaying a user name; it doesn’t care about the details of what formatting is done. This code cares about the semantic identity of DisplayName.call()
rather than the behavior. There’s a good case for mocking here, so that our specs don’t change if the behavior of DisplayName.call()
changes.
Another case for mocking would be a function that involves randomness, like UUID.generate()
. Though even then, there are other tricks here like re-seeding the random number generator in a global before
loop, as is typically done in Rails applications within the RSpec setup (Kernel.srand(config.seed)
), to get “deterministic randomness”.
Summary
Focus on the four goals of a good test suite. This is generally served by more integrated tests, which we decompose when performance, DRY-ness, combinatorial explosions, or general clarity dictates. Avoid the pervasive myth that every public method must be “unit tested”, or that it’s inherently bad to “re-test dependencies” by calling through them. And most of all, learn from those who’ve gone before and encapsulated a lifetime of wisdom into the written and spoken word.