Flaky tests are like meme stocks — many people have them, but no one knows what to do with them. Today, we will change that by diving into some common causes and, more importantly, solutions for flickering tests in Elixir.
Elixir has many great primitives that let us run tests asynchronously, including immutable data, lightweight processes, and the Ecto SQL sandbox. Running tests asynchronously can greatly speed up your test suite, but can also increase the chance of flaky tests.
What Are Flaky Tests?
Flaky tests are tests that sometimes fail. They erode confidence in your test suite and are hard to fix because they are hard to reproduce. Often they imply a test is broken (rather than the code) and so are ignored or retried until they work.
Locally, this slows you down, but it especially hurts your CI — every failure means at least one rebuild. Anything that doubles the time it takes to deploy code is very annoying. The culture of "oh, just retry" is a broken window that risks further decline in your codebase.
Find and Replicate Flaky Tests in Elixir
Flaky tests are usually easy to spot (they'll be the ones that fail on CI when you update the README), but they are harder to replicate locally.
Here, we can use ExUnit to try and help because it gives us the ability to run the tests in the same order as a previous test run. Usually, we run tests randomly to help encourage test isolation, but we can seed that randomness with a command-line option. Re-using the seed for a previous run will trigger the tests in the same order each time the seed is used. ExUnit outputs the used seed here:
We can re-use that seed like this:
However, this won't always reproduce the flake, especially if a database is the cause of the flakiness or if some resource constraint makes the flakiness more likely (CI is likely to have much less RAM, for example, than your dev machine). On top of that, the seed does not influence how quickly a test runs. If the tests run asynchronously, there is no guarantee that two tests will run at the same time again (even if they are triggered in the same order).
Imagine three tests are running concurrently: A, B, and C. The seed determines that the tests trigger in order A, B, and C. The first time these tests run, test A takes as long as the other two to finish. A starts, then B triggers and finishes, C triggers and finishes, and finally A finishes.
If we rerun these tests with the same seed, even though they trigger in the same order, A might finish before C starts this time, for whatever reason. That might mean you won't reproduce the conditions needed for the test to flicker. Using a seed is a good first stab, but it might not work.
Running the tests repeatedly can help. Here is a bash function that will run the tests until there is a failure:
When you understand some of the common causes of flaky tests, you can often identify the problem just by looking. That's the level of intuition we want to build up here.
All flaky tests boil down to one thing: non-determinism. Non-determinism is when the same input can produce different results. We need to look out for non-determinism sneaking into our tests and think about how it can happen when tests run asynchronously.
Especially look out for the global state. Global state, I hear you say? But Elixir is functional! There is no global state! Well...that's not quite true.
Let's take some of the most common causes of flaky tests in turn below.
8 Common Causes of Flaky Tests
1. Using Application.put_env
in Asynchronous Tests
When configuring Elixir apps, we can read values from the config using functions like Application.fetch_env!/2
. You might be tempted to set an application state in the test setup to test behavior in different environments:
Don't do this if your tests run asynchronously. Application.fetch_env/2
can be called from anywhere, meaning it is effectively a global state. And worse than that, because you can Application.put_env/3
anywhere, it is effectively a global mutable state.
That means even if you reset the application, another asynchronous running test might read from the application after you have changed it (but before your test completes and changes it back). That test gets the wrong value and potentially fails, sometimes.
The Fix
Don't use Application.put_env
in tests — or, if you have to, put it in a test with async: false
and reset it using on_exit()
.
2. Incorrectly Configuring Ecto.SQL.Sandbox
Usually, we configure Ecto so that each test runs in its own transaction. Each test runs concurrently in its own process, and each process opens its own transaction to the database.
This is great because it means that Ecto can simply roll back that transaction when the test finishes. This allows us to run our tests asynchronously (in Postgres at least) without worrying that the state of the database is anything other than what our test specifies it to be.
However, sometimes we need to see what other processes do to a database. Imagine we test some code that starts a task that writes to the database. The test process needs to see what that task's process does to the database.
We can do this by putting Ecto.AdaptersSQL.Sandbox
in :shared
mode, allowing "a process to share its connection with any other process automatically".
But remember that each asynchronous test runs in its own process. That means that in :shared
mode, any test running simultaneously with another test will share the transaction to the database and see all of the changes the other test makes.
This is the first way we could introduce non-determinism in tests.
The Fix
If a test sets the Ecto.Adapters.SQL.Sandbox to :shared
mode, never run it asynchronously, like so:
3. Having Non-unique Unique Data
Different databases implement transactions slightly differently. For now, I will talk about the most commonly used database in Elixir: Postgres.
Postgres
Postgres never lets a transaction see another's uncommitted changes (even though the sql standard technically allows it!), so the default transaction isolation level doesn't cause problems in concurrent tests.
Unfortunately, two concurrently running tests can still interfere via a different concurrency control that Postgres employs: locks.
A lock is a concurrency control mechanism that ensures different commands can be executed safely while other commands are happening. For example, if we wish to truncate a table, it isn't a good idea to try to insert something into that table at the same time. Truncate "locks" the table, preventing anything else from happening to that table while it is being truncated.
You can use explicit locking — where you tell Postgres to take a specific kind of lock — but each command has its own appropriate level of automatic locking. If Postgres thinks two concurrently running commands will conflict, the second command will wait for the first to complete.
By far, the most common problem here is with unique data. Let's say we have a user table with a unique email address. Now, let's imagine we have test 1 and 2, and both insert a user with the same email address — jeff@example.com
— as part of the test setup.
When checking a unique index on insertion, Postgres will look at uncommitted transactions to find out if it can continue. Otherwise, it checks the index just before it inserts and ends up with two non-unique rows.
If test 1 intends to insert a user with the email jeff@example.com
, test 2 cannot do that. But if, later on, test 1 never actually inserts them, then test 2 can insert their user. That means that Postgres can't know the answer to "can test 2 insert the user?" until after test 1 has finished. So it takes a lock on the row, which basically says: "hey test 2, wait until test 1 has finished its transaction before you continue".
Even though they happen in different isolated transactions that never actually commit, when Postgres sees one transaction has already "inserted" a row with a jeff@example.com
email,
it figures out the next one has to wait for the first transaction to finish or rollback.
All this means that if you have data that should be unique but isn't across concurrently running tests, you will at best incur some performance penalty. This can be quite severe as it adds up across the codebase. You can end up with async tests effectively running synchronously!
But, if we add more tables with more unique columns that are not unique across tests, it gets even worse! Depending on the order in which data is set up, you can end up with deadlocks. Let's say our app has blogs with a unique title. Picture the following:
Test 1 Test 2 inserts user with email jeff@example.com inserts a blog with title "Deadlocks!" inserts with email jeff@example.com inserts a blog with title "Deadlocks!"
Test 1 inserts the user. Before test 2 can do the same, it must get a lock and wait for test 1 to finish. Now test 2 inserts a blog, and for test 1 to do that, it must wait for test 2 to finish. Then test 2 attempts to insert the user, so it gets the lock and says "I'll wait for test 1". At this point, test 2 is waiting for test 1 to finish.
Meanwhile, test 1 continues and attempts to insert the blog but can't because test 2 just did. So it waits to see if test 2 will commit or rollback. But test 2 is waiting for test 1 to finish, and now test 1 is waiting for test 2 to finish!
This is a deadlock. Usually, Postgres detects them automatically and one transaction gets rolled back, causing your test to flake.
The Fix
Sometimes you will hear advice like "ensure that the locks are acquired in a consistent order". This helps prevent a deadlock, but is tricky to do and would still incur the performance penalty mentioned.
The simplest golden rule is — if your data should be unique, make it unique across all tests:
Really consider if your test data is unique as well. Using a uuid will make it unique. Picking a random number between 1 and 1,000 will not make it unique.
4. Writing to ETS or Persistent Term in Asynchronous Tests
ETS and persistent term are two data stores that come with Erlang. They are accessible from anywhere and, like all data stores, are stateful. That means if we have two tests running at the same time that set themselves up by writing into ETS or persistent term they can — and will — interfere with each other. For example, one test adds a record to the data store then deletes it, while another test asserts the number of records in that same table. They will interfere with each other.
The Fix
Your options are mock ets/persistent_term — after all, we don't need to test that these things do what they say (Erlang does that for us!) — or have the tests run synchronously. Prefer the former! Mox is a great choice for this sort of work.
5. Relying On The Order Of Logs
Sometimes you may want to test that a particular log line has been emitted. The best way to do this is with ExUnit's capture log:
This takes a function and captures all of the logs emitted during the execution of that function. It concatenates all the logs together into one binary and returns it.
But logs can be emitted from anywhere, and capture_log
will capture them dutifully, making them, in a sense, global. And because capture_log
concatenates all logs together, each new log line changes the result of capture_log
. So, in effect, the returned binary is global and changing.
If this has set off warning bells: congratulations, you are right to be worried! If the tests are running asynchronously the result of capture_log
will be different depending on what other tests are running at the time. If you assert what the captured logs should exactly look like, you will have a bad time.
The same principle applies if you use something like a Ring Logger or any custom logging backend that buffers the logs.
The Fix
Never rely on log order when making assertions. Assume that the log can include any number of other log lines. For assert capture, you can use regex's or the =~
operator to match on the subset of the log that you care about:
That way, the rest of the logs do not interfere with your assertions.
6. Failing to Specify Order in the Database
The most common cause of flaky tests in the wild is the assumption that a database will return results in a certain order when there is no such guarantee. In Postgres, for example, if an explicit order is not supplied then the results can be ordered in any way — even if it seems otherwise, most of the time! So all of these are bad ideas:
The Fix
Some possible solutions are:
- Specify the order in the database query:
But wait, did you notice the problem above? What if the records have the same inserted_at — the sort isn't guaranteed to be stable. You need to handle that possibility too, e.g., if the id is an auto_incrementing integer, you could:
- Don't assert on order if it's not important:
Or even:
7. Not Mocking Date/Time/Random ⏰
We've all been there. It's 2 PM on Wednesday, and for some reason, half the test suite fails when it all passed a minute ago. Yes, someone forgot to mock date time.
Imagine we have this function:
Now we add a test that looks like this:
Well, this will work until after 5th November 2022. This same idea can apply to Time
s, Date
s, DateTime
s, NaiveDateTime
s, and any function that uses random — for example, Enum.take_random
or the like.
The Fix
Mock those modules! Mox is a great choice for this sort of work.
8. Using assert_received/2
Instead of assert_receive/3
In ExUnit you can assert that a test process receives a message using either assert_receive/3 or assert_received/2. This lets you write tests for functions that spin up processes and send messages to other ones, e.g.:
But there is a subtle difference between the two assertions — assert_receive/3
allows a timeout. This is an amount of time to wait for the message to appear in the current process's mailbox. Usually, this is all very quick, so a timeout is not needed, but send
in elixir is non-blocking.
The send inside Ecto.echo
happens, then immediately, we continue with our test. The next thing is to check the mailbox. In the right conditions (i.e., some performance blip), there is a small chance that assert_received/2
could look in the mailbox after the send happens but before the message reaches the inbox of the current process. If this does happen, we have a flaky test.
The Fix
Prefer assert_receive/3
. Remember, the timeout is the maximum time it will wait — if the message gets there sooner, the test will finish sooner.
A Side-note: Should Tests Run Sync or Async?
You may have noticed that the chance of flickering tests decreases greatly if we make all tests synchronous. I do not recommend doing that. Running asynchronously greatly speeds up most test suites.
Similarly, mocking everything and writing only unit tests will also likely reduce the chance of flickering — but it's on us to decide whether such a testing strategy would give us enough confidence in our code.
One thing to note is that tests within a test file always run synchronously. If the file is marked to run async, the module might run at the same time that another module of tests runs. So, if you really need to have some synchronous tests, you can put them in their own module (in the same file). That allows most of the tests to run async:
Wrap up
In this post, we've defined flaky tests and seen how to replicate them, before running through 8 common causes of flaky tests and their fixes.
Here is a summary of what you should avoid doing, for easy reference:
- Don't use
Application.put_env
in async tests. - Don't use
:shared
mode on the Ecto sql sandbox in async tests. - Ensure unique data is unique across all tests.
- Never rely on the order of logs in a test.
- Consider mocking ETS/
:persistent_term
for testing. - If you rely on database order, specify it.
- Mock dates, times, datetimes, and random.
- Prefer
assert_receive/3
overassert_received/2
I hope you've found this post useful, and happy coding!
P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!