Etcher is an Electron app, based on a from-scratch implementation of a node.js engine to write to SD cards and USB flash drives, which is designed to work on all platforms. In case that was too easy, we set out to make it a pure-JS implementation so the result could be installed from npm without needing a compiler toolchain to be present. It’s safe to say we didn’t appreciate quite how tricky this would make things.
The challenges of writing to drives on Mac and Linux have not been trivial, but ultimately have been surpassed. Issues pop up on occasion and are quickly resolved. Windows, however, is another thing entirely. There's a separate story to be told about the amount of work needed to accomplish robust elevation on Windows, but even after we got that working reliably, we faced a number of issues that appeared at random parts of the process, and carried almost no information which could help debug them. The most severe of them came to be known as the "EPERM issue". This is what the error message looks like:
It’s safe to say that Juan, the maintainer of Etcher, spent a fair amount of hours over several months staring at a window just like that one.
Even though the user has granted explicit elevation of privileges, windows kicks the writer out, claiming some kind of violation occurred. The issue is not easy to reproduce, and seems to randomly happen on any position of the image. Retrying the write often succeeds. In order to test this, we spawned several instances of a test script. The parallelism increased the odds of seeing the bug, so we had a somewhat reliable way to replicate. However, a solution (or explanation) evaded us still.
We investigated a lot of theories. A seemingly promising hypothesis was that some other process, perhaps an anti-virus software, or Etcher’s own drive detection scanner, tried to access the drive while the write was in progress, and for some reason Windows decided to lock Etcher’s writer out. Users however, reported that they didn’t have any such programs running, and even when we stopped the background scanning during writes, the issues persisted. We still spent significant time on this theory, though, since it’s hard to exclude the potential some other process interfering in a modern OS like Windows with all sorts of things going on. It wound up a dead end though.
We tried retrying to write a block with increasing timeouts, reducing the block size after we hit the first EPERM, reducing the block size for the whole process, trying to zero out a sector before writing to it, going back to the previous sectors after the first failure, and zeroing out the whole drive before starting to flash. Nothing worked.
Even more puzzling was that other writers didn’t seem to have the same issue. It turns out that native applications mitigate these kinds of issues because they use the Windows C/C++ APIs, exposed by the operating system, which seem to handle all these edge cases, and do all sorts of incantations for everything to work fine. Sadly, Windows doesn't provide much information on this topic and the code is closed-source, so if we insisted on a pure-js approach, we would have to find our way by ourselves.
So, since all the guessing and flailing about wasn’t getting us anywhere, we decided to use some of that inductive reasoning everyone’s been raving about.
The first thing we looked for was the documentation of node.js to see if there is any mention of the EPERM error. The fs module does not mention any such error and the only reference we could find was that EPERM occurs when “An attempt was made to perform an operation that requires elevated privileges”. We were definitely running as an elevated process so we had to look under the hood.
The EPERM is caused by the "write" syscall, which by the small surface of our application and the place where the error happens, we can safely assume is "fs.write()"
Looking at the source code of nodejs we see that fs.write() is implemented by calling the
writeBuffer of its native bindings, which in turn call into libuv’s
uv_fs_write function. Going further down the rabbit hole we looked into how
uv_fs_write is implemented for the windows platform. Jumping through a few more function calls we see that the platform specific function used by libuv is
WriteFile must be returning the EPERM error that gets propagated all the way up to our program. Time to look for the MSDN documentation. There was a lot of information about how this function works, however we still couldn’t find any mention of the EPERM error. However, after reading more carefully we noticed the following section:
A write on a disk handle will succeed if one of the following conditions is true:
- The sectors to be written to do not fall within a volume's extents.
- The sectors to be written to fall within a mounted volume, but you have explicitly locked or dismounted the volume by using
- The sectors to be written to fall within a volume that has no mounted file system other than RAW.
One more detail about our writing process is that before our module started writing data to the physical drive, it cleaned the drive using diskpart.exe, which causes the drive to lose any file system information, and therefore allowing us to write to it (since Windows permits writes to volumes that have no actual file system).
Some of you have already figured it out, and after looking at the list and thinking for a while, it hit us. Given that we write in a linear fashion, the first chunk that we write represents the partition table. In some cases, Windows re-parsed the MBR and after that point any call to WriteFile would violate rule #1. In the process of writing a filesystem to the raw drive, Etcher spelled its own doom.
To work around this, for Etcher 1.0.0-beta.15 we take the following approach during the writing phase: we omit the first chunk, which contains the partition table, but temporarily save it in memory. We then proceed to write all but the first chunk, and once all writes complete, we finally write the first block.
This ensures that Windows will not detect any valid partition table until the end of the write, preventing it from revoking our write access to the volume.
Given that we clean the volume with diskpart.exe before starting the write, we don't need to be concerned with the existing data on the first sector when omitting the first chunk.
A more elegant approach could be making use of the
FSCTL_DISMOUNT_VOLUME calls, however they only seem to be available
For now, we're happy to release the latest version of Etcher, which should hopefully resolve our EPERM issues, and head on to our next adventure. We hope you enjoyed this writeup, our work in implementing Etcher has lead us down a few of these rabbit holes, hopefully we’ll write some more of them up when the opportunity arises.
Also, forgive us the plug, but if this kind of work sounds interesting, and you can imagine yourself working on a cross-platform open-source project remotely, we’re hiring!