Posted on May 22, 2018 by Jacek Galowicz, Werner Haas
This article is about the new variant 4 of the Spectre attack that works without misleading the branch predictor. Instead, it exploits an implementation detail of Intel’s memory disambiguation technique inside the CPU’s pipeline.
Jann Horn (Google Project Zero) and Ken Johnson (Microsoft Security Response Center) disclosed this vulnerability in February 2018 which was then publicly announced by Intel on May 21st. This vulnerability is known as Spectre V4/Speculative Store Bypass (SSB)/CVE-2018-3639.
A short summary of what this CPU vulnerability means:
If you have any questions about Meltdown and its impact or the involvement of Cyberus Technology GmbH, please contact:
At first glance, this vulnerability bears similarities with CVE-2017-5753, that is the first variant of the Spectre attack. While Spectre tricked the CPU into following an incorrect path of instruction execution, this new attack requires a carefully constructed instruction sequence tricking the CPU into speculatively using incorrect data for its operations. As is the case with the original Meltdown and Spectre vulnerabilities, the CPU will detect and fix its mistake. But the missteps leave traces behind which can lead attackers to hidden information.
In order to understand the attack scenario it is necessary to dive a bit deeper into the mode of operation of out-of-order execution. CPUs preserve the illusion of processing program instructions in the order in which they are written and keep all reordering internal. Memory, however, is an external component, so handling of memory accesses requires special attention. The corresponding functionality is denoted memory disambiguation.
Memory access instructions are commonly referred to as LOAD/STORE for reading from/writing to memory, respectively. Processors understanding the x86 instruction set decode the various MOV opcodes to their LOAD/STORE equivalents in the early stages of the command processing pipeline. Of course, the CPU has to respect dependencies between instructions when reordering them to improve performance.
Imagine an illustrating example, where a STORE writes the value
123 to memory address
0x1000. A subsequent second STORE instruction copies the value that it finds at address
0x1000 (via a corresponding LOAD) to memory address
0x2000. After executing both STORE instructions, both memory cells at the addresses
0x2000 must contain the value
123. The corresponding pseudo-code reads as follows:
This scenario is simple but becomes a bit more complex within the CPU’s pipeline: While both instructions can be in different stages of the pipeline at the same time, their effect on memory and/or registers will only be written to the system after they retired ( i.e. leave the pipeline). That means that the second STORE instruction will not see the value change in memory address
0x1000 in cache or memory, because the first STORE may not have left the pipeline, yet. The CPU must somehow solve this problem, otherwise the program’s correctness is jeopardized.
This specific scenario is called a Read-After-Write Hazard (RAW). The wikipedia article on memory disambiguation contains a full list of all different possible LOAD/STORE hazards and how to resolve them.
As already stated, the chip specification promises that the CPU preserves the sequential program execution model. That means that no matter in what order it decides to execute instructions, the results in registers and memory will always be the same as if executed by a CPU that does not work in out-of-order manner. While using out-of-order execution to boost performance, the CPU cannot immediately write to memory as soon as a store operation completes because the new value would become visible immediately. Instead, the result is buffered until the instruction is due to retire in original order and its value can be committed to the architectural state. The corresponding hardware structure is commonly called store queue and naturally takes care of Write-After-Read (WAR) and Write-After-Write (WAW) dependencies. We are getting to resolving our RAW dependency now:
By the way, the name queue is too simplistic. Even without reordering of instructions, if a STORE and a following LOAD are in the pipeline at the same time, the LOAD might be executed before the STORE’s effect has been propagated to memory. Using the example from above, let us assume that the initial STORE of
123 to address
0x1000 is blocked from retirement because of another long latency operation preceding it. That is, the value
123 can only be found in the store queue when the LOAD fetching the value for the second STORE gets executed. Hence, CPU architects usually call for a searchable structure instead of a plain first-in, first-out (FIFO) buffer and implement so-called store-to-load forwarding. That is, every LOAD searches for older STOREs to the same address before it consults the actual memory. In case of a hit, it uses the value from the store queue instead of fetching the data from memory. In our case, the LOAD is reading from address
0x1000 and indeed, the store queue holds a matching STORE with the value
An analogous technique is used to tackle the last remaining issue of RAW dependencies (these are also called “true dependencies”). Similar to STOREs getting buffered in the store queue until instruction retirement, LOADs enter a load queue. STOREs then search for newer in-flight LOADs with the same address. The consequences in case of a hit are different, though. Since the value returned by the LOAD may have been propagated to further dependent instructions, which means that the program might have been working with wrong values starting at this point. This makes it necessary to reset the program flow to this point in the near past and continue executing from there with the correct values. This little “reset” scenario should be completely private to the pipeline and should not be visible from the outside. Picking up the introductory example, again… If the initial STORE to
0x1000 gets delayed instead of merely stalled while waiting for retirement, the store queue is empty by the time the LOAD executes. This means it will return an arbitrary value which will also be used by the second STORE and thus end up in the store queue with address
0x2000. When the first STORE is finally getting processed, it finds its address
0x1000 in the load queue. Noting that the LOAD follows the initial STORE in original order, it will be squashed together with the final STORE and the two instructions will be re-executed once the the value
123 has been committed to memory.
In practice, because of the load and store queues, none of this is visible when looking at register and memory content during program execution. Unfortunately, other parts of the CPU are leaking the intermediate, temporarily wrongly executed part of the program, as we will see in the next section.
Given a basic understanding of CPU operation, let’s have a closer look at the new vulnerability. What we want is reading a value from a protected address. Assuming correct software, such an illegal access will be detected, of course, so we have to perform data-dependent operations in order to exfiltrate the value through a side channel. This could look as follows in C:
Spectre manipulates the branch predictor into believing
is_address_sane() would be true and by the time the CPU detects its mistake, the LOAD resulting from the array access has already left its marks in the cache hierarchy although neither
cache_trace will appear modified.
Assuming bounds checking is working properly we need a different way of tricking the CPU into working with an incorrect value for the pointer dereferencing and RAW-hazards provide us an option: In case of a read-after-write dependency, a LOAD following a STORE is supposed to return the value of the preceding STORE. So the basic algorithm for the new attack is:
If everything was executed sequentially, the
if block would see a sane pointer value and the secret would remain preserved. In case of a successful attack, however, the hardware is too slow to propagate the STORE effects so the LOAD will execute speculatively with the initial pointer value to the secret data.
Note that this has nothing to do with the branch prediction logic manipulation that the earlier published Spectre attack performs. The
ifclause will always be
trueduring correct execution!
This simple example would not work in practice without a few additional modifications, though: First, we need some time after the “init” step in order to ensure that the “read” step will find the pointer to the secret data. Second, we have to stimulate the RAW hazard i.e., we have to delay the “write” step such that the CPU executes the “read”- and at least the array “look-up” step speculatively with the pointer value from the “init” step before it detects the mistake. This means we have to “disguise” the address of the write instruction so the CPU needs some time to figure out the RAW dependency. That can be achieved by calculating this address somehow from further dependencies.
The Spectre paper describes so-called gadgets, that is vulnerable sequences in the machine code of another program, that need to be used in order to access its secrets. Finding such gadgets is more difficult given the complex requirements with respect to the memory operations.
Leveraging interpreters in the victim appears as more promising approach as it may allow attackers to create suitable gadget code themselves. Mitigations intended to harden address validation are of no avail in face of this new attack as correct program execution never uses incorrect addresses. Attempts at closing the side channel, however, still offer protection as the data exfiltration mechanism is identical. Since they do not address the root cause, some doubts remain whether for example reducing timer precision, as done by the Spectre-updates for web browsers, are truly effective.
On systems without specific mitigation features, so called fencing instructions provide programmers some means to influence reordering of memory accesses. An MFENCE, for example, ensures all load and store operations are completed before continuing. Hence, placing an MFENCE before the address check would prevent speculative execution with outdated values. Similarly, an LFENCE serialises all load operations, that is all preceding memory reads have to be completed before another load may execute. Thus, starting the protected code section after the address has been sanitised also guarantees the checks were performed with correct, up-to-date values.
Disabling full out-of-order execution of memory accesses completely would come with a noticeable performance impact. In case of critical code sections, however, programmers would likely appreciate more control over hardware behaviour so they can make a trade-off. Intel already released a microcode update that provides a Speculative Store Bypass Disable (SSBD) bit that can be set in the
IA32_SPEC_CTRL Model-Specific Register (MSR) by the programmer. According to not publicly available Benchmark results, the performance impact varies roughly between 2 to 8 percent.