Hijacking Bitcoin: routing attacks on cryptocurrencies
Hijacking Bitcoin: routing attacks on cryptocurrencies Apostolaki et al., IEEE Security and Privacy 2017
The Bitcoin network has more than 6,000 nodes, responsible for up to 300,000 daily transactions and 16 million bitcoins valued at roughly $17B.
Given the amount of money at stake, Bitcoin is an obvious target for attackers.
This paper introduces a new class of routing attacks on the network. These aren’t supposed to be feasible since Bitcoin is a vast peer-to-peer network using random flooding. However, look a little closer and you’ll find:
- The Internet infrastructure itself is vulnerable to routing manipulation (BGP hijacks), and
- Bitcoin is really quite centralised when viewed from a routing perspective.
Suppose for a minute it was practical to intercept or otherwise manipulate connections (you’ll soon see that it is):
As Bitcoin transactions are routed over the Internet – in clear text and without integrity checks – any third-party on the forwarding path can eavesdrop, drop, modify, inject, or delay Bitcoin messages such as blocks or transactions.
Seventeen billion dollars, clear text, no integrity checks!
It turns out that if you can hijack less than a hundred BGP prefixes (feasible) you can isolate about 50% of the mining power in the network. Once a collection of nodes are partitioned from the network the network becomes more vulnerable to double spending attacks, transaction filtering, and selfish mining attacks.
Nodes representing merchants, exchanges, and other large entities are thus unable to secure their transactions, or may not be able to broadcast them to the the network to begin with. The resulting longer-term loss trust in Bitcoin security may trigger a loss of value for Bitcoin. Attackers may even short Bitcoin and gain from the resulting devaluation.
The authors also demonstrate delay attacks which are effective against individual targets, but not against the network as a whole as the partitioning attacks are.
The core building block: BGP hijacking
BGP (Border Gateway Protocol) is the routing protocol that controls how packets are forwarded in the Internet. Routes are associated with IP prefixes, and are exchanged between neighbouring networks (Autonomous Systems, AS). The origin AS makes the original route announcement, and this then propagates through the network hop by hop.
In BGP, the validity of route announcements is not checked. In effect, this means that any AS can inject forged information on how to reach one or more IP prefixes, leading other ASes to send traffic to the wrong location. These rogue advertisements, known as BGP “hijacks”, are a very effective way for an attacker to intercept traffic en route to a legitimate destination.
If you want to hijack the traffic for a particular prefix, you can either advertise the same prefix (in which case you will attract traffic that is ‘closer’ to you than the genuine owner), or you can advertise a more specific (longer) prefix (up to /24) in which case longest match (i.e., yours) always wins. By leaving at least one path from the attacker to the destination untouched, a BGP hijack can be turned into an interception.
Is it really that simple to hijack Internet traffic??? I mean, does this really happen in practice? Sure it does! As evidence, the authors analysed six months of BGP updates (that’s 4 billion of them) to look for hijacks.
We see that there are hundreds of thousands of hijack events each month. While most of these hijacks involve a single IP prefix, large hijacks involving between 300 and 30,000 prefixes are also seen every month.
That’s all pretty alarming in general, but here we’re just focused on the Bitcoin specific nodes. Each month, at least 100 Bitcoin nodes are victims of hijacks. In November 2015 as an example, 7.8% of the network nodes were hijacked.
The structure of the Bitcoin network
The vulnerability of the Bitcoin network overall to routing attacks depends on the routing characteristics of the Bitcoin network itself. The conduct a study (details in section VI) to uncover the Bitcoin network topology. The key findings are as follows:
- Unsurprisingly, there seems to be some kind of power law at play whereby a few ASes host most of the Bitcoin nodes. E.g., only 13 ASes hosts 30% of the entire Bitcoin network.
- Just a few ASes naturally intercept the majority of Bitcoin traffic. Hurricane Electric, Level3, and Telianet for example, can together intercept more than 60% of all possible Bitcoin connections.
- Over 90% of Bitcoin nodes are vulnerable to BGP hijacks. 93% of them have prefixes shorter than /24!
- Mining pools are multi-homed and all use at least two ASes to connect to the Bitcoin network. Multi-homing is one of the main protections against routing attacks.
- Bitcoin routing protocols are stable over time – the same IPs are present on average for 15.2 consecutive days.
Partitioning attacks
The goal of a partitioning attack is to isolate some set of nodes P from the rest of the network, effectively partitioning the Bitcoin network into two disjoint components. It may be that not all of the nodes in P can be isolated (due to ‘leakage points’ from connections in P that can’t be intercepted – e.g., a peering agreement between mining pools using private connections). So ultimately we try to find the maximal subset of P that can be partitioned off.
The attacker first diverts the traffic destined to nodes in P by hijacking the most-specific prefixes hosting each of the IP address. Once on-path, the attacker intercepts the Bitcoin traffic (e.g., based on the TCP ports) and identifies whether the corresponding connections cross the partition she tries to create. If so, the attacker drops the packets. If not, meaning the connections are contained within P, she monitors the exchanged Bitcoin messages so as to detect “leakage points”. Leakage points are nodes currently within P, which maintain connections with nodes outside of P that the attacker cannot intercept…
These leakage points are dropped from P until the maximal set of nodes that can be isolated is left.
The authors demonstrate the practicality of the attack by hijacking their own nodes:
First, we performed a real BGP hijack against our Bitcoin nodes and show that it takes less than 2 minutes for an attacker to divert Bitcoin traffic. (Emphasis mine)
Second, we estimated the number of prefixes to hijack so as to isolate nodes with a given amount of mining power. We found that hijacking only 39 prefixes is enough to isolate a specific set of nodes which accounts for almost 50% of the overall mining power. (Emphasis mine)
Delay attacks
Delay attacks slow down the propagation of new blocks sent to a set of Bitcoin nodes, without disrupting their connections. The attack can be targeted at selected nodes, or in theory conducted network wide. The authors’ analysis though shows that network wide attacks are not really feasible. To carry out a delay attack, the attacker first intercepts traffic using BGP hijacking with a forwarding route left open, and then tampers with the incoming and/or outgoing protocol messages. By tampering with outgoing GETDATA requests, a requesting node will never receive the blocks it is asking for. By tampering with incoming BLOCK messages the content can be corrupted so that the victim considers it invalid. Each of these actions triggers a 20 minute timeout.
We verified the practicality of delay attacks by implementing an interception software which we used against our own Bitcoin nodes. We show that intercepting 50% of a node’s connections is enough to keep the node uniformed for 63% of its uptime.
For 68% of the nodes in the Bitcoin network, there is at least one AS other than their provider that intercepts more than 50% of their connections.
Countermeasures
The authors recommend the following short-term measures:
- Increase the diversity of node connections, for example by ensuring that all Bitcoin nodes are multi-homed.
- Select Bitcoin peers in a route-aware way, adding extra random connections if the same AS appears in all paths.
- Monitor RTT to detect sudden changes (increases) and establish extra random connections as a protection mechanism.
- Monitor additional statistics as early warning signals of an attack: distribution of connections, request-response intervals, simultaneous disconnection of peers, and so on.
- Embrace the natural churn of the network to continually refresh connections
- Use gateways (for mining pools) in different ASes
- Prefer peers hosted in the same AS, and in /24 prefixes
Longer term, Bitcoin can encrypt connections and/or use a Message Authentication Code to detect tampering. The authors also recommend separating control and data channels, negotiating a set of random ports on connection that will be used to exchange Bitcoin data. This forces an AS-level adversary to look at all traffic, not just the port 8333, which would be very costly. Other suggestions include adding UDP heartbeats to detect interception, and ensuring that any block requests are made on multiple connections in parallel. Ph
The last word
Our work underscores the importance of proposed modifications which argue for encrypting Bitcoin traffic or traffic exchanged among miners. Yet, we stress that not all routing attacks will be solved by such measures since attackers can still disrupt connectivity and isolate nodes by dropping Bitcoin packets instead of modifying them.
SoK: Cryptographically protected database search
SoK: Cryptographically proctected database search Fuller et al., IEEE Security and Privacy 2017
This is a survey paper (Systematization of Knowledge, SoK) reviewing the current state of protected database search (encrypted databases). As such, it packs a lot of information into a relatively small space.
As we’ve seen before, there are a wide-variety of cryptographic techniques in play (LIST), and we’re seeing the increasing emergence of commercial solutions:
In the commercial space, a number of established and startup companies offer products with protected search functionality, including Bitglass, Ciphercloud, CipherQuery, Crypteron, IQrypt, Kryptnostic, Google’s Encrypted BigQuery, Microsoft’s SQL Server 2016, Azure SQL Database, PreVeil, Skyhigh, StealthMine, and ZeroDB. While not all commercial systems have undergone thorough security analysis, their core ideas come from published work. For this reason, this paper focuses on systems with a published description.
I’m going to start by jumping to the heart of the paper (for me at least), which looks at the different mechanisms for protected base queries, and their strengths and weaknesses
Approaches to protected search
The section provides systematic reviews of the different cryptographic approaches used across query types and an evaluation of known attacks against them. Due to length limitations, we focus on the Pareto frontier of schemes providing the currently best available combinations of functionality, performance, and security.
To give you an idea where we’re heading, here’s the big summary table, with schemes grouped by query type (equality, boolean, range, or other).
After query type, the schemes are broken down by the primary approach that they take to protection: legacy, custom, oblivious, or other.
Legacy compliant schemes
Legacy schemes insert ciphertexts into a traditional database, without changing the indexing and querying mechanisms. Such schemes are based on property-preserving encryption which produces ciphertexts preserving some property (e.g., equality and/or ordering) of the underlying plaintexts. “As a result, Legacy schemes immediately inherit decades of advances and optimizations in database management systems.” As we saw a couple of weeks ago though, in practice implementations based on existing database engines may leak a lot more than the cryptographic theory taken on its own would lead you to believe.
Within the legacy bucket, deterministic encryption (DET) preserves only equality but applying a randomized but fixed permutation to all messages. Even without a querier making any queries though, the server learns everything about equalities between data items. Order-preserving encryption (OPE) preserves the relative order of the plaintexts, so that range queries can be performed over ciphertexts:
This approach requires no changes to a traditional database, but comes at the cost of quite significant leakage: roughly, in addition to revealing the order of data items, it also leaks the upper half of the bits of each message.
Mutable OPE only reveals the order of ciphertexts at the expense of added interactivity during insertion and query execution. Legacy approaches can be extended to perform boolean queries and joins by combining the results of equality or range queries.
Even the strongest legacy schemes reveal substantial information about queries and data to a dedicated attacker.
Custom schemes
Custom schemes include inverted indices and tree traversal. Inverted index schemes traditionally supported equality searches on single table databases via a reverse lookup mapping each keyword to a list of identifiers for database records containing the keywords. OSPIR-OXT extends this to support boolean queries using a second index. Other inverted index schemes have been introduced to support selection, projection, and Cartesian product at the expense of additional leakage.
Cash and Tessaro demonstrate that secure inverted indices must necessarily be slower than their insecure counterparts, requiring extra storage space, several non-local read requests, or large overall information transfer.
Tree traversal schemes execute queries by traversing a tree and returning the leaf nodes at which the query terminates (sort of). The main challenge is to hide the traversal pattern through the tree. Tree-based schemes have also been developed that can support range queries.
Oblivious schemes
Oblivious schemes aim to hide common results between queries. Oblivious RAM (ORAM) has been researched for decades, and the performance of ORAM schemes has steadily increased, with the Path ORAM scheme being a current favourite.
Applying ORAM techniques to protected search is still challenging.
Of interest here is 3PC-ORAM which shows that by adding a second non-colluding server you can build an ORAM scheme which is much simpler than previous constructions.
Other schemes
There have been several custom schemes supporting other query types such as ranking boolean query results, calculating inner products of fixed vectors, and computing the shortest distance on a graph.
These schemes mostly work by building encrypted indices out of specialized data structures for performing the specific query computation.
Oh, you want to update the data as well???
Not all secure schemes support updating the data, especially those in the custom and oblivious categories. Most legacy schemes allow immediate queries after insertion. For custom schemes that do support updates, new values are typically inserted into a less efficient side index. Periodically a refresh operation updates the main database structures to include these updates – this is expensive though so it can’t be done too frequently.
Attacks
Now that we’ve got a basic map of the territory, let’s take a look at the leakage inference attacks the different schemes are vulnerable too. There’s another big table for those too:
From the first ‘schemes’ table that we looked at, take the ‘S leakage’ (server leakage) column and then look up the values there in this table to find out which schemes are vulnerable to which attacks.
We stress that leakage inference is a new and rapidly evolving field. As a consequence, the attacks in Table III only cover a subset of leakage profiles included in Table II. Additionally, this section merely provides lower bounds on the impact of leakage because attacks only improve over time.
Attacks are divided into two broad groups depending on whether the attacker is trying to recover data stored at the server (data recovery), or to learn the set of queries asked by a user (query recovery).
The required leakage columns tell you what kind of information leakage the attack depends on. ‘Init’ here means that some data is leaked during the initial loading of the data, without any queries being made, and ‘Query’ refers to information leakage during querying.
Examples include the cardinality of a response set, the ordering of records in the database, and identifiers of the returned records. Some attacks require leakage on the entire dataset while others only require leakage on query responses
All current attacks require some prior knowledge – information about the stored data or the queries made. The prior knowledge column indicates how much an attacker needs to know. Some attacks also require the attacker to be able to insert records.
The attack efficacy columns give an idea of runtime required, amount of prior knowledge, and how large a keyword universe the different attacks can deal with.
Most of the attacks rely on the following two facts: 1) different keywords are associated with different numbers of records, and 2) most systems reveal keyword identifiers for a record either at rest (e.g, DET) or when it is returned across multiple queries.
The assumed attacker models for inference attacks are also fairly weak: there is the semi-honest attacker who follows protocols but tries to learn as much additional information as they can from what they observe, and an attacker capable of data injection (i.e., they can take some action that ultimately ends up in a record being updated in the database).
The authors identify the follow open problem, which seems to be saying “it’s an open problem to find real world use-cases where the known leakage attacks don’t matter!”
… it is important to understand what these leakage attacks mean in real-world application scenarios. Specifically, is it possible to identify appropriate real-world use-cases where the known leakage attacks do not disclose too much information? Understanding this will enable solutions that better span the security, performance, and functionality tradeoff space.
Full database solutions
The following table summaries protected search systems that have made the transition to full database solutions.
- CryptDB enables most DBMS functionality with a performance overhead of under 30%.
- Arx is built on top of MongoDB and reports a performance overhead of approximately 10%.
- BlindSeer reports slowdowns of between 20% and 300% for most queries
- OSPIR-EXT can occasionally beat a MySQL system with a cold cache (a somewhat strange comparison!), but are an order of magnitude slower than MySQL with a warm cache.
- SisoSPIR reports a 500% slowdown compared to a baseline MySQL system on keyword equality and range queries.
The systems offer different performance/leakage trade-offs. Considering just those based on relational algebra for the moment:
CryptDB is the fastest and easiest to deploy. However, once a column is used in a query, CryptDB reveals statistics about the entire dataset’s value on this column. The security impact of this leakage should be evaluated for each application. Blind Seer and OSPIR-OXT also leak information to the server but primarily on data returned by the query. Thus, they are appropriate in settings where a small fraction of the database is queried. Finally, SisoSPIR is appropriate if a large fraction of the data is regularly queried. However, SisoSPIR does not support Boolean queries, which is limiting in practice.
It is an open problem to construct protected search systems that leverage distributed server architectures.
In conclusion
There’s a lot of information in this paper that I didn’t have space to cover, but the overall message is that the market is early and adopters need to be careful to understand what guarantees they’re really getting and if those are sufficient for their use cases. I’ll leave you with this thought from the author’s own conclusions which I rather like, as it speaks to the value of not knowing something:
Governments and companies are finding value in lacking access to individual’s data. Proactively protecting data mitigates the (ever-increasing) risk of server compromise, reduces the insider threat, can be marketed as a feature, and frees developer’s time to work on other aspects of products and services.
I’m not quite so sure about how much time it frees up, but I agree with the other points.
Cloak and dagger: from two permissions to complete control of the UI feedback loop Fratantonio et al., IEEE Security and Privacy 2017
If you’re using Android, then ‘cloak and dagger’ is going to make for scary reading. It’s a perfect storm of an almost undetectable attack that can capture passwords, pins, and ultimately obtain all permissions (aka, a ‘god app’), that can leave behind almost no trace, and that may not require the user to view and (knowingly) accept even a single permissions dialog. Oh, and there’s no easy fix: “we responsibly disclosed our findings to Google; unfortunately, since these problems are related to design issues, these vulnerabilities are still unaddressed.”
At the core of the attack are two permissions: SYSTEM_ALERT_WINDOW
and BIND_ACCESSIBILITY_SERVICE
. If an app can gain both of these permissions, it’s game on:
… we demonstrate the devastating capabilities the “cloak and dagger” attacks offer an adversary by showing how to obtain almost complete control over the victim’s device. In this paper, we will demonstrate how to quietly mount practical, context-aware clickjacking attacks, perform (unconstrained) keystroke recording, steal user’s credentials, security PINs, and two factor authentication tokens, and silently install a God-mode app with all permissions enabled.
The SYSTEM_ALERT_WINDOW
permission lets an app draw overlays on top of other apps, and is widely used. For example, the Facebook, Twitter, LastPass, and Skype apps all use it, as do about 10% of the top apps on Google Play Store. Fortunately for our attacker, the permission is automatically granted to any app in the official Play Store, and by targeting SDK level 23 or higher, the user will not be shown the list of required permissions at installation time (the idea is to defer asking until runtime when the permission is actually required for the first time). So that’s one permission down, one to go – and the user has no way of knowing.
We note that this behaviour seems to appear to be a deliberate decision by Google, and not an oversight. To the best of our understanding, Google’s rationale behind this decision is that an explicit security prompt would interfere too much with the user experience, especially because it is requested by apps used by hundreds of millions of users.
The BIND_ACCESSIBILITY_SERVICE
permission grants an app the ability to discover UI widgets displayed on the screen, query the content of those widgets, and interact with them programmatically. It is intended to help make Android devices more accessible to users with disabilities. Because of the security implications, this permission needs to be manually enabled through a dedicated menu in the settings app, it can be done in just three clicks, but they have to be the right three clicks of course. If you have the SYSTEM_ALERT_WINDOW
permission though, getting the user to make those three clicks is fairly easy.
The example attack app built by the authors presents a simple tutorial with three buttons: “Start Tutorial”, then “Next” and a final “OK” button. When the OK button is clicked, the app shows a short video that lasts about 20 seconds. Except things are not what they seem. The tutorial app UI is actually an on-top opaque overlay and the “Start Tutorial” button click actually clicks on the accessibility service entry, the “Next” button click actually clicks on the ‘on/off’ switch, while the “OK” button click is clicking on the real OK button from the Settings app (the overlay UI has a carefully crafted hole at this exact point). The SYSTEM_ALERT_WINDOW
settings are such that if a click is sent through to the underlying window from an overlay, then the overlay is not told the whereabouts of that click. Traditionally that makes click-jacking attacks difficult because you don’t know where the user clicked – the event reports coordinates (0,0) – and hence you don’t know when to update your overlay (fake) UI in response to a click so that the user remains unsuspecting.
Our technique works by creating a full screen opaque overlay that catches all a user’s clicks, except for the clicks in a very specific position on the screen (the point where the attacker wants the user to click). This is achieved by actually creating multiple overlays so as to form a hole: the overlays around the hole capture all the clicks, while the overlay on the hole is configured to let clicks pass through… since there is only one hole, there is only one way for a user’s click to not reach the malicious app.
In other words, if you see an event with coordinates (0,0), you know that the user clicked in the hole!
There are four fundamental attack primitives which can be combined in many creative ways, once the two permissions are obtained:
- You can modify what the user sees (as in the click-jacking example above)
- You can know what is currently displayed
- You can inject user input
- You can know what the user inputs and when
Here are 12 example attacks:
- Context-aware clickjacking – we just saw this in action to fool the user into granting the
BIND_ACCESSIBILITY_SERVICE
permission, overlays are used to lure the user into clicking on target parts of the screen. - Context-hiding – using overlays that cover the full screen apart from carefully designed ‘holes’, so that the user is completely fooled concerning the true context of the button or other UI element that they click on.
- Keystroke inference – it turns out that even an app that only has the
SYSTEM_ALERT_WINDOW
permission can learn what keys you touch on the onscreen keyboard, including any private messages, passwords etc.. This works by creating several small transparent overlays, one on top of each key on the keyboard. The overlays don’t intercept any clicks, so these go to the underlying keyboard – however, they do receive MotionEvents for touches outside of their area. TheFLAG_WINDOW_IS_OBSCURED
flag is set to true if a click event passed through a different overlay before reaching its final destination. Combine this with the MotionEvents, and a careful stacking of the overlays so that each has a different Z-level, and you can tell based on obscured flag which key was actually pressed. - Keyboard app hijacking – normally when an app attempts to get the text of a password field, the getText method just returns an empty string, but with the
BIND_ACCESSIBILITY_SERVICE
permission the keyboard app itself is treated as a normal unprivileged app, and each of the key widgets generates accessibility events through which all keystrokes can be recorded, including passwords. - Targeted password stealing Use the accessibility service to detect that the user just launched e.g. the Facebook app login activity. Now draw a visible overlay that looks like the username and password EditText widgets, and an overlay on top of the login button. When the user clicks on the login button overlay, record the credentials and then pass them through to the real Facebook app. At the end of the attack, the user is logged into their real Facebook account and is completely unsuspecting.
- Security PIN stealing – if you use a security PIN, the security screen pad generates accessibility events which an accessibility app can see even when the screen is locked.
- Phone screen unlocking – accessibility apps can also inject events while the phone is locked. If a secure lock screen is not used, an accessibility app can inject events to enter the PIN, unlock the phone, and then go on to do whatever it wants. All this can be done while the phone screen remains off. “It is simple to imagine a malware that would unlock the phone during the night, and it would go on the web and click ads…” If you use a swipe pattern though, there is no way for an accessibility app to inject swipes.
- Silent app installation the initial malicious app with only the two permission can contain within itself an APK for another app which requests all permissions. It is possible to install this app while stealthily covering the screen with an on-top overlay. Remember the 20 second video that played as part of our tutorial earlier? That’s the overlay which gives the attacking app 20 seconds to go through the god-app installation process behind it, completely unbeknownst to the user. The installed app can be configured to not appear in the launcher, and to have device admin privileges so that its uninstall button is disabled. Disguising it as a legitimate system app (e.g.. “Google Parental Control”) should complete the deception. The full power is this is best understood when you see it in action:
- Enabling all permissions – with the app installed as in the previous example, it is possible to automatically click and enable all the requested permissions while a full screen overlay is on top.
- 2FA token stealer – “for SMS-based tokens, the accessibility service can be configured to be notified for an new notification, which contains the required token.” What if you’re using an authenticator app though? It turns out a malicious app can just open the activity displaying the tokens and read them off the screen – none of the views containing the tokens are protected in e.g., the Google Authenticator App. Why not combine this with #7 and grab a bunch of tokens at night with the screen off?
- Ad hijacking – an accessibility app can figure out where and when ads are shown in an app, and draw an invisible on top overlay that sends clicks to an ad of the attackers choice, generating revenue.
- Browse the web – the accessibility service has full access to the browser, so you can make Chrome visit any page, click on ‘likes,’ post content on behalf of the user, and so on. “Moreover, since all HTML elements are nicely exposed, we believe it would be simple to extend the password stealer attacks we described earlier to web-based forms.“
A user study with 20 users showed that none of the users suspected anything at all while interacting with the tutorial in the app, and after those three clicks their devices were fully compromised and a God-mode app installed.
None of the users was able to even suspect that they were under attack. What is more worrisome is that none of the users actually managed to understand what happened even after we told them the app they played with was malicious and even after we gave complete and free access to the compromised device. We also stress that the subjects could not detect anything even when they had a chance to directly compare what happened against a “benign” baseline: In a real-world scenario, there would not be any baseline to use as reference.
Section IX in the paper outlines some possible defences that Google could implement to prevent such ‘cloak and dagger’ attacks, which centre around a set of system and API modifications. These would clearly take some time to roll out. In the short term, not automatically granting SYSTEM_ALERT_WINDOW
permissions to Play Store apps, and more closely scrutinising apps that request both permissions (the authors had such a test app easily accepted) would both help.
We responsibly disclosed all our findings to the Android security team, which promptly acknowledged the security issues we reported. When available, the fixes will be distributed through an over-the-air (OTA) update as part of the monthly Android Security Bulletins. Unfortunately, even if [though?] we reported these problems several months ago, these vulnerabilities are still unpatched and potentially exploitable. This is due to the fact that the majority of presented attacks are possible due to inherent design issues outlined in Section V. Thus, it is challenging to develop and deploy security patches as they would require modifications of several core Android components.
If you want to see videos of some of the attacks in action, as well as a table indicating which versions of Android are affected (as of time of writing, the word ‘vulnerable’ appears in every cell!), then check out http://cloak-and-dagger.org/.
IoT goes nuclear: creating a ZigBee chain reaction
IoT goes nuclear: creating a ZigBee chain reaction Ronen et al., IEEE Security and Privacy 2017
You probably don’t need another reminder about the woeful state of security in IoT, but today’s paper choice may well give you further pause for thought about the implications. The opening paragraph sounds like something out of science fiction – except that it’s a demonstrated reality today:
Within the next few years, billions of IoT devices will densely populate our cities. In this paper, we describe a new type of threat in which adjacent IoT devices will infect each other with a worm that will rapidly spread over large areas, provided that the density of compatible IoT devices exceeds a certain critical mass.
The ZigBee protocol is the radio link of choice for many IoT devices since it is simple and widely available at low cost and with low power consumption. It has much lower bandwidth than WiFi, but that doesn’t matter for many IoT applications. The popular Philips Hue smart lamps use ZigBee for example. Suppose you could build a worm that jumps directly from one lamp to another using their ZigBee wireless connectivity and their physical proximity. If the install base of lamps in a city is sufficiently dense, you could take them all over in no time, with the worm spreading like a physical virus. The authors of today’s paper demonstrate exactly how to build such a worm:
… we developed and verified such an infection using the popular Philips Hue smart lamps as a platform… The attack can start by plugging in a single infected bulb anywhere in the city, and then catastrophically spread everywhere within minutes.
If plugging in an infected bulb is too much hassle, the authors also demonstrate how to take over bulbs by war-driving around in a car, or by war-flying a drone. (I get the impression they had fun with this one!). Based on percolation theory, we can estimate that if a city such as Paris has around 15,000 or more randomly located smart lamps then the attack will spread everywhere within the city.
What we demonstrate in this paper is that even IoT devices made by big companies with deep knowledge of security, which are protected by industry-standard cryptographic techniques, can be misused by hackers can rapidly cause city-wide disruptions which are very difficult to stop and investigate.
I just p3wned your city!
The whole paper is a wonderful tour de force that combines two main elements:
- A Correlation Power Analysis (CPA) attack against the CCM encryption mode used to encrypt and verify firmware updates, allowing the authors to encrypt, sign, and upload malicious OTA updates to infect lamps.
- A takeover attack that allows the authors to take control over lamps from long distances without using custom hardware.
Let’s look at these two components in turn, and then revisit the analysis that shows how easily such a worm can spread through a city given a relatively low density of installed bulbs. We’ll conclude by considering some of the attacks this enables, and another appeal for better IoT security.
Uploading malicious firmware
ZigBee provides an OTA update feature that the authors exploit. They started out by finding several different older Hue models which had previously had software updates. By plugging these in and letting them update, a recording of the update process can be made for analysis, which enabled discovery of the hex code OTA requires to be sent over the air. The team then wrote a python script to impersonate a lightbulb and ask for an OTA update using the code.
The firmware image that we recorded was not dependent on our impersonated lightbulb MAC address. This brought us to the conclusion that the software is protected by a single key that is shared at least between all the lamps from a specific model.
(Bad idea, but very common!).
On this assumption of symmetric cryptography, recovering the encryption and authentication keys from any one Philips Hue lightbulb will allow a software update to performed to any other lightbuild and upload malicious code of the attacker’s choosing. (At least once the OTA firmware image structure has been reverse engineered, which the authors also did).
To get those keys, the authors developed a side-channel power analysis attack to break the Philips bootloader:
Side channel power analysis measures the power consumed by a digital device as it performs operations. With this attack it is possible to infer something about the data being processed by the device, which was first demonstrated as a method of breaking cryptographic algorithms by Kocher et al..
Section 6 of the paper provides the full details of how differential power analysis (DPA) and correlation power analysis (CPA) were used to recover the key bits. We don’t really have space to cover it here (and besides, I’m not sure I could do it justice), but it’s well worth reading if you’re interested, and a great demonstration of the weakness of ‘security by obscurity’ even when you think you’re doing everything else right.
Reaching bulbs from a distance
At this point, if we can communicate with a bulb, we can upload our firmware to it. Communication normally requires close physical proximity though:
[The ZibBee chips in the Hue lamps] will ignore any request to reset or to change their affiliation unless it is sent from a ZigBee transmitter which is only a few centimeters away from the lamp. Even though the attacker can try to spoof such a proximity test by using very high power transmitters, the fact that the received power decreases quadratically with the distance makes such brute force attacks very hard.
Section 7 in the paper details how the authors managed to bypass the proximity check in the Atmel ZigBee chip’s Touchlink implementation. It all boils down to an assumption in the code that a message with a zero-value transaction id has already been rejected as invalid. However, this check is actually only carried out for Scan Request messages, and not for any other message in the protocol.
This means that we can send any other message assuming zero value for the Transaction and Response Ids and it will be received and processed as a valid message by the light.
So the team send a unicast ‘Reset to Factory New Request’ command to the light with a zero Transaction Id. Upon receiving the message the light undergoes a factory reset and begins scanning for a new network to join by sending ZigBee Beacon request messages. “…we respond with a ZigBee Beacon message with the Association Permit flag set to true. This cause the light to start a ZigBee association process and join our network.”
For this part of the attack:
We install 3 Philips Hue lamps in offices at the first floor of our faculty building. We successfully tested our full attack from a car which was parked across the lawn at at distance of about 50 meters… We then successfully tested our attack while ‘wardriving’ the car at the far edge of the lawn.
Then they installed 5 smart bulbs in the third floor of an office building and mounted the attack kit onto a DJI drone. Successful takeover was achieved with a fly-by (warflying) and attack code that caused all the lamps to repeatedly signal SOS in Morse code while the drone hovered in front of the building.
Building a self-spreading worm is now possible by combining the ability to sign and encrypt arbitrary binaries with our knowledge of the long-rage takeover attack vulnerability.
How many bulbs do you need before the infection takes over a whole city?
Once you can spread from bulb to bulb, all you need to do is infect one (or a small number) of seed bulb(s) and let nature take its course. Recall from some of the network theory we’ve looked at previously that in a random graph, once a certain number of edges have been added a phase change occurs in which a single giant connected component emerges. If we model this by assume N smart lamps (nodes) placed randomly within a city, and an edge between any two lamps that are less than distance D apart, then the field of Percolation theory can tell us the critical value of N (the Percolation threshold) at which this giant connected component should emerge.
For the city of Paris, with a total area of 105 square kilometers, and assuming D = 100 meters we find that the critical mass of installed lamps in the whole city of Paris needs to be about N = 15,000.
Since the Philips Hue smart lamps are very popular in Europe and especially in affluent areas such as Paris, there is a very good chance that this threshold has in fact already been exceeded, and thus the city is already vulnerable to massive infections via the ZigBee chain reaction described in this paper.
You took over my bulb, so what?
- A Bricking attack makes the attack irreversible such that any effect caused by the worm (blackout, constant flickering etc.) is permanent. Once the worm is in place, it can determine what further OTA updates to accept if any.
- Wireless network jamming uses the ZigBee band, which overlaps with WiFi. By going into ‘test mode’ which transmits a continuous wave signal without first checking for a clear channel, it would be possible to mount DoS attacks disrupting all WiFi or other 2.4GHz traffic.
- Data infiltration and exfiltration using Philips Hue lamps demonstrated by Ronen and Shamir at a rate of about 10KB per day. Using infected lamps similar covert channels can be created but at much higher rates.
- Let’s get a bit more serious… Epileptic seizures – it is possible to use the Philips Hue to trigger epileptic seizures, or to drive LEDs at frequencies that increase long-term discomfort in humans. Imagine this happening simultaneously across a whole city!
Can we please take IoT security a little more seriously now?
Our attacks exploit a specific implementation bug of the Touchlink commission protocol and a specific design for the OTA update process, but they are just an example of the way security in IoT is designed today. The Atmel code we reviewed was very well written and documented, but it is extremely difficult to implement complex state machines of such protocols without any bugs. The main problem is in the insecure design of the ZLL standard itself. We believe this will not be the last bug or attack found against ZLL commissioning… The sharp contrast between the open and inclusive manner in which the TLS 1.3 standard was designed and the secretive work on the ZigBee 3.0 specification that is still not open to the public, is a big part of the problem.
The attack is also another reminder that “security by obscurity has failed time after time,” and any reuse of keys across devices is a big security risk.
We should work together to use the knowledge we gained to protect IoT devices or we might face in the near future large scale attacks that will affect every part of our lives.
The password reset MitM attack
The password reset MitM attack Gelernter et al., IEEE Security and Privacy 2017
The Password Reset Man-in-the-Middle (PRMitM) attack is really very simple, but that doesn’t mean it’s not dangerous. It involves persuading the user to sign-up for an account for some service under the attacker’s control (maybe there’s an enticing free download for example), and then manipulating the registration flow such that the attacker is actually able to reset the password for the user’s account on some other system. An especially good target is the user’s email account. As we know, it’s all downhill from there. In addition to explaining the attack, and demonstrating it in user experiments with Google and Facebook users, the authors also provide a set of recommendations for implementing a password reset process to minimise the risks of PRMitM attacks.
We evaluated the PRMitM attacks on Google and Facebook users in several experiments, and found that their password reset process is vulnerable to the PRMitM attack. Other websites and some popular mobile applications are vulnerable as well… Since millions of accounts are currently vulnerable to the PRMitM attack, we also present a list of recommendations for implementing and auditing the password reset process.
The findings were reported to vulnerable vendors. Snapchat and Yahoo! fixed the vulnerability, Google, LinkedIn and Yandex indicated that they plan to fix it. Other less vulnerable websites (e.g. Facebook) indicated they would consider using the findings in the future but do not plan to apply fixes anytime soon.
I forgot my password
Sometimes users genuinely do forget their passwords, and then there needs to be a way to reset them. You can have the strongest password / authentication mechanism in the world of course, but if the password is easy to reset none of that matters. So how do you make a password reset process that’s at least as strong and secure as the regular authentication process?
If you’re using one of the ten most popular websites, here are the challenges you may be presented with during the reset process:
(The authors don’t discuss if/how this changes when you have two-factor authentication enabled – do many sites then require you to provide a backup code or recovery key for example? I didn’t have time to investigate all of the sites listed above).
CAPTCHAs are really only there to make it harder for attackers to automate going through the password reset process; they don’t provide any fundamental security. Security questions may relate to answers you provided during account creation, or to information you may know about your account (for example: which of these contacts is in your address book…). Phone codes refers to either a code sent to a registered phone number via SMS, or an automated call that may be placed to that number which reads you a code. The most common defense is sending a password reset link in an email, unfortunately this mechanism is usually not relevant for email accounts themselves.
Let’s be honest, bar the email link, none of these mechanisms are particularly great: security questions are known to be problematic, and SMS is wide open – weaknesses in the SS7 protocol let anyone redirect your phone calls and text messages:
“It’s so simple, it’s almost embarrassing to call it a hack” – So hey, you should stop using texts for two-factor authentication, Wired 2016.
The authors of this paper also conducted a short experiment to see whether users would give ‘correct’ answers to security questions for low-value websites (e.g. “What is your mother’s maiden name?”). Almost 80% of users in the study provided the correct answer to the security question, indicating the problem of relying on security questions. You could make up answers of course, but then you’ll need a way of remembering those answers when you’ve forgotten your password…
The problems with security questions and the popularity of mobile phones has made the authentication using mobile devices a preferred option for password recovery. The most common way to authenticate a user via mobile phone is by sending a code to the device. The user then has to insert the received code into the website to reset the password.
Spear-phishing attacks have been targeting the SMS codes sent during password reset for at least a couple of years (see “Password recovery scam tricks users into handing over email account access“). If that spear-phishing attack isn’t quite subtle enough to dupe enough users, and hijacking SMS via SS7 seems like a bit of a hassle, then there’s always the PRMitM attack.
The basic PRMitM attack
Phishing attacks exploit the users; there is no bug in the design of the attacked website and the attacker exploits unwary users who ignore indications given to them by the browsers. On the other hand, PRMitM attacks exploit bugs in the design of the password-reset process.
The fundamental insight behind a PRMitM attack is that many of the challenges and steps in a typical registration process are similar to those in a password reset process. The attacker sets up a website that encourages a user to create an account. When the user registers they provide their email address, which the attacker uses to initiate a password reset process with the email provider. Every challenge sent by the email provider (e.g., security questions), is forwarded to the user and appears to be part of the registration process. The answers supplied by the user are then passed on the email provider as part of the password reset flow.
It’s easy to see how that works with security questions, but what about SMS codes? As part of the fake registration process, the attacker asks the victim for their phone number, claiming that a code will be sent to it that must be entered to complete registration. Now as part of the password reset flow initiated by the attacker, the email provider sends a code via SMS to the user’s phone. The user enters this code on the attacker’s website, and the attacker in turn supplies it to the email provider. A similar approach works if the email provider reads a code to the user via a phone call.
Why are users fooled??
Firstly, there’s the simple fact that users are expecting a code to be sent to them as part of a registration process they initiated, so when a code arrives it is not surprising. Ideally at least 3 pieces of information should appear in the SMS message sent by a provider in addition to the code (i) the name of the sending website, (ii) an explanation that the code is for password reset, and (iii) a warning to avoid using the code anywhere but the sending website. None of the evaluated website’s messages included all of these elements:
- Yandex, Google, and Yahoo in Russian send just a code
- LinkedIn, Google and Yahoo (other languages) send website name and code, but no indication it is for password reset
- Facebook, Twitter, and Microsoft include website, code, and purpose (password reset)
- None of the tested sites include a warning about not entering the code anywhere else.
Even if the SMS did have all of these elements, the authors found that many users don’t even read the text or even open the message, but just quickly grab the code from the notification that pops up on the phone.
In a series of experiments run by the authors using a PRMitM attack against their Facebook accounts, most ‘victims’ (they weren’t actually hacked, but the researchers demonstrated they could) remained completely unaware that they had been hacked. The same thing happened with phone call based verifications when running PRMitM attacks against Google accounts.
If you look at messaging applications, you find similar vulnerabilities in their password reset mechanisms too:
Defences and guidelines
If using security questions, if these are directly related to the actions done by the user on the site (for example, Google asks questions about contacts, labels, and so on) these can’t be forwarded to users as legitimate security questions for other sites.
Nevertheless, it is desirable to avoid relying on security questions, as they can be bypassed by attackers, especially if the attacker is related to the victim.
Better is not to send a code in clear text over SMS (and even better in my mind is not to use SMS, due to its vulnerabilities, but leaving that aside…), but instead to send a link via SMS. Asking a victim to copy and paste a link from an SMS into their website would certainly look like an unnatural act. The authors call this scheme Link-via-SMS (LVS) based password reset.
A long link is better than just a long code. The natural user interaction with links is to press on them. On the other hand, there is always a chance that a user will just copy the code without reading the message. In our implementation of the LVS, the link refers the user to an interactive page that has an alert about the attempt to reset the user password.
In tests, using the LVS approach caused all participants in the study to detect and thwart the PRMitM attack.
Websites should also always send a notification after both password reset and psasword change – via both email and SMS. Of the websites tested, only Google sends an SMS notification on password change.
In section IX of the paper you’ll find a quick checklist that can be used to audit and secure password reset procedures. If you must use SMS etc., these are worth checking out. I’ll leave you with the following reminder though:
Finally, although the recommendations of this section are given mainly in the perspective of the PRMitM attack, it is important to note that according to the NIST Digital Authentication Guideliness, due to other security problems it is not recommended to rely only on SMS or phone calls for authentication.
How they did it: an analysis of emission defeat devices in modern automobiles Contag et al., IEEE Security and Privacy 2017
We’ll be looking at a selection of papers from the IEEE Security and Privacy 2017 conference over the next few days, starting with this wonderful tear down of the defeat devices used by Volkswagen Group (and Fiat Chrysler Automobiles as it turns out) in their ECUs. Through a static analysis tool the authors built, it is even possible to watch the evolution of defeat devices over a period of 8 years.
On September 18, 2015, the US Environmental Protection Agency (EPA) issued a notice of violation to the Volkswagen Group, accusing one of the world’s largest automakers of circumventing the EPA’s emissions tests, setting into motion the most expensive emissions scandal in history.
The use of so-called defeat devices is possible of course because pretty much everything that goes on in a modern car is controlled by software – over 70 electronic control units and about 100 million lines of codes in a premium class automobile. At the heart is the Engine Control Unit, or ECU, which operates a closed loop control system between engine sensors and actuators.
… while some emission control measures, like the catalytic converter or particulate filters, are passive, many others require active control by the ECU, which must sometimes sacrifice performance or efficiency for compliance. These tradeoffs are particularly challenging for diesel engines.
The authors find that both the Volkswagen group and Fiat Chrysler Automobiles used an EDC17 diesel ECU manufactured by Bosch:
Notably, we find strong evidence that both defeat devices were created by Bosch and then enabled by Volkswagen and Fiat for their respective vehicles.
(The evidence strongly suggests that Bosch builds the ECU hardware and software, and then manufacturers configure the ECU for each vehicle model using configuration variables documented by Bosch).
To understand how the defeat devices work, we first need to take a look at the means used by modern cars to control emissions. After that we’ll take a look at how the Volkswagen and Fiat devices work, and finally we’ll touch upon the CurveDiff tool the authors wrote to automatically analyse firmware and detect certain kinds of defeat devices.
Controlling emissions
I learned a lot more about diesel engines and how they control emissions in this CS paper than I ever knew before!
In a diesel engine, air is first drawn into the combustion cylinder, and then fuel is injected, which ignites in the compressed air. The mixing therefore happens at the time of ignition and so can never be perfect. “This is responsible for many of the diesel engine’s distinctive characteristics, including the black smoke and heavy knocking sound known as ‘diesel knock.'” The black smoke (particulate matter) is a result of the incomplete combustion of the fuel. The second major pollutants are nitrogen oxides (NOx). The emissions standards limit the amount of particulate matter and NOx that can be emitted.
There are four emission control devices that help to regulate pollution:
- Exhaust Gas Recirculation (EGR) recirculates exhaust gas back into the engine intake. This significantly reduces NOx, but unfortunately it also increases the amount of particulate matter.
- An NOx Storage Catalyst** (NSC) works by oxidizing NO to NO2 and then storing NO2 in the catalyst itself. After 30-300 seconds the catalyst capacity is exhausted and must be regenerated by switching the engine to a rich fuel-air mixture for 2 to 10 seconds. During this period, the engine is less efficient (reduced fuel economy).
- Selective Catalyst Reduction (SCR) is an NSC alternative that works by injecting urea (trade name AdBlue) into the exhaust stream to reduce NOx emissions. SCR is more effective than NSC, and usually used in diesel engines of size 3 litres and above. “The drawback of SCR is its increased complexity and the need to carry and replenish the urea fluid. Several Volkswagen vehicles implicated in the emission cheating scandal are reported to limit urea injection levels outside of a test cycle.”
- Diesel Particulate Filters (DPF) trap particulates (soot), reducing the amount of black smoke leaving the tailpipe. A DPF needs to be purged of accumulated particulates approximately every 500 km, in a regeneration cycle that lasts 10 to 15 minutes. This requires high exhaust temperatures only achieved at full load. Since this may not occur in regular driving (track day anyone?), the ECU may need to perform active regeneration in which engine operation is adjusted to increase exhaust temperature. This can only work if the vehicle is driven for longer distances. The vehicle needs servicing if regeneration does not happen.
… according to the New York Attorney General’s complaint, at normal load Volkswagen’s DPF could only last 50,000 miles before needing replacement, far short of the 120,000 miles standard Volkswagen was required to meet, compelling Volkswagen to reduce wear on the DPF.
Emissions tests are carried out on a chassis dynamometer, which holds the car in place while its drive wheel turns with varying resistance. Emissions are measured during the test and compared to emissions standards (EPA in the US, the more stringent CARB in California, and Euro 1 through 6 in Europe). To ensure repeatable and comparable measurements of exhaust emissions to evaluate emission compliance, the test goes through exactly the same cycle each time, and these cycles are known to the manufacturers.
How Volkswagen and Fiat cheat the emissions tests
A defeat device is a mechanism that causes a vehicle to behave differently during an emission test than on the road.
Essentially the ECU will have two modes: ‘test’ mode and ‘real driving’ mode. By monitoring the observed conditions, the ECU can determine which mode to operate in. You don’t need perfect detection, so long as you tune things such that when you get it wrong, you err on the side of thinking you’re undergoing an emissions test when in fact you’re not.
There are loads of pieces of information you can use to guess at whether or not you’re in a test (I think it would be super easy to train a highly accurate classifier!). For example, from 1991-95 General Motors was caught using the fact that air conditioning was turned off (tests were done at the time with AC off). Whether the steering wheel is being turned seems to be another easy one. The defeat device found in the Fiat (as of the time of paper publication, Fiat Chrysler Automobiles have not acknowledged the device) uses a very simple test – if the engine has been running for more than 26 minutes and 40 seconds (the test duration) then it’s not in a test!
Volkswagen’s test detection mechanism is more sophisticated. Here’s the schematic from the Bosch function sheet:
The flip-flop (FF) labeled (1) signals whether or not the vehicle is could currently undergoing an emissions test. It is initially set to true if all of the following five conditions hold (2):
Since the pressure and temperature conditions are easily satisfied, the ECU essentially assumes it might be in a test whenever the engine is started normally. The ECU then monitors its sensors, and sets the car into normal driving mode (3) if any one of four deactivating conditions occur (4).
- A configurable time period has elapsed since the acceleration pedal first exceeded a configurable threshold.
- The rev counter exceeds a configurable threshold.
- The ECU is configured not to perform emissions test detection (this would be a one-off factory setting as I understand it). This condition is never triggered.
- As of May 2017 a fourth condition was added. This condition measures how long the vehicle has been driving at a certain speed and the distance traveled in the current driving cycle. There are seven pairs of test cycle curves, in each pair one curve represents minimum distance travelled over time, and the other curve a maximum distance travelled over time. Together they define a time-distance profile. If the car strays outside of this time-distance profile, a flip-flop is set.
If all seven flip-flops are set, then the vehicle has strayed outside the profile of each of the curves at least once since start. If this happens, the acoustic condition [indication that the car could be undergoing a test] is canceled. In the firmware we examined, these curves are used to define the precise time-distance profile of known emission tests.
(Emphasis mine).
The number of profiles checked has been increasing with time, from seven in 2009 with the version of the firmware discussed above, to ten today in VW part number 03L906012. The most recent three profiles also include a steering wheel angle check that resets the flip-flop for the given profile if the steering wheel angle deviates by more than 20 degrees from the neutral position.
When the vehicle detects it is not in a test, the fuel injection behaviour is modified by adding a value based on engine speed (under test conditions, engine speed is ignored).
Exhaust Gas Recirculation is also altered when not under test.
The following table summarises all of the subsystems whose behaviour can be altered based on whether or not the vehicle thinks it is under test:
In the Fiat 500X, NSC regeneration is configured to happen more frequently in the first 26 minutes and 40 seconds of engine operation (test duration).
The CurveDiff tool for analysing firmware
Based on the insights obtained in our case studies, we designed a static analysis tool that helps us to identify a defeat device in a given firmware image. We implemented a prototype of this approach in a tool called CurveDiff for EDC17 ECUs that enables us to track the evolution and behavior of such a device across a large number of firmware images.
In the Bosch firmware, curve definitions are one of four different configuration parameter types, identified by the suffix _CUR
. CurveDiff disassembles a firmware image, creates a static single assignment (SSA) form of the logic contained within it, and then analyses curve function invocations to match curve checks against known test cycles.
The reference test cycles as used for emissions testing are available either free of charge or tied to a small subscription fee. In most cases, the cycles are given in the form of two-dimensional data points, containing information about the elapsed time in seconds and the speed at this point.
Using CurveDiff, the team analysed 963 firmware images, for which analysis completed successfully for 924. 406 of the analysed images contained a defeat device, out of which 333 contained at least one active profile. In at least 268 images, the test detection affects the EGR. Firmware images released on Dec 3rd 2014 are used in VW Passat cars, and include the refinement to the defeat device to detect steering wheel angle that we discussed previously.
This refinement of the defeat device is noteworthy given that at that point in time, the CARB hand already started to investigate emission abnormalities in Volkswagen cars.
Here’s a summary of the firmware images in which defeat devices were detected, and the cars that use them:
It will be interesting to see how the court case(s) progress!
Hardware is the new software
Hardware is the new software Baumann, HotOS’17
This is a very readable short paper that sheds an interesting light on what’s been happening with the Intel x86 instruction set architecture (ISA) of late. We’re seeing a sharp rise in the number and complexity of extensions, with some interesting implications for systems researchers (and for Intel!). We’re also seeing an increasing use of microcode blurring the role of the ISA as the boundary between hardware and software.
We argue that these extensions are now approaching software-like levels of complexity, yet carry all the attendant drawbacks of a hardware implementation and the slow deployment cycle that implies. We suspect that the current path may be unsustainable, and posit an alternative future with the ultimate goal of decoupling new ISA features from the underlying hardware.
(Note, the author is from Microsoft Research, not from Intel).
The old days: steady and contained growth
The 386 reference manual listed 96 instructions. New instructions were steadily added over time in areas such as floating point support, vector extensions, crypto accelerators, 64-bit, and so on. Crucially though:
… past system designers could, for the most part, ignore such changes… With the notable exception of 64-bit mode and virtualisation extensions, OS developers on x86 were occasionally given tweaks to improve performance or correct glaring shortcomings, but otherwise ignored. Even 64-bit mode didn’t substantially increase architectural complexity…
Why buy a new CPU?
With the slowing of Moore’s law, Intel’s traditional two year tick-tock development model is also slowing down. This presents an obvious question:
… absent improvements in microarchitecture they [CPUs] won’t be substantially faster, nor substantially more power efficient, and they will have about the same number of cores at the same price point as prior CPUs. Why would anyone buy a new CPU?
If you put yourselves in Intel’s shoes for a moment, you can see why this might be troubling. What else can you do to make buying a new CPU attractive?
One reason to which Intel appears to be turning is features: if the new CPU implements an important ISA extension – say, one required by software because it is essential to security – consumers will have a strong reason to upgrade.
In other words, upping the pace of extension release is likely to be a deliberate Intel strategy to keep us buying. As a result…
… the last two years have seen a dramatic and rapid increase in ISA complexity (e.g., seen in the size of the architecture manual in the figure below), with more extensions on the way.
The chart above still follows Moore’s law (note the log-scale on the left-hand axis) with the recently announced slow down in release cadence yet to appear. Using words in the manual as a proxy for complexity, we can also see a big jump in the 2015-16 period due to the extensions introduced with Skylake.
(Out of interest, I calculate that with 2 million words, and an average reading rate of 200wpm, it would take about a month of full-time reading – at 40 hr weeks – just to do one pass of the manual!).
The following table summarises recent ISA extensions together with the number of new instructions they introduce.
The complexity increase is not just from the number of new instructions though, but more critically, from the interactions with existing instructions:
These [more recent] extensions introduce new system-level functionality, often change the semantics of existing instructions, and exhibit complex interactions with other extensions and prior architectural features.
SGX and CET examples
Baumann cites the SGX and CET extensions as prime examples. We’ve looked at SGX several times previously on The Morning Paper of course.
The compelling combination of features, along with strong physical security (memory encryption), has made SGX attractive to researchers and practitioners alike… However, SGX introduces substantial complexity: 26 instructions described by nearly 200 pages of English/pseudocode specification.
Of those 200 pages, nearly 20 are devoted to documenting SGX interactions with prior architectural features. And it turns out one of those interactions threatens to undermine the very security which SGX seeks to provide. This was new to me, despite having read a number of papers on systems exploiting SGX. If you’re interested, it’s worth chasing the reference here to dip into “Preventing page faults from telling your secrets” (CCS’16). The attack relies on a side-channel under the OS’s control (and hence assumed under the control of an attacker in the SGX threat model): page faults. By inducing a page fault on nearly every instruction, enough information is leaked that 27% on average and up to 100% of the secret bits from encryption keys in OpenSSL and libgcrypt can be recovered (demonstrated attacks).
Perhaps ironically, the best of the known mitigations exploits a seemingly-unintended interaction with the transactional-memory extension: transactions abort rather than page fault, so the OS cannot observe transactional enclave memory accesses.
The CET (control-flow enforcement technology) extension sounds pretty cool to be honest. It defends against code-reuse attacks such as ROP by maintaining a shadow stack and indirect branch tracking. The shadow stack contains just return addresses and is not accessible to normal code. On a return, the addresses from the regular and shadow stacks are popped and compared – if they don’t match, there’s trouble afoot!
CET promises to add strong defenses to unsafe C/C++ code, at the cost of substantial architectural complexity… the main complexity arises from feature interaction. Control transfer in x86 is already very complex, including many forms of call and return, such as near and far calls to a different segment or privilege level. In total, nine instructions (some with many variants, such as JMP) are modified by CET.
Can we sustain this pace?
We’re seeing a high rate of change accompanied by rapid growth in complexity of systems-level features with complex interactions. Yet ,”a faithful implementation of x86 semantics is essential for x86-compatible processors, virtual machines, emulators, JIT compilers, dynamic translators, disassemblers, debuggers, profilers, and so on.”
… we have to question whether the core x86 promise of indefinite backwards compatibility across many implementations is sustainable.
When improvements come via new instructions, exploiting gains takes longer
The first SGX spec. was published in 2013, the first CPUs implementing it didn’t ship until 2015, and server-class CPUs with SGX support were still to appear as of early 2017. Add on the time for sufficient deployment, and we’re talking about a decade from initial spec. to widespread availability. At some point in this cycle, software developers will be able to start developing for SGX with a reasonable assumption of its availability.
This represents a difficult tradeoff for software developers; prior ISA extensions have also taken a long time to deploy, but they have generally only served to accelerate existing functionality; with a feature like SGX, the developer is faced with a stark choice: wait indefinitely for security, or deploy now without it.
Add this software lag, and we could be looking at well in excess of a decade for general exploitation of new features.
The boundary between hardware and software is blurring
A careful reading of Intel patents suggests that SGX instructions are implemented entirely in microcode. If new ISA features are new microcode, that opens up the possibility of supporting them on existing CPUs. For example, perhaps (poorer performing?) versions of (many?) new instructions could be implemented this way, making a faster path for adoption of new features. The licensing and revenue model for this is an open question (as is the interaction with the desire to sell new CPUs).
… now more than ever before it’s time to rethink the notion of an instruction set. It’s no longer the boundary between hardware and software, but rather just another translation layer in the stack.