They’re out there—Internet-enabled refrigerators and washing machines are here, although they’re not yet common household items and might take awhile to catch on. A large, fast, and growing market for Internet-enabled embedded devices is already here, however, and it includes vending machines, security and access devices, point of sale and remote data acquisition.
Secure Sockets Layer (SSL) is the de facto standard security protocol for securing transactions over the Internet. Other protocols exist but are less widely used and supported and don’t have as many application possibilities as SSL. SSL was designed to complement the TCP/IP sockets model, making any TCP/IP application a candidate for SSL encryption.
Perhaps the most common and effective way to interface with a remote embedded device over the Internet is with a web browser on a PC. If the embedded device can act as a web server, it can serve standard HTML pages, HTML forms, or Java applets that eliminate the need to develop complex, proprietary PC software and the accompanying maintenance, testing and support headaches. Every Internet user knows how to use a web browser, and every modern PC on the Internet has one. SSL security is built into every modern commercial web browser in the form of a secure HTTP client “HTTPS”. Whenever your browser shows you the message, “you are about to enter a secure web site”, you’re using HTTPS and SSL.
SSL is a computationally intensive protocol, particularly during the initial handshake to open a session. While many low-end microprocessors and microcontrollers can handle running an HTTP server, a secure HTTPS server requires something more than a Z180 or an 8051 to work, but it doesn’t necessarily need a fast 32-bit processor or even a 16-bit one. With some careful coding and some hardware tricks, a robust, secure web server can easily be made for under US$35 ($50) with an off-the shelf, Ethernet-enabled controller.
This article discusses the basics of SSL and Transport Layer Security (TLS, the Internet Engineering Task Force standard for SSL) and the problems inherent in constructing an SSL implementation for a system with very limited resources.
We’ll cover some basic cryptographic concepts here without delving into algorithmic details, which are thoroughly covered in the listed references.
Symmetric key cryptography is so named because the sender and receiver must have identical copies of the key and like algorithms in order to encrypt and decrypt messages. Figure 1 illustrates the symmetric nature of this type of encryption; both sides must have the key k to exchange the message M (encrypted using k to get the encrypted message E).
Some common algorithms for symmetric- key encryption include DES (data-encryption standard), 3DES (triple DES), AES (advanced encryption standard), and RC4 (Rivest Cipher four). AES is a newer algorithm intended to replace the aging DES and 3DES. RC4 is used extensively by SSL implementations due to its simple structure and excellent performance. (See Figs. 2 & 3.)
Figure 4 shows how SSL communicates using discrete messages called records. A record is defined by a header (containing protocol information, message type and length), a body (containing the message data itself), and a message authentication code (MAC), which is a hash of all the data in the message. The MAC is used to detect tampering or data corruption of the message after it’s received. All SSL data is wrapped in these records; the record layer is a vital component of any SSL implementation.
A new SSL session begins with a handshake consisting of a series of messages sent between the client and the server containing negotiable values for the session.
The first message sent, Client Hello, contains the client’s protocol version, what ciphersuites it supports, and some data to be used in the key derivation process. A ciphersuite is a predetermined combination of ciphers to be used in the session establishment process and in the session itself.
The server responds to the Client Hello message with its chosen ciphersuite and some data for the key derivation in a message called the Server Hello. This is followed by the server’s digital certificate and a message indicating the Server Hello is complete (this is to allow multiple certificates to be sent).
The client then uses the chosen public-key algorithm to encrypt a random chunk of data called the pre-master secret, which is used to generate the symmetric encryption keys later. The encrypted premaster secret is sent to the server in the Client Key Exchange message, followed by the Change Cipher Spec message, which indicates that all further records from the client will be encrypted. Next, the server and client generate the session keys simultaneously. These keys are simply symmetric encryption keys used by the chosen symmetric algorithm. The pre-master secret and the data from the Client Hello and Server Hello messages are used to generate the master secret, which is then used to generate two separate keys to encrypt outgoing messages (one for the server, and one for the client). Using two keys assures that if one key is compromised, the entire communications channel will not be compromised; only messages in one direction will be able to be decrypted using that key.
The client uses the newly generated keys to encrypt the last client handshake message, called Finished, which contains a hash of all the previous handshake messages. This hash is used to verify that no tampering or corruption occurred during the handshake. The Finished message also notifies the server that the client is ready to send and receive application data, encrypted using the symmetric algorithm.
Upon receiving the Finished message from the client, the server decrypts it, verifies the handshake message hash against its own hash of the handshake messages, and sends to the client its own Change Cipher Spec and Finished messages (which are handled by the client in the same manner as the client versions of these messages were handled by the server).
At this point, the session has begun and all data is hashed and encrypted using the chosen hash function and encryption algorithm. The MAC is used to validate the data, assuring with high confidence that the data was not corrupted or tampered with in transit.
When an error occurs during the handshake or session, that error needs to be transmitted to the other side of the communication channel. SSL does this with what are called Alerts: specialized messages encapsulated in SSL records that indicate what error occurred and the severity of the error. Most errors are considered fatal"the connection is terminated immediately upon receiving a fatal alert.
Once the application is finished transmitting data, either the client or the server sends a special alert called Close Notify, indicating that it's done communicating and the connection will close. This final step protects against what is called a Truncation Attack, where an attacker prevents all the data that was sent from being received. This is especially important in situations such as banking transactions, where all the data must be received.
Developing SSL implementations for smaller embedded systems is no small challenge, since there is no “embedded SSL” protocol. Encryption operations are expensive, both in terms of processor cycles and memory usage. The protocol was designed for powerful machines without resource constraints. However, a carefully implemented, viable SSL implementation can add as little as 50 kBytes to the code footprint of an application, and less than 20 kBytes of static data space.
Many embedded systems will need only server-side SSL, since they will be accessed using a web browser (a client). This allows us to eliminate the client-side SSL code, including the resource-intensive certificate-authentication code and root CA certificates. At 1 to 2 kByte for each certificate, this is a significant constant data space saving.
We can also take advantage of the structural similarities between SSL version 5 and TLS by writing cross-protocol code that works with both protocols. Almost 90 percent of the code can be shared between these two protocols.
The choice of supported algorithms also affects our code size. AES and 3DES are relatively complex cipher algorithms, whereas RC4 is very simple and requires only a few lines of C code. Not coincidentally, RC4 also has much better performance than either AES or DES. The TLS specification (RFC 2246) requires 3DES for TLS compliance, but in practice, all major commercial web browsers and SSL implementations support RC4, so by deviating from the spec a little bit and supporting only RC4, we can save several kilobytes of code space with no practical functional impact.
Obviously, writing in assembly code both saves code space and improves performance. The cipher and digest algorithms are prime candidates for assembly coding since they’re typically the performance bottleneck of the protocol, and the algorithms are fairly straightforward to port to assembly.
Digital certificates are needed for SSL communication, since they contain both the public key and authentication information. Unfortunately, all this information makes the certificates fairly large, on the order of 1 to 2 kBytes each. The SSL protocol specifies that any SSL client or server should be able to support multiple certificates. However, in practice only one certificate is actually needed for any communication. By restricting certificate storage to a single certificate per device, we can save some space in constant data space or a Flash file system.
Another possible savings is in certificate parsing. Certificates are stored in a derivative of the Abstract Syntax Notation (referred to as ASN.1). If we’re only using server-side SSL, we can eliminate the ASN.1 parsing code (which would be needed for client-side SSL) by separating out the public key from the certificate and storing it separately. Public keys are typically only 64 to 128 Bytes, so we can save quite a bit of code space with only a small loss of constant data space.
Minimizing data space
The best ways to reduce data size are to carefully construct message buffering algorithms and minimize buffer sizes. We need to store an entire message to be able to digest or verify it before passing it along to the TCP connection or the application, respectively. This means that the incoming message buffer must be at least 16 kByte in size (the maximum size of an SSL record, according to the SSL specification) so we can be sure to get an entire message.
The outgoing message buffer, however, can be reduced in size significantly, since we can control the size of records that our implementation sends out. It needs to be at least 2 kByte to hold the certificate message, which is between 1 and 2 kByte. A minimal buffer sacrifices some performance, since smaller records can be less efficient for throughput, but the RAM savings are significant. Buffer sizes can be easily parameterised, giving the programmer (and even the user) control over memory usage to fine tune the implementation.
The initial handshake requires more buffer space than the actual session does (it needs to derive the session keys and do the public-key operation), so we can temporarily allocate some memory for it. We can do this efficiently by implementing a simple stack in static memory from which we can allocate buffer space. The stack-based approach gives a dynamic memory allocation scheme without the need for the costly overhead of malloc() and free().
By encrypting and decrypting messages in-place, we can use just one buffer each for input and output rather than keeping separate buffers for encrypted and decrypted messages. This is possible because all SSL-supported cryptographic algorithms only work on one fixed-size block of data at a time, and the output data is the same size as the input (as well as being independent of all other output).
We can also assume that the encrypted and decrypted data are the same size for any given algorithm, so there will be no overflow of the buffer (all currently supported algorithms have this property). Figure 6 shows a circular buffer scheme that works well for decrypting received messages in place.
For incoming messages, we read the header into the buffer first (1), then we write the incoming encrypted data over the header (the footer containing the MAC is part of the encrypted data) (2). After the entire record is read into the buffer it is decrypted and verified (3), and finally the next record being read overwrites the record footer (4).
For outgoing messages, we reserve a number of bytes for both the header and the footer; these numbers are calculated from the ciphersuite chosen during the SSL handshake (and remain constant throughout the entire session). We need to reserve space for the header at the end of completed records so that we can build the header later. The header contains the length of the record, which cannot be known until the record data is complete. We reserve space for the footer (MAC), which also cannot be generated until the data is complete. Figure 5 shows the layout of the write buffer for outgoing messages. The record to the left is complete, consisting of a header, the data, and a footer, which contains the MAC. The record on the right is not yet finished, but contains the application data to be written. Once a completed record is written to the network layer, the buffer space is released.
Cryptography is notoriously computationally expensive and SSL was designed for maximum security, not optimal performance. But there are some tricks you can apply to make SSL run faster and more smoothly on low-end processors.
By far the most “expensive” operation in any SSL session is the public-key operation. Public-key algorithms such as RSA require thousands of operations on large multibyte numbers (512 to 2048 bits each). Beyond efficient coding, the only way to speed up these operations is with some type of hardware assist. A few companies make dedicated public-key encryption hardware, such as Atmel’s Trusted Platform Module, which provides an RSA-specific cryptography accelerator in hardware.
Another possibility is to use a processor with special instructions to accelerate the RSA algorithm. Multiple precision multiplication instructions can speed up RSA by as much as 10 times. Examples are the Rabbit 3000A and Rabbit 4000 processors (distributed by Dominion ), which have unsigned multiply add (UMA) and unsigned multiply subtract (UMS), which are used to greatly speed up exponentiation and modulus operations. The details of applying these instructions to RSA is beyond the scope of this article. (Alfred Menezes, et al, Handbook of Applied Cryptography, (CRC Press, 1996) is good reference for implementing RSA efficiently.)
With instructions like these, the initial handshake time can easily be reduced from as much as 30 to less than 3 s on a 44-MHz processor.
If performance is an issue, strong consideration must be given to some type of hardware assistance for public-key cryptography. Performance should be an issue if someone is interfacing to an embedded web server, because the user is likely to think the device is down if it takes more than a few seconds for initial authentication before the device can serve a web page.
One of the most important advanced features of SSL for optimising embedded SSL is session renegotiation. During session establishment, the client and server negotiate a Session ID, a large integer number used to uniquely identify the current session. The server and client both cache the session keys and other session information, and, after a session has ended, keep that information for some time. This information can be used to re-establish a previously completed session, avoiding the expensive public-key operation. The reduced overhead is definitely an advantage on embedded systems.
Unlike public-key cryptography, hashing algorithms lend themselves to easy programmatic optimisation. The algorithms such as MD5 and SHA-1 can easily be coded completely in assembly, and much of the code is redundant, allowing for fairly good code factoring. These algorithms are also comparatively faster than other cryptographic operations, so they will rarely be the cause of any performance bottleneck.
Random numbers are also important in the generation of session keys. Seeding a pseudo-random number generator (PRNG) with a constant number or the current time is not considered adequate in high-end security applications because it creates vulnerability to certain forms of attack. It may be adequate 999 of 1,000 times for lower-end applications, but for that 1,000th time when a skilled and motivated hacker attacks your web-enabled embedded system, something a bit more random, like network entropy, should be used. This works by using received TCP/IP or UDP packets to generate entropy from a real-time clock (the time at which a network packet is received is considered fairly random).
Very high security applications will go further and use real hardware random number generators (RNGs), but if a good PRNG is good enough for your PC, it ought to be good enough for most embedded applications as well.
A pre-emptive real-time operating system (RTOS) may use more resources than is desirable for a system that already must stretch its resources to handle a TCP/IP stack, a web server, SSL, stored web pages, and, of course, the application itself, so a cooperative multitasking method such as state machines may be preferable. HTTPS, SSL, and a TCP/IP stack can all be driven on an RTOS-less system using a single tick-type function, called as often as possible in “big loop” programs. If an RTOS is used, the same tick function could also be put in a high-priority task. There are a few complications, since HTTP is inherently a stateless protocol and SSL requires state to be preserved in order to function. These problems are not difficult to overcome.
The market for secure web-enabled embedded applications is growing. SSL is generally perceived as feasible only for desktops higher-end 32-bit embedded systems, but with some creativity and careful coding it’s perfectly usable on 8-bit microprocessors.
A viable 8-bit implementation of SSL will include:
SSL 2.0, at least the session-initiation portion
SSL 3.0, and preferably TLS 1.0
Multiprecision arithmetic for RSA, preferably with hardware acceleration.