Post

🤦 LFMM: Always Test the Update

This post should be rebranded to “LFRM: Learn From Rivian’s Mistakes” as we’ll be learning from a massive OTA mishap they recently experienced.

If you haven’t read about it, Rivian recently released an OTA update that may have bricked the infotainment system in their vehicles.

From the VP of Engineering:

We made an error with the 2023.42 OTA update – a fat finger where the wrong build with the wrong security certificates was sent out. We cancelled the campaign and we will restart it with the proper software that went through the different campaigns of beta testing.

An Educated Guess At Root Cause

What does this mean, exactly? Well, we can’t fully know without access to the engineering team but we can make an educated guess. Most chips nowadays offer a form of securely booting signed binaries. In the ESP32 world it’s the Secure Boot feature.

The idea is that you sign a binary with a private signing key. The signature is added to the final binary in a pre-specified location. At boot, the bootloader will validate the image was signed with the private key via cryptographic algorithms. It does this by using the known public key which is stored on the chip (EFUSE on ESP32). On the ESP32 this ensures two things:

  1. The image to boot was signed with the secret, private key. This means you cannot sideload an unsigned, custom binary.
  2. The image has not been modified. The signature block in the ESP32 contains an image digest which is also validated so if the binary file was modified after signing it will not boot.

Once Secure Boot is enabled on an ESP32 it is impossible to boot a binary image that is not signed with the correct security certificate. Rivian implies that the fix may require physical access to the vehicle. It sounds like maybe they OTA’d the infotainment system with a binary image that was signed with the wrong certificate.

Whether that’s true or not, however, is irrelevant. It’s something that can happen and is a catastrophic mistake that you will want to avoid. So let’s talk about how to avoid it in your production application.

ESP32 Anti-Brick Protection

This is just a quick note on the protections that exist on the ESP32 to save you from your own carelessness. Secure Boot has several anti-brick protections in place to prevent a device from becoming unusable from an OTA update.

First, when Secure Boot is enabled and an OTA update is downloaded to the ESP32, if the signature validation fails the device will refuse to set the OTA partition flag to boot it. This is your first line of defense.

Second, what happens if you manually flash a badly-signed binary image to one of the app partitions in an OTA configuration? When the secondary bootloader attempts to load that app partition it will fail signature validation and the bootloader will automatically try to boot one of the other available app partitions. It will do this until it finds one that will pass signature validation and load.

As long as you have one properly signed application partition on your ESP32 it will boot, thus protecting you from a bricked state.

Production Takeaways

Here are the key takeaways to apply to your production setup:

  • Keep your signing key private. If your signing key is compromised it’s game over. Anyone can build firmware that your device will happily boot.
  • Always validate your signed binary. After signing your firmware you should have another process that validates the signature using the production public key. Just guessing, but this might have saved Rivian the trouble they are now in. It’s an automation that can go in your CI pipeline and is super easy to run. For the ESP32 this can be done with a call to espsecure.py verify_signature.
  • Automate your deployment process. A human should never be able to “fat finger” a release causing a bad build to go out. The build, signing, and deployment process should be automated end-to-end.
  • Test the update process. You should always have at least one device that is configured as a production unit. This unit is set up exactly as a customer would be using it and runs the exact same firmware as customers have. This unit is also the first unit that will receive the OTA update to ensure things work. In the case of Rivian, a single truck in this configuration would have caught this issue before customers were affected.
  • Have a rollback mechanism. If a firmware update fails you should have a way to rollback to the previous working version that is either automatic or can be done by the customer.

Conclusion

OTA update failures that leave devices bricked or unusable are the absolute worst-nightmare in the embedded product world. Personally, I’ve been there and I feel for the engineer or engineers involved in this mistake. That said, it’s our job to learn from mistakes. Our mistakes as well as the mistakes of others. So turn Rivian’s nightmare into your motivation and do everything you can to prevent this kind of mishap in your product.

Have any update nightmare stories to share? Drop them in the comments below.

Join the community and get the weekly Production ESP32 newsletter. Concise, actionable content right in your inbox.

© Kevin Sidwar

Comments powered by Disqus.