Post

ProdESP32 #8: Read the Errata

Story Time…

I wanted to add a super simple way for customers to reset the WiFi credentials on their puck holder devices. An easy approach, I thought, would be for them to unplug and plug it in 3 times in a row which would tell it to wipe the WiFi creds. The puck holder is awake for about 10 seconds before it goes back to deep sleep. When you plug it in it wakes up. I figured customers could just use that 10 seconds to do the plug/unplug dance. They could then use the app to set it back up. Easy peasy.

Until it wasn’t.

As part of my PCB design there is a PGOOD line from the charging IC that is pulled low when external power is applied to the USB port (power good). “I’ll just count the falling edges and when it’s time to go to sleep, if it’s 3 or more I’ll reset the WiFi credentials.”

I whipped up the code and promptly found it didn’t work. It seemed like the interrupt for the falling edge was firing too much. After adding some debug logic I found the interrupt was running almost 90 times even though I was only unplugging and replugging in the device a couple of times 🤯.

“That can’t be” I thought. So I hooked up the logic analyzer. Sure enough, the PGOOD line was being pulled low….A LOT.

A lot of bad PGOOD pulses PGOOD being pulled low….a LOT

Maybe there is something wrong with the charge IC. I reread the datasheet. Nothing stood out. As I looked at the logic analyzer output some more I noticed a pattern. Every time PGOOD was pulled low unexpectedly it was for 2us. Since the max sample rate of my logic analyzer is 500 kS/s, 2us is the max resolution. I also noticed that a lot of the intervals between low pulses clocked in at almost exactly 102.4ms.

Suspiciously Timed Pulses Awfully consistent pulses of the PGOOD line

That can’t be a coincidence. I pulled up my schematic to see if there were some traces nearby that could be noise. I thought that maybe the RX/TX lines might be nearby so I was looking at UART timings for 115200 buad.

I started adding pauses in my code to essentially halt the main thread processing. I started near the beginning of the program. No PGOOD noise. Ok, that’s good. That means something in the code later on triggers it. I kept moving the loop later and later in my application until finally I moved it just beyond the WiFi connection logic. BOOM! Unexpected PGOOD pulses. So it has something to do with WiFi. 🤔

Then, by pure luck, I noticed this gem in the UART log output.

I (2261) wifi:AP's beacon interval = 102400 us, DTIM period = 1

The beacon interval is 102400us!!!! Which, as we know, is 102.4ms. Almost exactly what the interval is on the bad PGOOD pulses!

A quick internet search yielded this:

WiFi Beacon Interval Standard beacon interval is 102.4ms!!!

So the standard beacon interval for WiFi access points is 102.4ms. Well, we’re way beyond coincidence at this point. This has to be our smoking gun. The other, longer gaps between pulses are likely due to the sampling limitation of my logic analyzer. In other words, it just missed those PGOOD pulses.

I was already aware that you can’t use ADC2 and WiFi at the same time on the ESP32 but I was not aware of this fun piece of information in the docs.

Please do not use the interrupt of GPIO36 and GPIO39 when using ADC or Wi-Fi and Bluetooth with sleep mode enabled.

It then refers you to an errata that talks about ignoring input from GPIO36 when SAR ADC1 or SAR ADC2 are powered. It says the line will be pulled low for approximately 80ns but doesn’t elaborate on how often and makes no mention of the WiFi beacon interval.

At this point there is enough information to suggest this is a known limitation of GPIO36 when WiFi is running although it’s not as clearly documented as I would like.

What’s the Moral?

So besides being a fun nerd story, what’s the point and why am I putting it here on Production ESP32? How does this relate to building a production-quality product? Well, there are a few takeaways here.

Read the Errata

Every chip I’ve ever used has an errata. Some are brief, some are longer. As a maker or hobbyist you likely never cared about an errata datasheet or even knew they existed. But when it comes to a production product you have to know what’s in the errata and how it might affect your product. On that point I failed. That’s why this edition is filed under the Learn From My Mistakes category.

Now, to be fair to myself, this issue hasn’t affected my product until now. Or has it? I use the PGOOD symbol to make firmware branch decisions on whether the device is plugged in or not. After seeing the logic graphs, I need to improve the firmware code around testing PGOOD and will have to come up with another approach for my customers to reset their puck holders.

I’ve already made a note on the next hardware revision to move PGOOD to a different input pin.

Mistakes Will Be Made

Besides just knowing about and reading the errata, another takeaway here is that you will make mistakes. They may be hardware mistakes, firmware mistakes, or cloud implementation mistakes. Major production-quality ICs have errata. That means very accomplished teams of engineers have oversights. You will too.

The takeaway is that you need to be prepared to deal with them. That may be an emergency firmware change or it may be writing a difficult email to your customers explaining the situation. The point is, you should never bury your head in the sand and try to hide from them. Know they will happen and face them head on.

Always Find The Root Cause

This is a very important one. In my career I have found a huge difference between experienced engineers and newer engineers is the desire and ability to root cause issues. Inexperienced engineers are quick to find a workaround. To “just move on”. This is great for mitigation but is terrible for long-term stability of your project, whether firmware, hardware, or backend.

When you encounter a problem you must always drill down to the very bottom where you can say “I understand exactly what went wrong and why”. If you don’t, your temporary workaround could be another disaster waiting to happen. Remember, if you don’t know the root cause then you can’t be sure your solution truly fixed it. You may have just buried it deeper, or worse, created another problem.

Conclusion

Familiarizing yourself with the errata data and finding the root cause of problems link to the Deterministic and Maintainable Pillars of Production. One of the fastest ways to build a maintenance nightmare is to hack your way around every problem without understanding the root cause. Similarly, I created non-deterministic behavior with my PGOOD pin logic because I didn’t read the errata. Don’t make the same mistake I did.

Do you have a story where you discovered something crucial in the errata. Maybe you’ve got a great root cause tale. I’d love to hear about it in the comments.

Join the community and get the weekly Production ESP32 newsletter. Concise, actionable content right in your inbox.

© Kevin Sidwar

Comments powered by Disqus.