new post: This Week in Glean: Crashes & a buggy Glean
This commit is contained in:
parent
a4b3a4c162
commit
7519015b67
152
_posts/2021-11-01-buggy-glean-and-a-data-incident.md
Normal file
152
_posts/2021-11-01-buggy-glean-and-a-data-incident.md
Normal file
|
@ -0,0 +1,152 @@
|
|||
---
|
||||
permalink: "/{{ year }}/{{ month }}/{{ day }}/crashes-and-a-buggy-glean"
|
||||
title: "This Week in Glean: Crashes & a buggy Glean"
|
||||
published_date: "2021-11-01 15:00:00 +0100"
|
||||
layout: post.liquid
|
||||
data:
|
||||
route: blog
|
||||
tags:
|
||||
- mozilla
|
||||
- rust
|
||||
---
|
||||
|
||||
(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean.)
|
||||
All "This Week in Glean" blog posts are listed in the [TWiG index](https://mozilla.github.io/glean/book/appendix/twig.html)
|
||||
(and on the [Mozilla Data blog](https://blog.mozilla.org/data/category/glean/)).
|
||||
This article is [cross-posted on the Mozilla Data blog][datablog].
|
||||
|
||||
[datablog]: https://blog.mozilla.org/data/2021/11/01/this-week-in-glean-crashes-a-buggy-glean
|
||||
|
||||
---
|
||||
|
||||
In September I finally [landed work to ship Glean through GeckoView][geckoview-shipped].
|
||||
Contrary to what that post said Fenix did not actually _use_ Glean coming from GeckoView immediately due to another bug that took us [another few days][ac-shipping-glean] to land.
|
||||
Shortly after that was shipped in a Fenix Nightly release we received a crash report ([bug 1733757][bug]) pointing to code that we haven't touched in a long time.
|
||||
And yet the change of switching from a standalone Glean library to shipping Glean in GeckoView uncovered a crashing bug, that quickly rose to be the top crasher for Fenix for more than a week.
|
||||
|
||||
When I picked up that bug after the weekend I was still thinking that this would be _just_ a bug, which we can identify & fix and then get into the next Fenix release.
|
||||
But in data land nothing is ever _just_ a bug.
|
||||
|
||||
As I don't own an Android device myself I was initially restricted to the Android emulator,
|
||||
but I was unsuccessful in triggering that bug in my simple use of Fenix or in any tests I wrote.
|
||||
At some point I went as far as [leveraging physical devices in the Firebase test lab][phys-devices],
|
||||
still with no success of hitting the bug.
|
||||
Later that week I picked up an Android test device, hoping to find easy steps to reproduce the crash.
|
||||
Thankfully we also have quality assurance people running tests and simply _using_ the browser
|
||||
and reporting back steps to reproduce bugs and crashes.
|
||||
And so that's what one of them, Delia, did: [Providing simple steps to reproduce the crash][str].
|
||||
Simple here meant: open a website and leave the phone untouched for roughly 8-9 minutes.
|
||||
With those steps I was able to reproduce the crash in both the emulator and my test device as well.
|
||||
|
||||
As I was waiting for the crash to occur I started reading the code again.
|
||||
From the crash report I already knew the [exact place where the code panics][ffi-support-panic] and thus crashes the application,
|
||||
I just didn't know how it got there.
|
||||
While reading the code around the panic and the knowledge that it takes several minutes to get to the panic I finally stumbled upon a clue:
|
||||
The map implementation that was used internally has [a maximum capacity of elements it can store][ffi-support-maxcap].
|
||||
It even documents that it will panic[^1]!
|
||||
|
||||
```rust
|
||||
/// The maximum capacity of a [`HandleMap`]. Attempting to instantiate one with
|
||||
/// a larger capacity will cause a panic.
|
||||
///
|
||||
/// [...]
|
||||
pub const MAX_CAPACITY: usize = (1 << 15) - 1;
|
||||
```
|
||||
|
||||
I now knew that at some point Glean exhausts the map capacity,
|
||||
slowly but eventually leading to a panic[^2].
|
||||
Most of Glean is set up to not require dynamically allocating new things other than the data itself,
|
||||
but we never store the data itself within such a map.
|
||||
So where are we using a map and dynamically add new entries to it and potentially forget to remove them after use?
|
||||
|
||||
Luckily the stack trace from those crash reports gave me another clue:
|
||||
All crashes were caused by [labeled counters].
|
||||
Some instrument-the-code-paths-and-recompile cycles later I was able to pinpoint it to the exact metric that was causing the crash.
|
||||
|
||||
With all this information collected I was able to determine that while the panic was happening in a library we use,
|
||||
the problem was that Glean had the wrong assumptions about how the use of labeled counters in Kotlin maps to objects in a Rust map.
|
||||
On every subscript access to a labeled counter (`myCounter["label"]`) Glean would create a new object, store it in the map and return a handle to this object.
|
||||
The solution was to avoid creating new entries in that map on every access from a foreign language and instead cache created objects by a unique identifier
|
||||
and hand out already-created objects when re-accessed.
|
||||
This fix [was implemented in a few lines][glean-fix], accompanied by an extensive commit message as well as tests in all of the 3 major foreign-language Glean SDKs.
|
||||
|
||||
That fix was released in [Glean v42.0.1][glean-release], and shortly after in Fenix Nightly and Beta.
|
||||
|
||||
The bug was fixed and the crash prevented, but that wasn't the end of it.
|
||||
The code leading to the bug has been in Glean since at least March 2020
|
||||
and yet we only started seeing crashes with the GeckoView change.
|
||||
What was happening before? Did we miss these crashes for more than a year or were there simply no crashes?
|
||||
|
||||
To get a satisfying answer to these questions I had to retrace the issue in older Glean versions.
|
||||
I took older versions of Glean & Fenix from just before the GeckoView-Glean changes,
|
||||
added logging and built myself a Fenix to run in the emulator.
|
||||
And as quickly as that[^3] `logcat` spit out the following lines repeatedly:
|
||||
|
||||
```
|
||||
E ffi_support::error: Caught a panic calling rust code: "called `Result::unwrap()` on an `Err` value: \"PoisonError { inner: .. }\""
|
||||
E glean_ffi::handlemap_ext: Glean failed (ErrorCode(-1)): called `Result::unwrap()` on an `Err` value: "PoisonError { inner: .. }"
|
||||
```
|
||||
|
||||
No crashes! It panics, but it doesn't crash in old Fenix versions.
|
||||
So why does it crash in new versions?
|
||||
It's a combination of things.
|
||||
|
||||
First Glean data recording happens in a thread to avoid blocking the main thread.
|
||||
Right now that thread is handled on the Kotlin side.
|
||||
A panic in the Rust code base will bubble up the stack and crash the thread, thus poisoning the internal lock of the map,
|
||||
but it will not abort the whole application by itself.
|
||||
Subsequent calls will run on a new thread, handled by the Kotlin side.
|
||||
When using a labeled counter again it will try to acquire the lock,
|
||||
detect that it is poisoned and panic once more.
|
||||
That panic is caught by the FFI layer of Glean and turned into a log message.
|
||||
|
||||
Second the standalone Glean library is compiled with [`panic=unwind`][panic-method], the default in Rust,
|
||||
which unwinds the stack on panics.
|
||||
If not caught the runtime will abort the current thread, writing a panic message to the error output.
|
||||
`ffi-support` however catches it, logs it and returns without crashing or aborting.
|
||||
Gecko on the other hand sets [`panic=abort`][gecko-panic].
|
||||
In this mode a panic will immediately terminate the current process (after writing the crash message to the error output),
|
||||
without ever trying to unwind the stack, giving no chance for the support library to catch it.
|
||||
The Gecko crash reporter is able to catch those hard aborts and send them as crash reports.
|
||||
As Glean is now part of the overall Gecko build all of Gecko's build flags will transitively apply to Glean and its dependencies, too.
|
||||
So when Glean is shipped as part of GeckoView it runs in `panic=abort` mode, leading to internal panics aborting the whole application.
|
||||
|
||||
That behavior by itself is fine: Glean should only panic in exceptional cases and we'd like to know about them.
|
||||
It's good to know that an application could continue running without Glean working correctly;
|
||||
we won't be able to record and send telemetry data, but at least we're not interrupting someone using the application.
|
||||
However unless we engineers run into those bugs and see the log we will not notice them and thus can't fix them.
|
||||
So ultimately this change in (crashing) behavior is acceptable (and wanted) going forward.
|
||||
|
||||
After fixing the initial bug and being able to answer why it only started crashing recently my work was still not done.
|
||||
We were likely not recording data in exceptional cases for quite some time, which is not acceptable for a telemetry library.
|
||||
I had to explore our data, estimate how many metrics for how many clients were affected, inform relevant stakeholders and plan further mitigations.
|
||||
But that part of the investigation is a story for another time.
|
||||
|
||||
_This bug investigation had help from a lot of people.
|
||||
Thanks to Mike Droettboom for coordination,
|
||||
Marissa Gorlick for pushing me to evaluate the impact on data & reaching the right people,
|
||||
Travis Long for help with the Glean code & speedy reviews,
|
||||
Alessio Placitelli for reviews on the Gecko side,
|
||||
Delia Pop for the initial steps to reproduce the bug
|
||||
& Christian Sadilek for help on the Android Components & Fenix side._
|
||||
|
||||
---
|
||||
|
||||
_Footnotes:_
|
||||
|
||||
[^1]: Except that it panics at a slightly different place than the explicit check in the code base would suggest.
|
||||
[^2]: That meant I could also _tune_ how quickly it crashed. A smaller maximum capacity means its reached more quickly, reducing my bug reproduction time significantly.
|
||||
[^3]: No, not after 9 minutes, but in just under 3 minutes after tuning the maximum map capacity, see [^2].
|
||||
|
||||
[geckoview-shipped]: /2021/09/17/glean-geckoview/
|
||||
[phys-devices]: https://fnordig.de/2021/10/14/fenix-physical-device-testing/
|
||||
[ac-shipping-glean]: https://github.com/mozilla-mobile/android-components/pull/11045
|
||||
[bug]: https://bugzilla.mozilla.org/show_bug.cgi?id=1733757
|
||||
[str]: https://github.com/mozilla-mobile/fenix/issues/21767
|
||||
[ffi-support-panic]: https://github.com/mozilla/ffi-support/blob/0fdc22a8dfe3731be5fd39b311e4e4885219e26c/src/handle_map.rs#L409
|
||||
[ffi-support-maxcap]: https://github.com/mozilla/ffi-support/blob/0fdc22a8dfe3731be5fd39b311e4e4885219e26c/src/handle_map.rs#L160-L166
|
||||
[labeled counters]: https://mozilla.github.io/glean/book/reference/metrics/labeled_counters.html
|
||||
[glean-fix]: https://github.com/mozilla/glean/commit/499309475f4f002bdd6c19db4aa051634760efe1#diff-fe013beaf77cb562dae30dd3e639f386b9c9f01d6bc6800538ccd72eb89ad68cR77-R107
|
||||
[glean-release]: https://github.com/mozilla/glean/releases/tag/v42.0.1
|
||||
[panic-method]: https://doc.rust-lang.org/cargo/reference/profiles.html#panic
|
||||
[gecko-panic]: https://searchfox.org/mozilla-central/rev/a9e0a3f5e5f7cde941d419db967997aaa1f06b0f/Cargo.toml#63
|
Loading…
Reference in a new issue