(cache)1903530 - Frequent long janks on Android on loading certain pages

Reporter

Description

•

1 year ago

I can reliably reproduce this on Google when it is the first site being loaded in the session (i.e. after killing the app). The app appears to be stuck in some storage system.

Profile: https://share.firefox.dev/3zbzrSP

The Performance Impact Calculator has determined this bug's performance impact to be high. If you'd like to request re-triage, you can reset the Performance Impact flag to "?" or needinfo the triage sheriff.

Platforms: Android
Impact on site: Causes noticeable jank
Page load impact: Severe
Websites affected: Major
[x] Able to reproduce locally

Jens Stutte [:jstutte]

Comment 1

•

1 year ago

The jank is happening inside a Storage.setItem call and most likely is a long quota manager initialization supposed to be addressed by bug 1671932.

Depends on: 1671932

Comment hidden (obsolete)

Bas Schouten (:bas.schouten)

Reporter

Comment 3

•

1 year ago

(In reply to Jens Stutte [:jstutte] from comment #2)

(In reply to Bas Schouten (:bas.schouten) from comment #0)

I can reliably reproduce this on Google when it is the first site being loaded in the session (i.e. after killing the app). The app appears to be stuck in some storage system.

Profile: https://share.firefox.dev/3zbzrSP

A profile taken with all threads would definitely confirm if comment 1 is right.

This profile is taken with all threads :-).

Flags: needinfo?(bas)

Jens Stutte [:jstutte]

Comment 4

•

1 year ago

Sorry, stupid me. So no, this is something different, a long setitem that does a lot of mozilla_dump_image calls.

No longer depends on: 1671932

Bas Schouten (:bas.schouten)

Reporter

Comment 5

•

1 year ago

(In reply to Jens Stutte [:jstutte] from comment #4)

Sorry, stupid me. So no, this is something different, a long setitem that does a lot of mozilla_dump_image calls.

I think personally that stack trace is nonsense. As far as I can tell mozilla_dump_image is a graphics call that isn't actually reached by Storage.setItem :-s. The bottom of the stack looks real though, and the Storage.setItem is a pseudostack so definitely correct.

Jens Stutte [:jstutte]

Comment 6

•

1 year ago

(In reply to Bas Schouten (:bas.schouten) from comment #5)

I think personally that stack trace is nonsense. As far as I can tell mozilla_dump_image is a graphics call that isn't actually reached by Storage.setItem :-s. The bottom of the stack looks real though, and the Storage.setItem is a pseudostack so definitely correct.

Yes, that looks very weird. But there seems to be activity on the QM IO thread already before that setitem starts and that activity seems to return to the event loop level several times, so I would assume the initialization is already finished. But it is actually hard to tell with these garbled stacks.

Bas Schouten (:bas.schouten)

Reporter

Comment 7

•

1 year ago

Profile with fixed symbols here: https://share.firefox.dev/3KQ7ouJ

Bas Schouten (:bas.schouten)

Reporter

Comment 8

•

1 year ago

Looks like you were right about the cause: https://share.firefox.dev/3KQoxV2

Component: Storage: localStorage & sessionStorage → Storage: Quota Manager

Jens Stutte [:jstutte]

Comment 9

•

1 year ago

(In reply to Bas Schouten (:bas.schouten) from comment #8)

Looks like you were right about the cause: https://share.firefox.dev/3KQoxV2

Yeah, it has to do with initialization but I am not sure I read it well - I see QM related initialization but distributed over several events. Is that how this looks like these days when QM initializes?

Depends on: 1671932

Artur Iunusov

Updated

•

1 year ago

Severity: -- → S3

Priority: -- → P3

Bas Schouten (:bas.schouten)

Reporter

Comment 10

•

1 year ago

I noticed this is set to S3? Is there a way to resolve this without blowing away your profile? Would users normally be able to get out of this situation? Because otherwise this seems like a more severe issue.

Andrew Sutherland [:asuth] (he/him)

Comment 11

•

1 year ago

Delete browsing data is the simplest mechanism available on Android since the settings UI does not provide the more granular data clearing mechanism available in desktop which, most significantly, characterizes the per-site data usage. In general, the slowdown will be correlated with the amount of data, so having a way to see sites using the most data is most useful. That is, Fenix does provide a way to clear cookies and site data for any site from the lock icon, which is granular, but not useful because there's no way to know what sites are the source of the problem.

One additional thing to note is that, as discussed with :mstange for his similar report in #DOM on matrix the other day, performance problems will be exacerbated after:

Fenix updates because of the change in Build ID require running a full initialization sweep.
Killing Firefox without letting it shutdown cleanly because the cache will not be flushed. I guess a very interesting question is whether Fenix ever does cleanly shutdown? Like, if Android wants to reclaim memory, does XPCOM shutdown get a chance to run, or does Fenix 100% of the time perform an unclear shutdown? Or is Fenix regularly shutting down XPCOM proactively when it goes idle but before it's killed by the OS. Which is to say, it's definitely the case that if you explicitly kill the app while using it you are going to see worst-case init behavior. (But which is of course very normal user problem-solving behavior.)

BugBot [:suhaib / :marco/ :calixte]

Comment 12

•

1 year ago

The severity field for this bug is set to S3. However, the Performance Impact field flags this bug as having a high impact on the performance.
:aiunusov, could you consider increasing the severity of this performance-impacting bug? Alternatively, if you think the performance impact is lower than previously assessed, could you request a re-triage from the performance team by setting the Performance Impact flag to ??

For more information, please visit BugBot documentation.

Flags: needinfo?(aiunusov)

Jan Varga [:janv]

Comment 13

•

1 year ago

This bug now depends on bug 1671932 which already has severity set to S2.
Not sure if all bugs which such S2 bug blocks need to be S2 as well.

Artur Iunusov

Updated

•

1 year ago

Severity: S3 → S2

Flags: needinfo?(aiunusov)

Bas Schouten (:bas.schouten)

Reporter

Comment 14

•

1 year ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #11)

Delete browsing data is the simplest mechanism available on Android since the settings UI does not provide the more granular data clearing mechanism available in desktop which, most significantly, characterizes the per-site data usage. In general, the slowdown will be correlated with the amount of data, so having a way to see sites using the most data is most useful. That is, Fenix does provide a way to clear cookies and site data for any site from the lock icon, which is granular, but not useful because there's no way to know what sites are the source of the problem.

One additional thing to note is that, as discussed with :mstange for his similar report in #DOM on matrix the other day, performance problems will be exacerbated after:

Fenix updates because of the change in Build ID require running a full initialization sweep.

Killing Firefox without letting it shutdown cleanly because the cache will not be flushed. I guess a very interesting question is whether Fenix ever does cleanly shutdown? Like, if Android wants to reclaim memory, does XPCOM shutdown get a chance to run, or does Fenix 100% of the time perform an unclear shutdown? Or is Fenix regularly shutting down XPCOM proactively when it goes idle but before it's killed by the OS. Which is to say, it's definitely the case that if you explicitly kill the app while using it you are going to see worst-case init behavior. (But which is of course very normal user problem-solving behavior.)

Before I do any of this, is there anything I can do to help diagnose -why- this is happening? A lot of users will probably just think 'man this browser can be so slow', rather than actually trying to figure out what the cause is or discovering a possible solution like this :-).

Flags: needinfo?(bugmail)

Jan Varga [:janv]

Comment 15

•

1 year ago

We know why this is happening and we are working hard on addressing the issue. The main meta bug for the effort is bug 1671932.
There is about 100 patches, most of them already accepted and we are in a process of landing them.
The patches are attached to bug 1808294, bug 1866217, bug 1866402 and bug 1867997.

Flags: needinfo?(bugmail)

Jan Varga [:janv]

Comment 16

•

1 year ago

•

Edited

This is currently the last patch for the effort: https://phabricator.services.mozilla.com/D199082
It has a commit message which describes the changes from high level point of view.

Jan Varga [:janv]

Updated

•

9 months ago

Assignee: nobody → jan.varga

Status: NEW → ASSIGNED

Priority: P3 → P1

Markus Stange [:mstange]

Comment 17

•

6 months ago

What's the next step here?

Flags: needinfo?(jan.varga)

Markus Stange [:mstange]

Comment 18

•

6 months ago

This is probably the biggest contributor to slow "applink startup" scenarios on Android. For users with ad blockers, it delays the first load regardless of what page is loaded. For users without ad blockers, it delays the page load if the page accesses storage in its JS code.

It's also the type of issue that's hard to quantify with CI performance tests because the magnitude depends on how much stuff is stored in the profile.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #11)

Killing Firefox without letting it shutdown cleanly because the cache will not be flushed. I guess a very interesting question is whether Fenix ever does cleanly shutdown?

Firefox never shuts down cleanly on Android. Android always kills the process, there is no UI to cleanly shut down Firefox. Which means that we hit the slow path here on every startup, if what you're saying is true.

Blocks: applink-startup

Markus Stange [:mstange]

Comment 19

•

6 months ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #11)

Fenix updates because of the change in Build ID require running a full initialization sweep.

What work is being done in this initialization sweep? What information is computed by it? Doing it asynchronously is not a satisfying solution, why does it need to be done at all?

Markus Stange [:mstange]

Comment 20

•

6 months ago

(In reply to Markus Stange [:mstange] from comment #18)

It's also the type of issue that's hard to quantify with CI performance tests because the magnitude depends on how much stuff is stored in the profile.

Is there telemetry measuring how long this takes for our Android users?

Andrew Sutherland [:asuth] (he/him)

Comment 21

•

6 months ago

(In reply to Markus Stange [:mstange] from comment #19)

What work is being done in this initialization sweep? What information is computed by it? Doing it asynchronously is not a satisfying solution, why does it need to be done at all?

Quota Manager tracks and enforces quota limits for the storage APIs exposed to content, clearing origin storage buckets as needed. To do this it needs to know how much storage is being used for a given origin as well as how much storage Firefox is using for storage in general. https://developer.mozilla.org/en-US/docs/Web/API/Storage_API/Storage_quotas_and_eviction_criteria is a nice MDN page covering the scenario in general.

(In reply to Markus Stange [:mstange] from comment #17)

What's the next step here?

Jan is actively working on the async init mechanism. Much of the necessary work has been already landed, but there's a little more still to happen. You can try the functionality out by flipping dom.quotaManager.temporaryStorage.incrementalOriginInitialization or setting the env var MOZ_ENABLE_INC_ORIGIN_INIT to 1.

Flags: needinfo?(jan.varga)

Markus Stange [:mstange]

Comment 22

•

6 months ago

•

Edited

Why does computing how much space an origin occupies involve so much file system work? Why do we need to compute this size rather than just store a number per origin? Are we worried some files got deleted out from under us and that we're applying too stringent limits if we trust the stored sizes? Have we investigated how Chrome addresses this problem?

Markus Stange [:mstange]

Comment 23

•

6 months ago

I've filed bug 1952109 about the fact that, on Android, we might be hitting the expensive path on every startup.

Markus Stange [:mstange]

Comment 24

•

6 months ago

Setting needinfo for the questions in comment 20 and comment 22.

Flags: needinfo?(jan.varga)

Andrew Sutherland [:asuth] (he/him)

Comment 25

•

6 months ago

(In reply to Markus Stange [:mstange] from comment #22)

Why does computing how much space an origin occupies involve so much file system work?

This is handled on a per-Quota Client basis. Bug 1866240 for example changed the Cache API to aggressively cache individual stored Response blobs instead of doing disk scanning. Bug 1881058 tracks making a similar change to IndexedDB which currently unfortunately is a bit scan-heavy in mozilla::dom::indexedDB::QuotaClient::GetUsageForOriginInternal with the blob size tallying in mozilla::dom::indexedDB::DatabaseFileManager::GetUsage likely being the greatest concern since there are usually going to be O(1) IDB databases per origin but O(n) blobs where a blob-heavy site might be very blob-heavy. Looking at mozilla::dom::fs::data::FileSystemDatabaseManager::RescanUsages I suspect OPFS is a bit more scan-heavy, but since those files are mutable rather than immutable, that's harder to improve, although we could probably do some reasonable dirty bit tracking there.

Why do we need to compute this size rather than just store a number per origin? Are we worried some files got deleted out from under us and that we're applying too stringent limits if we trust the stored sizes?

The cache does store a number per origin. If we can't trust the cache, we re-derive the information. The primary concern is in-flight operations or changes in accounting between versions of Firefox, including both changes in QuotaManager clients and changes in things like the public suffix list which can impact the group accounting. A related concern is that if there are bugs in the accounting that are allowed to accumulate, they could lead to wiping all QM-managed user data due to a mistaken impression of the state of storage pressure or filling up the user's disk (or at least hit whatever limits are imposed on Firefox), which is particularly a problem since things like bug 1237527 will delete all the user's cookies as a result and other databases might have similar problems.

That said, invalidating the cache on upgrades has been overly cautious given the rate of changes to DB schemas and quota accounting which we've thankfully been able to keep largely stable, but as noted, that is not likely to be the problem facing Fennec/Android. It seems likely that after addressing async init and the IDB usage caching it could be appropriate to teach QM how to potentially un-dirty an origin as the last directorylock for it is released.

Have we investigated how Chrome addresses this problem?

A surprising amount of their storage APIs were mapped into a single LevelDB instance per origin which simplified their accounting. Their Cache API was implemented on top of their HTTP Cache which I think was more biased towards handling crashes. But a major thing here is just engineering resources.

Andrew Sutherland [:asuth] (he/him)

Comment 26

•

6 months ago

•

Edited

It's also worth noting that https://searchfox.org/mozilla-central/source/toolkit/components/antitracking/PurgeTrackerService.sys.mjs with high level documentation at https://firefox-source-docs.mozilla.org/toolkit/components/antitracking/anti-tracking/cookie-purging/index.html is probably also important to helping minimize the number of origins with data stored in QuotaManager. Searching on the pref it seems like it's enabled, but it would not be surprising if the shorter process lifetimes on Android might cause problems? Specifically, bug 1459974 is absolutely still a thing and I see that the code does use the Storage Activity Service. So there might be an irony that Firefox desktop, which is more able to handle excessive numbers of origins, actually benefits from the purging to have to deal with fewer origins.

Markus Stange [:mstange]

Comment 27

•

6 months ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #25)

(In reply to Markus Stange [:mstange] from comment #22)

Why does computing how much space an origin occupies involve so much file system work?

This is handled on a per-Quota Client basis.

Which clients are causing the delay in Bas's profile? https://share.firefox.dev/4hexqWa

Andrew Sutherland [:asuth] (he/him)

Comment 28

•

6 months ago

•

Edited

(In reply to Markus Stange [:mstange] from comment #27)

Which clients are causing the delay in Bas's profile? https://share.firefox.dev/4hexqWa

Going from left-to-right on the flame graph:

LocalStorage is 2,114 samples
Cache API is 1,083 samples
IndexedDB is 667 samples
general directory traversal (attributed to QuotaManager helpers and not directly to any of the above quota clients) is 1297 samples = 175 (leftmost colum) + 588 + 318 + 216 (rightmost) samples
Origin Metadata loading is 1,884 samples (these are the .metadata* files)

So it seems like the general problem here is the number of origins based on the distribution of samples, with LocalStorage being the dominant storage API in use.

Markus Stange [:mstange]

Comment 29

•

6 months ago

Thanks. So it seems like we need to find a solution that deals well with a large number of origins.

Andrew Sutherland [:asuth] (he/him)

Comment 30

•

6 months ago

•

Edited

(In reply to Markus Stange [:mstange] from comment #29)

Thanks. So it seems like we need to find a solution that deals well with a large number of origins.

Async Init should massively improve the situation since it means we only would need to wait for the origins that are blocked on storage to initialize first. But it is worth noting that we have enhancements in backlog to improve pruning of origin directories in general. Bug 1927259 is part of FFXP-1602 which would help with this (it's in internal review), although to some extent the greater problem is understanding what storage the user is likely to actually care about. PurgeTrackerService that I mention in comment 26 is part of this since it can let us know what origins are simply understood to be trackers. A more meta problem is that we don't have a good source of OriginAttributes-aware site-usage/frecency data. Chrome definitely uses site-usage data to inform their quota decisions, but our Places database is OriginAttributes-agnostic, as I understand it (AIUI), which means that we can't use it without potentially introducing side-channels that would let sites potential de-anonymize users. (And introducing new potential sources of such information has potential privacy implications which would need to be considered carefully.)

Jan Varga [:janv]

Comment 31

•

6 months ago

We have already found a solution for handling a large number of origins: asynchronous storage initialization (also known as incremental origin initialization). Once enabled, when data is accessed for an origin, only the corresponding eTLD+1 group of origins will need to be initialized, either from a cache or by scanning files on disk.
We also plan to do many other enhancements once the incremental origin initialization is done.

Flags: needinfo?(jan.varga)

Jari Jalkanen [:jari]

Comment 32

•

6 months ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #25)

(In reply to Markus Stange [:mstange] from comment #22)

Why does computing how much space an origin occupies involve so much file system work?

Looking at mozilla::dom::fs::data::FileSystemDatabaseManager::RescanUsages I suspect OPFS is a bit more scan-heavy, but since those files are mutable rather than immutable, that's harder to improve, although we could probably do some reasonable dirty bit tracking there.

OPFS is tracking dirty shutdowns and only scans the files which were being written to at the time when the application crashed, when quota manager initialization requires this information, in a lazy way after the storage initialization performance improvements land. If there's a less heavy way to do this, we should definitely consider it.

Jan Varga [:janv]

Updated

•

6 months ago

Depends on: 1952109

Jeff Muizelaar [:jrmuizel]

Comment 33

•

6 months ago

(In reply to Markus Stange [:mstange] from comment #22)

Why does computing how much space an origin occupies involve so much file system work? Why do we need to compute this size rather than just store a number per origin? Are we worried some files got deleted out from under us and that we're applying too stringent limits if we trust the stored sizes? Have we investigated how Chrome addresses this problem?

Chrome has a QuotaManager sqlite database with the following schema:

CREATE TABLE meta(key LONGVARCHAR NOT NULL UNIQUE PRIMARY KEY,value LONGVARCHAR);
CREATE TABLE IF NOT EXISTS "buckets"(id INTEGER PRIMARY KEY AUTOINCREMENT, storage_key TEXT NOT NULL, host TEXT NOT NULL, type INTEGER NOT NULL, name TEXT NOT NULL, use_count INTEGER NOT NULL, last_accessed INTEGER NOT NULL, last_modified INTEGER NOT NULL, expiration INTEGER NOT NULL, quota INTEGER NOT NULL, persistent INTEGER NOT NULL, durability INTEGER NOT NULL) STRICT;
CREATE TABLE sqlite_sequence(name,seq);
CREATE UNIQUE INDEX buckets_by_storage_key ON buckets(storage_key, type, name);
CREATE INDEX buckets_by_host ON buckets(host, type);
CREATE INDEX buckets_by_last_accessed ON buckets(type, last_accessed);
CREATE INDEX buckets_by_last_modified ON buckets(type, last_modified);
CREATE INDEX buckets_by_expiration ON buckets(expiration);

I don't think they support checking that the amount of storage on disk matches what they write down in the database.

Jan Varga [:janv]

Comment 34

•

6 months ago

Yeah, their architecture is probably quite different. We just need to enable incremental origin initialization as soon as possible and address the Android-specific bug 1952109 (we already have some ideas on how to solve it quickly and relatively easily).

Jan Varga [:janv]

Updated

•

6 months ago

Depends on: 1953860
No longer depends on: 1671932, 1952109

Markus Stange [:mstange]

Comment 35

•

6 months ago

(In reply to Markus Stange [:mstange] from comment #20)

(In reply to Markus Stange [:mstange] from comment #18)

It's also the type of issue that's hard to quantify with CI performance tests because the magnitude depends on how much stuff is stored in the profile.

Is there telemetry measuring how long this takes for our Android users?

Florian migrated qm_quota_info_load_time_v0 to Glean in bug 1950710 (D241236), so we should have some Android telemetry for this soon.

Markus Stange [:mstange]

Comment 36

•

1 month ago

What's the current status of this bug? I believe at some point the decision was made to shift focus from shipping L2 to making it so that we wouldn't run these full checks on every startup on Android. In the meantime I think we've enabled L2 on non-Android platforms. What's the current status on Android?

Flags: needinfo?(htsai)

Jan Varga [:janv]

Comment 37

•

1 month ago

•

Edited

Not sure what you mean by not running full checks (is there a bug for that?), but L2 is now enabled on Android too, see bug 1978609.
The telemetry looks promising.

Jan Varga [:janv]

Comment 38

•

1 month ago

•

Edited

It would also be good to add telemetry to count number of origins with zero usage (bug 1975258), removing such origins might improve perf as well.

Markus Stange [:mstange]

Comment 39

•

1 month ago

Ooh, that's great news! I didn't realize you were still working on this.

With "full checks" I meant "the slow part"... or in other words, I was referring to bug 1952109.

Flags: needinfo?(htsai)

Jan Varga [:janv]

Comment 40

•

1 month ago

(In reply to Markus Stange [:mstange] from comment #39)

Ooh, that's great news! I didn't realize you were still working on this.

I'm only helping a bit in my free time.

With "full checks" I meant "the slow part"... or in other words, I was referring to bug 1952109.

Unfortunately, the slow path was the only thing available on Android, since the current L1 cache is persisted to disk only during shutdown and clean shutdown never happens on Android.

Markus Stange [:mstange]

Comment 41

•

9 days ago

Here's a startup profile I got with simpleperf from my regular Firefox Nightly profile on my personal phone: https://share.firefox.dev/3UOT2zP

Is this indicating whether L2 is being used? 800ms is still a rather significant delay, is there more we can do here?

Flags: needinfo?(jan.varga)

Jan Varga [:janv]

Comment 42

•

9 days ago

Yeah, there’s incremental origin initialization, but it’s not enabled by default since it isn’t finished yet.

Flags: needinfo?(jan.varga)

Jan Varga [:janv]

Comment 43

•

9 days ago

Yeah, we were hoping to get initialization time under 1s with relatively little dev effort using L2. An almost instant init time is only possible if we get L1 working on Android, or if we finish incremental origin initialization (or both).

Markus Stange [:mstange]

Comment 44

•

9 days ago

Andrew, what are the next steps here?

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Comment 45

•

9 days ago

•

Edited

(In reply to Markus Stange [:mstange] from comment #44)

Andrew, what are the next steps here?

Unfortunately, we don't have resources to staff async init (bug 1671932) until H1. Please direct any staffing questions to manager people.

I've marked bug 1953860 as fixed and downgraded this bug from P1 to P2 since the situation is massively improved but is still something we plan to address further.

There is a mechanism that conservatively "archives" old origins that could instead be changed to purge them if someone has cycles to take a look at that and which could potentially also be augmented to handle the 0-usage edge cases alluded to in comment 38. They'd want to look at the code around dom.quotaManager.longQuotaInfoLoadTimeThresholdMs and dom.quotaManager.unaccessedForLongTimeThresholdSec, specifically this code goes and performs the archiving. But we wouldn't want to archive things since that just bloats the profile and creates privacy problems as the functionality currently exists.

Assignee: jan.varga → nobody

Status: ASSIGNED → NEW

Depends on: 1671932

Flags: needinfo?(bugmail)

Priority: P1 → P2