fosstodon.org is one of the many independent Mastodon servers you can use to participate in the fediverse.
Fosstodon is an invite only Mastodon instance that is open to those who are interested in technology; particularly free & open source software. If you wish to join, contact us for an invite.

Administered by:

Server stats:

8.8K
active users

Gabriele Svelto [moved]

On Monday morning we (Mozilla) detected a very large crash spike affecting users on Linux, specifically on an older version of a Debian-based distribution.

It turned out to be an interesting bug involving the kernel and JavaScript code so let me tell you about it.

A thread 🧵

bugzilla.mozilla.org/show_bug. 1/6

bugzilla.mozilla.org1839669 - Google Images search reproducibly causes tab crashRESOLVED (jdemooij) in Core - JavaScript Engine. Last updated 2023-06-24.

The crash started apparently out-of-the-blue, hitting thousands of Argentinian users on a Debian-based distro called Huayra, and specifically on version 5 which was based on Debian 10.

bugzilla.mozilla.org/show_bug.

Everybody seemed to crash while searching for images on Google. All versions of Firefox - even very old ones - were affected suggesting that the change didn't happen on our side, but on Google's. 2/6

bugzilla.mozilla.org1839139 - Crash in [@ EnterBaseline] affecting users in the es-ar locale doing searches on GoogleRESOLVED (nobody) in Core - JavaScript Engine: JIT. Last updated 2023-06-22.

A colleague analyzed Firefox' behavior at the point of crash and realized that it happened during stack probing. The JIT touched the area that would hold the variables for the next JavaScript call and somehow hit an overflow.

bugzilla.mozilla.org/show_bug.

This is where things got weird, Google's code was allocating 20000 variables in a single frame. Ouch, that's probably some machine-generated code which went out of hand. Think twice before using ChatGPT to write code. 3/6

bugzilla.mozilla.org1839139 - Crash in [@ EnterBaseline] affecting users in the es-ar locale doing searches on GoogleRESOLVED (nobody) in Core - JavaScript Engine: JIT. Last updated 2023-06-22.

But why was it crashing? Linux automatically extends the stack and we had reserved more than enough space, something that we confirmed by looking at the memory map of the affected processes.

Well it turns out that the Linux kernel used to have a check that prevented stack accesses that were too far from the stack pointer. Specifically accesses 64KiB + 256 bytes away would crash instead of extending the stack.

github.com/torvalds/linux/blob 4/6

GitHublinux/arch/x86/mm/fault.c at 84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d · torvalds/linuxLinux kernel source tree. Contribute to torvalds/linux development by creating an account on GitHub.

This was fixed in kernel 4.20 so users of more recent distros are unaffected, and we'll see if we can deploy a workaround to help users of older systems.

github.com/torvalds/linux/comm

It is interesting though that we find ourselves working around a bug we did not introduce triggered by code we do not control. 5/6

GitHubx86/mm/fault: Allow stack access below %rsp · torvalds/linux@1d8ca3bThe current x86 page fault handler allows stack access below the stack pointer if it is no more than 64k+256 bytes. Any access beyond the 64k+ limit will cause a segmentation fault. The gcc -fstac...

@gabrielesvelto How did I miss hearing about TCP? I've been using 1p-isolate for years so it's not really relevant to me but I'd love to know the details on how they built a weaker 1p-isolate and whether it fully does the job or has exploitable weaknesses.

@gabrielesvelto thank you for nice thread. I have different question than this subject. will we see more roll out of code to rust language? firefox still till now makes much memory bugs

@Issa yes, we are continuously deploying Rust code into Firefox. My team has been replacing parts of Firefox with Rust this year, and also fixing bugs in Rust crates that we discovered while using Firefox

@gabrielesvelto sorry for my silly question as am not technical guy. I still see firefox making memory bugs yet at same time you say you are continuously deploying more rust code? could you explain this to me sir gabriele? we shouldn't see new memeory bugs with rust code.

@gabrielesvelto I wonder if it's intentional... I've been feeling conspiratorial about abject sabotage every time something doesn't work correctly in Firefox. Web apps failing in strange ways, like Samsara taking you to the wrong login page, or Plex just failing to authenticate. I suspect some Microsoft in the 90s is afoot at Google.

@MontgomeryGator I don't know about other services, but Google's services are deliberately optimized to work well for Chrome. I don't think this was deliberate sabotage but they most likely don't test much with Firefox.

@gabrielesvelto

And since we’re at it let’s shame Google for putting 20 thousand variables in a single function. Bad Google, no cookie.

I once worked on a game engine that used ODE as its physics layer. At the core of ODE collision detection and handling was a function that built a Jacobian matrix on the stack (using alloca) to compute the forces to apply to objects colliding to separate them. We crashed on touching the stack redzone in Windows when our engine ran as a plugin in Internet Explorer—not something we could fix easily on our end, since the size of a thread redzone is decided at compile time by the application configs (which, again, application is Internet Explorer).

Filed a ticket against ODE maintainers and their response was basically “We don’t consider that application domain to be a meaningful one to fix bugs in.” So we fixed it on our end by #define-ing alloca away to a heap allocation in a tiny buffer.

Point of this story is: no shame on Google. Google doesn’t consider the Firefox browser on old Linux configurations a meaningful application domain to fix bugs in. And if you can’t point to where in the JavaScript language spec it says 20,000 variables is disallowed… Shame on Mozilla for having a noncompliant JS implementation. ;)

At least it was easy to fix.

It is interesting though that we find ourselves working around a bug we did not introduce triggered by code we do not control.

Oh yeah… That’s the nature of Internet software. It is interesting every time. :) I’ve had to get up from the keyboard and take a walk twice in my career, and the first time was when I realized if I’m going to be writing web software, that’s going to be, like, my whole career: stuff breaking because someone changed something somewhere that I was relying on for their own reasons. Internet software is like 1/3 technology and 2/3 social network effects.

@mtomczak this is sadly a very common occurrence for us. Just in the past two months we dealt with a couple of CPU bugs and an issue in a Rust crate that would only occur to people running Windows 7 installations w/o the SP1 installed on AVX-ready CPUs (yes, in 2023).

As for Google they reverted the change before we contacted them, so chances are that it was either wildly inefficient or it also messed Chrome up.

@somecanuckchick yes, we had TCP for a while, I think more improvements were made over time but that's the link I had on hand

@gabrielesvelto my fear as a developer is that this is basically my life. With more and more inter-dependent code being used everywhere at some point you're going to end up working around the platform you're tied into itself.

Good on Mozilla for having the ability to inspect that this was happening and provide a backported fix.

@gabrielesvelto Compiling with stack clash protection will fix this 100% unless the extension is happening via JIT code rather than static native code.

@dalias we're building with -fstack-protector-strong and -fstack-clash-protection but this is happening in the JIT IIUC

@gabrielesvelto "JIT lacks stack clash protection" sounds like a CVE...

@gabrielesvelto Wait a sec, in 2023, there are active users, on a desktop PC, connected to the internet and running a Linux version older than 4.20? That's 5 years old at this point!!!

It's bullshit like this that makes me just want to stop Debian! How much engineering time is wasted because users or package maintainers refuse to update? It's not like they would lose features and it's free! But no, "stability" by dust gathering is still a thing...

Well, enjoy your crashes I guess?

@mupuf @gabrielesvelto I mean, first, Debian 10 is oldoldstable (as in, you're supposed to upgrade ASAP). Second, and more importantly, Debian doesn't have control over what a derivative distribution (like Huayra) does and when they update. Finally, buster-backports (packages.debian.org/buster-bac) contains a much newer kernel.

Debian itself is not at fault here.

packages.debian.orgDebian -- Details of package linux-image-amd64 in buster-backports

@chiraag @gabrielesvelto Agreed, and we are touching the root of the problem here: Distros like Debian pretend that they do what's best for stability and security, which is complete bullshit if you know a bit about software development. Derivatives of Debian just drank the cool-aid and don't feel any urgency to upgrade of migrate their users back to a new Debian when they become unmaintained.

IMO, kernels should be updated at a yearly cadence at the very least!

@mononoaware @gabrielesvelto You can point out issues and even be frustrated by what others decide to spend their time on, but no need to be insulting either to Debian users or developers!

If you really care about improving the Linux ecosystem, challenge people's preconceived notions, experiment with alternatives! Name calling is just weak and not the moral high ground you think it is...

@mupuf @gabrielesvelto blaming the victim is not a best practice.

@mupuf many users don't have the means or technical knowledge to upgrade their machines, or are simply not in control of them. In this particular instance chances are that many of the affected machines belonged to school/university deployments. We see this all the time.

Just last week we caught a bug in a Rust crate that only affected users on Windows 7 w/o the SP1 installed: bugzilla.mozilla.org/show_bug.

There are users on *those* configurations too.

bugzilla.mozilla.org1838108 - Crash in [@ core::ptr::const_ptr::impl$0::offset]RESOLVED (dkeeler) in Core - Security: PSM. Last updated 2023-06-17.

@gabrielesvelto Oh, absolutely. I don't blame the users here, I blame the distros for failing them!

Debian’s model for stability and security doesn't work for desktops...

IMO, what desktop users need is a more rapid pace of update, with staged rollout, and guaranteed ability to rollback (even automatic if the boot failed). Comparatively to maintaining all free software versions for 3 years or so, it is way less work and more stable/secure!

Servers can keep Debian's model, but with fewer pkgs.

@mupuf @gabrielesvelto feel free to step up and build such functionality into Debian.

@gabrielesvelto Google's JavaScript is produced by an optimizing compiler. It's machine-generated only in the sense that a compiler rearranged it in attempt to shrink code size or runtime.

@evmar yes I'm sure this is some scalarization pass that accidentally turned a fixed-size array into a bunch of variables or something along the lines. The mention of ChatGPT was meant as a joke.

@gabrielesvelto I remember hearing from someone who worked on gmail that Firefox used to limit your DOM to 255 levels of nesting, after which point new children in the HTML would appear in the DOM as siblings.

They apparently had discovered this because they had generated such awful HTML that they hit that limit. 😬

@gabrielesvelto I haven't tried to look at the code, but my guess is that that was the result of relatively old functional language compilation techniques (lambda lifting comes to mind, but this isn't something I've thought much about since grad school), or maybe old-fashioned code generation like what Firefox uses for XPCOM and IPC, rather than any kind of “AI”.

@xlerb yes, that was meant to be a bit of a joke... But also a warning, if a compiler phase can go *this* bad think how much crap "AI" could produce