NPM error message and fix - what's the explanation?

Hi all,

A few weeks ago, we were at KubeCon EU, and among the Ampere-related activities at the event, we had a very nice System 76 Thelio Astra system at the Open Telemetry Observatory. We’ve been doing quite a bit with OTel recently, and it has been great working with that community! The demo that they had prepared, however, had a few issues, and fixing them created more answers (for me) than it answered - I have a hypothesis below, but would love to hear if anyone else has seen this issue and has a better explanation.

This demo is a reference application that uses multiple languages and runtimes to exercise Open Telemetry, including a React.js front-end, which pulls a lot of Javascript dependencies from npm. The application is designed to dispatch (using a Spring Boot dispatcher) parts of the Mandelbrot set to different Golang workers before rendering the result using React.js. The idea is to exercise many different cores to render quickly, and give good raw data to your observability platform to read out on.

Unfortunately, when we ran npm build install on site, we got the delightfully helpful error message “Bus error (core dumped)”. We could not find the core to get a stack trace, so this was all we had to go on. Searching the Internet yielded a number of promising hits:

This error appears to happen more often on Arm64 nodes (lots of Mac links show up). The advice is universal, and typically arrives without any additional comment:

  • delete node_modules directory
  • delete the package-lock.json file
  • delete _next is sometimes included

And indeed this did fix our issue on site! Unfortunately, it involves downloading and rebuilding hundreds of megabytes of dependencies, and over conference wifi, that wasn’t ideal, but we got there!

My question, though, is: what’s going on? And how should people head the issue off if it is a frequently occurring one on Arm64 systems?

My hypothesis is that the package-lock.json file included in the repository results in specific binary artifacts being downloaded by npm, which have some kind of pre-compiled Javascript modules in there, to save rebuilding time locally, and that this results in binary-incompatible compiled Javascript modules to be downloaded, when they need to be rebuilt locally (or have some architecture awareness built in to npm) - does that even make sense?

Anyone else encountered this issue?

Dave.

Do you have a 64k page size on the system? I’ve seen this behavior with programs which don’t support 64k.

No, this was with 4K page size, the default Ubuntu 22.04 kernel I believe (possibly 24.04).

Interesting, I just got home so I’ll see if I can reproduce this issue on my ARM hardware (M1 and Ampere).

Made this flake, git add flake.nix & nix build .#frontend. It didn’t reproduce this issue.

{
  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs";
    flake-parts.url = "github:hercules-ci/flake-parts";
    systems.url = "github:nix-systems/default-linux";
  };

  outputs = {
    self,
    nixpkgs,
    flake-parts,
    systems,
    ...
  }@inputs: flake-parts.lib.mkFlake { inherit inputs; } {
    systems = import inputs.systems;

    perSystem = { lib, pkgs, ... }: {
      packages.frontend = pkgs.stdenv.mkDerivation (finalAttrs: {
        pname = "otelbrot-frontend";
        version = "git+${inputs.self.shortRev or "dirty"}";

        src = lib.cleanSource inputs.self;
        sourceRoot = "${finalAttrs.src.name or "source"}/frontend";

        nativeBuildInputs = with pkgs; [
          nodejs
          npmHooks.npmConfigHook
        ];

        npmDeps = pkgs.fetchNpmDeps {
          name = "${finalAttrs.pname}-npm-deps-${finalAttrs.version}";
          inherit (finalAttrs) src sourceRoot;
          hash = "sha256-8KiyuD/u0tmJ/COX9YrU40quEE6IsP2cF7CjBxu9gDY=";
        };

        buildPhase = ''
          runHook preBuild
          npm run build
          runHook postBuild
        '';

        installPhase = ''
          runHook preInstall
          cp -r dist $out
          runHook postInstall
        '';
      });
    };
  };
}

Anyway … “just delete this and rebuild it” without explaining why that was causing the problem is one of my pet hates - tell me what was wrong! It’s even more infuriating when this works! This reminds me of old cars where you need to “jiggle” the key or pedals in some secret way to get them to start.

Heh yeah, I’m guessing something might’ve gotten corrupted. Maybe when the system shut down previously, things didn’t write correctly? :person_shrugging: Hard to say without having an actual copy of the exact state.

On GH200 system with 64K page size, I have to rebuild nodejs from source

This stack overflow article describes core dumps, how to enable the core dump file (typically these are disabled as described in the article) and how to run gdb on the core dump file. This will show the stack trace of what crashed, which should tell you what application crashed. If the app doesn’t have symbols, it will just have hex addresses for the function names so you’ll not be able to know what function caused the crash so might not be very useful beyond knowing what specific app crashed.

https://stackoverflow.com/questions/17965/how-to-generate-a-core-dump-in-linux-on-a-segmentation-fault

1 Like