﻿{"id":6236,"date":"2026-06-05T15:33:09","date_gmt":"2026-06-05T15:33:09","guid":{"rendered":"https:\/\/uni.hi.is\/helmut\/?p=6236"},"modified":"2026-06-05T21:45:07","modified_gmt":"2026-06-05T21:45:07","slug":"fixing-the-ansible-reboot-freeze-on-nvidia-dgx-spark-clusters","status":"publish","type":"post","link":"https:\/\/uni.hi.is\/helmut\/2026\/06\/05\/fixing-the-ansible-reboot-freeze-on-nvidia-dgx-spark-clusters\/","title":{"rendered":"Fixing the Ansible Reboot No Network Connection Freeze on NVIDIA DGX Spark"},"content":{"rendered":"<p>This is my first AI-generated blog post: after debugging for hours an Ansible script for updating (using <code class=\"\" data-line=\"\">fwupdmgr<\/code> and <code class=\"\" data-line=\"\">apt<\/code>) a Spark DGX that includes a reboot, I involved AI to solve my problem.<\/p>\n<p>Before the automation, I finished the manual firmware and kernel update with a manual <\/p>\n<blockquote><p><code class=\"\" data-line=\"\">\/sbin\/reboot now<\/code><\/p><\/blockquote>\n<p>While this manual approach worked, my automated Ansible-based approach that I tried then, did freeze the network connection and while I could still log-in via keyboard and mouse, the network devices where not visible in system logs. Only pulling the power plug helped to bring the networking back. <\/p>\n<p>As an LLM helped me debugging, that LLM had all the context needed to create this blog post:<\/p>\n<p>When running updates via Ansible on the newer NVIDIA DGX Spark platforms, executing a standard <code class=\"\" data-line=\"\">ansible.builtin.reboot<\/code> module immediately following a major firmware and package upgrade can cause the target node to completely drop off the network. The system will stop responding to pings.<\/p>\n<p>If you check the via the physical console the logs after a hard power cycle, you will find a quiet, catastrophic platform trace:<\/p>\n<blockquote>\n<p><code class=\"\" data-line=\"\">acpi NVDA8800:00: platform device creation failed: -16 (EBUSY)<\/code><\/p>\n<\/blockquote>\n<p>The primary PCIe Host Bridge Controller crashes, rendering the OS completely blind to PCI devices such as the network. As a result, physical network interfaces (like the Realtek <code class=\"\" data-line=\"\">enP7s7<\/code>) vanish entirely from the system \u2014 running <code class=\"\" data-line=\"\">lspci<\/code> returns absolutely nothing.<\/p>\n<p>While Spark DGX forums catch the downstream symptoms of this fragility \u2014 for example, citing PCI DOE mailbox timeouts during warm resets \u2014 they usually miss the exact software trigger and offer no  solution for Ansible-based reboot automation. Here is why it happens and how to fix your Ansible playbooks permanently.<\/p>\n<hr \/>\n<h2>The Root Cause: The SSH Handshake Trap<\/h2>\n<p>The standard <code class=\"\" data-line=\"\">ansible.builtin.reboot<\/code> module operates on a synchronous protocol. It opens an SSH connection, triggers the reboot, and then clamps that TCP socket open while polling the machine's status to watch it go down and verify when it returns online.<\/p>\n<p>Immediately following a massive <code class=\"\" data-line=\"\">apt dist-upgrade<\/code>, a Linux system sits in a highly volatile \"dirty state.\" Core shared libraries and network configurations are mid-mutation in memory.<\/p>\n<p>When <code class=\"\" data-line=\"\">systemd<\/code> attempts to dismantle the network driver stacks while at the same time, Ansible is actively fighting to hold that SSH socket open, a severe timing race condition triggers on the unified NVLink-C2C\/PCIe switch fabric. Because a warm, i.e. software reboot keeps the motherboard power sub-planes fully energized, the <code class=\"\" data-line=\"\">EBUSY (-16)<\/code> error flags stick in the hardware registers, locking out the PCIe bridges on the subsequent boot cycle.<\/p>\n<hr \/>\n<h2>The Solution: Staging and Fire-and-Forget Automation<\/h2>\n<p>To fix this platform trait, we have to implement two distinct changes in the Ansible script:<\/p>\n<ul>\n<li>Assuming that the script involved firmware anda package updates: <strong>Rearrange the execution order:<\/strong> Stage firmware capsules <em>first<\/em> using <code class=\"\" data-line=\"\">fwupdmgr<\/code>, run <code class=\"\" data-line=\"\">apt<\/code> package upgrades <em>second<\/em>, and consolidate the layout into a single reboot at the absolute end to minimize fabric disruptions.<\/li>\n<li><strong>Decouple the reboot execution:<\/strong> Force a low-level kernel storage flush, request an immediate hardware reset vector jump, and instruct Ansible to drop its hands off the connection before the network stack can stutter.<\/li>\n<\/ul>\n<p>By wrapping the reboot command in <code class=\"\" data-line=\"\">async: 10<\/code> and <code class=\"\" data-line=\"\">poll: 0<\/code>, Ansible passes the instruction string to the kernel and instantly terminates its own SSH pipeline that it established from your remote administration machine from where you started the Ansible script. The fabric remains completely quiet, allowing the system to cycle cleanly.<\/p>\n<p>Here is the hardened, production-ready playbook snippet:<\/p>\n<pre><code class=\"\" data-line=\"\">    # ==========================================\n    # CONSOLIDATED DECOUPLED REBOOT\n    # ==========================================\n    # We use fire-and-forget (async\/poll) to drop the SSH connection \n    # BEFORE driver stacks cycle, preventing the PCIe fabric race condition.\n    - name: Data-safe sync and single hardware reset\n      ansible.builtin.shell: &quot;sync &amp;&amp; \/sbin\/reboot now&quot;\n      async: 10\n      poll: 0\n      when:\n        - reboot_required_file.stat.exists or firmware_staged | bool\n        - not ansible_check_mode\n\n    # Explicitly hold execution to accommodate firmware flashing and NVLink link-training cycles\n    - name: Wait for hardware initialization and fabric training to complete\n      ansible.builtin.wait_for_connection:\n        delay: 60\n        timeout: 1200\n      when:\n        - reboot_required_file.stat.exists or firmware_staged | bool\n        - not ansible_check_mode<\/code><\/pre>\n<hr \/>\n<h2>The Takeaway<\/h2>\n<p>The above <code class=\"\" data-line=\"\">wait_for_connection<\/code> task will patiently pause automation, i.e. Ansible will not keep the network connection alive and will give the Grace Blackwell chip the full 60+ seconds it needs to process the firmware layers, serialize its memory planes, and finish calibrating and setting up the high-speed links. Once the links are bound, Ansible safely re-authenticates.<\/p>\n<p>P.S. (Written by the human author, not the AI): Strictly speaking, you do not need the final <code class=\"\" data-line=\"\">ansible.builtin.wait_for_connection:<\/code> step (it even slows down your script), but it gives you the confirmation that the Spark DGX gained network connectivity after the reboot.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is my first AI-generated blog post: after debugging for hours an Ansible script for updating (using fwupdmgr and apt) a Spark DGX that includes a reboot, I involved AI to solve my problem. Before the automation, I finished the manual firmware and kernel update with a manual \/sbin\/reboot now While this manual approach worked, [&hellip;]<\/p>\n","protected":false},"author":512,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[139469],"tags":[],"class_list":["post-6236","post","type-post","status-publish","format-standard","hentry","category-tech"],"_links":{"self":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/6236","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/users\/512"}],"replies":[{"embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/comments?post=6236"}],"version-history":[{"count":24,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/6236\/revisions"}],"predecessor-version":[{"id":6260,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/6236\/revisions\/6260"}],"wp:attachment":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/media?parent=6236"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/categories?post=6236"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/tags?post=6236"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}