﻿{"id":5856,"date":"2025-11-07T11:11:06","date_gmt":"2025-11-07T11:11:06","guid":{"rendered":"https:\/\/uni.hi.is\/helmut\/?p=5856"},"modified":"2026-03-25T13:54:08","modified_gmt":"2026-03-25T13:54:08","slug":"setting-up-a-spark-dgx","status":"publish","type":"post","link":"https:\/\/uni.hi.is\/helmut\/2025\/11\/07\/setting-up-a-spark-dgx\/","title":{"rendered":"Spark DGX"},"content":{"rendered":"<p>While systems based on AMD AI max+ 395 are cheaper (and might in some cases even be faster), the NVIDIA Spark DGX systems have the advantage of providing the CUDA ecosystem. (However, if you just want to run LLMs, then Ollama and LM Studio support the AMD AI processors as well as CUDA.) Both system suffer from the fact, that standard DDR5 RAM is used (or to be more precise LPDDR5, i.e. soldered, instead of upgradable DIMMs): if you have CUDE code that fits into the High-Bandwith Memory of a real graphics card, it will run ca. 5 times faster in comparison to the standard DDR5 RAM.<\/p>\n<p>I did some research on the Spark DGX<\/p>\n<ul>\n<li>Comes with <em>no<\/em> <a href=\"https:\/\/www.naddod.com\/products\/13888.html?utm_source=blog&amp;utm_medium=476&amp;utm_campaign=inner&amp;utm_content=NADDOD%20200G%20QSFP56%20DAC%20cable\">200 GB QSFP Direct Attach Cable (DAC)<\/a> to cluster two of them without a 200 GB switch in-between, i.e. you need to buy an extra one, e.g. from <a href=\"https:\/\/www.fs.com\/de\/products\/115634.html\">here<\/a>. Currently, only a single 1:1 connection is supported for coupling two devices together, but maybe via a 200 GB switch you could connect more Sparks (if you have more...) using the 200 GB interface -- you can of course always use the 10 GB Ethernet ports connected to a switch.<\/li>\n<li><code class=\"\" data-line=\"\">enP7p1s0<\/code> is the Realteak-based 10 GB RJ45 Ethernet port. The others (<code class=\"\" data-line=\"\">enp1s0f0np0 enp1s0f1np1 enP2p1s0f0np0 enP2p1s0f1np1<\/code>) are the 200 GB QSFP56 ports, <a href=\"https:\/\/www.reddit.com\/r\/LocalLLaMA\/comments\/1oieip0\/theoretically_scaling_beyond_2_dgx_sparks_in_a\/\">it seems that as the CPU consists essentially of two parts, each with its own PCIe and each CPU part handles one 100 GB connection and therefore, the 200 GB connection is reported as two 100 GB devices<\/a>).<\/li>\n<li>Setup is described here: <a href=\"https:\/\/docs.nvidia.com\/dgx\/dgx-spark\/first-boot.html\">https:\/\/docs.nvidia.com\/dgx\/dgx-spark\/first-boot.html<\/a><\/li>\n<li>While it is setting up, it creates a WiFi AP (SSID shown a booklet that ships with the hardware) so that you can connect and create a user via a Web browser. It uses the Ethernet connection for installing OS updates. Once that setup has been finished, the WiFi AP gets disabled! You should be able to enable the AP again, e.g. using NetworkManager, but this will not NAT the WiFi AP client traffic via the Ethernet connection, i.e. while your WiFi AP client can access the DGX Spark, you cannot browse the Internet.<\/li>\n<li>From the documentation:<br \/>\n<blockquote><p>The machine name is your DGX Spark hostname with <code class=\"\" data-line=\"\">.local<\/code> appended, such as <code class=\"\" data-line=\"\">spark-xxxx.local<\/code>. You can find the default hostname on the Quick Start Guide that came in the box. The .local address uses mDNS (multicast DNS) to automatically locate your DGX Spark on the net-work without needing to know its IP address. This is particularly useful if your router periodically reassigns IP addresses.<\/p>\n<p>For <strong>Windows users:<\/strong> mDNS requires Bonjour Print Services from Apple. If you have iTunes or other<br \/>\nApple software installed, you likely already have this. Otherwise, you can download it from Ap-<br \/>\nple\u2019s website. Alternatively, you can try using just the hostname without .local (such as spark-xxxx),<br \/>\nthough this method is less reliable on modern networks.<\/p>\n<p>Why <strong>.local might not work:<\/strong> .local hostnames may not work in enterprise networks with strict se-<br \/>\ncurity policies, networks that block multicast traffic, or other restricted network environments.<\/p>\n<p><strong>Using an IP address instead<\/strong>: If .local hostnames do not work, you will need to use the IP address. To<br \/>\nfind the IP address, physically log in to your DGX Spark and click the <em>network icon<\/em> in the top right<br \/>\ncorner of the Ubuntu desktop. Select <em>Settings<\/em> from the dropdown menu, then navigate to the<br \/>\n<em>Network<\/em> section. The IP address will be displayed under as \"Realtek Ethernet\". Click on the settings icon to see its IPv4 and IPv6 address.<br \/>\n Alternatively, you<br \/>\ncan log in to your router\u2019s administration console to view connected devices and their IP addresses.<\/p><\/blockquote>\n<\/li>\n<li>As this OS Ubuntu 2024.04, it uses <strong><a href=\"https:\/\/netplan.io\/\">Netplan<\/a><\/strong> for all network configuration.<br \/>\nTo change the hostname, you can either use the DGX Dashboard (see below and then <em>Settings<\/em>, <em>System<\/em>, <em>Edit Hostname<\/em>) or:<\/p>\n<p><code class=\"\" data-line=\"\">sudo hostnamectl set-hostname<\/code> <em>new_hostname<\/em><\/p>\n<p>E.g.: <code class=\"\" data-line=\"\">sudo hostnamectl set-hostname spark1<\/code><\/p>\n<p>Restart the device or its services for the change to take effect.<\/p>\n<p>Further netplan commands: <code class=\"\" data-line=\"\">netplan get<\/code>, <code class=\"\" data-line=\"\">netplan status --all<\/code><\/p>\n<p>To generate configuration files for all the involved network tools from the yaml file:<br \/>\n<code class=\"\" data-line=\"\">sudo netplan generate<\/code>. To try (will revert after 120 s): <code class=\"\" data-line=\"\">netplan try<\/code> To finally apply persistently: <code class=\"\" data-line=\"\">netplan apply<\/code>\n<\/li>\n<li>Note: the netplan permissions are messed up: do <code class=\"\" data-line=\"\">chmod 600 \/etc\/netplan\/* <\/code> and run <code class=\"\" data-line=\"\">netplan apply<\/code> <\/li>\n<li>Talking about Ubuntu: <a href=\"https:\/\/docs.nvidia.com\/dgx\/dgx-os-7-user-guide\/release_notes.html#latest-release\">Release notes<\/a> are available, but these refer both to the full-blown X86_64 DGX <em>and<\/em> the ARM64 DGX Spark, so search there for Spark. <a href=\"https:\/\/docs.nvidia.com\/dgx\/dgx-spark\/system-recovery.html\">recovery media archive file<\/a> is also available if you need to reinstall. In contrast to Ubuntu that has LTS releases, <a href=\"https:\/\/docs.nvidia.com\/dgx\/dgx-spark\/dgx-os.html#release-cadence\">NVIDIA promises updates only for two years<\/a> and there are <a href=\"https:\/\/www.jeffgeerling.com\/blog\/2025\/dells-version-dgx-spark-fixes-pain-points\/#software\">concerns about NVIDIA  supporting the Spark or other GB10 systems beyond a few years.<\/a>\n<\/li>\n<li>As this is standard Ubuntu, you will very likely need to do some variant of the <a href=\"https:\/\/github.com\/ME0094\/Ubuntu-22.04-LTS-Hardening-Commands\">usual hardening<\/a>.<\/li>\n<li>The <a href=\"https:\/\/build.nvidia.com\/spark\/connect-to-your-spark\/sync\">https:\/\/build.nvidia.com\/spark\/connect-to-your-spark\/sync NVIDIA Sync tool<\/a> tunnels from your client machine via SSH to the DGX Spark: at the first usage of NVIDIA Sync, it will ask for your username and password: On Linux systems, it will then create a password-less SSH key (so that in future, NVIDIA Sync can tunnel without needing to ask for a password) and copy the public-key of the SSH key over to the DGX Spark (it will do so even if you already have an SSH key). On Mac or MS Windows system, it seems not to assume that there is support for password-less SSH keys.\n<p>The <a href=\"https:\/\/build.nvidia.com\/spark\/connect-to-your-spark\/sync\">NVIDIA Sync documentation<\/a> contains some copy\/paste instructions for Ubuntu\/Debian of how add it as APT source: I do not like that they append it to <code class=\"\" data-line=\"\">sources.list<\/code> -- I would rather make it file on its own in <code class=\"\" data-line=\"\">sources.list.d\/<\/code>. Also the provided <code class=\"\" data-line=\"\">deb<\/code> entry might need to be changed into <code class=\"\" data-line=\"\">deb [arch=amd64] <\/code>. I also had after some time to update the GPG key from NVDIA: see their curl line on the above NVDIA page.<\/p>\n<li>There seems to be an issue with the file <a href=\"https:\/\/forums.developer.nvidia.com\/t\/dgx-dashboard-jupyterlab-multi-user-mapping-requires-manual-yaml-edit-and-service-restart-enhancement-request\/350641\"> \/opt\/nvidia\/dgx-dashboard-service\/jupyterlab_ports.yaml<\/a>. But restarting the system after a user has been added, seems to help. The question is whether restarting some service using <code class=\"\" data-line=\"\">systemctl restart<\/code> would do the same job as rebooting.<br \/>\nUpdate: I tried <code class=\"\" data-line=\"\">systemctl restart dgx-dashboard.service<\/code> (which will then ask for the sudo-enabled user to be used for that, incl. entering the password) and that did add the missing entry. <code class=\"\" data-line=\"\">sudo systemctl restart dgx-dashboard.service<\/code> did not ask for the user to be used.<\/li>\n<li>If there is permission issue with docker: <code class=\"\" data-line=\"\">sudo usermod -aG docker $USER<\/code><br \/>\n<code class=\"\" data-line=\"\">newgrp docker<\/code><\/li>\n<li><a href=\"https:\/\/build.nvidia.com\/spark\">https:\/\/build.nvidia.com\/spark Playbooks with first steps to play around<\/a>.\n<p>Start with <a href=\"https:\/\/build.nvidia.com\/spark\/dgx-dashboard\/instructions\">activating Jupyter lab viah the DGX Dashboard<\/a>. There seems to be an <a href=\"https:\/\/forums.developer.nvidia.com\/t\/dgx-dashboard-jupyterlab-multi-user-mapping-requires-manual-yaml-edit-and-service-restart-enhancement-request\/350641\">issue with the file<\/a> <code class=\"\" data-line=\"\">\/opt\/nvidia\/dgx-dashboard-service\/jupyterlab_ports.yaml<\/code>, though. Also, the sample AI workload will generate a warning concerning GPU NVIDIA GB10 which is of cuda capability 12.1 and Minimum and Maximum cuda capability supported by PyTorch is (8.0) - (12.0): <a href=\"https:\/\/forums.developer.nvidia.com\/t\/dgx-dashboard-playbook-pytorch-in-sample-code-not-supporting-cuda-12-1\/350762\">that can be ignored<\/a>.<\/li>\n<\/ul>\n<p>In general, the software quality delivered by NVIDIA is rather low: e.g. in December 2025, <a href=\"https:\/\/forums.developer.nvidia.com\/t\/broken-apt-update\/354897\/2\" target=\"_blank\">apt update was broken<\/a> and <a href=\"https:\/\/forums.developer.nvidia.com\/t\/dgx-spark-apt-update-failure-expkeysig\/354539\" target=\"_blank\">NVIDIA provided a fix only later<\/a> where some even say that it does not work. In any case, this renders unattended-upgrade not working, thus preventing security updates.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>While systems based on AMD AI max+ 395 are cheaper (and might in some cases even be faster), the NVIDIA Spark DGX systems have the advantage of providing the CUDA ecosystem. (However, if you just want to run LLMs, then Ollama and LM Studio support the AMD AI processors as well as CUDA.) Both system [&hellip;]<\/p>\n","protected":false},"author":512,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[139469],"tags":[],"class_list":["post-5856","post","type-post","status-publish","format-standard","hentry","category-tech"],"_links":{"self":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/5856","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/users\/512"}],"replies":[{"embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/comments?post=5856"}],"version-history":[{"count":67,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/5856\/revisions"}],"predecessor-version":[{"id":6172,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/5856\/revisions\/6172"}],"wp:attachment":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/media?parent=5856"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/categories?post=5856"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/tags?post=5856"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}