{"id":5856,"date":"2025-11-07T11:11:06","date_gmt":"2025-11-07T11:11:06","guid":{"rendered":"https:\/\/uni.hi.is\/helmut\/?p=5856"},"modified":"2026-03-25T13:54:08","modified_gmt":"2026-03-25T13:54:08","slug":"setting-up-a-spark-dgx","status":"publish","type":"post","link":"https:\/\/uni.hi.is\/helmut\/2025\/11\/07\/setting-up-a-spark-dgx\/","title":{"rendered":"Spark DGX"},"content":{"rendered":"

While systems based on AMD AI max+ 395 are cheaper (and might in some cases even be faster), the NVIDIA Spark DGX systems have the advantage of providing the CUDA ecosystem. (However, if you just want to run LLMs, then Ollama and LM Studio support the AMD AI processors as well as CUDA.) Both system suffer from the fact, that standard DDR5 RAM is used (or to be more precise LPDDR5, i.e. soldered, instead of upgradable DIMMs): if you have CUDE code that fits into the High-Bandwith Memory of a real graphics card, it will run ca. 5 times faster in comparison to the standard DDR5 RAM.<\/p>\n

I did some research on the Spark DGX<\/p>\n

Comes with no<\/em> 200 GB QSFP Direct Attach Cable (DAC)<\/a> to cluster two of them without a 200 GB switch in-between, i.e. you need to buy an extra one, e.g. from here<\/a>. Currently, only a single 1:1 connection is supported for coupling two devices together, but maybe via a 200 GB switch you could connect more Sparks (if you have more...) using the 200 GB interface -- you can of course always use the 10 GB Ethernet ports connected to a switch.<\/li>\n
enP7p1s0<\/code> is the Realteak-based 10 GB RJ45 Ethernet port. The others (enp1s0f0np0 enp1s0f1np1 enP2p1s0f0np0 enP2p1s0f1np1<\/code>) are the 200 GB QSFP56 ports, it seems that as the CPU consists essentially of two parts, each with its own PCIe and each CPU part handles one 100 GB connection and therefore, the 200 GB connection is reported as two 100 GB devices<\/a>).<\/li>\n
Setup is described here: https:\/\/docs.nvidia.com\/dgx\/dgx-spark\/first-boot.html<\/a><\/li>\n
While it is setting up, it creates a WiFi AP (SSID shown a booklet that ships with the hardware) so that you can connect and create a user via a Web browser. It uses the Ethernet connection for installing OS updates. Once that setup has been finished, the WiFi AP gets disabled! You should be able to enable the AP again, e.g. using NetworkManager, but this will not NAT the WiFi AP client traffic via the Ethernet connection, i.e. while your WiFi AP client can access the DGX Spark, you cannot browse the Internet.<\/li>\n
From the documentation: \nThe machine name is your DGX Spark hostname with .local<\/code> appended, such as spark-xxxx.local<\/code>. You can find the default hostname on the Quick Start Guide that came in the box. The .local address uses mDNS (multicast DNS) to automatically locate your DGX Spark on the net-work without needing to know its IP address. This is particularly useful if your router periodically reassigns IP addresses.<\/p>\n For Windows users:<\/strong> mDNS requires Bonjour Print Services from Apple. If you have iTunes or other \nApple software installed, you likely already have this. Otherwise, you can download it from Ap- \nple\u2019s website. Alternatively, you can try using just the hostname without .local (such as spark-xxxx), \nthough this method is less reliable on modern networks.<\/p>\n Why .local might not work:<\/strong> .local hostnames may not work in enterprise networks with strict se- \ncurity policies, networks that block multicast traffic, or other restricted network environments.<\/p>\n Using an IP address instead<\/strong>: If .local hostnames do not work, you will need to use the IP address. To \nfind the IP address, physically log in to your DGX Spark and click the network icon<\/em> in the top right \ncorner of the Ubuntu desktop. Select Settings<\/em> from the dropdown menu, then navigate to the \nNetwork<\/em> section. The IP address will be displayed under as \"Realtek Ethernet\". Click on the settings icon to see its IPv4 and IPv6 address. \n Alternatively, you \ncan log in to your router\u2019s administration console to view connected devices and their IP addresses.<\/p><\/blockquote>\n<\/li>\nAs this OS Ubuntu 2024.04, it uses Netplan<\/a><\/strong> for all network configuration. \nTo change the hostname, you can either use the DGX Dashboard (see below and then Settings<\/em>, System<\/em>, Edit Hostname<\/em>) or:<\/p>\nsudo hostnamectl set-hostname<\/code> new_hostname<\/em><\/p>\n E.g.: sudo hostnamectl set-hostname spark1<\/code><\/p>\n Restart the device or its services for the change to take effect.<\/p>\n Further netplan commands: netplan get<\/code>, netplan status --all<\/code><\/p>\n To generate configuration files for all the involved network tools from the yaml file: \nsudo netplan generate<\/code>. To try (will revert after 120 s): netplan try<\/code> To finally apply persistently: netplan apply<\/code>\n<\/li>\n Note: the netplan permissions are messed up: do chmod 600 \/etc\/netplan\/* <\/code> and run netplan apply<\/code> <\/li>\n Talking about Ubuntu: Release notes<\/a> are available, but these refer both to the full-blown X86_64 DGX and<\/em> the ARM64 DGX Spark, so search there for Spark. recovery media archive file<\/a> is also available if you need to reinstall. In contrast to Ubuntu that has LTS releases, NVIDIA promises updates only for two years<\/a> and there are concerns about NVIDIA supporting the Spark or other GB10 systems beyond a few years.<\/a>\n<\/li>\n As this is standard Ubuntu, you will very likely need to do some variant of the usual hardening<\/a>.<\/li>\n The https:\/\/build.nvidia.com\/spark\/connect-to-your-spark\/sync NVIDIA Sync tool<\/a> tunnels from your client machine via SSH to the DGX Spark: at the first usage of NVIDIA Sync, it will ask for your username and password: On Linux systems, it will then create a password-less SSH key (so that in future, NVIDIA Sync can tunnel without needing to ask for a password) and copy the public-key of the SSH key over to the DGX Spark (it will do so even if you already have an SSH key). On Mac or MS Windows system, it seems not to assume that there is support for password-less SSH keys.\nThe NVIDIA Sync documentation<\/a> contains some copy\/paste instructions for Ubuntu\/Debian of how add it as APT source: I do not like that they append it to sources.list<\/code> -- I would rather make it file on its own in sources.list.d\/<\/code>. Also the provided deb<\/code> entry might need to be changed into deb [arch=amd64] <\/code>. I also had after some time to update the GPG key from NVDIA: see their curl line on the above NVDIA page.<\/p>\n There seems to be an issue with the file \/opt\/nvidia\/dgx-dashboard-service\/jupyterlab_ports.yaml<\/a>. But restarting the system after a user has been added, seems to help. The question is whether restarting some service using systemctl restart<\/code> would do the same job as rebooting. \nUpdate: I tried systemctl restart dgx-dashboard.service<\/code> (which will then ask for the sudo-enabled user to be used for that, incl. entering the password) and that did add the missing entry. sudo systemctl restart dgx-dashboard.service<\/code> did not ask for the user to be used.<\/li>\n If there is permission issue with docker: sudo usermod -aG docker $USER<\/code> \nnewgrp docker<\/code><\/li>\n https:\/\/build.nvidia.com\/spark Playbooks with first steps to play around<\/a>.\nStart with activating Jupyter lab viah the DGX Dashboard<\/a>. There seems to be an issue with the file<\/a> \/opt\/nvidia\/dgx-dashboard-service\/jupyterlab_ports.yaml<\/code>, though. Also, the sample AI workload will generate a warning concerning GPU NVIDIA GB10 which is of cuda capability 12.1 and Minimum and Maximum cuda capability supported by PyTorch is (8.0) - (12.0): that can be ignored<\/a>.<\/li>\n<\/ul>\n In general, the software quality delivered by NVIDIA is rather low: e.g. in December 2025, apt update was broken<\/a> and NVIDIA provided a fix only later<\/a> where some even say that it does not work. In any case, this renders unattended-upgrade not working, thus preventing security updates.<\/p>\n","protected":false},"excerpt":{"rendered":" While systems based on AMD AI max+ 395 are cheaper (and might in some cases even be faster), the NVIDIA Spark DGX systems have the advantage of providing the CUDA ecosystem. (However, if you just want to run LLMs, then Ollama and LM Studio support the AMD AI processors as well as CUDA.) Both system […]<\/p>\n","protected":false},"author":512,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[139469],"tags":[],"class_list":["post-5856","post","type-post","status-publish","format-standard","hentry","category-tech"],"_links":{"self":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/5856","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/users\/512"}],"replies":[{"embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/comments?post=5856"}],"version-history":[{"count":67,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/5856\/revisions"}],"predecessor-version":[{"id":6172,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/5856\/revisions\/6172"}],"wp:attachment":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/media?parent=5856"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/categories?post=5856"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/tags?post=5856"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}