Training Nanochat on DGX Spark
Picking up from where I left off, it was time to attempt training. The first thing I did was look for others who had tried this so I could learn from them. Specifically, I was expecting to pick up tips on parameters to torchrun tuned for my hardware. I found this discussion on the Nanochat github: https://github.com/karpathy/nanochat/discussions/28 Which led me to this discussion on the NVIDIA DGX forum: https://forums.developer.nvidia.com/t/anyone-got-nanochat-training-working-on-the-dgx-spark/348537/8 The first thing I did was grab the torchrun command line from afalk42's post. But I hit the same problem he described whereupon torchrun would error out with an error about Triton 3.5.0 not knowing what an sm_121a is (it's the Blackwell GB10). I needed to additionally run a series of steps to manually update CUDA to 13.0.2 adn then point Triton to the updated ptxas version installed with CUDA 13.0.2. For posterity, the instructions contained therein are repeated here: Install CU...