Training Nanochat on DGX Spark
Picking up from where I left off, it was time to attempt training. The first thing I did was look for others who had tried this so I could learn from them. Specifically, I was expecting to pick up tips on parameters to torchrun tuned for my hardware. I found this discussion on the Nanochat github:
https://github.com/karpathy/nanochat/discussions/28
Which led me to this discussion on the NVIDIA DGX forum:
https://forums.developer.nvidia.com/t/anyone-got-nanochat-training-working-on-the-dgx-spark/348537/8
The first thing I did was grab the torchrun command line from afalk42's post. But I hit the same problem he described whereupon torchrun would error out with an error about Triton 3.5.0 not knowing what an sm_121a is (it's the Blackwell GB10). I needed to additionally run a series of steps to manually update CUDA to 13.0.2 adn then point Triton to the updated ptxas version installed with CUDA 13.0.2. For posterity, the instructions contained therein are repeated here:
Install CUDA 13.0.2
The next step in the nanochat instructions would be to now run pre-training, but that step will fail, because the default ptxas installed with Triton 3.5.0 is the CUDA 12.8 version and doesn’t know about the sm_121a gpu-name of the Blackwell GB10.
At this time, you need to go to the nVIDIA Developer website and install CUDA 13.0.2 manually by following the steps here: CUDA Toolkit 13.0 Update 2 Downloads | NVIDIA Developer
In particular, this was the sequence that worked for me on the DGX Spark:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/13.0.2/local_installers/cuda-repo-ubuntu2404-13-0-local_13.0.2-580.95.05-1_arm64.deb
sudo dpkg -i cuda-repo-ubuntu2404-13-0-local_13.0.2-580.95.05-1_arm64.deb
sudo cp /var/cuda-repo-ubuntu2404-13-0-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0
And now you need to tell Triton to use the new ptxas version you just installed with the CUDA 13.0.2 toolkit:
# assuming CUDA 13.0 is installed at /usr/local/cuda-13.0
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:${LD_LIBRARY_PATH}
Once this was done, I could launch the training with the parameters afalk42 had outlined and pretraining commenced!
Here's the command line I ran:
torchrun --standalone --nproc_per_node=gpu -m scripts.base_train -- --depth=20
At the time of authoring, it has been running for 4 1/2 hours and is 2% complete (!) Based on that I estimate it will take a total of about 9 1/2 days to train. I don't really have a baseline for whether this is slow for the hardware but I suspect it is. I will dig into why this may be in the future, but for the moment, I'm going on a business trip soon anyway and it will be done when I'm back. So I'll leave it running.
Comments
Post a Comment