Posts

Training Nanochat on DGX Spark

Image
Picking up from where I left off, it was time to attempt training. The first thing I did was look for others who had tried this so I could learn from them. Specifically, I was expecting to pick up tips on parameters to torchrun tuned for my hardware. I found this discussion on the Nanochat github: https://github.com/karpathy/nanochat/discussions/28 Which led me to this discussion on the NVIDIA DGX forum: https://forums.developer.nvidia.com/t/anyone-got-nanochat-training-working-on-the-dgx-spark/348537/8 The first thing I did was grab the torchrun command line from afalk42's post. But I hit the same problem he described whereupon torchrun would error out with an error about Triton 3.5.0 not knowing what an sm_121a is (it's the Blackwell GB10). I needed to additionally run a series of steps to manually update CUDA to 13.0.2 adn then point Triton to the updated ptxas version installed with CUDA 13.0.2. For posterity, the instructions contained therein are repeated here: Install CU...

Comfy UI on DGX Spark

Image
 I took a brief break from Nanochat to try out some other workloads on the Spark. NVIDIA has a nice start up guide for the Spark here:  https://build.nvidia.com/spark The first step is to verify prerequisites. Running the suggested steps in the Comfy UI guide from the link above, I found that I don't have the NVIDIA CUDA toolkit installed. That's rectified with opening a terminal session to the Spark from NVIDIA Sync and running (the other pre-reqs were preinstalled for me but you may want to check them from the link above): sudo apt install nvidia-cuda-toolkit After that, I pulled down the Comfy repository. I did this before the venv steps because I want the virtual environment in the ComfyUI directory, and pulling down the repository creates it. git clone https://github.com/comfyanonymous/ComfyUI.git Then they have me set up a virtual environment using venv:  python3 -m venv comfyui-env source comfyui-env/bin/activate I'm not going off script on this one to use uv, but ...

Andrej Karpathy's Nanochat

Image
Right around the time that I decided to buy a DGX Spark and up my game in AI, Andrej Karpathy posted his magnum opus (so far) that he's been working on for quite a while, Nanochat  (https://github.com/karpathy/nanochat). I discovered it through my boss mentioning it me; it hit Hacker News et al earlier that week. The idea is to perform all the steps from scratch (but using production tools like torch) to create an LLM that is similar in capability to GPT-2. I figured this is a great start to diving deeper into LLMs.  The tagline for Nanochat is "The best ChatGPT that $100 can buy." The idea is that it can be trained on hardware that can be rented for sufficient time to do the training for about $100. The target hardware is an 8xH100 machine. It goes through tokenization, pretraining (this is the compute intensive part), midtraining, supervised finetuning, reinforcement learning, and even a light-weight web UI to invoke it.  A more expensive to train version of the same mo...

I bought a DGX Spark

Image
... and I'm using the expense of the device to hold myself accountable to learning, and I figured a blog would be a great way to record what I've learned and share with others who may be on a similar journey. A little background about me: I know relatively little about AI. I've become a moderate user of ChatGPT on a daily basis, and I have been playing around with running local AI models. So, I'm a capable user of AI but I haven't done much peeking under the hood. I'm primarily a Windows 11 user, but I prefer my MacBook Pro over any PC for an excellent device experience. I also use WSL for running Linux development tools frequently and run a home sever hosted on Unraid (I may talk more about that later). My day job is a development director for a large software company and I work in the games industry. On to the Spark: Today, Christmas came early and my DGX Spark was delivered two days before it was scheduled. I did the unboxing and plugged it in. Man is this de...