Andrej Karpathy's Nanochat

Right around the time that I decided to buy a DGX Spark and up my game in AI, Andrej Karpathy posted his magnum opus (so far) that he's been working on for quite a while, Nanochat  (https://github.com/karpathy/nanochat). I discovered it through my boss mentioning it me; it hit Hacker News et al earlier that week. The idea is to perform all the steps from scratch (but using production tools like torch) to create an LLM that is similar in capability to GPT-2. I figured this is a great start to diving deeper into LLMs. 

The tagline for Nanochat is "The best ChatGPT that $100 can buy." The idea is that it can be trained on hardware that can be rented for sufficient time to do the training for about $100. The target hardware is an 8xH100 machine. It goes through tokenization, pretraining (this is the compute intensive part), midtraining, supervised finetuning, reinforcement learning, and even a light-weight web UI to invoke it. 

A more expensive to train version of the same model ($800) is hosted online by Karpathy and can be found here if you'd like to try it out:

https://nanochat.karpathy.ai/

A little about that training hardware, an 8XH100 is a machine with 8 NVIDIA H100 Tensor Core GPUs usually with 80GB of HBM3 memory and a basic server class CPU or two like the Intel Xeon 8480C. It's a beast with a combined 640gb of VRAM. I'd like to train on my DGX Spark. I think it will work as I have an effective 96gb of VRAM on the Spark and a single H100 has 80gb and somewhere I saw Andrej mention where I can change the script to work on a single H100. So it'll just work, right? lol It's worth noting that I do expect it to be very slow. My hardware is outclassed in tensor cores and in memory bandwidth by the H100 by a lot.

Karpathy describes the steps of the "speedrun.sh" script in a good amount of detail here: https://github.com/karpathy/nanochat/discussions/1. As I'm not renting a machine and can run the steps at my leisure, I embarked on them interactively. 

I'm aware of Python's dependency resolution challenges and virtual environments that are used to keep them in check. I've used venv and conda in the past as directed by my physician. However, I wasn't aware of uv which seems pretty nice. It's a new virtual environment manager written in Rust that virtualizes both the version of Python being used as well as the installed libraries. I followed the instructions to set up a virtual environment for nanochat and was able to complete several steps of the process without incident. First, I cloned the repo (Karpathy shows the SSH syntax and I used the HTTPS one):

git clone git@github.com:karpathy/nanochat.git

Then I created the UV environment, acquired Rust, built his custom tokenizer written in Rust, and trained the tokenizer all according to plan.

The next step is pretraining. This is the costly step, and I wanted to confirm that I would be using the latest Blackwell (the chip in the 50xx series of GPUs and the GB10 that's in the DGX Spark) features in torch for the training. Before I started this, I had a few concerns. One was whether or not I had the optimal torch installed for relatively new hardware and a second was how to properly configure the training to use my hardware (as opposed to 8XH100 hardware.)

Starting with the first concern, I started by answering the question "what driver do I have installed?" as I knew this will influence what CUDA version the upstream software stack can use. Aside, I only realized this recently when researching "What is the cu128 build of Comfy UI?" Turns out that the answer is a version of Comfy UI built to take advantage of CUDA 12.8. That package updates the torch stack to use a later version of CUDA than the default build that works on a much wider range of hardware. My understanding is that they have such a build because CUDA 12.8 supports 50XX series features including enhanced FP8 and new support for FP4 and INT4 data types and the resultant higher parallelism that this enables for quantized models. Anyhow, to figure out what driver support I've got, I knew from my experience running GPUs in my Unraid sever that I needed to run nvidia-smi. That outputs something like this:


Note "CUDA version:" in the upper right:13.0, so we're good there. Higher than the 12.8 I was expecting.

So what version of torch do I have and does it support CUDA 12.8+? Well, seems like the best way is to see what Python is seeing inside my virtual environment, so I ran a quick script like this one (with some guidance from ChatGPT):

uv run python - <<'PY'
import torch
print(f"Torch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}");
print(f"CUDA Version: {torch.version.cuda}")
if torch.cuda.is_available():
    print(f"CUDA Device: {torch.cuda.get_device_name(0)}")
PY

This gives me what I need. It told me that I did not in fact have CUDA support of any kind. Further digging had me run:

uv pip list | grep torch

That told me I had 2.9.0+cpu. Turns out that +cpu doesn't mean it additionally supports CPU, rather it only supports CPU, so good thing I checked. For whatever reason, the pyproject.toml file specifying torch >= 2.8.0 did not pick a version that worked with my card. I'm too new to this stack to understand why, but I can fix it right? 

My first thought was to uninstall torch and install the right one. Not knowing how UV resolves dependencies, I wasn't certain if I could have had the wrong torch installed globally and the rev spec just selected that one? Seems unlikely but easy to test. I started by uninstalling torch and reinstalling it:

uv pip uninstall torch

uv pip install torch

That had the effect you might expect, still CPU only. After a bit more research, I found that there are official packages of torch for my setup. Specifically, I found:

https://download.pytorch.org/whl/cu130/torch-2.9.0+cu130-cp310-cp310-manylinux_2_28_aarch64.whl

This version is saying it's torch 2.9.0 -- I picked this because that is the version that was installed before I started messing with it, cu130 means CUDA 13.0 which is the first one I found greater than 12.8, and cp310 is saying it's compiled for python 3.10 which is the version in my environment.

In the process, I learned that to resolve this URI from uv pip install I needed to replace the + with %2B. (I sneakily have the link above point to the corrected version.) I also learned, or maybe was reminded), that the part after the + in a semantic version doesn't contribute to version resolution so one with +cu130 or one with +cpu are equivalent for version resolution purposes. This leaves me wondering how Karpathy's torch >= 2.8.0 version spec would ever guarantee a CUDA version would be installed on the 8xH100 target machines. More learning for another day. Anyhow, uninstalling torch and then reinstalling it as such should do it:

uv pip uninstall torch

uv pip install https://download.pytorch.org/whl/cu130/torch-2.9.0%2Bcu130-cp310-cp310-manylinux_2_28_aarch64.whl

To my surprise, after watching it download and install, I was still seeing CPU only. Me and ChatGPT figure that this was due to uv "fixing" it for me since my version spec for the virtual environment says I should have the other one (I guess the CPU one is winning for my machine as a better match for reasons.) Anyhow, that also points to the proper fix which is to update the project dependencies to specify the version of torch I want specifically. To do that, I changed the line in the dependencies section in the pyproject.toml that said torch >= 2.8.0 to:

torch @ https://download.pytorch.org/whl/cu130/torch-2.9.0%2Bcu130-cp310-cp310-manylinux_2_28_aarch64.whl

Now my Python script reports what I wanted to see:

Torch Version: 2.9.0+cu130
CUDA Available: True
CUDA Version: 13.0
/home/mikeott/src/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning:
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0)
  warnings.warn(
CUDA Device: NVIDIA GB10

The warning, I'm told, is just the NVIDIA driver warning me that I'm not using the latest supported cuda capability version (which is separate from the CUDA version itself). Some future version will support that, but for now this is the latest one that I could find that is supported.

So now I have torch updated to a version that uses CUDA 13.0 and I'm ready to tackle adjusting the training parameters to run on my hardware. That will have to wait for another article.

Comments

Popular posts from this blog

Comfy UI on DGX Spark

Training Nanochat on DGX Spark