transformers

Running

App Files Files Community

burtenshaw commited on 12 days ago

Commit

a224632

1 Parent(s): e3af5c6

improve structure and layout

Browse files

Files changed (5) hide show

app/src/content/article.mdx +0 -2
app/src/content/chapters/inference.mdx +7 -1
app/src/content/chapters/sft.mdx +90 -36
inference.ipynb +187 -0
sft.ipynb +4 -402

app/src/content/article.mdx CHANGED Viewed

@@ -570,8 +570,6 @@ In `modular_nanochat.py`, we don't need to write this logic at all. As seen in t
 It's very clear that Andrej Karpathy's implementation offers 10 times more to learn from than the transformer version which inherits almost entirely from existing models or features. That said, we can still take more away from the inherited modular modeling implementation. Models like Llama, Llama4, Gemma2, Qwen3, and CLIP are all reused to create a genuinely canonical implementation of a transformer.
-# Hands-on Tutorial
 Ok. Let's cut the philosphy and see what we can do with `nanochat` in transformers.
 <Inference />

 It's very clear that Andrej Karpathy's implementation offers 10 times more to learn from than the transformer version which inherits almost entirely from existing models or features. That said, we can still take more away from the inherited modular modeling implementation. Models like Llama, Llama4, Gemma2, Qwen3, and CLIP are all reused to create a genuinely canonical implementation of a transformer.
 Ok. Let's cut the philosphy and see what we can do with `nanochat` in transformers.
 <Inference />

app/src/content/chapters/inference.mdx CHANGED Viewed

@@ -1,4 +1,10 @@
-## Inference on `nano` in Transformers
 First bonus tutorial will help you to do basic inference in `transformers`:

+## Example 1: Inference on nanochat in Transformers
+<Sidenote>
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://huggingface.co/datasets/nanochat-students/notebooks/blob/main/inference.ipynb)
+</Sidenote>
 First bonus tutorial will help you to do basic inference in `transformers`:

app/src/content/chapters/sft.mdx CHANGED Viewed

@@ -1,7 +1,7 @@
 import Sidenote from '../../components/Sidenote.astro'
 import Note from '../../components/Note.astro'
-# [BONUS 2] Supervised Fine-tuning in `torch`
 <Sidenote>
@@ -19,7 +19,7 @@ In this tutorial, we'll fine-tune the NanoChat model using pure PyTorch, giving
 </Note>
-## Import model and tokenizer
 <Sidenote>
@@ -51,7 +51,7 @@ model = AutoModelForCausalLM.from_pretrained(
 We use `bfloat16` precision on GPU to reduce memory usage while maintaining training stability. On CPU, we fall back to `float32` for compatibility.
-## Setup LoRA
 <Sidenote>
@@ -87,7 +87,7 @@ trainable params: 1,179,648 || all params: 1,880,227,840 || trainable%: 0.0627
 With LoRA, we're only training **0.06%** of the model's parameters—just over 1 million weights instead of 1.8 billion. This makes fine-tuning feasible on consumer hardware.
-## Demo the model
 <Sidenote>
@@ -179,7 +179,7 @@ Generated: The capital of France is Paris.<|assistant_end|>
 Notice the special tokens: `<|bos|>`, `<|user_start|>`, `<|assistant_start|>`, etc. These delimiters help the model understand conversation structure.
-## Dataset
 <Sidenote>
@@ -196,8 +196,6 @@ train_dataset = splits["train"]
 eval_dataset = splits["test"]
 ```
-### Process the Dataset
 <Sidenote>
 The [datasets map](https://huggingface.co/docs/datasets/process#map) function applies transformations efficiently with caching and multiprocessing support.
@@ -236,31 +234,6 @@ eval_dataset = eval_dataset.select(range(min(len(eval_dataset), max_eval_example
 eval_dataset = eval_dataset.map(format_example, remove_columns=eval_dataset.column_names)
 ```
-## Training Configuration
-These hyperparameters control the training dynamics. We use conservative values that work well across different hardware:
-```python
-train_batch_size = 2
-eval_batch_size = 2
-num_epochs = 1
-gradient_accumulation_steps = 4
-learning_rate = 1e-5
-weight_decay = 0.0
-warmup_ratio = 0.03
-logging_frequency = 10
-```
-<Sidenote>
-**Gradient accumulation** simulates larger batch sizes by accumulating gradients over multiple forward passes before updating weights. Effective batch size = `train_batch_size × gradient_accumulation_steps` = 8.
-</Sidenote>
-Key configuration choices include using a low learning rate (`1e-5`), as LoRA generally requires smaller learning rates given that the base model weights are kept frozen. Additionally, gradient accumulation is employed to enable larger effective batch sizes, which helps when training on GPUs with limited memory.
-## Create a DataLoader
 <Sidenote>
 PyTorch's [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) handles batching, shuffling, and parallel data loading automatically.
@@ -288,7 +261,30 @@ eval_loader = DataLoader(eval_dataset, batch_size=eval_batch_size, shuffle=False
 Setting padding tokens to `-100` in labels tells PyTorch's cross-entropy loss to ignore them—we don't want to penalize the model for not predicting padding.
-## Optimizer
 <Sidenote>
@@ -306,7 +302,7 @@ optimizer = torch.optim.AdamW(
 )
 ```
-## Learning Rate Scheduler
 <Sidenote>
@@ -323,8 +319,6 @@ warmup_steps = max(1, int(max_train_steps * warmup_ratio))
 scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, max_train_steps)
 ```
-## The Training Loop
 <Sidenote>
 For distributed training across multiple GPUs, consider [Accelerate](https://huggingface.co/docs/accelerate/index) which wraps this loop with minimal code changes.
@@ -398,3 +392,63 @@ step=00040 | loss=1.7935 | lr=5.33e-06
 step=00050 | loss=1.8029 | lr=6.67e-06
 ...
 ```

 import Sidenote from '../../components/Sidenote.astro'
 import Note from '../../components/Note.astro'
+## Example 2: Supervised Fine-tuning in torch
 <Sidenote>
 </Note>
+### Import model and tokenizer
 <Sidenote>
 We use `bfloat16` precision on GPU to reduce memory usage while maintaining training stability. On CPU, we fall back to `float32` for compatibility.
+### Setup LoRA
 <Sidenote>
 With LoRA, we're only training **0.06%** of the model's parameters—just over 1 million weights instead of 1.8 billion. This makes fine-tuning feasible on consumer hardware.
+### Demo the model
 <Sidenote>
 Notice the special tokens: `<|bos|>`, `<|user_start|>`, `<|assistant_start|>`, etc. These delimiters help the model understand conversation structure.
+### Dataset
 <Sidenote>
 eval_dataset = splits["test"]
 ```
 <Sidenote>
 The [datasets map](https://huggingface.co/docs/datasets/process#map) function applies transformations efficiently with caching and multiprocessing support.
 eval_dataset = eval_dataset.map(format_example, remove_columns=eval_dataset.column_names)
 ```
 <Sidenote>
 PyTorch's [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) handles batching, shuffling, and parallel data loading automatically.
 Setting padding tokens to `-100` in labels tells PyTorch's cross-entropy loss to ignore them—we don't want to penalize the model for not predicting padding.
+## Training
+These hyperparameters control the training dynamics. We use conservative values that work well across different hardware:
+```python
+train_batch_size = 2
+eval_batch_size = 2
+num_epochs = 1
+gradient_accumulation_steps = 4
+learning_rate = 1e-5
+weight_decay = 0.0
+warmup_ratio = 0.03
+logging_frequency = 10
+```
+<Sidenote>
+**Gradient accumulation** simulates larger batch sizes by accumulating gradients over multiple forward passes before updating weights. Effective batch size = `train_batch_size × gradient_accumulation_steps` = 8.
+</Sidenote>
+Key configuration choices include using a low learning rate (`1e-5`), as LoRA generally requires smaller learning rates given that the base model weights are kept frozen. Additionally, gradient accumulation is employed to enable larger effective batch sizes, which helps when training on GPUs with limited memory.
+### Optimizer
 <Sidenote>
 )
 ```
+### Learning Rate Scheduler
 <Sidenote>
 scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, max_train_steps)
 ```
 <Sidenote>
 For distributed training across multiple GPUs, consider [Accelerate](https://huggingface.co/docs/accelerate/index) which wraps this loop with minimal code changes.
 step=00050 | loss=1.8029 | lr=6.67e-06
 ...
 ```
+## Example 3: Fine-tuning with TRL
+<Sidenote>
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://github.com/huggingface/trl/blob/main/examples/notebooks/sft_trl_lora_qlora.ipynb)
+</Sidenote>
+Finally, we can implement the training loop above with TRL. Which definitely simplifies the code and abstracts away a lot of the complexity (education). But it's got all the bells and whistles of a production-ready solution.
+We can define the training arguments and create the trainer object like this:
+```python
+from trl import SFTConfig
+training_args = SFTConfig(
+    per_device_train_batch_size = 1,      # Batch size per GPU
+    gradient_accumulation_steps = 4,      # Gradients are accumulated over multiple steps → effective batch size = 2 * 8 = 16
+    warmup_steps = 5,
+    # num_train_epochs = 1,               # Number of full dataset passes. For shorter training, use `max_steps` instead (this case)
+    max_steps = 30,
+    learning_rate = 2e-4,                 # Learning rate for the optimizer
+    optim = "paged_adamw_8bit",           # Optimizer
+    # Logging / reporting
+    logging_steps=1,                      # Log training metrics every N steps
+    report_to="trackio",                  # Experiment tracking tool
+    trackio_space_id=output_dir,          # HF Space where the experiment tracking will be saved
+    output_dir=output_dir,                # Where to save model checkpoints and logs
+    max_length=1024,                      # Maximum input sequence length
+    use_liger_kernel=True,                # Enable Liger kernel optimizations for faster training
+    activation_offloading=True,           # Offload activations to CPU to reduce GPU memory usage
+    gradient_checkpointing=True,          # Save memory by re-computing activations during backpropagation
+    # Hub integration
+    push_to_hub=True,                     # Automatically push the trained model to the Hugging Face Hub
+                                          # The model will be saved under your Hub account in the repository named `output_dir`
+)
+```
+Then we can train the model like this and TRL will deal with data loading, batching, and training.
+```
+from trl import SFTTrainer
+trainer = SFTTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset,
+    peft_config=peft_config
+)
+```
+And then we can train the model like this:
+```
+trainer_stats = trainer.train()
+```

inference.ipynb ADDED Viewed

	@@ -0,0 +1,187 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "b7eb261b",
+      "metadata": {
+        "id": "b7eb261b"
+      },
+      "source": [
+        "# NanoChat Easy - SFT Training\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "8b8a04a8",
+      "metadata": {
+        "id": "8b8a04a8"
+      },
+      "source": [
+        "## Import model and tokenizer\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "3e48247c",
+      "metadata": {
+        "id": "3e48247c",
+        "outputId": "882fcf01-34fb-4123-e84c-deefdf477814"
+      },
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+            "  from .autonotebook import tqdm as notebook_tqdm\n",
+            "`torch_dtype` is deprecated! Use `dtype` instead!\n"
+          ]
+        }
+      ],
+      "source": [
+        "import torch\n",
+        "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+        "\n",
+        "\n",
+        "model_id = \"karpathy/nanochat-d32\"\n",
+        "revision = \"refs/pr/1\"\n",
+        "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+        "\n",
+        "\n",
+        "tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)\n",
+        "model = AutoModelForCausalLM.from_pretrained(\n",
+        "    model_id,\n",
+        "    revision=revision,\n",
+        "    torch_dtype=torch.bfloat16 if device.type == \"cuda\" else torch.float32,\n",
+        ").to(device)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "4810af1a",
+      "metadata": {
+        "id": "4810af1a"
+      },
+      "source": [
+        "## Demo the model\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "b3e81aa9",
+      "metadata": {
+        "id": "b3e81aa9",
+        "outputId": "1cde7e69-7ff1-4bfe-aa9f-9ded20249d82"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "================================================================================\n",
+            "TEST 1: Plain Autoregressive Prompt\n",
+            "================================================================================\n",
+            "Prompt: The Eiffel Tower stands in Paris and\n",
+            "\n",
+            "Generated:  is one of the most famous landmarks in the world. It is located on the Champ de Mars in the heart of the city. The tower was built for the 1889 World's Fair. It was designed by the French engineer Gustave Eiffel and took 2 years to build. The Eiffel Tower stands 324 meters\n",
+            "================================================================================\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(\"=\" * 80)\n",
+        "print(\"TEST 1: Plain Autoregressive Prompt\")\n",
+        "print(\"=\" * 80)\n",
+        "prompt = \"The Eiffel Tower stands in Paris and\"\n",
+        "test_inputs = tokenizer(prompt, return_tensors=\"pt\").to(device)\n",
+        "\n",
+        "\n",
+        "with torch.no_grad():\n",
+        "    test_outputs = model.generate(\n",
+        "        **test_inputs,\n",
+        "        max_new_tokens=64,\n",
+        "        do_sample=False,\n",
+        "        pad_token_id=tokenizer.pad_token_id,\n",
+        "    )\n",
+        "\n",
+        "generated_tokens = test_outputs[0, test_inputs[\"input_ids\"].shape[1] :]\n",
+        "print(f\"Prompt: {prompt}\")\n",
+        "print(f\"\\nGenerated: {tokenizer.decode(generated_tokens, skip_special_tokens=True)}\")\n",
+        "print(\"=\" * 80)\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "8e7b275c",
+      "metadata": {
+        "id": "8e7b275c",
+        "outputId": "719e986e-61b4-4fd5-db15-4a9ef8f97396"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "================================================================================\n",
+            "TEST 2: Chat Template\n",
+            "================================================================================\n",
+            "Formatted prompt: <|bos|><|user_start|>What is the capital of France?<|user_end|><|assistant_start|>\n",
+            "Input IDs: [65527, 65528, 1442, 309, 261, 3429, 281, 4215, 63, 65529, 65530]\n",
+            "\n",
+            "Generated: The capital of France is Paris.<|assistant_end|>\n",
+            "================================================================================\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(\"=\" * 80)\n",
+        "print(\"TEST 2: Chat Template\")\n",
+        "print(\"=\" * 80)\n",
+        "conversation = [\n",
+        "    {\"role\": \"user\", \"content\": \"What is the capital of France?\"},\n",
+        "]\n",
+        "\n",
+        "inputs = tokenizer.apply_chat_template(\n",
+        "    conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors=\"pt\"\n",
+        ").to(device)\n",
+        "\n",
+        "print(f\"Formatted prompt: {tokenizer.decode(inputs['input_ids'][0])}\")\n",
+        "print(f\"Input IDs: {inputs['input_ids'][0].tolist()}\")\n",
+        "\n",
+        "with torch.no_grad():\n",
+        "    outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False)\n",
+        "\n",
+        "generated_tokens = outputs[0, inputs[\"input_ids\"].shape[1] :]\n",
+        "print(f\"\\nGenerated: {tokenizer.decode(generated_tokens)}\")\n",
+        "print(\"=\" * 80)\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": ".venv",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.18"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}

sft.ipynb CHANGED Viewed

@@ -59,48 +59,6 @@
         ").to(device)\n"
       ]
     },
-    {
-      "cell_type": "markdown",
-      "id": "c9a9c0a4",
-      "metadata": {
-        "id": "c9a9c0a4"
-      },
-      "source": [
-        "## Setup LoRA\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "id": "dd9a698a",
-      "metadata": {
-        "id": "dd9a698a",
-        "outputId": "0aae9ecc-7af9-436e-a95b-a4cd023997fd"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "trainable params: 1,179,648 || all params: 1,880,227,840 || trainable%: 0.0627\n"
-          ]
-        }
-      ],
-      "source": [
-        "from peft import LoraConfig, get_peft_model\n",
-        "\n",
-        "lora_config = LoraConfig(\n",
-        "    r=1,\n",
-        "    lora_alpha=2,\n",
-        "    lora_dropout=0.00,\n",
-        "    task_type=\"CAUSAL_LM\",\n",
-        "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"fc1\", \"fc2\"]\n",
-        ")\n",
-        "\n",
-        "model = get_peft_model(model, lora_config)\n",
-        "model.print_trainable_parameters()\n"
-      ]
-    },
     {
       "cell_type": "markdown",
       "id": "4810af1a",
@@ -206,365 +164,12 @@
         "print(f\"\\nGenerated: {tokenizer.decode(generated_tokens)}\")\n",
         "print(\"=\" * 80)\n"
       ]
-    },
-    {
-      "cell_type": "markdown",
-      "id": "44cb321a",
-      "metadata": {
-        "id": "44cb321a"
-      },
-      "source": [
-        "## Dataset\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "id": "e1a75c14",
-      "metadata": {
-        "id": "e1a75c14"
-      },
-      "outputs": [],
-      "source": [
-        "raw_dataset = load_dataset(\"HuggingFaceTB/smoltalk2\", \"SFT\", split=\"OpenThoughts3_1.2M_think\")\n",
-        "splits = raw_dataset.train_test_split(test_size=0.1, seed=42)\n",
-        "train_dataset = splits[\"train\"]\n",
-        "eval_dataset = splits[\"test\"]\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "id": "8b29399d",
-      "metadata": {
-        "id": "8b29399d"
-      },
-      "source": [
-        "### Process the Dataset\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "id": "451542b4",
-      "metadata": {
-        "id": "451542b4",
-        "outputId": "caa727dd-f9d8-4c67-d193-79bcc0836b49"
-      },
-      "outputs": [
-        {
-          "name": "stderr",
-          "output_type": "stream",
-          "text": [
-            "Map:   0%|          | 0/20000 [00:00<?, ? examples/s]"
-          ]
-        },
-        {
-          "name": "stderr",
-          "output_type": "stream",
-          "text": [
-            "Map: 100%|██████████| 20000/20000 [06:27<00:00, 51.68 examples/s]\n",
-            "Map: 100%|██████████| 1000/1000 [00:19<00:00, 52.12 examples/s]\n"
-          ]
-        }
-      ],
-      "source": [
-        "max_length = 2048\n",
-        "max_train_examples = 20000\n",
-        "max_eval_examples = 1000\n",
-        "\n",
-        "def format_example(example):\n",
-        "    formatted = tokenizer.apply_chat_template(\n",
-        "        example[\"messages\"],\n",
-        "        add_generation_prompt=False,\n",
-        "        truncation=True,\n",
-        "        max_length=max_length,\n",
-        "        padding=False,\n",
-        "        return_dict=True,\n",
-        "        return_tensors=\"pt\",\n",
-        "    )\n",
-        "    return {\n",
-        "        \"input_ids\": formatted[\"input_ids\"][0].tolist(),\n",
-        "        \"attention_mask\": formatted[\"attention_mask\"][0].tolist(),\n",
-        "    }\n",
-        "\n",
-        "\n",
-        "if max_train_examples is not None:\n",
-        "    train_dataset = train_dataset.select(range(min(len(train_dataset), max_train_examples)))\n",
-        "    train_dataset = train_dataset.map(format_example, remove_columns=train_dataset.column_names)\n",
-        "else:\n",
-        "    train_dataset = train_dataset.map(format_example, remove_columns=train_dataset.column_names)\n",
-        "\n",
-        "if max_eval_examples is not None:\n",
-        "    eval_dataset = eval_dataset.select(range(min(len(eval_dataset), max_eval_examples)))\n",
-        "    eval_dataset = eval_dataset.map(format_example, remove_columns=eval_dataset.column_names)\n",
-        "else:\n",
-        "    eval_dataset = eval_dataset.map(format_example, remove_columns=eval_dataset.column_names)\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "id": "ecd33dd7",
-      "metadata": {
-        "id": "ecd33dd7"
-      },
-      "source": [
-        "## Training Configuration"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "id": "f9d837ee",
-      "metadata": {
-        "id": "f9d837ee"
-      },
-      "outputs": [],
-      "source": [
-        "train_batch_size = 2\n",
-        "eval_batch_size = 2\n",
-        "num_epochs = 1\n",
-        "gradient_accumulation_steps = 4\n",
-        "learning_rate = 1e-5\n",
-        "weight_decay = 0.0\n",
-        "warmup_ratio = 0.03\n",
-        "logging_frequency = 10"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "id": "1cf11e96",
-      "metadata": {
-        "id": "1cf11e96"
-      },
-      "source": [
-        "## Create a `DataLoader` 👴"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "id": "1bc4fa24",
-      "metadata": {
-        "id": "1bc4fa24"
-      },
-      "outputs": [],
-      "source": [
-        "def collate_fn(batch):\n",
-        "    batch_dict = {\n",
-        "        \"input_ids\": [record[\"input_ids\"] for record in batch],\n",
-        "        \"attention_mask\": [record[\"attention_mask\"] for record in batch],\n",
-        "    }\n",
-        "    padded = tokenizer.pad(batch_dict, padding=True, return_tensors=\"pt\")\n",
-        "    labels = padded[\"input_ids\"].clone()\n",
-        "    labels[padded[\"attention_mask\"] == 0] = -100\n",
-        "    padded[\"labels\"] = labels\n",
-        "    return padded\n",
-        "\n",
-        "\n",
-        "TrainLoader = DataLoader(train_dataset, batch_size=train_batch_size, shuffle=True, collate_fn=collate_fn)\n",
-        "EvalLoader = DataLoader(eval_dataset, batch_size=eval_batch_size, shuffle=False, collate_fn=collate_fn)\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "id": "f5965d1b",
-      "metadata": {
-        "id": "f5965d1b"
-      },
-      "source": [
-        "## Optimizer"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "id": "f57c7be2",
-      "metadata": {
-        "id": "f57c7be2"
-      },
-      "outputs": [],
-      "source": [
-        "optimizer = torch.optim.AdamW(\n",
-        "    model.parameters(),\n",
-        "    lr=learning_rate,\n",
-        "    weight_decay=weight_decay,\n",
-        ")\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "id": "215f8782",
-      "metadata": {
-        "id": "215f8782"
-      },
-      "source": [
-        "# Learning Rate Scheduler"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "id": "034e2903",
-      "metadata": {
-        "id": "034e2903"
-      },
-      "outputs": [],
-      "source": [
-        "num_update_steps_per_epoch = max(len(TrainLoader) // gradient_accumulation_steps, 1)\n",
-        "max_train_steps = num_epochs * num_update_steps_per_epoch\n",
-        "warmup_steps = max(1, int(max_train_steps * warmup_ratio))\n",
-        "scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, max_train_steps)\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "id": "0f0090b6",
-      "metadata": {
-        "id": "0f0090b6"
-      },
-      "source": [
-        "# The Training Loop"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "id": "1540e30a",
-      "metadata": {
-        "id": "1540e30a",
-        "outputId": "747badd7-18df-441f-8026-7aa4f30c2fd7"
-      },
-      "outputs": [
-        {
-          "name": "stderr",
-          "output_type": "stream",
-          "text": [
-            "You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n"
-          ]
-        },
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Epoch 1/1\n",
-            "step=00010 | loss=1.7586 | lr=1.33e-06\n",
-            "step=00020 | loss=1.8188 | lr=2.67e-06\n",
-            "step=00030 | loss=1.8235 | lr=4.00e-06\n",
-            "step=00040 | loss=1.7935 | lr=5.33e-06\n",
-            "step=00050 | loss=1.8029 | lr=6.67e-06\n",
-            "step=00060 | loss=1.8433 | lr=8.00e-06\n",
-            "step=00070 | loss=1.8616 | lr=9.33e-06\n",
-            "step=00080 | loss=1.8238 | lr=9.98e-06\n",
-            "step=00090 | loss=1.7774 | lr=9.94e-06\n",
-            "step=00100 | loss=1.8081 | lr=9.90e-06\n",
-            "step=00110 | loss=1.7437 | lr=9.86e-06\n",
-            "step=00120 | loss=1.7830 | lr=9.81e-06\n",
-            "step=00130 | loss=1.8064 | lr=9.77e-06\n",
-            "step=00140 | loss=1.8541 | lr=9.73e-06\n",
-            "step=00150 | loss=1.8301 | lr=9.69e-06\n",
-            "step=00160 | loss=1.7725 | lr=9.65e-06\n",
-            "step=00170 | loss=1.7635 | lr=9.61e-06\n",
-            "step=00180 | loss=1.7963 | lr=9.57e-06\n",
-            "step=00190 | loss=1.7563 | lr=9.53e-06\n",
-            "step=00200 | loss=1.6950 | lr=9.48e-06\n",
-            "step=00210 | loss=1.7680 | lr=9.44e-06\n",
-            "step=00220 | loss=1.8906 | lr=9.40e-06\n",
-            "step=00230 | loss=1.7120 | lr=9.36e-06\n",
-            "step=00240 | loss=1.8390 | lr=9.32e-06\n",
-            "step=00250 | loss=1.7180 | lr=9.28e-06\n",
-            "step=00260 | loss=1.7709 | lr=9.24e-06\n",
-            "step=00270 | loss=1.7598 | lr=9.20e-06\n",
-            "step=00280 | loss=1.7981 | lr=9.15e-06\n",
-            "step=00290 | loss=1.7540 | lr=9.11e-06\n",
-            "step=00300 | loss=1.7695 | lr=9.07e-06\n",
-            "step=00310 | loss=1.7468 | lr=9.03e-06\n"
-          ]
-        },
-        {
-          "ename": "KeyboardInterrupt",
-          "evalue": "",
-          "output_type": "error",
-          "traceback": [
-            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
-            "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
-            "Cell \u001b[0;32mIn[14], line 11\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m step, batch \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(TrainLoader, start\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1\u001b[39m):\n\u001b[1;32m     10\u001b[0m     batch \u001b[38;5;241m=\u001b[39m {key: value\u001b[38;5;241m.\u001b[39mto(device) \u001b[38;5;28;01mfor\u001b[39;00m key, value \u001b[38;5;129;01min\u001b[39;00m batch\u001b[38;5;241m.\u001b[39mitems()}\n\u001b[0;32m---> 11\u001b[0m     outputs \u001b[38;5;241m=\u001b[39m \u001b[43mmodel\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mbatch\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m     12\u001b[0m     loss \u001b[38;5;241m=\u001b[39m outputs\u001b[38;5;241m.\u001b[39mloss \u001b[38;5;241m/\u001b[39m gradient_accumulation_steps\n\u001b[1;32m     13\u001b[0m     loss\u001b[38;5;241m.\u001b[39mbackward()\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1773\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1771\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1772\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1773\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1784\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1779\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1780\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1781\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1782\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1783\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1784\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1786\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1787\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/peft/peft_model.py:1850\u001b[0m, in \u001b[0;36mPeftModelForCausalLM.forward\u001b[0;34m(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, task_ids, **kwargs)\u001b[0m\n\u001b[1;32m   1848\u001b[0m     \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_enable_peft_forward_hooks(\u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs):\n\u001b[1;32m   1849\u001b[0m         kwargs \u001b[38;5;241m=\u001b[39m {k: v \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m kwargs\u001b[38;5;241m.\u001b[39mitems() \u001b[38;5;28;01mif\u001b[39;00m k \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mspecial_peft_forward_args}\n\u001b[0;32m-> 1850\u001b[0m         \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbase_model\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m   1851\u001b[0m \u001b[43m            \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1852\u001b[0m \u001b[43m            \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mattention_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1853\u001b[0m \u001b[43m            \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1854\u001b[0m \u001b[43m            \u001b[49m\u001b[43mlabels\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlabels\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1855\u001b[0m \u001b[43m            \u001b[49m\u001b[43moutput_attentions\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43moutput_attentions\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1856\u001b[0m \u001b[43m            \u001b[49m\u001b[43moutput_hidden_states\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43moutput_hidden_states\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1857\u001b[0m \u001b[43m            \u001b[49m\u001b[43mreturn_dict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_dict\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1858\u001b[0m \u001b[43m            \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1859\u001b[0m \u001b[43m        \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1861\u001b[0m batch_size \u001b[38;5;241m=\u001b[39m _get_batch_size(input_ids, inputs_embeds)\n\u001b[1;32m   1862\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m attention_mask \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m   1863\u001b[0m     \u001b[38;5;66;03m# concat prompt attention mask\u001b[39;00m\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1773\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1771\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1772\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1773\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1784\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1779\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1780\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1781\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1782\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1783\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1784\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1786\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1787\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:222\u001b[0m, in \u001b[0;36mBaseTuner.forward\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m    221\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mforward\u001b[39m(\u001b[38;5;28mself\u001b[39m, \u001b[38;5;241m*\u001b[39margs: Any, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs: Any):\n\u001b[0;32m--> 222\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmodel\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mforward\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/utils/generic.py:757\u001b[0m, in \u001b[0;36mcan_return_tuple.<locals>.wrapper\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m    755\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m return_dict_passed \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m    756\u001b[0m     return_dict \u001b[38;5;241m=\u001b[39m return_dict_passed\n\u001b[0;32m--> 757\u001b[0m output \u001b[38;5;241m=\u001b[39m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    758\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m return_dict \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(output, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[1;32m    759\u001b[0m     output \u001b[38;5;241m=\u001b[39m output\u001b[38;5;241m.\u001b[39mto_tuple()\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/models/nanochat/modeling_nanochat.py:474\u001b[0m, in \u001b[0;36mNanoChatForCausalLM.forward\u001b[0;34m(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, cache_position, logits_to_keep, **kwargs)\u001b[0m\n\u001b[1;32m    435\u001b[0m \u001b[38;5;129m@can_return_tuple\u001b[39m\n\u001b[1;32m    436\u001b[0m \u001b[38;5;129m@auto_docstring\u001b[39m\n\u001b[1;32m    437\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mforward\u001b[39m(\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    448\u001b[0m     \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs: Unpack[TransformersKwargs],\n\u001b[1;32m    449\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m CausalLMOutputWithPast:\n\u001b[1;32m    450\u001b[0m \u001b[38;5;250m    \u001b[39m\u001b[38;5;124mr\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m    451\u001b[0m \u001b[38;5;124;03m    Example:\u001b[39;00m\n\u001b[1;32m    452\u001b[0m \n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    472\u001b[0m \u001b[38;5;124;03m    >>> output = tokenizer.decode(generated_tokens, skip_special_tokens=True)\u001b[39;00m\n\u001b[1;32m    473\u001b[0m \u001b[38;5;124;03m    ```\"\"\"\u001b[39;00m\n\u001b[0;32m--> 474\u001b[0m     outputs: BaseModelOutputWithPast \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmodel\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m    475\u001b[0m \u001b[43m        \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    476\u001b[0m \u001b[43m        \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mattention_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    477\u001b[0m \u001b[43m        \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    478\u001b[0m \u001b[43m        \u001b[49m\u001b[43mpast_key_values\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mpast_key_values\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    479\u001b[0m \u001b[43m        \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    480\u001b[0m \u001b[43m        \u001b[49m\u001b[43muse_cache\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muse_cache\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    481\u001b[0m \u001b[43m        \u001b[49m\u001b[43mcache_position\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcache_position\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    482\u001b[0m \u001b[43m        \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    483\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    485\u001b[0m     hidden_states \u001b[38;5;241m=\u001b[39m outputs\u001b[38;5;241m.\u001b[39mlast_hidden_state\n\u001b[1;32m    486\u001b[0m     slice_indices \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mslice\u001b[39m(\u001b[38;5;241m-\u001b[39mlogits_to_keep, \u001b[38;5;28;01mNone\u001b[39;00m) \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(logits_to_keep, \u001b[38;5;28mint\u001b[39m) \u001b[38;5;28;01melse\u001b[39;00m logits_to_keep\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1773\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1771\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1772\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1773\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1784\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1779\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1780\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1781\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1782\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1783\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1784\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1786\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1787\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/utils/generic.py:927\u001b[0m, in \u001b[0;36mcheck_model_inputs.<locals>.wrapped_fn.<locals>.wrapper\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m    924\u001b[0m                 monkey_patched_layers\u001b[38;5;241m.\u001b[39mappend((module, original_forward))\n\u001b[1;32m    926\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 927\u001b[0m     outputs \u001b[38;5;241m=\u001b[39m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    928\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m original_exception:\n\u001b[1;32m    929\u001b[0m     \u001b[38;5;66;03m# If we get a TypeError, it's possible that the model is not receiving the recordable kwargs correctly.\u001b[39;00m\n\u001b[1;32m    930\u001b[0m     \u001b[38;5;66;03m# Get a TypeError even after removing the recordable kwargs -> re-raise the original exception\u001b[39;00m\n\u001b[1;32m    931\u001b[0m     \u001b[38;5;66;03m# Otherwise -> we're probably missing `**kwargs` in the decorated function\u001b[39;00m\n\u001b[1;32m    932\u001b[0m     kwargs_without_recordable \u001b[38;5;241m=\u001b[39m {k: v \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m kwargs\u001b[38;5;241m.\u001b[39mitems() \u001b[38;5;28;01mif\u001b[39;00m k \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m recordable_keys}\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/models/nanochat/modeling_nanochat.py:401\u001b[0m, in \u001b[0;36mNanoChatModel.forward\u001b[0;34m(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, cache_position, **kwargs)\u001b[0m\n\u001b[1;32m    398\u001b[0m hidden_states \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39minitial_norm(hidden_states)\n\u001b[1;32m    400\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m decoder_layer \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlayers:\n\u001b[0;32m--> 401\u001b[0m     hidden_states \u001b[38;5;241m=\u001b[39m \u001b[43mdecoder_layer\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m    402\u001b[0m \u001b[43m        \u001b[49m\u001b[43mhidden_states\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    403\u001b[0m \u001b[43m        \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcausal_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    404\u001b[0m \u001b[43m        \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    405\u001b[0m \u001b[43m        \u001b[49m\u001b[43mpast_key_values\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mpast_key_values\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    406\u001b[0m \u001b[43m        \u001b[49m\u001b[43muse_cache\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muse_cache\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    407\u001b[0m \u001b[43m        \u001b[49m\u001b[43mcache_position\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcache_position\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    408\u001b[0m \u001b[43m        \u001b[49m\u001b[43mposition_embeddings\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mposition_embeddings\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    409\u001b[0m \u001b[43m        \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    410\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    412\u001b[0m hidden_states \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnorm(hidden_states)\n\u001b[1;32m    414\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m BaseModelOutputWithPast(\n\u001b[1;32m    415\u001b[0m     last_hidden_state\u001b[38;5;241m=\u001b[39mhidden_states,\n\u001b[1;32m    416\u001b[0m     past_key_values\u001b[38;5;241m=\u001b[39mpast_key_values \u001b[38;5;28;01mif\u001b[39;00m use_cache \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[1;32m    417\u001b[0m )\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/modeling_layers.py:94\u001b[0m, in \u001b[0;36mGradientCheckpointingLayer.__call__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m     91\u001b[0m         logger\u001b[38;5;241m.\u001b[39mwarning_once(message)\n\u001b[1;32m     93\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_gradient_checkpointing_func(partial(\u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39m\u001b[38;5;21m__call__\u001b[39m, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs), \u001b[38;5;241m*\u001b[39margs)\n\u001b[0;32m---> 94\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[38;5;21;43m__call__\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1773\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1771\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1772\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1773\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1784\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1779\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1780\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1781\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1782\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1783\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1784\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1786\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1787\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/models/nanochat/modeling_nanochat.py:279\u001b[0m, in \u001b[0;36mNanoChatDecoderLayer.forward\u001b[0;34m(self, hidden_states, attention_mask, position_ids, past_key_values, use_cache, cache_position, position_embeddings, **kwargs)\u001b[0m\n\u001b[1;32m    267\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mforward\u001b[39m(\n\u001b[1;32m    268\u001b[0m     \u001b[38;5;28mself\u001b[39m,\n\u001b[1;32m    269\u001b[0m     hidden_states: torch\u001b[38;5;241m.\u001b[39mTensor,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs: Unpack[TransformersKwargs],\n\u001b[1;32m    277\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m torch\u001b[38;5;241m.\u001b[39mTensor:\n\u001b[1;32m    278\u001b[0m     residual \u001b[38;5;241m=\u001b[39m hidden_states\n\u001b[0;32m--> 279\u001b[0m     hidden_states \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43minput_layernorm\u001b[49m\u001b[43m(\u001b[49m\u001b[43mhidden_states\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    280\u001b[0m     \u001b[38;5;66;03m# Self Attention\u001b[39;00m\n\u001b[1;32m    281\u001b[0m     hidden_states, _ \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mself_attn(\n\u001b[1;32m    282\u001b[0m         hidden_states\u001b[38;5;241m=\u001b[39mhidden_states,\n\u001b[1;32m    283\u001b[0m         attention_mask\u001b[38;5;241m=\u001b[39mattention_mask,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    289\u001b[0m         \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs,\n\u001b[1;32m    290\u001b[0m     )\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1773\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1771\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1772\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1773\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1784\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1779\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1780\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1781\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1782\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1783\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1784\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1786\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1787\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/models/nanochat/modeling_nanochat.py:53\u001b[0m, in \u001b[0;36mNanoChatRMSNorm.forward\u001b[0;34m(self, x)\u001b[0m\n\u001b[1;32m     52\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mforward\u001b[39m(\u001b[38;5;28mself\u001b[39m, x):\n\u001b[0;32m---> 53\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_norm\u001b[49m\u001b[43m(\u001b[49m\u001b[43mx\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfloat\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39mtype_as(x)\n",
-            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/models/nanochat/modeling_nanochat.py:50\u001b[0m, in \u001b[0;36mNanoChatRMSNorm._norm\u001b[0;34m(self, x)\u001b[0m\n\u001b[1;32m     49\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m_norm\u001b[39m(\u001b[38;5;28mself\u001b[39m, x):\n\u001b[0;32m---> 50\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m x \u001b[38;5;241m*\u001b[39m torch\u001b[38;5;241m.\u001b[39mrsqrt(\u001b[43mx\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpow\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmean\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m-\u001b[39;49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkeepdim\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m \u001b[38;5;241m+\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39meps)\n",
-            "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
-          ]
-        }
-      ],
-      "source": [
-        "\n",
-        "model.train()\n",
-        "global_step = 0\n",
-        "running_loss = 0.0\n",
-        "running_steps = 0\n",
-        "\n",
-        "for epoch in range(num_epochs):\n",
-        "    print(f\"Epoch {epoch + 1}/{num_epochs}\")\n",
-        "    optimizer.zero_grad(set_to_none=True)\n",
-        "    for step, batch in enumerate(TrainLoader, start=1):\n",
-        "        batch = {key: value.to(device) for key, value in batch.items()}\n",
-        "        outputs = model(**batch)\n",
-        "        loss = outputs.loss / gradient_accumulation_steps\n",
-        "        loss.backward()\n",
-        "\n",
-        "        running_loss += outputs.loss.float().item()\n",
-        "        running_steps += 1\n",
-        "\n",
-        "        if step % gradient_accumulation_steps == 0 or step == len(TrainLoader):\n",
-        "            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n",
-        "            optimizer.step()\n",
-        "            scheduler.step()\n",
-        "            optimizer.zero_grad(set_to_none=True)\n",
-        "            global_step += 1\n",
-        "\n",
-        "            if global_step % logging_frequency == 0:\n",
-        "                current_lr = scheduler.get_last_lr()[0]\n",
-        "                mean_loss = running_loss / running_steps\n",
-        "                print(f\"step={global_step:05d} | loss={mean_loss:.4f} | lr={current_lr:.2e}\")\n",
-        "                running_loss = 0.0\n",
-        "                running_steps = 0\n",
-        "\n",
-        "    train_loss = running_loss / running_steps if running_steps > 0 else float(\"nan\")\n",
-        "    print(f\"Training loss after epoch {epoch + 1}: {train_loss:.4f}\")\n",
-        "\n",
-        "    model.eval()\n",
-        "    losses = []\n",
-        "    with torch.no_grad():\n",
-        "        for _, batch in enumerate(EvalLoader, start=1):\n",
-        "            batch = {key: value.to(device) for key, value in batch.items()}\n",
-        "            loss = model(**batch).loss\n",
-        "            losses.append(loss.float().item())\n",
-        "    model.train()\n",
-        "    val_loss = sum(losses) / len(losses) if losses else float(\"nan\")\n",
-        "\n",
-        "    print(f\"Validation loss after epoch {epoch + 1}: {val_loss:.4f}\")\n",
-        "\n",
-        "print(\"Training complete.\")\n"
-      ]
     }
   ],
   "metadata": {
     "kernelspec": {
       "display_name": ".venv",
       "language": "python",
@@ -581,11 +186,8 @@
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
       "version": "3.10.18"
-    },
-    "colab": {
-      "provenance": []
     }
   },
   "nbformat": 4,
   "nbformat_minor": 5
-}

         ").to(device)\n"
       ]
     },
     {
       "cell_type": "markdown",
       "id": "4810af1a",
         "print(f\"\\nGenerated: {tokenizer.decode(generated_tokens)}\")\n",
         "print(\"=\" * 80)\n"
       ]
     }
   ],
   "metadata": {
+    "colab": {
+      "provenance": []
+    },
     "kernelspec": {
       "display_name": ".venv",
       "language": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
       "version": "3.10.18"
     }
   },
   "nbformat": 4,
   "nbformat_minor": 5
+}