Training Loop

The previous section organized text samples into input_ids, labels, and attention_mask. Starting from this section, this batch enters the real training loop: model forward, loss computation, backpropagation, optimizer update, learning rate scheduling, logging, and checkpoint saving.

The pretraining loop of cookllm-bento can first be viewed as the following pipeline:

Pretrain Training Loop

How configs become a running Lightning training job.

Shell script

compose trainer, model and data configs

fit command

LightningCLI

instantiate PretrainModule and PretrainDataModule

objects

DataLoader batch

input_ids, labels and attention_mask

batch

training_step

forward BentoLM and return language modeling loss

loss

Optimizer step

AdamW update after gradient accumulation

weights

Callbacks

log metrics, validate, sample text and save checkpoints

logs

Training Entrypoint

The pretraining entrypoint file is very thin:

tasks/entrypoints/main_pretrain.py

def main():
    LightningCLI(PretrainModule, PretrainDataModule, save_config_callback=None)

It mainly does three things:

Adds the project root to sys.path so the src package can be imported normally.
Sets torch.set_float32_matmul_precision("medium"), so Ampere and later GPUs can use TF32 to accelerate part of the matrix computation.
Uses LightningCLI to assemble PretrainModule, PretrainDataModule, and the Lightning Trainer into a single training task.

There is no hand-written, complex argparse here. The training parameters mainly come from YAML configs and command-line overrides, which is also the most important way to organize different experiments later.

Log in to continue reading

This is premium content. Please log in to access the full article.

Training Loop

Premium

Take apart LightningCLI, PretrainModule, the optimizer, and the scheduler

The pretraining loop of cookllm-bento can first be viewed as the following pipeline:

Pretrain Training Loop

How configs become a running Lightning training job.

Shell script

compose trainer, model and data configs

fit command

LightningCLI

instantiate PretrainModule and PretrainDataModule

objects

DataLoader batch

input_ids, labels and attention_mask

batch

training_step

forward BentoLM and return language modeling loss

loss

Optimizer step

AdamW update after gradient accumulation

weights

Callbacks

log metrics, validate, sample text and save checkpoints

logs

Training Entrypoint

The pretraining entrypoint file is very thin:

tasks/entrypoints/main_pretrain.py

def main():
    LightningCLI(PretrainModule, PretrainDataModule, save_config_callback=None)

It mainly does three things:

Adds the project root to sys.path so the src package can be imported normally.
Sets torch.set_float32_matmul_precision("medium"), so Ampere and later GPUs can use TF32 to accelerate part of the matrix computation.
Uses LightningCLI to assemble PretrainModule, PretrainDataModule, and the Lightning Trainer into a single training task.

Log in to continue reading

This is premium content. Please log in to access the full article.

Pretrain Training Loop

Training Entrypoint

Log in to continue reading

Table of Contents

Training Loop

Pretrain Training Loop

Training Entrypoint

Log in to continue reading

Table of Contents