Monitoring and Validation

After pretraining is running, what you really need to watch is whether the experiment is progressing as expected: whether training loss is being recorded, whether validation loss is being written out stably, whether the learning rate schedule is correct, whether sampled text is being generated, and whether checkpoints are being saved.

cookllm-bento configures TensorBoard by default, and also provides a SwanLab logger. For actual training I prefer SwanLab: it suits remote machines, long-running experiments, and comparing multiple experiments; TensorBoard is better for a quick local glance.

Where the Logs Come From

The curves you see during training are not computed by the logger itself, but are actively recorded by the code. For example, PretrainModule has two key logging points:

src/trainer/pretrain.py

self.log("train/loss", loss, prog_bar=True, on_step=True, on_epoch=False)
self.log("val/loss", loss, prog_bar=True, on_step=False, on_epoch=True, sync_dist=True)

Lightning hands these metrics to the currently configured logger. The default logger is TensorBoard; if you additionally pass configs/swanlab.yaml, the logger switches to SwanLab. LearningRateMonitor is also a callback; it writes the learning rate into the logger per step.

ModelCheckpoint is different from the logger. It is not responsible for plotting curves, but reads the already-recorded val/loss, then decides which checkpoints should be kept based on monitor: val/loss, mode: min, and save_top_k: 3.

Log in to continue reading

This is premium content. Please log in to access the full article.

Monitoring and Validation

Where the Logs Come From

Log in to continue reading

Table of Contents

Monitoring and Validation

Where the Logs Come From

Log in to continue reading

Table of Contents