Data Pipeline

This section ties together the previous chapters: the pretraining data has already been split into Parquet shards, the tokenizer can already turn text into token ids, and the model has already been defined. What's missing in between is the data pipeline.

In cookllm-bento, the process of generating a pretraining batch can be summarized as:

Pretrain Data Pipeline

How a Parquet row becomes a training batch for next-token prediction.

Parquet shard

read text column from row groups

list[str]

Batch tokenization

prepend BOS, encode texts in parallel

token ids

Truncate to max_length

keep the first 512 tokens by default

sample

input_ids / labels

same tokens; shifting happens inside the model

tensors

Batch padding

pad input_ids, mask labels with -100

attention_mask

PretrainModule.training_step()

forward batch and log train/loss

loss

This pipeline is handled by two files:

Level	File	Role
DataModule	`src/datamodule/pretrain_datamodule.py`	Parse data paths, load the tokenizer, split train/val, create the DataLoader
Dataset	`src/dataset/pretrain.py`	Stream-read Parquet, batch tokenization, generate a single training sample

Configuration Entrypoint

The core fields of the data config used by the pretraining example are as follows:

configs/data/data_pretrain_sample.yaml

data:
  data_path: pretrain_data/fineweb_shards
  tokenizer_path: tokens
  max_length: 512
  batch_size: 96
  num_workers: 4
  tokenizer_batch_size: 128
  tokenizer_num_threads: 8
  val_files: 10

Config	Meaning
`data_path`	The pretraining Parquet shard directory
`tokenizer_path`	The tokenizer file directory; by default reads `tokens/` under the project root
`max_length`	How many tokens at most to keep per sample
`batch_size`	How many samples the DataLoader returns each time
`num_workers`	The number of DataLoader workers
`tokenizer_batch_size`	The number of texts sent to the tokenizer at a time
`tokenizer_num_threads`	The number of internal parallel threads for the tokenizer
`val_files`	How many Parquet files to randomly keep as the validation set

Log in to continue reading

This is premium content. Please log in to access the full article.

Data Pipeline

Premium

Understand how Parquet shards become input_ids, labels, and attention_mask

In cookllm-bento, the process of generating a pretraining batch can be summarized as:

Pretrain Data Pipeline

How a Parquet row becomes a training batch for next-token prediction.

Parquet shard

read text column from row groups

list[str]

Batch tokenization

prepend BOS, encode texts in parallel

token ids

Truncate to max_length

keep the first 512 tokens by default

sample

input_ids / labels

same tokens; shifting happens inside the model

tensors

Batch padding

pad input_ids, mask labels with -100

attention_mask

PretrainModule.training_step()

forward batch and log train/loss

loss

This pipeline is handled by two files:

Level	File	Role
DataModule	`src/datamodule/pretrain_datamodule.py`	Parse data paths, load the tokenizer, split train/val, create the DataLoader
Dataset	`src/dataset/pretrain.py`	Stream-read Parquet, batch tokenization, generate a single training sample

Configuration Entrypoint

The core fields of the data config used by the pretraining example are as follows:

configs/data/data_pretrain_sample.yaml

data:
  data_path: pretrain_data/fineweb_shards
  tokenizer_path: tokens
  max_length: 512
  batch_size: 96
  num_workers: 4
  tokenizer_batch_size: 128
  tokenizer_num_threads: 8
  val_files: 10

Config	Meaning
`data_path`	The pretraining Parquet shard directory
`tokenizer_path`	The tokenizer file directory; by default reads `tokens/` under the project root
`max_length`	How many tokens at most to keep per sample
`batch_size`	How many samples the DataLoader returns each time
`num_workers`	The number of DataLoader workers
`tokenizer_batch_size`	The number of texts sent to the tokenizer at a time
`tokenizer_num_threads`	The number of internal parallel threads for the tokenizer
`val_files`	How many Parquet files to randomly keep as the validation set

Log in to continue reading

This is premium content. Please log in to access the full article.

Pretrain Data Pipeline

Configuration Entrypoint

Log in to continue reading

Table of Contents

Data Pipeline

Pretrain Data Pipeline

Configuration Entrypoint

Log in to continue reading

Table of Contents