determining protease transformer steps (iterations):
To calculate the number of steps (iterations) needed to run for 4 epochs, we use the following formula:
Given:
total_tokens_in_dataset = 589222838
num_epochs = 4
tokens_per_fwdbwd = 262144
(obtained from default setting in modded nanogpt)
Plugging in the values:
So, we should run approximately 8992 steps (iterations) to achieve 4 epochs. We eventually choose to run 9k steps.