Or simply…
… Tips and Tricks to Use When wandb
Logger Just. Won’t. DIE.
The Weights and Biases Logger (illustrated above by DALL-E; admittedly with some artistic license) hardly requires introduction. It’s something of an industry standard at this point, well-regarded for the extensive (and extensible) functionality of its interactive dashboard; for advanced features like checkpointing model weights in the cloud and automating hyperparameter sweeps; and for integrating painlessly with frameworks like PyTorch and PyTorch Lightning. It simplifies your life as an ML researcher enormously by making it easy to track and compare experiments, monitor system resource usage, all while giving you very fun interactive graphs to play with.
Plot arbitrary quantities you may be logging against each other, interactively, on the fly, however you like. In Dark Mode, of course (you’re a professional, after all). Here’s a less artistic impression to give you an idea, should you have been living under a rock:
The logger has been variously introduced on this very blog, so I won’t belabour the point. Its documentation is excellent, read it; then use it.
Excellent, that is, until you encounter the following (purely hypothetical!) scenario: You scancel
a job, or it runs OOM
or OOT
, and SLURM promptly informs you in no uncertain terms that your favourite node has just gone into drain (for the third time this week), swiftly followed by IT who provide damming evidence that it is your jobs leaving undead, un-killable wandb
logger processes behind–meaning it is, in fact, your fault.
Scarcely has this news had time to earn you the undying adoration of your colleagues (😅😬), when IT follow it up with statements like “In normal operation nvidia-smi
simply should not hang” (read: how the !?#!! did you !?#!! up low-level processes this badly?!), and “FYI this morning the server was so hung ssh
no longer responded and I had to force a reboot”. In this entirely hypothetical (!!) scenario, you may find that the documentation offers you precious little information on what specific crime you ought to charge yourself with.
If you do find yourself facing something akin to this ENTIRELY HYPOTHETICAL (!!!) scenario, consider the following steps: Disable multi-process logging (sync_dist=False, rank_zero_only=True
) to simplify debugging, then package up your programme logic in a my_main()
function, and use the following idiom in your script:
if __name__ == "__main__": try: my_main() finally: wandb.finish() # should be called even upon SIGTERM, and ensure correct shutdown
The thing to know here is that when SLURM sends your job a SIGTERM
for whatever reason (rather than your programme logic terminating the job from within Python
) any running wandb
processes sometimes refuse to die.
Either they are desperately trying to sync any last logs to the server, eventually exhausting SLURM’s patience, or they may actually be caught in a deadlock (if you’re running on multiple GPUs and have distributed logging calls).
If you’re lucky, you can even manage to make it impossible to release any GPU resources you hold back to the system, meaning not even nvidia-smi
can see the GPUs any more.
In any case, Python
does not exit, SLURM is faced with an un-killable job, panics, and sends the node into drain.
If this is you, see if the above helps; make the changes above, scancel
mid-job, and see if you still crash the node. If you don’t, try to re-enable multi-process logging and see if the fix still holds (I’m not entirely sure I trust that wandb
is thread-safe, so proceed with caution). If you do still crash the node … happy debugging! May the odds be ever in your favour!