The Tale of the Undead Logger

A picture of a scary-looking zombie in a lumberjack outfit holding an axe, in the middle of a forest at night, staring menacingly at the viewer.
Fear the Undead Logger all ye who enter here.
For he may strike, and drain the life out nodes that you hold dear.
Among the smouldering embers of jobs you thought long dead,
he lingers on, to terrorise, and cause you frightful dread.
But hark ye all my tale to save you from much pain,
and fight ye not anew the battles I have fought in vain.

Or simply…

… Tips and Tricks to Use When wandb Logger Just. Won’t. DIE.

The Weights and Biases Logger (illustrated above by DALL-E; admittedly with some artistic license) hardly requires introduction. It’s something of an industry standard at this point, well-regarded for the extensive (and extensible) functionality of its interactive dashboard; for advanced features like checkpointing model weights in the cloud and automating hyperparameter sweeps; and for integrating painlessly with frameworks like PyTorch and PyTorch Lightning. It simplifies your life as an ML researcher enormously by making it easy to track and compare experiments, monitor system resource usage, all while giving you very fun interactive graphs to play with.
Plot arbitrary quantities you may be logging against each other, interactively, on the fly, however you like. In Dark Mode, of course (you’re a professional, after all). Here’s a less artistic impression to give you an idea, should you have been living under a rock:

A screenshot of the wandb logger dashboard homepage in dark mode, showing several colourful line graphs tracking training metrics across different model runs.

The logger has been variously introduced on this very blog, so I won’t belabour the point. Its documentation is excellent, read it; then use it.

Excellent, that is, until you encounter the following (purely hypothetical!) scenario: You scancel a job, or it runs OOM or OOT, and SLURM promptly informs you in no uncertain terms that your favourite node has just gone into drain (for the third time this week), swiftly followed by IT who provide damming evidence that it is your jobs leaving undead, un-killable wandb logger processes behind–meaning it is, in fact, your fault.
Scarcely has this news had time to earn you the undying adoration of your colleagues (😅😬), when IT follow it up with statements like “In normal operation nvidia-smi simply should not hang” (read: how the !?#!! did you !?#!! up low-level processes this badly?!), and “FYI this morning the server was so hung ssh no longer responded and I had to force a reboot”. In this entirely hypothetical (!!) scenario, you may find that the documentation offers you precious little information on what specific crime you ought to charge yourself with.

If you do find yourself facing something akin to this ENTIRELY HYPOTHETICAL (!!!) scenario, consider the following steps: Disable multi-process logging (sync_dist=False, rank_zero_only=True) to simplify debugging, then package up your programme logic in a my_main() function, and use the following idiom in your script:

if __name__ == "__main__":
    try:
        my_main()
    finally:
        wandb.finish()
        # should be called even upon SIGTERM, and ensure correct shutdown

The thing to know here is that when SLURM sends your job a SIGTERM for whatever reason (rather than your programme logic terminating the job from within Python) any running wandb processes sometimes refuse to die.
Either they are desperately trying to sync any last logs to the server, eventually exhausting SLURM’s patience, or they may actually be caught in a deadlock (if you’re running on multiple GPUs and have distributed logging calls).
If you’re lucky, you can even manage to make it impossible to release any GPU resources you hold back to the system, meaning not even nvidia-smi can see the GPUs any more.
In any case, Python does not exit, SLURM is faced with an un-killable job, panics, and sends the node into drain.

If this is you, see if the above helps; make the changes above, scancel mid-job, and see if you still crash the node. If you don’t, try to re-enable multi-process logging and see if the fix still holds (I’m not entirely sure I trust that wandb is thread-safe, so proceed with caution). If you do still crash the node … happy debugging! May the odds be ever in your favour!

Author