Monitoring worker-ng jobs

You will have noticed that when you have submitted a job using wsub, a directory is created with a name that starts with worker_, and ends in the job ID. Among other things, this directory contains files that allow you to monitor the progress of a running worker job, or analyze its performance once it is done.

Since a worker-ng job will typically run for several hours, it may be reassuring to monitor its progress. worker server keeps a log of its activity in the directory mentioned abovewhere the job was submitted. You can use the wsummarize command to get information, e.g., for job ID was 1234.

$ wsummarize  --dir=worker_1234/

This will give you an overview of the status of your work items, i.e., the number of

  • succesful items: the number of computations that finished with exit status 0;
  • failed items: the computations that finished with a non-zero exit status;
  • incomplete items: the number of items that are currnetly being executed.

To monitor progress "in real time", you can use the watch Linux command.

$ watch  -n 60  wsummarize  --dir=worker_1234/

This will summarize the status of the work items every 60 seconds. Note: use a reasonable value for the update period, this will cause load on the login node where you run this command.

The wsummarize command has various command line options to get a more detailed analysis of perfornmance issues. For instance, to get statistics on the walltime of your work items, you can use the --show_walltime_stats flag. This will give you descriptive statistics on the walltime of your work items such as the minimum and maximum, the average and median, as well as informaiton on the spread.

In order to detect problems with load balancing between the worker clients, you can use the --show_client_stats flag. This will provide you with the same descriptive statistics on the walltime, but grouped by client. In addition, you will get the total walltime for each client, a good measure for load balance.

Finally, the --show_all options will given the output of --show_walltime_stats and --show_client_stats in a single wsammarize invocation.