Analyzing VelmaParalleCmdLine Failed Runs

Overview

When VelmaParallelCmdLine is invoked, one of the first things it does is create a top-level results directory.
The directory name starts with MULTI_, plus the name specified in the simulation configuration .xml file's run_index parameter.
(Example: if run_index is MyConfiguration, the top-level directory is named MULTI_MyConfiguration.)

After creating the top-level results directory, VelmaParallelCmdLine configures a runtime logger and creates a file in the top-level results directory named VelmaParalleRunlog.txt. This top-level log file contains start/end messages for each sub-reach that VelmaParallelCmdLine attempts to run, along with whether each sub-reach ran to completion or failed. In all but the most abrupt runtime failures the VelmaParallelRunlog.txt file will contain log messages documenting the runtime behavior of the overall parallel run.

Additionally if any of the sub-reaches started and run by VelmaParallelCmdLine fail, VelmaParallelCmdLine creates a subdirectory named .\FAILED_Results directly underneath the top-level results directory. If your top-level results directory contains a .\FAILED_Results directory, at least one of your sub-reaches failed to run correctly to completion.

You can get a quick idea of how many runs failed, by looking at the contents of the .\FAILED_Results directory: it will contain a FAILED_Results__log.txt file (where is the cell index of a specfic sub-reach outlet) for each sub-reach that failed to run. The contents of these "fail-log" files are the WARNING and SEVERE log messages from the specific sub-reach .\Results_\GlobalStateLog.txt runtime log file.

Rule of Thumb: Look in VelmaParallelRunlog.txt for overview/summary information, and look in the fail-log files for the error details of specific, failed reaches.

Heuristics for Analyzing VelmaParallelCmdLine Failures

Things to check when VelmaParallelCmdLine doesn't run as as expected.

Is the expected top-level directory missing?

If no top-level directory exists, the VelmaParallelCmdLine invocation failed very quickly and abruptly. In these cases, the fault may be a mis-typed command line parameter. Check your console output for information or clues.
Here is is the console output for a VelmaParallelCmdLine with a mis-type command line parameter (maxProcessors instead of maxProcesses):

SEVERE 2021-10-12 15:04:02 VelmaParallelCmdLine: FAIL! Unrecognized optional argument="--maxProcessors=2"
CONFIG 2021-10-12 15:04:02 VelmaParallelCmdLine: FINISHED Parallel Processing outcome=FAILED finalExitStatus=1

Did you re-run an existing, completed result?

When you repeatly run VelmaSimulatorCmdLine with the same simulation configuration .xml file, it runs the configuratione multiple times, and generates a new results directory for each run. Each directory (after the initial run) has a sequence number (_) added as a suffix, to keep track of and distiguish each run.
But: VelmaParallelCmdLine does not behave like VelmaSimulatorCmdLine in this regard.
VelmaParallelCmdLine allows "restarting" a prior, partially-failed run (to avoid re-doing parts that initially succeeded). Because of this, it assumes that you want to re-run in the exact same results directory as before. If your first run was completely successful, there is no further work to do, and VelmaParallelCmdLine will exit almost immediately.

Check your console: if you see that the final message has outcome=COMPLETED and finalExitStatus=0, then you have likely tried to re-run a simulation that already ran completely correctly.
Here is an example of the type of full, final log message that characterizes this situation:

CONFIG 2021-10-12 14:29:04 VelmaParallelCmdLine: FINISHED Parallel Processing outcome=COMPLETED finalExitStatus=0

Are there dependency paradoxes?

When VelmaParallelCmdLine is (re)started for a pre-existing, partially-completed simulation result, it checks whether any of the completed reaches depending "pending" reaches (i.e.) reaches that haven't been run yet. This is a relatively rare situation, but if it occurs it is not something VelmaParallelCmdLine can resolve. The SEVERE message:

FAIL! Cannot run parallel VELMA with completed reach dependency paradoxes.

is written to the VelmaParallelRunlog.txt file, and VelmaParallelCmdLine stops its run.
You will have to resolve the situation manually before re-re-starting the simulation run. One way to do this is to delete any completed Results_* sub-directories from the simulation's top-level results directory. (You can move the Results_* folders to some other location outside of the top-level results directory-tree instead of deleting them if you want to preserve their contents.) The VelmaParallelRunlog.txt file should contain a SEVERE message for each completed/paradox reach that has this form:

FAIL! Dependency paradox: completed reach outlet= ...

Look for these messages to determine which completed/paradox Results_* sub-directories to delete or move.

Does the top-level directory have a FAILED_Results sub-directory?

When the .\FAILED_Results sub-directory exists, at least one of the simulation sub-reaches failed to run correctly to completion. The fail-log files in the sub-directory indicate the number of sub-reaches that failed, and the reach outlet-cell location that is part of each fail-log file name indicates the identity of specific failed reaches.

What additional details exist?

The reach outlet cell-index location numbers are consistent in all reporting. If there is a .\FAILED_Results\FAILED_Results__log.txt file, the same in its name will show up in .\Results_ directory, and in ... reach outlet= ... messages in the top-level VelmaParallelRunlog.txt file.
Look in the .\Reasults_\GlobalStateLog.txt file (or alternate named file specified by the globalStateLogFileName parameter) for additional reach-specific log messages. All warning and severe message should have been copied to the FAILED_Results__log.txt fail-log file, but the surrounding log message may provide additional context and insight if the problem isn't evident in the fail-log file.
You can also grep (or Select-String in Powershell) to locate all the messages associated with .
Here is an example using the Powershell command: (select-string "reach outlet=3590" VelmaParallelRunlog.txt).Line

CONFIG 2021-10-12 14:26:42 VelmaParallelCmdLine: Independent reach outlet=3590 ready to run
INFO 2021-10-12 14:26:42 VelmaParallelCmdLine: STARTED reach outlet=3590 process=java.lang.ProcessImpl@52cc8049 memory="-Xmx1g"
INFO 2021-10-12 14:26:48 VelmaParallelCmdLine: FINISHED reach outlet=3590 outcome=FAILED process=java.lang.ProcessImpl@52cc8049 status=1
WARNING 2021-10-12 14:26:48 VelmaParallelCmdLine: FAILED reach outlet=3590 prevents dependent reach outlets=[1735, 3236, 2263] from running.

The above gives a concise summary of reach outlet=3590's runtime history (but no details as to why it failed -- look in the fail-log file or reach-specific global state log file for those.) Note: the (...).Line PowerShell syntax emits each line found without line numbers or other context; remember that the above lines are very likely not contiguous in the original log file.

Log Message Notes and Descriptions

In VelmaParallelRunlog.txt

Log Message Format Notes

All log messages start with 3 whitespace-separated fields: a status keyword, a date, and a time, in that order.

The keyword is one of the following: CONFIG, INFO, WARNING, or SEVERE.
The date format is "YYYY-MM-DD"
The time format is "HH:mm:ss"

Example: CONFIG 2021-10-12 14:26:42 ...

The first message written by VelmaParallelCmdLine for a given invocation is always a CONFIG message with the text:

VelmaParallelCmdLine: STARTING Parallel Processing

When a VelmaParallelRunlog.txt file already exists, (e.g. for a re-start of an existing, partially-completed run) log messages are appended to the existing file.
Look for the STARTING Parallel Processing message to determine the start of separate runs when multiple runs are appended into one log file.

The last message written in a given invocation of VelmaParalleCmdLine is a CONFIG message with the text

VelmaParallelCmdLine: FINISHED Parallel Processing ...

and includes outcome= and finalExitStatus= values.
When outcome=FAILED and finalExitStatus=1, it is preceeded by 2 log messages summarizing which reaches failed, and which reaches were prevented from running.
Example:

[ . . . ]
SEVERE 2021-10-12 14:26:51 VelmaParallelCmdLine: These reach outlets failed to run correctly: [2434, 3590]
WARNING 2021-10-12 14:26:51 VelmaParallelCmdLine: The failed reach outlets prevented these reaches from running: [1735, 2263, 3236]
CONFIG 2021-10-12 14:26:51 VelmaParallelCmdLine: FINISHED Parallel Processing outcome=FAILED finalExitStatus=1

Exit Status Codes

Each reach that VelmaParallelCmdLine tries to run reports its exit status in an INFO message with the text:

... FINISHED reach outlet= outcome=... status=...

The outlet= value is the reach outlet cell index location.
The outcome= value is either COMPLETED or FAILED
The status= value is 0 when outcome=COMPLETED, otherwise it is != 0 (and is usually 1).

Exit status messages indicate failure by a nonzero value ("1" in nearly every case).
The only exit status message value indicating complete success is zero ("0").
This applies to indivual reach message status values as well as the overall finalExitStatus value reported as the last message of a given VelmaParallelCmdLine invocation.

In Reach-Specific "fail-log" files

Filename and Contents

As stated previously, the reach-specific fail-log files have a uniform naming format that includes the reach's outlet cell index location in the name. The contents are specific for that reach, and are a filtered selection of the error messages reported in the reach's own global state log file.

Sometimes, There is Nothing to Say

It's possible for a reach to fail without leaving anything for VelmaParallelCmdLine to write into a fail-log file.:

No global state log file exists in the reach-specific .\Results_ directory
The global state log file cannot be opened for reading
The global state log is empty, or contains no error messages.

In any of these cases, VelmaParallelCmdLine can write a fail-log file, but has nothing to write. This can happen if the reach-specific simulation run failed very quickly (e.g. during initialization, before its own logger mechanism started), if the cause of the failure emitted no error messages, or if the failure somehow locks the reach's global state log file for read-access.
VelmaParallelCmdLine does write a fail-log file in these situations, but the contents will only report that it cannot report any actual error messages.
Look for the following messages in the fail-log file:

Reach outlet= status indicated a failed run, but no error messages were found in the log file.

FAILURE INFO UNAVAILABLE: cannot find logfile=...

Which are indicators that this sort of failure has occurred.

Sometimes, There is No Place to Write

If creating the .\FAILED_Results directory itself fails, or if a fail-log file cannot be created, VelmaParallelCmdLine logs the failure in the VelmaParallelRunlog.txt file with a WARNING message that says:

Cannot create a failure-summary log for reach outlet= : encountered exception=...

where is the outlet cell index of the reach that failed.

**Exception Stack Traces Are Not Logged**

The fail-log files do not contain exception stack trace information. Although this information is emitted for some reach simulation failures, it is emitted directly to stderr, and is not "caught" by either the reach-specific or the top-level VelmaParallelRunlog logger mechanims. The only place to view exception stack trace information is the command line console, and this is complicated by the fact that all the reach simulation processes that VelmaParallelCmdLine starts echo logging and exception messages back to the console, in an asynchronous and interleaved fashion. To reproduce and capture the console messages (include exception stack traces) for a specific reach, consider re-running that reach by itself using the VelmaSimulatorAltCmdLine tool (mentioned in "Gathering Further Information" below).

Gathering Further Information

Determining the exact cause of a specific reach's failure to run correctly and completely may involve running that specific reach by itself. You can do that using your existing simulation configuration .xml and the (partially-completed) results directory tree that VelmaParallelCmdLine created by using the related VelmaSimulatorAltCmdline tool. Please refer to VelmaSimulatorAltCmdline's HowTo guide for further information.

Submit a Suggestion