FLUENT randomly crashes with Mellanox OFED 1.5.3 during case/data reading, iterations or while writing case/data files.

' Parallel runs using lower core counts ~64 cores will typically run fine.

' Runs using 64-256 cores might fail randomly during IO or iterations with below message:

fluent_mpi.14.0.0: Rank 0:0: MPI_Send: ibv_reg_mr() failed: addr 0x2ae4879a4e58, len 1057200

fluent_mpi.14.0.0: Rank 0:0: MPI_Send: Internal MPI error

' Higher core count (>256 cores) will result in startup issues with below messages:

Runs using Platform MPI:

fluent_mpi.14.5.0: Rank 0:22: MPI_Init: Could not pin pre-pinned rdma region 0

fluent_mpi.14.5.0: Rank 0:22: MPI_Init: hpmp_rdmaregion_alloc() failed
fluent_mpi.14.5.0: Rank 0:22: MPI_Init: make_world_rdmaenvelope() failed
fluent_mpi.14.5.0: Rank 0:22: MPI_Init: Internal Error: Processes cannot connect to rdma device

For runs using uDAPL protocol (Platform or Intel MPI) it fails with below message:

[1:[../../dapl_module_util.c:1550] error(0x30000): OpenIB-mlx4_0-1: could not create vc: DAT_INSUFFICIENT_RESOURCES()
[15:../../dapl_module_util.c:2033] error(0x30000): OpenIB-mlx4_0-1: could not create vc: DAT_INSUFFICIENT_RESOURCES()

Source of the problem:
Mellanox OFED version 1.5.3-1.0.0 "MLNX_OFED_LINUX-1.5.3-1.0.0" is found to be the root cause of these crashes.

Customers running with -pib option on such clusters will encounter random crashes at different stages of simulation. This issue is not yet fixed in later Mellanox OFED releases. The last known stable version is 1.5.2.

Customers using Infiniband from other vendors like QLogic etc. are not impacted with this issue. This issue is only for Mellanox hardware users.


Below are solutions/workarounds for this issue (until the actual issue is fixed by Mellanox) :

1) revert back to OFED 1.5.2
2) change the value of `log_num_mtt` in the mlx4_core driver

To do this you will need to modify /etc/modprobe.conf and restart the driver.

echo `options mlx4_core log_num_mtt=24` >> /etc/modprobe.conf
/etc/init.d/openibd restart

This will need to be done on all of the machines.





Show Form
No comments yet. Be the first to add a comment!