Fluent Random Crashes with Mellanox OFED 1.5.3 Fluent crashes randomly during case/dat reading, iterations or while writing case/dat files. Below are some expected error messages: Parallel runs using lower core counts ~64 cores will typically run fine. Runs using 64-256 cores might fail randomly during IO or iterations with below message: fluent_mpi.14.0.0: Rank 0:0: MPI_Send: ibv_reg_mr() failed: addr 0x2ae4879a4e58, len 1057200 fluent_mpi.14.0.0: Rank 0:0: MPI_Send: Internal MPI error Higher core count (>256 cores) will result in startup issues with below message Runs using Platform MPI fluent_mpi.14.5.0: Rank 0:22: MPI_Init: Could not pin pre-pinned rdma region 0 fluent_mpi.14.5.0: Rank 0:22: MPI_Init: hpmp_rdmaregion_alloc() failed fluent_mpi.14.5.0: Rank 0:22: MPI_Init: make_world_rdmaenvelope() failed fluent_mpi.14.5.0: Rank 0:22: MPI_Init: Internal Error: Processes cannot connect to rdma device For runs using uDAPL protocol (Platform or Intel MPI) it fails with below message: [1:[../../dapl_module_util.c:1550] error(0x30000): OpenIB-mlx4_0-1: could not create vc: DAT_INSUFFICIENT_RESOURCES() [15:../../dapl_module_util.c:2033] error(0x30000): OpenIB-mlx4_0-1: could not create vc: DAT_INSUFFICIENT_RESOURCES() Source of the problem: Mellanox OFED version 1.5.3-1.0.0 "MLNX_OFED_LINUX-1.5.3-1.0.0" is found to be the root cause of these crashes. Customers running with -pib option on such clusters will encounter random crashes at different stages of simulation. This issue is not yet fixed in later Mellanox OFED releases also. The last known stable version is 1.5.2. Customers using Infiniband from other vendors like QLogic etc. are not impacted with this issue. This issue is only for Mellanox hardware users.
Below are solutions/workarounds for this issue (until the actual issue is fixed by Mellanox) :
1) revert back to OFED 1.5.2
2) change the value of log_num_mtt in the mlx4_core driver
To do this you will need to modify /etc/modprobe.conf and restart the driver.
echo options mlx4_core log_num_mtt=24 >> /etc/modprobe.conf
need to do this on all of the machines.