OpenMPI and Open-MX setup memo:


OS: Cent OS 5.3/5.4

CUDA:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2009 NVIDIA Corporation
Built on Thu_Jul_30_09:24:36_PDT_2009
Cuda compilation tools, release 2.3, V0.2.1221

CUDA SDK:
2.3

Network: Intel Gigabit ET Quad Port Server Adapter E1G44ET
Driver: igb

We have 4 ports: eth3,eth4,eth5,eth6.
To get the best performance with multi-rail setting, we need to set
ethtool -A eth3 autoneg on rx on tx on
ethtool -A eth4 autoneg on rx on tx on
ethtool -A eth5 autoneg on rx on tx on
ethtool -A eth6 autoneg on rx on tx on
ethtool -C eth3 rx-usecs 12
ethtool -C eth4 rx-usecs 12
ethtool -C eth5 rx-usecs 12
ethtool -C eth6 rx-usecs 12
rx-usecs latency is about 12-13 usecs.
If rx-usecs 0 is set, MPI shows bumpy behavior.
Flow control (rx on, tx on) also improves the bumpy behavior.
MTU = 9000 is used for Open-MX.

Open-MX: 1.2.0
[http://open-mx.gforge.inria.fr/]

config'd as
./configure --prefix=/opt/open-mx-1.2.0 --disable-mx-wire --disable-endian --disable-FMA
omx_perf result (single port p2p connection)
===========
length         0:       13.834 us       0.00 MB/s        0.00 MiB/s
length         1:       12.902 us       0.08 MB/s        0.07 MiB/s
length         2:       13.061 us       0.15 MB/s        0.15 MiB/s
length         4:       13.097 us       0.31 MB/s        0.29 MiB/s
length         8:       12.903 us       0.62 MB/s        0.59 MiB/s
length        16:       13.091 us       1.22 MB/s        1.17 MiB/s
length        32:       13.392 us       2.39 MB/s        2.28 MiB/s
length        64:       13.934 us       4.59 MB/s        4.38 MiB/s
length       128:       15.848 us       8.08 MB/s        7.70 MiB/s
length       256:       18.922 us       13.53 MB/s       12.90 MiB/s
length       512:       24.026 us       21.31 MB/s       20.32 MiB/s
length      1024:       33.584 us       30.49 MB/s       29.08 MiB/s
length      2048:       53.032 us       38.62 MB/s       36.83 MiB/s
length      4096:       101.085 us      40.52 MB/s       38.64 MiB/s
length      8192:       174.576 us      46.93 MB/s       44.75 MiB/s
length     16384:       240.273 us      68.19 MB/s       65.03 MiB/s
length     32768:       372.839 us      87.89 MB/s       83.82 MiB/s
length     65536:       667.963 us      98.11 MB/s       93.57 MiB/s
length    131072:       1207.452 us     108.55 MB/s      103.52 MiB/s
length    262144:       2258.683 us     116.06 MB/s      110.68 MiB/s
length    524288:       4377.124 us     119.78 MB/s      114.23 MiB/s
length   1048576:       8607.746 us     121.82 MB/s      116.17 MiB/s
length   2097152:       17065.792 us    122.89 MB/s      117.19 MiB/s
length   4194304:       33994.376 us    123.38 MB/s      117.67 MiB/s
===========

MPI: OpenMPI 1.4
[http://www.open-mpi.org/]

config'd with
===========
./configure --prefix=/opt/openmpi-1.4-mx \
--with-memory-manager=none \
--disable-shared \
--disable-mpi-cxx \
--enable-static \
--enable-mpi-threads \
--with-threads=posix \
--with-mx=/opt/open-mx \
--with-mx-libdir=/opt/open-mx/lib64 \
CC=icc CXX=icpc F77=ifort FC=ifort \
CFLAGS="-O3 -xT -static -g -traceback -gcc -m64" \
FFLAGS="-O3 -xT -static -g -traceback -m64" \
FCFLAGS="-O3 -xT -static -g -traceback -m64" \
LD=ld
===========
The option "--with-memory-manager=none" is important with Intel compiler. Without this option, compile will stop at the opal/.../ptmalloc2 dir.
To get the best performance with multi rail setting (quad port ethernet), MCA BTL MX parameter setting in /opt/openmpi-1.4-mx/etc/openmpi-mca-params.conf:
===========
btl_mx_bandwidth = 500
btl_mx_latency = 25
btl_mx_bonding = 1
btl_base_warn_component_unused = 1
===========
And 4 ports (eth3,eth4,eth5,eth6) are attached to the Open-MX driver. With these settings, OpenMPI with Open-MX shows amazing performance.

IMB-3.2 result:

% mpirun -np 4 -bynode ...
=============
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2, MPI-1 part    
#---------------------------------------------------
# Date                  : Mon Dec 21 11:52:09 2009
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.18-164.6.1.el5
# Version               : #1 SMP Tue Nov 3 16:12:36 EST 2009
# MPI Version           : 2.1
# MPI Thread Environment: MPI_THREAD_MULTIPLE


# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down 
# dynamically when a certain run time (per message size sample) 
# is expected to be exceeded. Time limit is defined by variable 
# "SECS_PER_SAMPLE" (=> IMB_settings.h) 
# or through the flag => -time 
  


# Calling sequence was: 

# ./IMB-MPI1 sendrecv

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv 
# #processes = 2 
# ( 2 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        13.78        13.80        13.79         0.00
            1         1000        13.54        13.56        13.55         0.14
            2         1000        13.47        13.47        13.47         0.28
            4         1000        13.33        13.33        13.33         0.57
            8         1000        13.42        13.43        13.42         1.14
           16         1000        13.74        13.74        13.74         2.22
           32         1000        14.08        14.08        14.08         4.34
           64         1000        14.77        14.77        14.77         8.26
          128         1000        17.08        17.08        17.08        14.29
          256         1000        20.11        20.12        20.12        24.27
          512         1000        25.45        25.46        25.46        38.36
         1024         1000        35.53        35.57        35.55        54.91
         2048         1000        56.03        56.07        56.05        69.67
         4096         1000       132.30       132.30       132.30        59.05
         8192         1000       139.91       139.91       139.91       111.68
        16384         1000       170.61       170.63       170.62       183.15
        32768         1000       262.18       262.21       262.20       238.36
        65536          640       329.08       329.10       329.09       379.82
       131072          320       474.95       474.98       474.97       526.33
       262144          160       801.22       801.29       801.25       624.00
       524288           80      1546.08      1546.25      1546.16       646.73
      1048576           40      3130.63      3130.82      3130.73       638.81
      2097152           20      6503.36      6504.00      6503.68       615.01
      4194304           10     15194.99     15195.80     15195.39       526.46

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv 
# #processes = 4 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        13.99        14.04        14.02         0.00
            1         1000        13.71        13.72        13.71         0.14
            2         1000        14.28        14.32        14.31         0.27
            4         1000        13.69        13.72        13.70         0.56
            8         1000        14.03        14.07        14.05         1.08
           16         1000        14.20        14.22        14.21         2.15
           32         1000        15.17        15.22        15.20         4.01
           64         1000        15.09        15.10        15.10         8.08
          128         1000        17.98        18.02        18.00        13.55
          256         1000        20.65        20.70        20.68        23.59
          512         1000        26.64        26.72        26.69        36.55
         1024         1000        35.88        35.97        35.92        54.30
         2048         1000        56.24        56.40        56.32        69.26
         4096         1000       137.73       137.77       137.75        56.71
         8192         1000       146.22       146.26       146.24       106.83
        16384         1000       162.60       162.71       162.67       192.06
        32768         1000       276.37       276.46       276.43       226.07
        65536          640       409.15       409.52       409.32       305.24
       131072          320       600.41       601.50       600.91       415.63
       262144          160      1470.24      1470.55      1470.40       340.01
       524288           80      2449.69      2449.81      2449.75       408.19
      1048576           40      4504.12      4557.88      4543.65       438.80
      2097152           20      8583.20      8798.05      8742.06       454.65
      4194304           10     17032.12     17245.01     17187.83       463.90


# All processes entering MPI_Finalize

----
# Exchange

#-----------------------------------------------------------------------------
# Benchmarking Exchange 
# #processes = 2 
# ( 2 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        16.34        16.34        16.34         0.00
            1         1000        15.32        15.32        15.32         0.25
            2         1000        16.00        16.02        16.01         0.48
            4         1000        15.26        15.28        15.27         1.00
            8         1000        16.24        16.24        16.24         1.88
           16         1000        15.76        15.76        15.76         3.87
           32         1000        16.99        17.02        17.00         7.17
           64         1000        16.97        16.99        16.98        14.37
          128         1000        20.32        20.33        20.33        24.02
          256         1000        22.57        22.58        22.57        43.25
          512         1000        28.52        28.55        28.54        68.40
         1024         1000        38.23        38.25        38.24       102.13
         2048         1000        60.81        60.85        60.83       128.39
         4096         1000       250.88       250.90       250.89        62.28
         8192         1000       267.87       267.87       267.87       116.66
        16384         1000       289.17       289.18       289.18       216.13
        32768         1000       492.19       492.20       492.19       253.96
        65536          640       638.92       638.93       638.92       391.28
       131072          320       936.69       936.71       936.70       533.78
       262144          160      1545.04      1545.13      1545.08       647.20
       524288           80      2596.69      2596.80      2596.74       770.18
      1048576           40      4702.63      4702.85      4702.74       850.55
      2097152           20      8945.35      8946.30      8945.82       894.22
      4194304           10     17617.20     17618.39     17617.80       908.14

#-----------------------------------------------------------------------------
# Benchmarking Exchange 
# #processes = 4 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        14.87        14.88        14.87         0.00
            1         1000        16.98        16.99        16.98         0.22
            2         1000        14.98        15.00        14.99         0.51
            4         1000        15.24        15.25        15.25         1.00
            8         1000        15.27        15.29        15.28         2.00
           16         1000        17.22        17.24        17.23         3.54
           32         1000        16.04        16.06        16.05         7.60
           64         1000        17.08        17.09        17.09        14.28
          128         1000        20.87        20.90        20.89        23.36
          256         1000        26.96        27.00        26.98        36.17
          512         1000        28.14        28.17        28.15        69.34
         1024         1000        44.22        44.27        44.25        88.23
         2048         1000        60.92        60.97        60.94       128.15
         4096         1000       291.34       291.45       291.40        53.61
         8192         1000       281.90       281.94       281.91       110.84
        16384         1000       334.74       334.84       334.78       186.66
        32768         1000       569.37       569.58       569.49       219.46
        65536          640       814.59       814.91       814.68       306.78
       131072          320      1297.89      1299.04      1298.46       384.90
       262144          160      2915.94      2922.26      2920.04       342.20
       524288           80      5002.27      5016.69      5009.51       398.67
      1048576           40      9265.12      9313.92      9289.56       429.46
      2097152           20     18330.41     18523.85     18427.12       431.88
      4194304           10     35378.00     36082.60     35730.58       443.43
----
# PingPong

#---------------------------------------------------
# Benchmarking PingPong 
# #processes = 2 
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        12.89         0.00
            1         1000        13.10         0.07
            2         1000        13.06         0.15
            4         1000        13.08         0.29
            8         1000        13.18         0.58
           16         1000        13.41         1.14
           32         1000        13.83         2.21
           64         1000        14.53         4.20
          128         1000        17.00         7.18
          256         1000        19.73        12.37
          512         1000        25.04        19.50
         1024         1000        35.25        27.70
         2048         1000        54.85        35.61
         4096         1000       131.15        29.78
         8192         1000       140.83        55.47
        16384         1000       152.95       102.16
        32768         1000       252.08       123.97
        65536          640       327.77       190.68
       131072          320       475.06       263.12
       262144          160       769.79       324.76
       524288           80      1301.54       384.16
      1048576           40      2354.76       424.67
      2097152           20      4473.95       447.03
      4194304           10      8727.20       458.34
----
# Barrier

#---------------------------------------------------
# Benchmarking Barrier 
# #processes = 2 
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000        14.23        14.24        14.24

#---------------------------------------------------
# Benchmarking Barrier 
# #processes = 4 
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000        26.22        26.23        26.23
----
# Allgather

#----------------------------------------------------------------
# Benchmarking Allgather 
# #processes = 2 
# ( 2 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.02         0.02         0.02
            1         1000        15.17        15.17        15.17
            2         1000        14.34        14.34        14.34
            4         1000        14.22        14.22        14.22
            8         1000        14.30        14.30        14.30
           16         1000        14.59        14.59        14.59
           32         1000        15.07        15.07        15.07
           64         1000        15.77        15.78        15.78
          128         1000        17.85        17.85        17.85
          256         1000        20.75        20.78        20.76
          512         1000        25.93        25.96        25.95
         1024         1000        36.01        36.05        36.03
         2048         1000        56.28        56.32        56.30
         4096         1000       134.17       134.19       134.18
         8192         1000       144.00       144.02       144.01
        16384         1000       158.85       158.86       158.85
        32768         1000       262.29       262.32       262.30
        65536          640       336.30       336.31       336.31
       131072          320       491.06       491.10       491.08
       262144          160       869.71       869.72       869.71
       524288           80      1504.08      1504.20      1504.14
      1048576           40      3322.63      3322.73      3322.68
      2097152           20      8086.30      8086.50      8086.40
      4194304           10     17404.29     17404.29     17404.29

#----------------------------------------------------------------
# Benchmarking Allgather 
# #processes = 4 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.02         0.02         0.02
            1         1000        27.08        27.09        27.08
            2         1000        28.05        28.06        28.06
            4         1000        27.72        27.75        27.74
            8         1000        26.80        26.82        26.81
           16         1000        28.25        28.27        28.26
           32         1000        30.26        30.27        30.27
           64         1000        33.26        33.29        33.28
          128         1000        37.50        37.51        37.51
          256         1000        46.38        46.41        46.39
          512         1000        61.92        61.98        61.95
         1024         1000        91.41        91.45        91.43
         2048         1000       188.71       188.75       188.73
         4096         1000       316.29       316.35       316.32
         8192         1000       425.82       425.98       425.90
        16384         1000       666.94       667.23       667.09
        32768         1000      1019.19      1019.59      1019.40
        65536          640      1452.81      1453.70      1453.24
       131072          320      2148.04      2148.46      2148.23
       262144          160      3162.96      3163.08      3163.00
       524288           80      6284.14      6284.39      6284.26
      1048576           40     12090.43     12097.53     12094.01
      2097152           20     25040.75     25041.10     25040.91
      4194304           10     43153.29     43158.01     43155.04
=============