Я использую HPC, состоящий из 4 машин (master, slave1, slave2, slave3 и slave4). Я пытаюсь запустить скрипт для структуры HPC:

mpirun -report-uri - -host master,slave1,slave2,slave3,slave4 --map-by node-np 50 hellompi

но я сталкиваюсь с этим сообщением об ошибке:

657129472.0;tcp://10.1.1.1,10.1.2.1,10.1.3.1,10.1.4.1:54761
[charlotte-ProLiant-DL380-Gen10-slave1:07172] [[10027,0],1] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32)

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

Я работаю над Ubuntu. Брандмауэры (UFW) отключены на каждой машине. Вход по ssh в порядке, даже в режиме без пароля. Версия Mpirun одинакова на каждой машине. Iptables включен.

Скрипт, который я пытаюсь запустить, представляет собой простой код на языке Fortran:

program hello
include 'mpif.h'
integer rank, size, ierror, nl
character(len=MPI_MAX_PROCESSOR_NAME) :: hostname

call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
call MPI_GET_PROCESSOR_NAME(hostname, nl, ierror)
print*, 'node', rank, ' of', size, ' on ', hostname(1:nl), ': Hello world'
call MPI_FINALIZE(ierror)
end

Если я запускаю пару узлов, это работает:

mpirun -report-uri - --mca oob_tcp_if_include 10.1.1.0/24 -host master,slave1 --map-by node -np 4 hellompi

4211277824.0;tcp://10.1.1.1:49281
node           0  of           4  on charlotte-ProLiant-DL380-Gen10-master: Hello world
node           2  of           4  on charlotte-ProLiant-DL380-Gen10-master: Hello world
node           1  of           4  on charlotte-ProLiant-DL380-Gen10-slave1: Hello world
node           3  of           4  on charlotte-ProLiant-DL380-Gen10-slave1: Hello world

То же самое с Master-Slave2, Master-Slave3, Master-Slave4

На master ifconfig дает:

eno1      Link encap:Ethernet  HWaddr 54:80:28:57:0f:7e  
      UP BROADCAST MULTICAST  MTU:1500  Metric:1
      Packets reçus:0 erreurs:0 :0 overruns:0 frame:0
      TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 lg file transmission:1000 
      Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B)
      Interruption:16 

eno2      Link encap:Ethernet  HWaddr 54:80:28:57:0f:7f  
      UP BROADCAST MULTICAST  MTU:1500  Metric:1
      Packets reçus:0 erreurs:0 :0 overruns:0 frame:0
      TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 lg file transmission:1000 
      Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B)
      Interruption:17 

eno3      Link encap:Ethernet  HWaddr 54:80:28:57:0f:80  
      UP BROADCAST MULTICAST  MTU:1500  Metric:1
      Packets reçus:0 erreurs:0 :0 overruns:0 frame:0
      TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 lg file transmission:1000 
      Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B)
      Interruption:16 

eno4      Link encap:Ethernet  HWaddr 54:80:28:57:0f:81  
      UP BROADCAST MULTICAST  MTU:1500  Metric:1
      Packets reçus:0 erreurs:0 :0 overruns:0 frame:0
      TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 lg file transmission:1000 
      Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B)
      Interruption:17 

eno5      Link encap:Ethernet  HWaddr 80:30:e0:31:b1:68  
      inet adr:10.1.3.1  Bcast:10.1.3.255  Masque:255.255.255.0
      adr inet6: fe80::5cfb:416e:a702:7582/64 Scope:Lien
      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
      Packets reçus:1038 erreurs:0 :0 overruns:0 frame:0
      TX packets:966 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 lg file transmission:1000 
      Octets reçus:186531 (186.5 KB) Octets transmis:106391 (106.3 KB)
      Interruption:32 Mémoire:e7800000-e7ffffff 

eno6      Link encap:Ethernet  HWaddr 80:30:e0:31:b1:6c  
      inet adr:10.1.4.1  Bcast:10.1.4.255  Masque:255.255.255.0
      adr inet6: fe80::9451:8431:7010:46/64 Scope:Lien
      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
      Packets reçus:873 erreurs:0 :0 overruns:0 frame:0
      TX packets:844 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 lg file transmission:1000 
      Octets reçus:86934 (86.9 KB) Octets transmis:72778 (72.7 KB)
      Interruption:144 Mémoire:e8800000-e8ffffff 

ens2f0    Link encap:Ethernet  HWaddr 20:67:7c:06:5f:a8  
      inet adr:10.1.1.1  Bcast:10.1.1.255  Masque:255.255.255.0
      adr inet6: fe80::39c2:fdd5:930e:c253/64 Scope:Lien
      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
      Packets reçus:2195 erreurs:0 :0 overruns:0 frame:0
      TX packets:1425 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 lg file transmission:1000 
      Octets reçus:1332614 (1.3 MB) Octets transmis:200100 (200.1 KB)
      Interruption:28 Mémoire:e3000000-e37fffff 

ens2f1    Link encap:Ethernet  HWaddr 20:67:7c:06:5f:ac  
      inet adr:10.1.2.1  Bcast:10.1.2.255  Masque:255.255.255.0
      adr inet6: fe80::91f5:53ce:378a:686e/64 Scope:Lien
      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
      Packets reçus:1644 erreurs:0 :0 overruns:0 frame:0
      TX packets:1385 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 lg file transmission:1000 
      Octets reçus:379968 (379.9 KB) Octets transmis:211904 (211.9 KB)
      Interruption:123 Mémoire:e4000000-e47fffff 

ens5f0    Link encap:Ethernet  HWaddr 20:67:7c:06:5f:a0  
      inet adr:10.0.0.2  Bcast:10.1.0.255  Masque:255.255.255.0
      adr inet6: fe80::52e5:a943:831d:35f5/64 Scope:Lien
      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
      Packets reçus:9821 erreurs:0 :0 overruns:0 frame:0
      TX packets:9230 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 lg file transmission:1000 
      Octets reçus:983759 (983.7 KB) Octets transmis:2599111 (2.5 MB)
      Interruption:34 Mémoire:f0000000-f07fffff 

ens5f1    Link encap:Ethernet  HWaddr 20:67:7c:06:5f:a4  
      UP BROADCAST MULTICAST  MTU:1500  Metric:1
      Packets reçus:0 erreurs:0 :0 overruns:0 frame:0
      TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 lg file transmission:1000 
      Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B)
      Interruption:165 Mémoire:f1000000-f17fffff 

lo        Link encap:Boucle locale  
      inet adr:127.0.0.1  Masque:255.0.0.0
      adr inet6: ::1/128 Scope:Hôte
      UP LOOPBACK RUNNING  MTU:65536  Metric:1
      Packets reçus:230476 erreurs:0 :0 overruns:0 frame:0
      TX packets:230476 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 lg file transmission:1000 
      Octets reçus:411579801 (411.5 MB) Octets transmis:411579801 (411.5 MB)

Slave1: 10.1.1.1 (ens2f0), Slave2: 10.1.2.1 (ens2f1), Slave3: 10.1.3.1 (eno5), Slave4: 10.1.4.1 (eno6).

Информация о моей версии ompi:

ompi_info
Package: Open MPI buildd@lgw01-57 Distribution
Open MPI: 1.10.2
Open MPI repo revision: v1.10.1-145-g799148f
Open MPI release date: Jan 21, 2016
Open RTE: 1.10.2
Open RTE repo revision: v1.10.1-145-g799148f
Open RTE release date: Jan 21, 2016
OPAL: 1.10.2
OPAL repo revision: v1.10.1-145-g799148f
OPAL release date: Jan 21, 2016
MPI API: 3.0.0
Ident string: 1.10.2
Prefix: /usr
Configured architecture: x86_64-pc-linux-gnu
Configure host: lgw01-57
Configured by: buildd
Configured on: Thu Feb 25 16:33:01 UTC 2016
Configure host: lgw01-57
Built by: buildd
Built on: Thu Feb 25 16:40:59 UTC 2016
Built host: lgw01-57
C bindings: yes
C++ bindings: yes
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                      limitations in the gfortran compiler, does not
                      support the following: array subsections, direct
                      passthru (where possible) to underlying Open MPI's
                      C functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: gcc
C compiler absolute: /usr/bin/gcc
C compiler family name: GNU
C compiler version: 5.3.1
C++ compiler: g++
C++ compiler absolute: /usr/bin/g++
Fort compiler: gfortran
Fort compiler abs: /usr/bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort PROTECTED: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
C profiling: yes
C++ profiling: yes
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
C++ exceptions: no
Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes,
                      OMPI progress: no, ORTE progress: yes, Event lib:
                      yes)
Sparse Groups: no
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
          dl support: yes
Heterogeneous support: yes
mpirun default --prefix: no
MPI I/O support: yes
MPI_WTIME support: gettimeofday
Symbol vis. support: yes
Host topology support: yes
MPI extensions: 
FT Checkpoint support: no (checkpoint thread: no)
C/R Enabled Debugging: no
VampirTrace support: no
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA backtrace: execinfo (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA compress: gzip (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA compress: bzip (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA crs: none (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA db: print (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA db: hash (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA dl: dlopen (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA event: libevent2021 (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA hwloc: external (MCA v2.0.0, API v2.0.0, Component
                       v1.10.2)
MCA if: posix_ipv4 (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA if: linux_ipv6 (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA installdirs: env (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA installdirs: config (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA memory: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA pstat: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA sec: basic (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA shmem: posix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA shmem: mmap (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA shmem: sysv (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA timer: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA dfs: app (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA dfs: test (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA dfs: orted (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA errmgr: default_tool (MCA v2.0.0, API v3.0.0, Component
                      v1.10.2)
MCA errmgr: default_app (MCA v2.0.0, API v3.0.0, Component
                      v1.10.2)
MCA errmgr: default_orted (MCA v2.0.0, API v3.0.0, Component
                      v1.10.2)
MCA errmgr: default_hnp (MCA v2.0.0, API v3.0.0, Component
                      v1.10.2)
MCA ess: singleton (MCA v2.0.0, API v3.0.0, Component
                      v1.10.2)
MCA ess: slurm (MCA v2.0.0, API v3.0.0, Component v1.10.2)
MCA ess: env (MCA v2.0.0, API v3.0.0, Component v1.10.2)
MCA ess: tool (MCA v2.0.0, API v3.0.0, Component v1.10.2)
MCA ess: hnp (MCA v2.0.0, API v3.0.0, Component v1.10.2)
MCA filem: raw (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA grpcomm: bad (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA iof: tool (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA iof: hnp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA iof: orted (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA iof: mr_hnp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA iof: mr_orted (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA odls: default (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA oob: tcp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA plm: isolated (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA plm: rsh (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA plm: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA ras: gridengine (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA ras: loadleveler (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA ras: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA ras: simulator (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA rmaps: round_robin (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA rmaps: mindist (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA rmaps: seq (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA rmaps: ppr (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA rmaps: rank_file (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA rmaps: staged (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA rmaps: resilient (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA rml: oob (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA routed: radix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA routed: debruijn (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA routed: direct (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA routed: binomial (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA state: orted (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA state: app (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA state: dvm (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA state: staged_hnp (MCA v2.0.0, API v1.0.0, Component
                      v1.10.2)
MCA state: tool (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA state: hnp (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA state: staged_orted (MCA v2.0.0, API v1.0.0, Component
                      v1.10.2)
MCA state: novm (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA allocator: bucket (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA allocator: basic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA bcol: basesmuma (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA bcol: ptpcoll (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA bml: r2 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA btl: vader (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA btl: tcp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA btl: openib (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA btl: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA btl: self (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: tuned (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: self (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: hierarch (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA coll: basic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: libnbc (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: ml (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: inter (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA dpm: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA fbtl: posix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA fcoll: ylib (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA fcoll: individual (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA fcoll: static (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA fcoll: dynamic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA fcoll: two_phase (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA fs: ufs (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA io: ompio (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA io: romio (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA mpool: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA mpool: grdma (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA osc: pt2pt (MCA v2.0.0, API v3.0.0, Component v1.10.2)
MCA osc: sm (MCA v2.0.0, API v3.0.0, Component v1.10.2)
MCA pml: v (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA pml: ob1 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA pml: cm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA pml: bfo (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA pubsub: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA rcache: vma (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA rte: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA sbgp: basesmsocket (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA sbgp: basesmuma (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA sbgp: p2p (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA sharedfp: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA sharedfp: individual (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA sharedfp: lockedfile (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)
MCA topo: basic (MCA v2.0.0, API v2.1.0, Component v1.10.2)
MCA vprotocol: pessimist (MCA v2.0.0, API v2.0.0, Component
                      v1.10.2)

Любая идея ?

1 ответ1

0

Для тех, у кого была похожая проблема, я частично решил ее с помощью команды OMPI. Пожалуйста, проверьте https://github.com/open-mpi/ompi/issues/6293 для деталей. Проблема заблокирована.

Всё ещё ищете ответ? Посмотрите другие вопросы с метками .