From 6294a43ceb199d318766a0ee7b0db4f3d2e39b6e Mon Sep 17 00:00:00 2001 From: Zhumatiy Sergey Date: Sat, 16 Apr 2022 01:46:04 +0300 Subject: [PATCH 1/6] First description --- Containerization-HPC.md | 144 ++++++++++++++++++++++++++++++---------- 1 file changed, 108 insertions(+), 36 deletions(-) diff --git a/Containerization-HPC.md b/Containerization-HPC.md index bf3cb8db..a2d760eb 100644 --- a/Containerization-HPC.md +++ b/Containerization-HPC.md @@ -6,39 +6,111 @@ The aim of this task is to build an HPC compatible container (i.e. [Singularity](https://sylabs.io/guides/3.5/user-guide/introduction.html)) and test its performance in comparison with a native installation (no containerization) for a set of distributed memory calculations. -# Requirements - -1. A working deployment pipeline - using any preferred tool such as SaltStack, Terraform, CloudFormation - for building out the computational infrastructure -2. A pipeline for building the HPC compatible container -3. A set of benchmarks for one or more HPC application on one or more cloud instance type - -# Expectations - -- The application may be relatively simple - e.g. Linpack, this is focused more on infrastructure -- Repeatable approach (no manual setup "in console") -- Clean workflow logic - -# Timeline - -We leave exact timing to the candidate. Should fit Within 5 days total. - -# User story - -As a user of this pipeline I can: - -- build an HPC-compatible container for an HPC executable/code -- run test calculations to assert working state of this container -- (optional) compare the behavior of this container with a OS native installation - -# Notes - -- Commit early and often - -# Suggestions - -We suggest: - -- using AWS as the cloud provider -- using Exabench as the source of benchmarks: https://github.com/Exabyte-io/exabyte-benchmarks-suite -- using CentOS or similar as operating system -- using Salstack, or Terraform, for infrastructure management +# Solution + +Here we use AWS virtual parallel cluster (https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html). AWS account is needed. We suppose to use Linux and console tools. aws console tools are supposed to be installed. + +**Note**: aws account should have appropriate permissions (TO BE ADDED). + +## Steps + +1. Install pcluster tool (here v2 is used): `pip3 install "aws-parallelcluster<3.0" --upgrade --user` +2. Run initial conigure: `aws configure; pcluster configure`. The scheduler preferred to be slurm +3. Create a cluster: `pcluster create my-hpc-cluster` +4. Go to the cluster `pcluster ssh my-hpc-cluster` +5. Install the singularity from github releases (https://github.com/sylabs/singularity/releases/): `wget https://github.com/sylabs/singularity/releases/download/v3.9.8/singularity-ce_3.9.8-focal_amd64.deb; sudo apt install singularity-ce_3.9.8-focal_amd64.deb` +6. Copy and edit reciepe file (look the [hpl1.def](hpl1.def)) +7. Build a singularity image: `sudo singularity build my-image.sif my-reciepe.def` (note **sudo** usage). +8. Run the application like this: `sbatch -n NUM_CPUS ./sing-run.sh --bind INPUT:/INPUT my-image.sif /path/to/app/in/image`. Here `--bind ...` is optional, it passes the input file into the container, can be specified multiple times and can pass directories. + +`sing-run.sh` can be modified for custom needs. + +**Important note**: do not use standard openmpi, you should use amazon toolkit and amazon openmpi. See the example reciepe file (hpl1.def) to details. + +## Options + +You can prepare your singularity image on another aws instance or computer, but you should optimize it for aws instance processor (AVX512 preferred). + +You can set up the cluster without slurm and run the app via mpirun directly. + +You can set up the cluster without slurm via terraform and salt/ansible, but details should be specified in later investigations. + +## Example for HPL + +Here is commented hpl1.def file, use it for new images as a sample. + +``` +Bootstrap: docker +# use amazon linux, it is based on centos. If use ubuntu or alpine, change packages management. +From: amazonlinux + +# files to be copied into container +%files + hpl-2.3.tar.gz + Make.t1 + +# default environment, add amazon openmpi path +%environment + export LC_ALL=C + export PATH=$PATH:/opt/amazon/openmpi/bin + +# prepare our image +%post + export PATH=$PATH:/opt/amazon/openmpi/bin + + # install gcc etc + yum groupinstall 'Development Tools' -y + + # install additional tools and atlas lib (for HPL) + yum install -y make tar curl sudo altas atlas-devel + + # Here is ubuntu option + # apt install -y altas make tar openmpi-aws + + # get and install amazon efa tools. Important for MPI applications + curl -O https://efa-installer.amazonaws.com/aws-efa-installer-1.15.1.tar.gz + tar -xf aws-efa-installer-1.15.1.tar.gz && pushd aws-efa-installer + sudo ./efa_installer.sh -y + popd + + # + # Here is HPL-oriented part. Ypu can use another commands to build and install your app + # build HPL in the /hpl directory, then we'll delete it + mkdir /hpl + tar xfz hpl-2.3.tar.gz -C /hpl + rm hpl-2.3.tar.gz + cp Make.t1 /hpl/hpl-2.3 + cd /hpl/hpl-2.3 + + # use cutom prepared makefile + make arch=t1 + + # copy compiled xhpl app + mv bin/t1/xhpl /bin + cd .. + + # clean + rm -rf /hpl +``` + +## Testing results + +See [singularity-hpl.out](singularity-hpl.out) and [raw-hpl.out](raw-hpl.out) for singularity run and raw hpl run respectively. The short data is here: + +``` +RAW HPL: + +WR10L2L2 5000 512 9 8 1.73 4.8320e+01 +WR10L2L2 20000 512 9 8 22.08 2.4160e+02 +WR10L2L2 12128 512 9 8 7.07 1.6818e+02 + +Singularity HPL: + +WR10L2L2 5000 512 9 8 1.79 4.6495e+01 +WR10L2L2 20000 512 9 8 22.52 2.3682e+02 +WR10L2L2 12128 512 9 8 7.09 1.6776e+02 +``` + +## Terraform approach notes + +Here is my attempts to set up the cluster. See my terraform config [...](...). Please, note, that amazon efa tools should me installen on each node! Salt config should be added. From ab1a8b8c55788e4ddce0781d93f2afa519a54488 Mon Sep 17 00:00:00 2001 From: Zhumatiy Sergey Date: Fri, 15 Apr 2022 17:48:55 -0500 Subject: [PATCH 2/6] Add scripts, configs and results --- HPL.dat | 32 ++++++++ Make.t1 | 186 ++++++++++++++++++++++++++++++++++++++++++++ hpl1.def | 33 ++++++++ raw-hpl.out | 83 ++++++++++++++++++++ results.txt | 12 +++ singularity-hpl.out | 83 ++++++++++++++++++++ xhpl.sh | 5 ++ zhpl.sh | 4 + 8 files changed, 438 insertions(+) create mode 100644 HPL.dat create mode 100644 Make.t1 create mode 100644 hpl1.def create mode 100644 raw-hpl.out create mode 100644 results.txt create mode 100644 singularity-hpl.out create mode 100644 xhpl.sh create mode 100644 zhpl.sh diff --git a/HPL.dat b/HPL.dat new file mode 100644 index 00000000..2dc24dca --- /dev/null +++ b/HPL.dat @@ -0,0 +1,32 @@ +HPLinpack benchmark input file +Innovative Computing Laboratory, University of Tennessee +HPL.out output file name (if any) +6 device out (6=stdout,7=stderr,file) +1 # of problems sizes (N) +1000 16129 16128 Ns +1 # of NBs +512 384 640 768 896 960 NBs +0 PMAP process mapping (0=Row-,1=Column-major) +1 # of process grids (P x Q) +1 2 2 Ps +2 1 2 Qs +16.0 threshold +1 # of panel fact +0 1 2 PFACTs (0=left, 1=Crout, 2=Right) +1 # of recursive stopping criterium +2 8 NBMINs (>= 1) +1 # of panels in recursion +2 NDIVs +1 # of recursive panel fact. +0 1 2 RFACTs (0=left, 1=Crout, 2=Right) +1 # of broadcast +0 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) +1 # of lookahead depth +1 0 DEPTHs (>=0) +1 SWAP (0=bin-exch,1=long,2=mix) +192 swapping threshold +1 L1 in (0=transposed,1=no-transposed) form +1 U in (0=transposed,1=no-transposed) form +1 Equilibration (0=no,1=yes) +8 memory alignment in double (> 0) + diff --git a/Make.t1 b/Make.t1 new file mode 100644 index 00000000..054be42a --- /dev/null +++ b/Make.t1 @@ -0,0 +1,186 @@ +# +# -- High Performance Computing Linpack Benchmark (HPL) +# HPL - 2.3 - December 2, 2018 +# Antoine P. Petitet +# University of Tennessee, Knoxville +# Innovative Computing Laboratory +# (C) Copyright 2000-2008 All Rights Reserved +# +# -- Copyright notice and Licensing terms: +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# +# 1. Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# 2. Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions, and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# +# 3. All advertising materials mentioning features or use of this +# software must display the following acknowledgement: +# This product includes software developed at the University of +# Tennessee, Knoxville, Innovative Computing Laboratory. +# +# 4. The name of the University, the name of the Laboratory, or the +# names of its contributors may not be used to endorse or promote +# products derived from this software without specific written +# permission. +# +# -- Disclaimer: +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +# ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY +# OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +# DATA OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +# ###################################################################### +# +# ---------------------------------------------------------------------- +# - shell -------------------------------------------------------------- +# ---------------------------------------------------------------------- +# +SHELL = /bin/sh +# +CD = cd +CP = cp +LN_S = ln -s +MKDIR = mkdir +RM = /bin/rm -f +TOUCH = touch +# +# ---------------------------------------------------------------------- +# - Platform identifier ------------------------------------------------ +# ---------------------------------------------------------------------- +# +ARCH = t1 +# +# ---------------------------------------------------------------------- +# - HPL Directory Structure / HPL library ------------------------------ +# ---------------------------------------------------------------------- +# +TOPdir = /hpl/hpl-2.3 +INCdir = $(TOPdir)/include +BINdir = $(TOPdir)/bin/$(ARCH) +LIBdir = $(TOPdir)/lib/$(ARCH) +# +HPLlib = $(LIBdir)/libhpl.a +# +# ---------------------------------------------------------------------- +# - Message Passing library (MPI) -------------------------------------- +# ---------------------------------------------------------------------- +# MPinc tells the C compiler where to find the Message Passing library +# header files, MPlib is defined to be the name of the library to be +# used. The variable MPdir is only used for defining MPinc and MPlib. +# +MPdir = +# /usr/local/mpi +MPinc = +# -I$(MPdir)/include +MPlib = +# $(MPdir)/lib/libmpich.a +# +# ---------------------------------------------------------------------- +# - Linear Algebra library (BLAS or VSIPL) ----------------------------- +# ---------------------------------------------------------------------- +# LAinc tells the C compiler where to find the Linear Algebra library +# header files, LAlib is defined to be the name of the library to be +# used. The variable LAdir is only used for defining LAinc and LAlib. +# +LAdir = /usr/lib64/atlas/ +LAinc = +LAlib = $(LAdir)/libsatlas.so.3 +# +# ---------------------------------------------------------------------- +# - F77 / C interface -------------------------------------------------- +# ---------------------------------------------------------------------- +# You can skip this section if and only if you are not planning to use +# a BLAS library featuring a Fortran 77 interface. Otherwise, it is +# necessary to fill out the F2CDEFS variable with the appropriate +# options. **One and only one** option should be chosen in **each** of +# the 3 following categories: +# +# 1) name space (How C calls a Fortran 77 routine) +# +# -DAdd_ : all lower case and a suffixed underscore (Suns, +# Intel, ...), [default] +# -DNoChange : all lower case (IBM RS6000), +# -DUpCase : all upper case (Cray), +# -DAdd__ : the FORTRAN compiler in use is f2c. +# +# 2) C and Fortran 77 integer mapping +# +# -DF77_INTEGER=int : Fortran 77 INTEGER is a C int, [default] +# -DF77_INTEGER=long : Fortran 77 INTEGER is a C long, +# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short. +# +# 3) Fortran 77 string handling +# +# -DStringSunStyle : The string address is passed at the string loca- +# tion on the stack, and the string length is then +# passed as an F77_INTEGER after all explicit +# stack arguments, [default] +# -DStringStructPtr : The address of a structure is passed by a +# Fortran 77 string, and the structure is of the +# form: struct {char *cp; F77_INTEGER len;}, +# -DStringStructVal : A structure is passed by value for each Fortran +# 77 string, and the structure is of the form: +# struct {char *cp; F77_INTEGER len;}, +# -DStringCrayStyle : Special option for Cray machines, which uses +# Cray fcd (fortran character descriptor) for +# interoperation. +# +F2CDEFS = +# +# ---------------------------------------------------------------------- +# - HPL includes / libraries / specifics ------------------------------- +# ---------------------------------------------------------------------- +# +HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc) +HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib) +# +# - Compile time options ----------------------------------------------- +# +# -DHPL_COPY_L force the copy of the panel L before bcast; +# -DHPL_CALL_CBLAS call the cblas interface; +# -DHPL_CALL_VSIPL call the vsip library; +# -DHPL_DETAILED_TIMING enable detailed timers; +# +# By default HPL will: +# *) not copy L before broadcast, +# *) call the BLAS Fortran 77 interface, +# *) not display detailed timing information. +# +HPL_OPTS = -DHPL_CALL_CBLAS +# +# ---------------------------------------------------------------------- +# +HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES) +# +# ---------------------------------------------------------------------- +# - Compilers / linkers - Optimization flags --------------------------- +# ---------------------------------------------------------------------- +# +CC = mpicc +CCNOOPT = $(HPL_DEFS) +CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops +# +# On some platforms, it is necessary to use the Fortran linker to find +# the Fortran internals used in the BLAS library. +# +LINKER = mpicc +LINKFLAGS = $(CCFLAGS) +# +ARCHIVER = ar +ARFLAGS = r +RANLIB = echo +# +# ---------------------------------------------------------------------- diff --git a/hpl1.def b/hpl1.def new file mode 100644 index 00000000..61b6de75 --- /dev/null +++ b/hpl1.def @@ -0,0 +1,33 @@ +Bootstrap: docker +From: amazonlinux + +%files + hpl-2.3.tar.gz + HPL.dat + Make.t1 + /opt/amazon/openmpi + +%environment + export LC_ALL=C + export PATH=$PATH:/opt/amazon/openmpi/bin +#/usr/lib64/openmpi/bin:/usr/lib/openmpi/bin + +%post + export PATH=$PATH:/opt/amazon/openmpi/bin + yum groupinstall 'Development Tools' -y + yum install -y altas atlas-devel make tar wget curl sudo + #apt install -y altas make tar openmpi-aws + curl -O https://efa-installer.amazonaws.com/aws-efa-installer-1.15.1.tar.gz + tar -xf aws-efa-installer-1.15.1.tar.gz && pushd aws-efa-installer + sudo ./efa_installer.sh -y + popd + mkdir /hpl + tar xfz hpl-2.3.tar.gz -C /hpl + rm hpl-2.3.tar.gz + cp Make.t1 /hpl/hpl-2.3 + cd /hpl/hpl-2.3 + make arch=t1 + mv bin/t1/xhpl /bin + cd .. + rm -rf /hpl + diff --git a/raw-hpl.out b/raw-hpl.out new file mode 100644 index 00000000..b4735136 --- /dev/null +++ b/raw-hpl.out @@ -0,0 +1,83 @@ +================================================================================ +HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 +Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK +Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK +Modified by Julien Langou, University of Colorado Denver +================================================================================ + +An explanation of the input/output parameters follows: +T/V : Wall time / encoded variant. +N : The order of the coefficient matrix A. +NB : The partitioning blocking factor. +P : The number of process rows. +Q : The number of process columns. +Time : Time in seconds to solve the linear system. +Gflops : Rate of execution for solving the linear system. + +The following parameter values will be used: + +N : 5000 20000 12128 +NB : 512 +PMAP : Row-major process mapping +P : 9 +Q : 8 +PFACT : Left +NBMIN : 2 +NDIV : 2 +RFACT : Left +BCAST : 1ring +DEPTH : 1 +SWAP : Spread-roll (long) +L1 : no-transposed form +U : no-transposed form +EQUIL : yes +ALIGN : 8 double precision words + +-------------------------------------------------------------------------------- + +- The matrix A is randomly generated for each test. +- The following scaled residual check will be computed: + ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) +- The relative machine precision (eps) is taken to be 1.110223e-16 +- Computational tests pass if scaled residuals are less than 16.0 + +================================================================================ +T/V N NB P Q Time Gflops +-------------------------------------------------------------------------------- +WR10L2L2 5000 512 9 8 1.73 4.8320e+01 +HPL_pdgesv() start time Fri Apr 15 04:41:38 2022 + +HPL_pdgesv() end time Fri Apr 15 04:41:39 2022 + +-------------------------------------------------------------------------------- +||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 3.59899598e-03 ...... PASSED +================================================================================ +T/V N NB P Q Time Gflops +-------------------------------------------------------------------------------- +WR10L2L2 20000 512 9 8 22.08 2.4160e+02 +HPL_pdgesv() start time Fri Apr 15 04:41:40 2022 + +HPL_pdgesv() end time Fri Apr 15 04:42:02 2022 + +-------------------------------------------------------------------------------- +||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.84323663e-03 ...... PASSED +================================================================================ +T/V N NB P Q Time Gflops +-------------------------------------------------------------------------------- +WR10L2L2 12128 512 9 8 7.07 1.6818e+02 +HPL_pdgesv() start time Fri Apr 15 04:42:02 2022 + +HPL_pdgesv() end time Fri Apr 15 04:42:09 2022 + +-------------------------------------------------------------------------------- +||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 2.42384393e-03 ...... PASSED +================================================================================ + +Finished 3 tests with the following results: + 3 tests completed and passed residual checks, + 0 tests completed and failed residual checks, + 0 tests skipped because of illegal input values. +-------------------------------------------------------------------------------- + +End of Tests. +================================================================================ diff --git a/results.txt b/results.txt new file mode 100644 index 00000000..563abea7 --- /dev/null +++ b/results.txt @@ -0,0 +1,12 @@ +ubuntu@compute-st-c59xlarge-1:~$ grep -A2 Gflops raw-hpl.out | grep -Ev Gflop\|-- + +The following parameter values will be used: +WR10L2L2 5000 512 9 8 1.73 4.8320e+01 +WR10L2L2 20000 512 9 8 22.08 2.4160e+02 +WR10L2L2 12128 512 9 8 7.07 1.6818e+02 +ubuntu@compute-st-c59xlarge-1:~$ grep -A2 Gflops singularity-hpl.out | grep -Ev Gflop\|-- + +The following parameter values will be used: +WR10L2L2 5000 512 9 8 1.79 4.6495e+01 +WR10L2L2 20000 512 9 8 22.52 2.3682e+02 +WR10L2L2 12128 512 9 8 7.09 1.6776e+02 diff --git a/singularity-hpl.out b/singularity-hpl.out new file mode 100644 index 00000000..bd79fbb7 --- /dev/null +++ b/singularity-hpl.out @@ -0,0 +1,83 @@ +================================================================================ +HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 +Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK +Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK +Modified by Julien Langou, University of Colorado Denver +================================================================================ + +An explanation of the input/output parameters follows: +T/V : Wall time / encoded variant. +N : The order of the coefficient matrix A. +NB : The partitioning blocking factor. +P : The number of process rows. +Q : The number of process columns. +Time : Time in seconds to solve the linear system. +Gflops : Rate of execution for solving the linear system. + +The following parameter values will be used: + +N : 5000 20000 12128 +NB : 512 +PMAP : Row-major process mapping +P : 9 +Q : 8 +PFACT : Left +NBMIN : 2 +NDIV : 2 +RFACT : Left +BCAST : 1ring +DEPTH : 1 +SWAP : Spread-roll (long) +L1 : no-transposed form +U : no-transposed form +EQUIL : yes +ALIGN : 8 double precision words + +-------------------------------------------------------------------------------- + +- The matrix A is randomly generated for each test. +- The following scaled residual check will be computed: + ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) +- The relative machine precision (eps) is taken to be 1.110223e-16 +- Computational tests pass if scaled residuals are less than 16.0 + +================================================================================ +T/V N NB P Q Time Gflops +-------------------------------------------------------------------------------- +WR10L2L2 5000 512 9 8 1.79 4.6495e+01 +HPL_pdgesv() start time Fri Apr 15 04:40:17 2022 + +HPL_pdgesv() end time Fri Apr 15 04:40:19 2022 + +-------------------------------------------------------------------------------- +||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 3.36383588e-03 ...... PASSED +================================================================================ +T/V N NB P Q Time Gflops +-------------------------------------------------------------------------------- +WR10L2L2 20000 512 9 8 22.52 2.3682e+02 +HPL_pdgesv() start time Fri Apr 15 04:40:19 2022 + +HPL_pdgesv() end time Fri Apr 15 04:40:42 2022 + +-------------------------------------------------------------------------------- +||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.55503713e-03 ...... PASSED +================================================================================ +T/V N NB P Q Time Gflops +-------------------------------------------------------------------------------- +WR10L2L2 12128 512 9 8 7.09 1.6776e+02 +HPL_pdgesv() start time Fri Apr 15 04:40:42 2022 + +HPL_pdgesv() end time Fri Apr 15 04:40:49 2022 + +-------------------------------------------------------------------------------- +||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 2.31682621e-03 ...... PASSED +================================================================================ + +Finished 3 tests with the following results: + 3 tests completed and passed residual checks, + 0 tests completed and failed residual checks, + 0 tests skipped because of illegal input values. +-------------------------------------------------------------------------------- + +End of Tests. +================================================================================ diff --git a/xhpl.sh b/xhpl.sh new file mode 100644 index 00000000..13b78c5b --- /dev/null +++ b/xhpl.sh @@ -0,0 +1,5 @@ +#!/bin/sh + +mpirun -n $SLURM_NNODES singularity exec "$@" +#--bind HPL.dat:/HPL.dat ./hpl1.sif /bin/xhpl + diff --git a/zhpl.sh b/zhpl.sh new file mode 100644 index 00000000..8534012a --- /dev/null +++ b/zhpl.sh @@ -0,0 +1,4 @@ +#!/bin/sh + +mpirun -n $SLURM_NNODES ./xhpl + From 8f60b7808a4a972671f29319420a09bd5aad4c76 Mon Sep 17 00:00:00 2001 From: Zhumatiy Sergey Date: Sat, 16 Apr 2022 01:55:43 +0300 Subject: [PATCH 3/6] Attached files description --- Containerization-HPC.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/Containerization-HPC.md b/Containerization-HPC.md index a2d760eb..0ae5502a 100644 --- a/Containerization-HPC.md +++ b/Containerization-HPC.md @@ -114,3 +114,16 @@ WR10L2L2 12128 512 9 8 7.09 1.6776e+02 ## Terraform approach notes Here is my attempts to set up the cluster. See my terraform config [...](...). Please, note, that amazon efa tools should me installen on each node! Salt config should be added. + +## Attached files + +- [hpl1.def](hpl1.def) - sigularity reciepe to build an image +- [HPL.dat](HPL.dat) - Linpack input sample data +- [Make.t1](Make.t1) - makefile for Linpack (note path to atlas, in case of ubuntu it will be different) +- [raw-hpl.out](raw-hpl.out) - output of native HPL run +- [singularity-hpl.out](singularity-hpl.out) - output of singularity HPL run +- [results.txt](results.txt) - short results for comparation (native and singularity HPL) +- [xhpl.sh](xhpl.sh) - batch script for SLURM to run singilarity xhpl +- [zhpl.sh](zhpl.sh) - batch script for SLURM to run native xhpl +- [sing-run.sh](sing-run.sh) - batch script for SLURM to run any singularity container (just add singularity options, needed after `exec`). + From b5d260483ef65f2d9547d5c93d7e7260bb6c53ff Mon Sep 17 00:00:00 2001 From: Zhumatiy Sergey Date: Fri, 15 Apr 2022 17:57:42 -0500 Subject: [PATCH 4/6] Add sing-run.sh, fix typos. --- sing-run.sh | 4 ++++ xhpl.sh | 3 +-- zhpl.sh | 2 +- 3 files changed, 6 insertions(+), 3 deletions(-) create mode 100644 sing-run.sh diff --git a/sing-run.sh b/sing-run.sh new file mode 100644 index 00000000..c2c1c53b --- /dev/null +++ b/sing-run.sh @@ -0,0 +1,4 @@ +#!/bin/sh + +mpirun -n $SLURM_NPROCS singularity exec "$@" + diff --git a/xhpl.sh b/xhpl.sh index 13b78c5b..900cdb96 100644 --- a/xhpl.sh +++ b/xhpl.sh @@ -1,5 +1,4 @@ #!/bin/sh -mpirun -n $SLURM_NNODES singularity exec "$@" -#--bind HPL.dat:/HPL.dat ./hpl1.sif /bin/xhpl +mpirun -n $SLURM_NPROCS singularity exec --bind HPL.dat:/HPL.dat ./hpl1.sif /bin/xhpl diff --git a/zhpl.sh b/zhpl.sh index 8534012a..0f440cab 100644 --- a/zhpl.sh +++ b/zhpl.sh @@ -1,4 +1,4 @@ #!/bin/sh -mpirun -n $SLURM_NNODES ./xhpl +mpirun -n $SLURM_NPROCS ./xhpl From 9cfca8c1a2a3dd33d1346e7fcc2246168e5e10c1 Mon Sep 17 00:00:00 2001 From: Zhumatiy Sergey Date: Sat, 16 Apr 2022 03:14:48 +0300 Subject: [PATCH 5/6] Update Containerization-HPC.md --- Containerization-HPC.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/Containerization-HPC.md b/Containerization-HPC.md index 0ae5502a..7f004513 100644 --- a/Containerization-HPC.md +++ b/Containerization-HPC.md @@ -113,7 +113,19 @@ WR10L2L2 12128 512 9 8 7.09 1.6776e+02 ## Terraform approach notes -Here is my attempts to set up the cluster. See my terraform config [...](...). Please, note, that amazon efa tools should me installen on each node! Salt config should be added. +Here is my attempts to set up the cluster. TODO: Salt config should be added. + +Yes, after installing efa toolkit terrafom-based cluster works too. In the [terraform.tgz](terraform.tgz) file are terraform configs used. I've got them from https://github.com/bugbiteme/demo-tform-aws-vpc and slightly modified for this task. + +Tasks below should be executed via saltstack (or ansible, or...), but I did them manually. I need some time to remember how to use salt :) + +After terraform installations we need to get all our nodes ip addresses from internal network (I don't know how to do this automatically yet), then create the slurm configuration file (here I attach the final [slurm.conf](slurm.conf)). Then we need to install slurm packages on head and compute nodes, copy `slurm.conf` into `/etc/slurm-llnl/` (on all nodes), enable and start slurmctld service on head node, and slurmd service on compute nodes. + +Then we need to share `/home` on head node via nfs (put `/home 10.0.0.0/8(rw,no_root_squash)` into `/etc/exportfs` and exec `exportfs -r`). After that mount it on the nodes (put `head-node-ip:/home /home nfs rw,defaults,_netdev 0 0` into `/etc/fstab`, then exec `mount /home`). + +Install efa tools on all nodes if we need to run native (not conteinerised) apps. + +Ok! Our cluster is ready. ## Attached files @@ -126,4 +138,6 @@ Here is my attempts to set up the cluster. See my terraform config [...](...). P - [xhpl.sh](xhpl.sh) - batch script for SLURM to run singilarity xhpl - [zhpl.sh](zhpl.sh) - batch script for SLURM to run native xhpl - [sing-run.sh](sing-run.sh) - batch script for SLURM to run any singularity container (just add singularity options, needed after `exec`). +- [terraform.tgz](terraform.tgz) - terraform configs +- [slurm.conf](slurm.conf) - sample slurm config file. IP addresses should be replaced by appropriate ones From c6b3a09f0b1c31ea6669d49b32577c217b2d5f48 Mon Sep 17 00:00:00 2001 From: Zhumatiy Sergey Date: Fri, 15 Apr 2022 19:15:33 -0500 Subject: [PATCH 6/6] Add terraform and slurm sonfigs --- slurm.conf | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++ terraform.tgz | Bin 0 -> 2216 bytes 2 files changed, 151 insertions(+) create mode 100644 slurm.conf create mode 100644 terraform.tgz diff --git a/slurm.conf b/slurm.conf new file mode 100644 index 00000000..3a2b6d68 --- /dev/null +++ b/slurm.conf @@ -0,0 +1,151 @@ +# slurm.conf file generated by configurator.html. +# Put this file on all nodes of your cluster. +# See the slurm.conf man page for more information. +# +ClusterName=hpc1 +SlurmctldHost=ip-172-31-9-81 +DebugFlags=NO_CONF_HASH +#ec2-3-238-188-80.compute-1.amazonaws.com +#SlurmctldHost= +# +#DisableRootJobs=NO +#EnforcePartLimits=NO +#Epilog= +#EpilogSlurmctld= +#FirstJobId=1 +#MaxJobId=67043328 +#GresTypes= +#GroupUpdateForce=0 +#GroupUpdateTime=600 +#JobFileAppend=0 +#JobRequeue=1 +#JobSubmitPlugins=lua +#KillOnBadExit=0 +#LaunchType=launch/slurm +#Licenses=foo*4,bar +#MailProg=/bin/mail +#MaxJobCount=10000 +#MaxStepCount=40000 +#MaxTasksPerNode=512 +MpiDefault=none +#MpiParams=ports=#-# +#PluginDir= +#PlugStackConfig= +#PrivateData=jobs +ProctrackType=proctrack/cgroup +#Prolog= +#PrologFlags= +#PrologSlurmctld= +#PropagatePrioProcess=0 +#PropagateResourceLimits= +#PropagateResourceLimitsExcept= +#RebootProgram= +ReturnToService=1 +SlurmctldPidFile=/var/run/slurmctld.pid +SlurmctldPort=6817 +SlurmdPidFile=/var/run/slurmd.pid +SlurmdPort=6818 +SlurmdSpoolDir=/var/spool/slurmd +SlurmUser=slurm +#SlurmdUser=root +#SrunEpilog= +#SrunProlog= +StateSaveLocation=/var/spool/slurmctld +SwitchType=switch/none +#TaskEpilog= +TaskPlugin=task/affinity +#TaskProlog= +#TopologyPlugin=topology/tree +#TmpFS=/tmp +#TrackWCKey=no +#TreeWidth= +#UnkillableStepProgram= +#UsePAM=0 +# +# +# TIMERS +#BatchStartTimeout=10 +#CompleteWait=0 +#EpilogMsgTime=2000 +#GetEnvTimeout=2 +#HealthCheckInterval=0 +#HealthCheckProgram= +InactiveLimit=0 +KillWait=30 +#MessageTimeout=10 +#ResvOverRun=0 +MinJobAge=300 +#OverTimeLimit=0 +SlurmctldTimeout=120 +SlurmdTimeout=300 +#UnkillableStepTimeout=60 +#VSizeFactor=0 +Waittime=0 +# +# +# SCHEDULING +#DefMemPerCPU=0 +#MaxMemPerCPU=0 +#SchedulerTimeSlice=30 +SchedulerType=sched/backfill +SelectType=select/linear +#SelectTypeParameters= +# +# +# JOB PRIORITY +#PriorityFlags= +#PriorityType=priority/basic +#PriorityDecayHalfLife= +#PriorityCalcPeriod= +#PriorityFavorSmall= +#PriorityMaxAge= +#PriorityUsageResetPeriod= +#PriorityWeightAge= +#PriorityWeightFairshare= +#PriorityWeightJobSize= +#PriorityWeightPartition= +#PriorityWeightQOS= +# +# +# LOGGING AND ACCOUNTING +#AccountingStorageEnforce=0 +#AccountingStorageHost= +#AccountingStoragePass= +#AccountingStoragePort= +AccountingStorageType=accounting_storage/none +#AccountingStorageUser= +#AccountingStoreFlags= +#JobCompHost= +JobCompLoc=/tmp/slurm-log +#JobCompPass= +#JobCompPort= +JobCompType=jobcomp/filetxt +#JobCompUser= +#JobContainerType=job_container/none +JobAcctGatherFrequency=30 +JobAcctGatherType=jobacct_gather/none +SlurmctldDebug=info +SlurmctldLogFile=/var/log/slurmctld.log +SlurmdDebug=info +SlurmdLogFile=/var/log/slurmd.log +#SlurmSchedLogFile= +#SlurmSchedLogLevel= +#DebugFlags= +# +# +# POWER SAVE SUPPORT FOR IDLE NODES (optional) +#SuspendProgram= +#ResumeProgram= +#SuspendTimeout= +#ResumeTimeout= +#ResumeRate= +#SuspendExcNodes= +#SuspendExcParts= +#SuspendRate= +#SuspendTime= +# +# +# COMPUTE NODES +NodeName=node[1-2] NodeAddr=172.31.24.91,172.31.30.190 CPUs=16 State=UNKNOWN +PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP + diff --git a/terraform.tgz b/terraform.tgz new file mode 100644 index 0000000000000000000000000000000000000000..4dbc34ffaf7f693b435d7d9692f471cffd6adef4 GIT binary patch literal 2216 zcmV;Z2v_$XiwFP!000001MM5_ZsWLdfBPv2?fqzrvnN9d2_af(l)$HD_EDnc54re$Y)RywtO;>Na1P@?7 zpW?sc#muYzt6@G_ycth@f9_2uvnk+>$KGV}hIpNFWg$~3m4v)eOs*fpJJkM4VJntd z#E28K^hU^Q9nzp9{@6z&c2>tAH=tk&0%ED)|m^ z_o13*z=g>M$V+u>y9jRHh7;~w{vu10ELEMFfP?z)O`G~Z_ou!7p8|~hi37Vh;$aYq zIA&qWMI5M97S>?NR493Z?@`Y#L~=WNPgTlB@8Xn6G9>>H8KII99+Py#$d*FyPpE=B z8fC0dGqeU|$jMiCdp7;Oo3Jeb^&Mug;(sjg;Eii}2wWT18X#Y`j}~$OOEC z?GcCUxo2WoHdA>?`6p8B_>##k4ZtD&_u#pq{|kTK>;EYrmp^gnjWP-&+1ls|9+>9G z%kI%0x&x3gr&r)Ib!Y$%>HlO=>i^uEf&Tm6bkXboDIk|WL;$&K+ zc?^t^YhduoP`oV=KTO0RR>I&uz<433>VH?Uz(M_AH1q#cAC81x|4#sW{eSH@xBKP0 zB&^}k__l^nB)1RHDU#=pQgh10`9r*yUcOfkRsG*~R0}+y|C0&C{B{3t0zZ5GKLrd1 zqY=4~3=`gXPD9jARy@MQI0W|z53fm<=(ISt&>iKG)=D0eTL|=R5I$Jalm{EFaj{l=4h?k57(% zryUGs`+bi6Aq+m;?6!Y^{)F748Nu0!c*WP5WO^F!F3PMv&Vmx-%Mg`a#Vj2Z2@I#| zwm7=xXuTgZ9bAuna7J)JT)Q4_eKlNN0i;TakfZlg_)-B~E@608u1-FDP|aeLC5Y5_ zw>*@>saF(R*uhk8QCB2(uDYtZ2N`U9?&KL{^^^C}6ioxg!c6k?HdsrMB|xQ=A9gLw zx)x$e*Ek^~MCWfPJ7oQyk^r(>ZJf+Yr;(*zVZjD%D|A9yX)aDl}#%EtyJrO!M^82&s?q+D&^m-LI@U z*&&+=CoB^=wgqWbT5|GvbM6h}EYI{jf9JG~{CkVP0evdTDHg3NZaKcZADI8(?`RZm zz|`v%oO&JseyfVmTtvyXB4tDNiG}%E@)))+)4FCzA({WQF3rbRu{Ot82wM?+l8G?5 z&tfCqK5isdG*Z6-D`7>Gn~CueX2Q62$J>eTzy0C-!#isz`eg2}fW{4+PI7?9=YQdG z+~@yJ0ZsXLkpnos{y&5L()<6XfTza)FU@BU-9S_SyT}0?(EqXLPn!8ZZ`R-cI|-os z`wau9AN^bOelIWof?i@mxrE;|B{%H9KmQU%y3h}CSGd<#b_o(YU+LTd^!%FKfcwX4 zck0l82O^Y2BBYTC=4iYmSrQ4lG^i#Vx0vTYUJfb_cN(dn5(|LOU0f}u(OSTGn=QJH z-(Q~VJA!3TJz!v>L#IaK&ZHVx-7jwQJ@I;<4H*@S6;Jt2Upq`~N~6x4tI^Bf0d%!P z!K2)Ah2fSN%7bXC!7v5}x)(^o%+o&JHSZ6%GKx&vpuq%9t=ts|fk{}KN$y#D*YlRzK;f7#HY|IeNRI8^_i zj^_>g5AI*D|0e+q**|>u=kLgsWb`@@zA+@vLv|iR7_L}Cb&BBppQ>8yH}?T}gk_%3 zB4!F}|M?{WFeH_4c8gm>Ho(Qo6qAy!gxn5+*f8IUfH=aBkp@N4e?PPZzdQSZLCBZ# zQ0KAd>OZ6L9J|oZ>d_v${8@ZnK=n0)w`Cdc!D_h0S{McgWhK*=%i4#~Y>j~}f9iZ~ zx%AD?k7ZoHmqdkz*Pt@?Ds-^E$k%bOwh5qRz?0oPz=aF0A`xDfHxWX8MI^wvUQv~^ zZ%TW1NQ=p0oG(f6xp(!m691IKB=NAyZA-_w~zEiBq3$RkhH(Nx+QFQy* zM)~3WMg7u`I%3O!`X0GwOrL!lr-Mzi%A$zSFhm4BVY6{)q-P#RoW-eX+eE)wq*ZhJ zYW9$o&ixmeglO^qo;?M4F#k7i)_-ObtpE4^U-yuw?F#QK&)Ah_r}~LUW%KFk%1nX&Py*@4?XnILk~T?diWn!$1EiPPyhg!t6-x5 literal 0 HcmV?d00001