# Using Yambo in parallel

This module presents examples of how to run Yambo in a parallel environment.

## Prerequisites

Previous modules

You will need:

• The SAVE databases for bulk hBN
• The yambo executable

## Yambo Parallelism in a nutshell

Yambo implements a hybrid MPI+OpenMP parallel scheme, suited to run on large partitions of HPC machines. For instance, runs over several tens of thousands of cores on the Marconi KNL partition (CINECA, IT) have been achieved, with a computational power ~ 3 PFl/s, as well as large scale runs on accelerated machines such as Piz-Daint at CSCS (CH), featuring NVIDIA GPUs.

• MPI is particularly suited to distribute both memory and computation, and has a very efficient implementation in Yambo, which needs very few communications though may be prone to load unbalance (see GW parallel strategies).
• OpenMP instead works within a shared memory paradigm, meaning that multiple threads perform computation in parallel on the same data in memory (no memory replica, at least in principle).
• Concerning Yambo, tend to use MPI parallelism as much as possible (i.e. as much as memory allows), then resort to OpenMP parallelism in order not to increase memory usage any further while exploiting more cores and computing power.
• GPU support is extremely efficient in Yambo, meaning that when you are running on GPU accelerated machines it is always a good idea to exploit the GPUs. In general, one needs to set 1 MPI task for each available GPU card, also including OpenMP threads to fully exploit the capabilities of the host. While a very relevant speed up can be achieved, device memory (eg 12, 16, 32 GB on currently available devices) may become a problem and needs to be controlled (eg increasing the number of required nodes and further distributing data across them).
• The number of MPI tasks and OpenMP threads per task can be controlled in a standard way, e.g. as
 $export OMP_NUM_THREADS=4$ mpirun -np 12  yambo  -F yambo.in  -J label


resulting in a run exploiting up to 48 threads (best fit on 48 physical cores, though hyper-threading, i.e. more threads than cores, can be exploited to feed computing units at best)

• TIP: In order to control the number of OpenMP threads via the OMP_NUM_THREADS environment variable, make sure that the thread-related variables in the yambo input file (e.g. X_Threads, DIP_Threads, SE_Threads) are set to zero or not set. When set, these variables will overwrite the value given to OMP_NUM_THREADS.
• A fine tuning of the parallel structure of Yambo (both MPI and OpenMP) can be obtained by operating on specific input variables (run level dependent), which can be activated, during input generation, via the flag  -V par (verbosity high on parallel variables).
• Yambo can take advantage of parallel dense linear algebra (e.g. via [ScaLAPACK], SLK in the following). Control is provided by input variables

(see e.g. RPA response in parallel)

• When running in parallel, one report file is written, while multiple log files are dumped (one per MPI task, by default), and stored in a newly created ./LOG folder. When running with thousands of MPI tasks, the number of log files can be reduced by setting, in the input file, something like:
 NLogCPUs = 4         # [PARALLEL] Live-timing CPUs (0 for all)


In the following we give direct examples of parallel setup for some among the most standard Yambo kernels.

## HF run in parallel

Full theory and instructions about how to run HF (or, better, exchange self-energy) calculations are given in the Hartree Fock module.

Concerning parallelism, the quantities to be computed are:

Basically, for every orbital nk we want to compute the exchange contribution to the qp correction. In order to to this, we need to perform a sum over q vectors and a sum over occupied bands (to build the density matrix).

• Generate the input file asking for parallel verbosity:
 $yambo -x -V par -F hf.in  • By default, OpenMP acts on spatial degrees of freedom (both direct and reciprocal space) and takes care of FFTs. Do not forget to set the OMP_NUM_THREADS variable (to 1 to avoid OpenMP parallelism) $ export OMP_NUM_THREADS=1     or
$export OMP_NUM_THREADS= <integer_larger_than_one>  • By default, the MPI parallelism will distribute both computation and memory over the bands in the inner sum (b). Without editing the input file, simply run: $ mpirun -np 4 yambo -F hf.in -J <run_label>

• Alternatively, MPI parallelism can work over three different levels q,qp,b at the same time:
 q       parallelism over transferred momenta (q in Eq. above)
qp      parallelism over qp corrections to be computed  (nk in Eq.)
b       parallelism over (occupied) density matrix (or Green's function) bands  (m in Eq.)


Taking the case of hBN, in order to exploit this parallelism over 8 MPI tasks, set e.g.:

 SE_CPU= "  1 2  4"               # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b"               # [PARALLEL] CPUs roles (q,qp,b)


Then run as

 $mpirun -np 8 yambo -F hf.in -J run_mpi8_omp1  Note that the product of the numbers in SE_CPU needs to be equal to the total number of MPI tasks, otherwise yambo will switch back to default parallelism (a warning is provided in the log). Having a look at the report file, r-run_mpi8_omp1_HF_and_locXC here, one finds: [01] CPU structure, Files & I/O Directories ===========================================  * CPU-Threads :8(CPU)-1(threads)-1(threads@SE) * CPU-Threads :SE(environment)- 1 2 4(CPUs)-q qp b(ROLEs) * MPI CPU : 8 * THREADS (max): 1 * THREADS TOT(max): 8  Now we can inspect the files created in the ./LOG directory, e.g. ./LOG/l-run_mpi8_omp1_HF_and_locXC_CPU_1, where we find: <---> P0001: [01] CPU structure, Files & I/O Directories <---> P0001: CPU-Threads:8(CPU)-1(threads)-1(threads@SE) <---> P0001: CPU-Threads:SE(environment)- 1 2 4(CPUs)-q qp b(ROLEs) ... <---> P0001: [PARALLEL Self_Energy for QPs on 2 CPU] Loaded/Total (Percentual):70/140(50%) <---> P0001: [PARALLEL Self_Energy for Q(ibz) on 1 CPU] Loaded/Total (Percentual):14/14(100%) <---> P0001: [PARALLEL Self_Energy for G bands on 4 CPU] Loaded/Total (Percentual):3/10(30%) <---> P0001: [PARALLEL distribution for Wave-Function states] Loaded/Total(Percentual):84/140(60%)  providing the details of memory and computation distributions for the different levels. This report is from processor 1 (as highlighted about _CPU_1 suffix). Similar pieces of information are provided for all CPUs. If different CPUs show very different distribution levels, it is likely that load unbalance occurs (e.g., try with 6 MPI tasks parallel over bands). For Yambo version <= 4.1.2, Yambo may not be able to find a proper default parallel structure (try e.g. the above example asking for 20 MPI tasks, while you only have 8 bands to parallelise over). In These cases the calculation crashes and an error message is provided in the report file:  [ERROR]Impossible to define an appropriate parallel structure  In these cases it is then mandatory to specify a proper parallel structure. This has been mostly overcome by Yambo v. 4.2 on. ## RPA response in parallel Full theory and instructions about how to run independent particle (IP) or RPA linear response calculations are given e.g. in Optics at the independent particle level or Dynamic screening (PPA). IP linear response Concerning parallelism, let us focus at first on a q-resolved quantity like the IP optical response: • Here, in order to obtain the dielectric function, either in the optical limit (q->0) or at finite q, one basically needs to sum over valence-conduction transitions (v,c) for each k point in the BZ. The calculation can be made easily parallel over these indexes. • This is done at the MPI level (distributing both memory and computation), and also by OpenMP threads, just distributing the computational load. • In this very specific case, OpenMP also requires some extra memory workspace (i.e. by increasing the number of OMP threads, the memory usage will also increase). • After initialisation (see Initialization for info) generate the input file by issuing: $ yambo -o c -V par -F yambo_IP.in


edit some the variables by setting

Chimod= "IP"                 # [X] IP/Hartree/ALDA/LRC/BSfxc
% QpntsRXd
1 | 1 |                   # [Xd] Transferred momenta
%


These changes set as independent particle (IP) the response function to be calculated, and asks for the dielectric function at q=1 (that is always gamma)

• Set also the following openmp related variables to zero (or simply delete them):
X_Threads = 0


This is needed to control the number of OpenMP threads vai the OMP_NUM_THREADS environment variable. To avoid OpenMP parallelism, set

export OMP_NUM_THREADS=1

• Now you are ready to run. By suing the default parallelism (over 8 MPI tasks) we have to issue:
 $mpirun -np 8 yambo -F yambo_IP.in -J IP_mpi8  • by inspecting the log files (one per MPI task in the ./LOG folder), we see that the parallelism over conduction states has been used. <---> P0001: CPU structure provided for the Response_G_space_Zero_Momentum ENVIRONMENT is incomplete. Switching to defaults <---> P0001: [LA] SERIAL linear algebra <---> P0001: [PARALLEL Response_G_space_Zero_Momentum for K(ibz) on 1 CPU] Loaded/Total (Percentual):14/14(100%) <---> P0001: [PARALLEL Response_G_space_Zero_Momentum for CON bands on 8 CPU] Loaded/Total (Percentual):12/92(13%) <---> P0001: [PARALLEL Response_G_space_Zero_Momentum for VAL bands on 1 CPU] Loaded/Total (Percentual):8/8(100%)  • Alternatively, edit the input file and set the parallelism fine-tuning variables: X_q_0_CPU= " 1 4 2" # [PARALLEL] CPUs for each role X_q_0_ROLEs= "k c v" # [PARALLEL] CPUs roles (k,c,v)  Run as before $ mpirun -np 8 yambo -F yambo_IP.in  -J IP_mpi8_tuned

• Fine tuning is useful when running over a large number of MPI tasks. Care must be taken when setting the parameters: In this case, for instance, we need to make sure that conduction and valence states are multiples of 4 and 2 respectively. Such situations can be handled by yambo, but especially for valence states which are usually less than conduction states, can lead to load unbalance and not perfect filling of the cores.

RPA linear response

Here, after the calculation of the independent particle linear response, a matrix inversion is performed to obtain the RPA response.

• After initialization (see Initialization for info) generate the input file by issuing:
$yambo -o c -V par -F yambo_RPA.in  edit some of the variables by setting Chimod= "Hartree" # [X] IP/Hartree/ALDA/LRC/BSfxc NGsBlkXd= 6 Ry # [Xd] Response block size % QpntsRXd 1 | 1 | # [Xd] Transferred momenta %  • Run and inspect the log files: After the calculation of 'Xo', the code computes 'X', performing some dense linear algebra operations, which can be made parallel, using scaLapack, by setting, eg X_q_0_nCPU_LinAlG_INV = 8 # [PARALLEL] CPUs for Linear Algebra  Note that the largest square number smaller than 8, in this case, will be used to form the scaLapack grid (2x2, here). • Parallelism fine tuning works as for the IP response. In case multiple or all q points need to be computed (as in GW calculations), the parallelism variables to be used, are: X_all_q_ROLEs= "g q k c v" # [PARALLEL] CPUs roles (g,q,k,c,v) X_all_q_CPU= "1 2 1 4 1" # [PARALLEL] CPUs for each role X_all_q_nCPU_LinAlg_INV= 8 # [PARALLEL] CPUs for Linear Algebra  adding one extra MPI level of parallelism related to the distribution over q points. Finally the different possible parallelism strategies for the dielectric constant are: g parallelism over G-vectors q parallelism over transferred momenta (q in Eq. above) c/v parallelism over conduction/valence bands k parallelism over k-points  Notice that parellelization on c,v and k reduces the amount of memory in the calculation. ## GW in parallel In a complete GW run, several runlevels (dipoles, RPA linear response for all q, exchange self-energy, and correlation self-energy) are run in a row to eventually compute the GW quasi-particle (QP) corrections. How to run linear response and the exchange self-energy in parallel has already been discussed above. Here we focus on the parallelisation of the correlation self-energy. The correlation part of the self-energy in a plane wave representation reads: According to this equation, for each nk matrix element to be computed (qp corrections), a sum over q vectors in the Brillouin zone and a sum over (b) bands (m in Eq.) have to be performed, together with sums over G,G' reciprocal space variables. This gives rise to 3 levels of MPI parallelism (qp, q, b), which can be combined with OpenMP to take care of spatial degrees of freedom. As for the exchange self-energy, the MPI parallelism levels that can be controlled are given by: q parallelism over transferred momenta (q in Eq. above) qp parallelism over qp corrections to be computed (nk in Eq.) b parallelism over (occupied) density matrix (or Green's function) bands (m in Eq.)  Additionally, the variable SE_Threads can be set to fine-tune the OpenMP parallelism (leave it set to zero to control it via the OMP_NUM_THREADS environment variable) To generate the input file, type  yambo -g n -p p -F gw.in  To run in parallel with default parallelism do:  export OMP_NUM_THREADS=1 #(or an integer larger than 1 to use OpenMP parallelism) mpirun -np 8 yambo -F gw.in -J run_mpi8_omp1  Taking the case of hBN, in order to exploit this parallelism over 8 MPI tasks, set e.g.:  SE_CPU= " 1 2 4" # [PARALLEL] CPUs for each role SE_ROLEs= "q qp b" # [PARALLEL] CPUs roles (q,qp,b)  Then run as $ mpirun -np 8  yambo -F gw.in  -J  run_mpi8_omp1_tuned


Note that the product of the numbers in SE_CPU needs to be equal to the total number of MPI tasks, otherwise yambo will switch back to default parallelism (a warning is provided in the log).

Having a look at the report file, r-run_mpi8_omp1_HF_and_locXC here, one finds:

[01] CPU structure, Files & I/O Directories
===========================================

* CPU-Threads     :8(CPU)-1(threads)-1(threads@SE)
* CPU-Threads     :SE(environment)-  1 2  4(CPUs)-q qp b(ROLEs)
* MPI CPU         :  8


Now we can inspect the files created in the ./LOG directory, e.g. ./LOG/l-run_mpi8_omp1_HF_and_locXC_CPU_1, where we find:

<---> P0001: [01] CPU structure, Files & I/O Directories
<---> P0001: CPU-Threads:SE(environment)-  1 2  4(CPUs)-q qp b(ROLEs)
...
<---> P0001: [PARALLEL Self_Energy for QPs on 2 CPU] Loaded/Total (Percentual):70/140(50%)
<---> P0001: [PARALLEL Self_Energy for Q(ibz) on 1 CPU] Loaded/Total (Percentual):14/14(100%)
<---> P0001: [PARALLEL Self_Energy for G bands on 4 CPU] Loaded/Total (Percentual):3/10(30%)
<---> P0001: [PARALLEL distribution for Wave-Function states] Loaded/Total(Percentual):84/140(60%)


providing the details of memory and computation distributions for the different levels. This report is from processor 1 (as highlighted about _CPU_1 suffix). Similar pieces of information are provided for all CPUs. If different CPUs show very different distribution levels, it is likely that load unbalance occurs (e.g., try with 6 MPI tasks parallel over bands).

For Yambo version <= 4.1.2, Yambo may not be able to find a proper default parallel structure (try e.g. the above example asking for 20 MPI tasks, while you only have 8 bands to parallelise over). In These cases the calculation crashes and an error message is provided in the report file:

 [ERROR]Impossible to define an appropriate parallel structure
`

In these cases it is then mandatory to specify a proper parallel structure. This has been mostly overcome by Yambo v. 4.2 on.