Strange behaviour when changing X_all_q parallelization

Various technical topics such as parallelism and efficiency, netCDF problems, the Yambo code structure itself, are posted here.

Moderators: Daniele Varsano, andrea.ferretti, andrea marini, Conor Hogan, myrta gruning

Strange behaviour when changing X_all_q parallelization

Postby ncolonna » Tue Aug 28, 2018 11:47 am

Dear Yambo community,

I'm new to Yambo. I've downloaded and successfully installed yambo-4.2.3 linking against
intel2018 netcdf/4.6.1 hdf5/1.10.1 netcdf-fortran/4.4.4 [see setup and report in the tar.gz file
available from the link at the end of the post (I was not able to upload the file "Sorry, the board attachment quota has been reached")]

I was running a G0W0 calculation for Silicon and I noticed a very strange behaviour when changing the parallelization
strategy for the X. I run different calculations changing the number of MPI processes for the parallelization over
the Brillouin zone and the conduction bands:
X_all_q_CPU= "1 $nk $nc 1" # [PARALLEL] CPUs for each role
X_all_q_ROLEs= "q k c v" # [PARALLEL] CPUs roles (q,k,c,v)
I have pure MPI runs wiht 28 processes splitted on k-points and conduction bands (the node I'm running on is a 2
Intel Broadwell processors with 14 cores each (28 cores per machine)):
nk=1 nc=28
nk=2 nc=14
nk=4 nc=7
nk=7 nc=4
nk=14 nc=2
nk=28 nc=1
(and ncpu=1 nk=1 nc=1 as a reference)

Some of these runs end successfully (nk=1,2,4).
The nk=7 and nk=14 get stuck somewhere: the slurm job is still running; logging into the nodes I see that all the CPU are 100%,
and there is no memory issue; I tried to debug with dbg (on one particular instance of yambo.x) and from what I understood
the code is stuck on some MPI call
Only the nk=28 run ends with the error:
[ERROR] STOP signal received while in :[04] Dynamic Dielectric Matrix (PPA)
[ERROR] File ./run//ndb.dip_iR_and_P; Variable NOT DEFINED; NetCDF: HDF error

Even more strange is the fact that the successful runs give different results for the QP energies (see qp_report.dat)

I'm pretty sure I didn't mess up with the database (I created a different folder for each run).
I also did a serial run for reference (with a version of the code compiled without MPI and openMP).
Some of the successful runs give the same results as the serial one (interestingly all the run with
no parallelization over k...). See qp_report.dat

It would be great if you could have a look and tell me what you think about.

Thank you,

Nicola S. Colonna
Post-doctoral Research Scientist
THEOS STI IMX EPFL
ME D2 1426
Station 9
CH-1015 Lausanne (Switzerland)

Link to the tar file (contains the outputs of all the runs, the submission script and the config files):
https://drive.google.com/open?id=1nB4Er ... guVrxgmR-d
ncolonna
 
Posts: 2
Joined: Mon Aug 27, 2018 2:35 pm

Re: Strange behaviour when changing X_all_q parallelization

Postby Daniele Varsano » Tue Aug 28, 2018 12:02 pm

Dear Nicola,
thank you very much for reporting, we will have a careful look.
In order to reproduce the errors and problems, if needed, could you also post your QE input files and pseudopotentials.

Thanks a lot,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
User avatar
Daniele Varsano
 
Posts: 2086
Joined: Tue Mar 17, 2009 2:23 pm

Re: Strange behaviour when changing X_all_q parallelization

Postby ncolonna » Tue Aug 28, 2018 3:29 pm

Dear Daniele,

thanks for the very fast reply!
Here:
https://drive.google.com/open?id=1ZyRKH ... TWPOh33gGm
the PWSCF inputs and Pseudopotential.

Best,

Nicola
ncolonna
 
Posts: 2
Joined: Mon Aug 27, 2018 2:35 pm


Return to Technical Issues

Who is online

Users browsing this forum: No registered users and 2 guests

cron