an error message at the GW level

Various technical topics such as parallelism and efficiency, netCDF problems, the Yambo code structure itself, are posted here.

Moderators: Daniele Varsano, andrea.ferretti, andrea marini, Conor Hogan, myrta gruning

an error message at the GW level

Postby zaabar foudil » Sat Mar 03, 2018 10:35 pm

I have done a calculation GW (yambo-4.1.2) for MoSe2 monolayers in a cluster using two (2) nodes * 2 tasks (4 CPU in total), this calculation works well but it will take a lot of time, since the number of nodes is limited in our cluster, I tried to perform the same calculates with 16 CPU (4nodes * 4tasks), after two days of launching this work I encountered the following error message: mpirun noticed that process rank 0 with PID 25006 on node ibnbadis15 exited on signal 9 (Killed).
is there any possibility to overcome it!
my best regards
foudil zaabar
university of bejaia
Algeria
zaabar foudil
 
Posts: 19
Joined: Sun Dec 03, 2017 10:24 am
Location: Algeria

Re: an error message at the GW level

Postby Daniele Varsano » Sun Mar 04, 2018 9:38 am

Dear foudil zaabar,
this message does not come from yambo but from your queue system. It is possible that your job reached the maximum wall time allowed. You can inspect if this is the case if you have any standard error or standard output message.
In any case if you post your input, report and log file we can have a look to see if there is some problem and/or possibility to optimize the calculation.
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
User avatar
Daniele Varsano
 
Posts: 2052
Joined: Tue Mar 17, 2009 2:23 pm

Re: an error message at the GW level

Postby zaabar foudil » Sun Mar 04, 2018 10:10 pm

Dear Daniele
Thank you for your help,
here you find all the files you asked me , I put the slurm script in the same file input (yambo.in)
thank you in advance
zaabar foudil
university of bejaia
Algeria
You do not have the required permissions to view the files attached to this post.
zaabar foudil
 
Posts: 19
Joined: Sun Dec 03, 2017 10:24 am
Location: Algeria

Re: an error message at the GW level

Postby Daniele Varsano » Mon Mar 05, 2018 8:29 am

Dear Zaabar,

It is possible your job died for memory reason:
Please note that actually, you run is using more than 5Gb and the code seems to have died after 18 seconds.
You can check the memory allocated in the log file:
Code: Select all
<18s> P0008: [M  5.096 Gb] Alloc wf_disk ( 0.025)

Possible strategies to solve the issue are:
1) #SBATCH --mem=30000
is this the max mem available for the node? Be sure of that and set to the maximum.
2) Reduce your FFTGvec in the calculation, you can always do that but not that much:
Add in your input the variable e.g.:
Code: Select all
FFTGvecs=70000

probably you can reduce it even further but remember to check the accuracy of the final results with respect the value you put there.
3) If this still does not solve the problem you can think about splitting your calculation in multiple lighter runs:
e.g.
Run1:
Code: Select all
%QPkrange                    # [GW] QP generalized Kpoint/Band indices
  1|127 | 43|46|
%

Run2:
Code: Select all
%QPkrange                    # [GW] QP generalized Kpoint/Band indices
  1|127 | 47|50|
%


These will generate two QP databases that can be merged with the ypp utility.
If you need to do so pay attention to not overwrite these databases e.g. you can add the following line at the end of your slurm script.
mv ./SAVE/ndb.QP ./SAVE/ndb.QP_1 for the first
mv ./SAVE/ndb.QP ./SAVE/ndb.QP_2 for the second


Here other suggestion not related to the memory issue:
1)
CUTGeo= "Z" this is not a valid keyword replace with
Code: Select all
CUTGeo= "box Z" 

in your calculation actually, the coulomb cutoff technique is not used.
2)
Code: Select all
% BndsRnXp
   1 | 130 |                 # [Xp] Polarization function bands
%
% GbndRnge
   1 | 130 |                 # [GW] G[W] bands range
%

you cannot use more than 100 bands as they are the maximum value in your database (nscf calculation). If you need more bands you need to calculate them with QE (nscf).

3) you can also consider to use terminators to accelerate bands convergence by adding this two lines in input:
Code: Select all
GTermKind= "BG"              # [GW] GW terminator ("none","BG" Bruneval-Gonze)
GTermEn= 40.81708      eV      # [GW] GW terminator energy (only for kind="BG")





Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
User avatar
Daniele Varsano
 
Posts: 2052
Joined: Tue Mar 17, 2009 2:23 pm

Re: an error message at the GW level

Postby zaabar foudil » Sun Mar 18, 2018 7:48 pm

Dear Daniele
I thank you very much for your advises, I start the calculations according to different instruction as you showed me
it works well until now. .

my best regards
zaabar foudil
university of bejaia
Algeria
zaabar foudil
 
Posts: 19
Joined: Sun Dec 03, 2017 10:24 am
Location: Algeria

Re: an error message at the GW level

Postby zaabar foudil » Wed Mar 28, 2018 2:24 pm

Dear Daniele
I want to perform the previous calculations (GW, BSE) that I showed you in EC2 instance (cloud amazon), I installed yambo-4.1.2 (parallel version) it works well with the command mpirun -n 8 ... .... / bin / yambo, it displays the LOG folder and the other output files, r_optique, but when I close the terminal whose I have typed the command (mpirun -n ....) or disconnects the calculation s 'stopped ...
Is there any script or a solution to overcome that?

my best regards
zaabar foudil
university of bejaia
Algeria
zaabar foudil
 
Posts: 19
Joined: Sun Dec 03, 2017 10:24 am
Location: Algeria

Re: an error message at the GW level

Postby Daniele Varsano » Wed Mar 28, 2018 2:38 pm

Dear Zaabar,
I do not know at all the cloud amazon environment, but from what you say you can try to run the job as:
nohup mpirun -n 8 ... .... /bin/yambo &

Have a look e.g. here for the nohup usage:
https://www.cyberciti.biz/tips/nohup-ex ... rompt.html

or just search "nohup" in google.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
User avatar
Daniele Varsano
 
Posts: 2052
Joined: Tue Mar 17, 2009 2:23 pm

Re: an error message at the GW level

Postby zaabar foudil » Wed Mar 28, 2018 3:41 pm

Dear Daniele


I tested the command " nohup ", now the problem is solved,
I thank you very much

my best regards
foudil zaabar
university of bejaia
Algeria
zaabar foudil
 
Posts: 19
Joined: Sun Dec 03, 2017 10:24 am
Location: Algeria


Return to Technical Issues

Who is online

Users browsing this forum: No registered users and 1 guest