How to restart in BSE Kernel loop?

You can post here problems arising when using the devel release of Yambo. Issues as parallelization strategy, performance issues and other technical aspects realted to the new release.

Moderators: Daniele Varsano, andrea.ferretti, andrea marini, Conor Hogan, myrta gruning

How to restart in BSE Kernel loop?

Postby wufeng » Wed Jan 03, 2018 1:51 am

Dear all,
Due to the walltime limit I must split a large BSE calculation into several part. The screening has been done, and the BS matrix build step (Kernel loop) will take a long time. However, when I tried to restart the calculatioin, either it restarts from the start (just like no ndb.BS_Q1_CPU_* files present) or it hangs up. Is there any specific settings required to guarantee the restart? What I did is just to rerun with the same input file after the job was killed by the job management system.

I found this question has been asked but did not get what exactly should be done to correctly restart in BSE kernal step http://www.yambo-code.org/forum/viewtopic.php?f=13&t=796. Thanks very much.
Feng Wu
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
wufeng
 
Posts: 24
Joined: Fri Dec 15, 2017 4:17 am

Re: How to restart in BSE Kernel loop?

Postby Daniele Varsano » Wed Jan 03, 2018 10:36 am

Dear Feng Wu,
unfortunately at this moment the restart of the BS kernel is problematic, we are working on that.
If you post your report and log files we can have a look and see if we can suggest an optimal parallelisation strategy in order
to have the calculations done in a reasonable wall time.

Best,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
User avatar
Daniele Varsano
 
Posts: 1973
Joined: Tue Mar 17, 2009 2:23 pm

Re: How to restart in BSE Kernel loop?

Postby wufeng » Thu Jan 04, 2018 1:27 am

Dear Daniel,
Thanks for the information.

A log file tarball is attached. This is a case with 4 processors per node * 8 nodes. Due to memory limit, not all cores are used (instead, 8 openmp threads per CPU), and k-eh-t = 1-32-1.
I would really appreciate if you could provide some advices about this.


Best,
Feng
You do not have the required permissions to view the files attached to this post.
Feng Wu
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
wufeng
 
Posts: 24
Joined: Fri Dec 15, 2017 4:17 am

Re: How to restart in BSE Kernel loop?

Postby Daniele Varsano » Thu Jan 04, 2018 3:32 pm

Dear Feng,
from the log you attached it looks the calculations finished correctly.
Anyway, as you can see from the warning message in the log file:
Code: Select all
<03s> P0001: [WARNING] n_eh_CPU > 1 in a system with symmetries and k-points is not efficient. Try distributing first "k" and "t"


it is more efficient to parallelize over the k points first. The report it is not attached, but it seems you have 4 points right? This should also distribute the memory, and if this the case you can use more cpu of your nodes.
In any case, maintaining your number of cores, a strategy as k-eh-t = 4-1-8, should perform much better.
Just a curiosity, your BSE matrix looks huge, and you do not have many k points, how many conduction and valence bands are you including? Perhaps you can reduce the number of valence bands included in the BSE?

Best,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
User avatar
Daniele Varsano
 
Posts: 1973
Joined: Tue Mar 17, 2009 2:23 pm

Re: How to restart in BSE Kernel loop?

Postby wufeng » Thu Jan 04, 2018 7:34 pm

Dear Daniele,
This a a pretty large system, with 1280 valence electrons so there are really a lot of bands. I would like to find out the limit of system size we can run on this cluster. Thanks for your advice and I will try it.

I have another question about the k-point parallization: once I tried the k-point parallization but found some process run significantly longer than others. For example, eh-only runs 18 hours on all processors, and k-only runs 4/8/16/24 hours on different processors based LOG files; so the total time is much less in k-point parallization (18*4 > 4+8+16+24), but the WALL time is actually longer (24 > 18). I am sorry I cannot find the LOG files now. Is this behaviour expected?

Best,
Feng
Feng Wu
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
wufeng
 
Posts: 24
Joined: Fri Dec 15, 2017 4:17 am

Re: How to restart in BSE Kernel loop?

Postby Daniele Varsano » Fri Jan 05, 2018 9:39 am

Dear Feng Wu,
There may be some unbalance but I would not expect such a large discrepancy, may be something happened there, but without report and logs it is hard to say.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
User avatar
Daniele Varsano
 
Posts: 1973
Joined: Tue Mar 17, 2009 2:23 pm

Re: How to restart in BSE Kernel loop?

Postby wufeng » Sat Jan 06, 2018 12:59 am

Dear Daniele,
Thanks. I will try to report if this can be reproduced.


Feng
Feng Wu
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
wufeng
 
Posts: 24
Joined: Fri Dec 15, 2017 4:17 am

Re: How to restart in BSE Kernel loop?

Postby Davide Sangalli » Sat Jan 06, 2018 11:30 pm

Dear Feng Wu,
few more comments.

The unbalance you find in the "k"-only parallelization is not so strange to me.
Yambo distributes the points in the IBZ while the BSE matrix is computed in the whole BZ.
The reason is that some matrix elements can be symmetry related.

As Daniele pointed out, and you indeed found out, the parallelization over "eh" is balanced but not efficient.

I think instead the parallelization over "t"-only could be both balanced and efficient.
The maximum number of processors you can use for that is roughly N*(N+1)/2, with N=nk*neh and
nk= nmber of kpt in the full BZ
neh= number of cores used for eh parallelization .

Finally I see you are also using 8 threads. Not fully sure of the effect.
The OpenMP parallelism has not been tested much with BSE.

Hope it helps.
Best,
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
http://www.ism.cnr.it/en/davide-sangalli-cv/
http://www.max-centre.eu/
User avatar
Davide Sangalli
 
Posts: 314
Joined: Tue May 29, 2012 4:49 pm
Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy

Re: How to restart in BSE Kernel loop?

Postby wufeng » Fri Jan 12, 2018 12:29 am

The unbalance you find in the "k"-only parallelization is not so strange to me.
Yambo distributes the points in the IBZ while the BSE matrix is computed in the whole BZ.
The reason is that some matrix elements can be symmetry related.

As Daniele pointed out, and you indeed found out, the parallelization over "eh" is balanced but not efficient.

I think instead the parallelization over "t"-only could be both balanced and efficient.
The maximum number of processors you can use for that is roughly N*(N+1)/2, with N=nk*neh and
nk= nmber of kpt in the full BZ
neh= number of cores used for eh parallelization .


Thanks very much for the details. I have attached my LOG file with 3 different parallielization settings.

    1. k=4 eh=16, time from 8h27min to 1d06h45min
    2. eh=64, time from 16h50min to 18h43min
    3. eh=16 t=4, time from 15h50min to 19h28min


Also the OpenMP have no effect in BSE part in my other test.
You do not have the required permissions to view the files attached to this post.
Feng Wu
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
wufeng
 
Posts: 24
Joined: Fri Dec 15, 2017 4:17 am

Re: How to restart in BSE Kernel loop?

Postby Davide Sangalli » Fri Jan 12, 2018 12:00 pm

Thanks for the report.
Did you try to just distribute over "t" ?

Best,
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
http://www.ism.cnr.it/en/davide-sangalli-cv/
http://www.max-centre.eu/
User avatar
Davide Sangalli
 
Posts: 314
Joined: Tue May 29, 2012 4:49 pm
Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy

Next

Return to Yambo version 4.0

Who is online

Users browsing this forum: No registered users and 1 guest

cron