This lesson aims at showing how to get the following physical properties : the pseudo total energy the bond length the charge density the atomisation energy You will learn about the two input files, the basic input variables, the existence of defaults, the actions of the preprocessor, and the use of the multi-dataset feature. You will also learn about the two output files as well as the density file. The very first step is a detailed tour of the input and output files : you are like a tourist, and you discover a town in a coach. You will have a bit more freedom after that first step The first step the most important, and the most difficult! Try to manage adequately these four windows
|Published (Last):||6 November 2013|
|PDF File Size:||2.69 Mb|
|ePub File Size:||6.35 Mb|
|Price:||Free* [*Free Regsitration Required]|
You will learn how to speedup your calculations and how to improve their convergence rate. This tutorial should take about 1. Some useful references: Levitt , Bottin , Knyazev To run the examples in parallel with e. Before continuing you might work in a different subdirectory, as for the other tutorials. All the input files can be found in the Input directory in the directory dedicated to this tutorial. You might have to adapt them to the path of the working directory.
You can compare your results with reference output files located in Refs. When the size of the system increases up to or atoms, it is usually impossible to perform ab initio calculations with a single computing core.
This is because the basis sets used to solve the problem plane waves, bands, … increase — linearly, as the square, or even exponentially —. The computational resources are limited by 2 factors:. The tests are performed on a gold atom system; in this tutorial the plane-wave cutoff energy is strongly reduced, for practical reasons.
Then you have to choose between 2 strategies:. A weight is assigned to each distribution of processors. If we just focus on npband and npfft , we see that, for processes, the recommended distribution is 18x6. But somehow, you did it without understanding how you got the result…. In this part we will try to recover the previous process distribution, but with a better understanding.
As shown above, the pair npband x npfft of input variables can have various values: x1 , 54x2 , 36x3 , 27x4 , 18x6 , 12x9 or 9x The timing of each calculation can be retrieved using the shell command:. As far as the timing is concerned, the best distributions are then the ones proposed in section 2; 27x4 seems to be the best one. Up to now, we have not learned more than before. We have so far only considered the timing of 10 electronic steps. Each block of eigenvectors is concurrently diagonalized by the npband processes, one block after the other.
When the npband value is modified, the size of the block changes accordingly it is exactly equal to npband , and the solutions of the eigensolver are modified. One calculation can be the quickest if we look at the time needed by one iteration but the slowest at the end because many more steps are performed.
In order to see this, we can have a look at the convergence rate at the end of the calculations. The last iterations of the SCF loops are:. The last column indicates the convergence of the potential residual. You can see that this quantity is the smallest when npband is the highest.
This result is expected: the convergence is better when the size of the block is the largest. But this best convergence is obtained for the x1 distribution… when the worst timing is measured. So, you face a dilemma. The calculation with the smallest number of iterations the best convergence is not the best concerning the timing of one iteration the best efficiency , and vice versa… The best choice is a compromise. In the following we will choose the 27x4 pair, because it definitively offers more guarantees concerning the convergence and the timing.
Note: You could check that the convergence is not changed when the npfft value is modified. We have seen in the previous section that the best convergence is obtained when the size of the block is the largest. This size was exactly equal to the npband value. It was only possible to increase the block size by increasing the number of MPI processes.
Is it possible to do better? The answer is yes! The input variable named bandpp BAND s P er P rocess enables an increasing of the block size without changing the number of processes dedicated to bands. How does this work? As previoulsy, each block of bands is diagonalized by npband MPI processes in parallel.
But, if bandpp is activated, each process handles bandpp bands sequentially. The block size — exactly equal to the number of bands handled by the band processes — is now equal to npband x bandpp. Accordingly the block size can be modified usually increased by playing with the value of bandpp , without changing the number of MPI processes.
A comparison of these two files shows that the convergence is better in the second case. Conclusion: for a given number of processors, it is possible to improve the convergence by increasing bandpp.
Use the same input file and change it according to:. As you can see, the two calculations give exactly the same convergence rate. This was expected since, in both cases, the block sizes are equal to 54 and the number of FFT processors npfft does not affect the convergence.
It is possible to adjust the distribution of processes, without changing the convergence, by reducing npband and increasing bandpp proportionally. However, as you can see in the previous calculations, the CPU time per iteration increases when bandpp increases note that the 2 nd run performed less iterations than the first one :. Where does this CPU time consumption come from? As previously explained, each MPI processes handles bandpp bands sequentially.
Thus the sequential part of the code increases when bandpp increases. Using only MPI parallelism, the timing of a single electronic step increases when bandpp increases but the convergence rate is better. For even values of bandpp , the real wavefunctions are associated in pairs in the complex FFTs, leading to a reduction by a factor of their cost. In modern supercomputers, the computing units CPU cores are no more equally distributed. They are grouped by nodes in which they share the same memory access.
In so-called many-core architecture CPU cores can be numerous on the same node. You could continue to use them as if they were not sharing the memory using MPI only but this is not the most efficient way to take benefit from the computer.
The best practice is to used hybrid parallelism, mixing distributed memory parallelism MPI, between nodes and shared memory parallelism openMP , inside a node. As you will see, this will also have consequences on the performance of the iterative diagonalization algorithm LOBPCG. The best is to suppress it from the input file. As you can wee, the new output file show a larger computing time for process 0: disappointing? Not really: you have to keep in mind that this timing is for one MPI process, adding the timings of all the openMP tasks for this process.
In the pure MPI case, we thus have 96 sec. This is better! The best way to confirm that is to look at the Wall Time cumulated on all tasks at the end of the output file:. Each block of bands is diagonalized by npband MPI processes in parallel.
As previously, each process handles bandpp bands but now using the openMP tasks. This means that bandpp x npband bands are computed in parallel using nthreads x npband tasks bandpp has thus to be a multiple of nthreads. This is in principle more efficient than in the pure MPI case.
So, note that there are subtle differences with the pure MPI case. Important note : When using threads, bandpp has to be a multiple of the number of threads.
How do we choose the number of threads? It strongly depends on the computer architecture! A computer is made of nodes. On each node, there are sockets containing a given number of CPU cores. All the cores of the node can access the RAM of all the sockets but this access is faster on their own socket.
The number of threads has thus to be a divisor of the total number of CPU cores in the node, but it is better to choose a divisor of the number of cores in a socket. Can we do better? In principle, yes. And we obtain the following timings:. There is no difficulty in adding processes to this level. To test the full parallelism, we restart with the same input file as in section 3 and add a denser k-point grid. In this case, the system has 4 k-points in the irreducible Brillouin zone IBZ so the calculation can be parallelized over at most 4 k-points MPI processes.
This is done using the npkpt input variable:. Indeed, the time needed here is slightly longer 10 sec. So, the speedup is quasi-linear in vtowfk. The problem comes from parts outside vtowfk which are not parallelized and are responsible for the negligible These parts are no longer negligible when you parallelize over hundreds of processors.
The time spent in vtowfk corresponds to The speedup of a program using multiple processes in parallel computing is limited by the time needed for the sequential fraction of the program. Hence the speedup is limited to In our case, the part above the loop over k-points in not parallelized by the KGB parallelization.
ABINIT, third lesson of the tutorial:
The ABINIT program is "a software suite to calculate the optical, mechanical, vibrational, and other observable properties of materials. Starting from the quantum equations of density functional theory, you can build up to advanced applications with perturbation theories based on DFT, and many-body Green's functions GW and DMFT. ABINIT can calculate molecules, nanostructures and solids with any chemical composition, and comes with several complete and robust tables of atomic potential", according to its authors. Run it again with a specific version number, e.
ABINIT, first lesson of the tutorial:
This lesson aims at showing how to get the following physical properties, for an insulator : the phonon frequencies and eigenvectors at Gamma the dielectric constant the Born effective charges the LO-TO splitting the phonon frequencies and eigenvectors at other q-points in the Brillouin Zone the interatomic force constants not yet the phonon band structure from interatomic force constants not yet associated themodynamical properties not yet You will learn to use of response-function features of ABINIT. In a future version, you will learn the use of the associated codes Mrgddb and Anaddb. This lesson should take about to be provided hours to be done. The ground-state geometry of AlAs. Before beginning, you might consider to work in a different subdirectory as for the other lessons. Why not "Work5"? You can copy it in the Work5 directory and change it, as usual.