# Install PiNN
!pip install git+https://github.com/Teoroo-CMC/PiNN
Data list dataset¶
Suppose your dataset can be represented as a list and each data point can be accessed separately with some function.
The list dataset descriptor helps you to transform your reader function to a dataset loader, with handy options to split your dataset. The list can be your list of filenames of structures, or identifiers to retrieve your data points, e.g. ID from some online database.
The advantage of this approach is that you only need to write the reader for one data point, and you can get the tensorflow dataset objects with reasonably optimized IO. Later, it's also easy to convert your dataset into the TFRecord format if you need to train on the cloud or further improve the performance.
We'll demonstrate with a list of ASE atoms.
from ase import Atoms
datalist = [Atoms(elem) for elem in ['Cu', 'Ag', 'Au']]
For the purpose of training ANN potentials, you typically need to provide the elements, coordinates and potential energy of a struture.
Your reader function should take one list element as input, and return a dictionary consisting of:
'atoms'
: the elements of shape [n_atoms]'coord'
: the coordinates of shape [n_atoms, 3]'e_data'
: a single number
After you have got your reader function, decorate it with the list_loader
decorator to transform it into a dataset loader.
from pinn.io import list_loader
@list_loader()
def load_ase_list(atoms):
import numpy as np
datum = {'elems': atoms.numbers,
'coord': atoms.positions,
'e_data': 0.0}
return datum
That's it, you've got your customized dataset!
import tensorflow as tf
dataset = load_ase_list(datalist)
for tensors in dataset.as_numpy_iterator():
print(tensors)
{'elems': array([29], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0} {'elems': array([47], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0} {'elems': array([79], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0}
Force and cell¶
By default, the list loader expects the loader to return the elements, coordinates and total energy of each structure. It is also usual to have nuclei forces and pbc in the training data.
The default behavior of list_loader can be changed with the pbc
and force
options.
@list_loader(pbc=True, force=True)
def load_ase_list(atoms):
import numpy as np
data = {'elems': atoms.numbers,
'coord': atoms.positions,
'cell': atoms.cell[:], # get full cell from ASE
'f_data': np.zeros_like(atoms.positions),
'e_data': 0.0}
return data
dataset = load_ase_list(datalist)
for tensors in dataset.as_numpy_iterator():
print(tensors)
{'elems': array([29], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'cell': array([[0., 0., 0.], [0., 0., 0.], [0., 0., 0.]], dtype=float32), 'f_data': array([[0., 0., 0.]], dtype=float32)} {'elems': array([47], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'cell': array([[0., 0., 0.], [0., 0., 0.], [0., 0., 0.]], dtype=float32), 'f_data': array([[0., 0., 0.]], dtype=float32)} {'elems': array([79], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'cell': array([[0., 0., 0.], [0., 0., 0.], [0., 0., 0.]], dtype=float32), 'f_data': array([[0., 0., 0.]], dtype=float32)}
Dataset Spec¶
For even more complex dataset structures, you can instead supply a dataset specification to build your list loader. Note that the dataset is always expected to be a dictionary of tensors.
For example, we add a molecular weight entry to the dataset here.
The format dict should provide the shape and datatype of each
entry.
In the case that a certain dimension is unknow, e.g. the number of atoms,
use None
as the dimension.
ds_spec = {
'elems': {'dtype': tf.int32, 'shape': [None]},
'coord': {'dtype': tf.float32, 'shape': [None, 3]},
'e_data': {'dtype': tf.float32, 'shape': []},
'mw_data': {'dtype': tf.float32, 'shape': []}}
@list_loader(ds_spec=ds_spec)
def load_ase_list(atoms):
data = {'elems': atoms.numbers,
'coord': atoms.positions,
'e_data': 0.0,
'mw_data': atoms.get_masses().sum()}
return data
dataset = load_ase_list(datalist)
for tensors in dataset.as_numpy_iterator():
print(tensors)
{'elems': array([29], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'mw_data': 63.546} {'elems': array([47], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'mw_data': 107.8682} {'elems': array([79], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'mw_data': 196.96657}
Loading trajectories¶
It's rather common to have trajectories as training data. However, trajectories are harder to handle compared to lists as it's not trivial how many data points there are and how they should be split.
One solution is to load all the data into the memory once. A more sophisticated solution is to quickly scan through the dataset and get a list of "positions" which can be used to read a particular frame.
You might want to look into pinn.io.runner
or pinn.io.cp2k
if you would like to implement something like that.