Customizing dataset Open in Colab

# Install PiNN
!pip install tensorflow==2.9
!pip install git+https://github.com/Teoroo-CMC/PiNN

Data list dataset

Suppose your dataset can be represented as a list and each data point can be accessed separately with some function.

The list dataset descriptor helps you to transform your reader function to a dataset loader, with handy options to split your dataset. The list can be your list of filenames of structures, or identifiers to retrieve your data points, e.g. ID from some online database.

The advantage of this approach is that you only need to write the reader for one data point, and you can get the tensorflow dataset objects with reasonably optimized IO. Later, it's also easy to convert your dataset into the TFRecord format if you need to train on the cloud or further improve the performance.

We'll demonstrate with a list of ASE atoms.

from ase import Atoms
datalist = [Atoms(elem) for elem in ['Cu', 'Ag', 'Au']]

For the purpose of training ANN potentials, you typically need to provide the elements, coordinates and potential energy of a struture.

Your reader function should take one list element as input, and return a dictionary consisting of:

  • 'atoms': the elements of shape [n_atoms]
  • 'coord': the coordinates of shape [n_atoms, 3]
  • 'e_data': a single number

After you have got your reader function, decorate it with the list_loader decorator to transform it into a dataset loader.

from pinn.io import list_loader

@list_loader()
def load_ase_list(atoms):
    import numpy as np
    datum = {'elems': atoms.numbers,
            'coord': atoms.positions,
            'e_data': 0.0}
    return datum

That's it, you've got your customized dataset!

import tensorflow as tf

dataset = load_ase_list(datalist)
for tensors in dataset.as_numpy_iterator():
    print(tensors)
{'elems': array([29], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0}
{'elems': array([47], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0}
{'elems': array([79], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0}

Force and cell

By default, the list loader expects the loader to return the elements, coordinates and total energy of each structure. It is also usual to have nuclei forces and pbc in the training data.

The default behavior of list_loader can be changed with the pbc and force options.

@list_loader(pbc=True, force=True)
def load_ase_list(atoms):
    import numpy as np
    data = {'elems': atoms.numbers,
            'coord': atoms.positions,
            'cell': atoms.cell[:], # get full cell from ASE
            'f_data': np.zeros_like(atoms.positions),
            'e_data': 0.0}
    return data

dataset = load_ase_list(datalist)
for tensors in dataset.as_numpy_iterator():
    print(tensors)
{'elems': array([29], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'cell': array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]], dtype=float32), 'f_data': array([[0., 0., 0.]], dtype=float32)}
{'elems': array([47], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'cell': array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]], dtype=float32), 'f_data': array([[0., 0., 0.]], dtype=float32)}
{'elems': array([79], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'cell': array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]], dtype=float32), 'f_data': array([[0., 0., 0.]], dtype=float32)}

Dataset Spec

For even more complex dataset structures, you can instead supply a dataset specification to build your list loader. Note that the dataset is always expected to be a dictionary of tensors.

For example, we add a molecular weight entry to the dataset here. The format dict should provide the shape and datatype of each entry. In the case that a certain dimension is unknow, e.g. the number of atoms, use None as the dimension.

ds_spec = {
    'elems': {'dtype':  tf.int32,   'shape': [None]},
    'coord': {'dtype':  tf.float32, 'shape': [None, 3]},
    'e_data': {'dtype': tf.float32, 'shape': []},
    'mw_data': {'dtype': tf.float32, 'shape': []}}

@list_loader(ds_spec=ds_spec)
def load_ase_list(atoms):
    data = {'elems': atoms.numbers,
            'coord': atoms.positions,
            'e_data': 0.0,
            'mw_data': atoms.get_masses().sum()}
    return data
dataset = load_ase_list(datalist)
for tensors in dataset.as_numpy_iterator():
    print(tensors)
{'elems': array([29], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'mw_data': 63.546}
{'elems': array([47], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'mw_data': 107.8682}
{'elems': array([79], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'mw_data': 196.96657}

Loading trajectories

It's rather common to have trajectories as training data. However, trajectories are harder to handle compared to lists as it's not trivial how many data points there are and how they should be split.

One solution is to load all the data into the memory once. A more sophisticated solution is to quickly scan through the dataset and get a list of "positions" which can be used to read a particular frame.

You might want to look into pinn.io.runner or pinn.io.cp2k if you would like to implement something like that.

« Previous
Next »