Data loaders

In PiNN, the dataset is represented with the TensorFlow Dataset class. Several dataset loaders are implemented in PiNN to load data from common formats. Starting from v1.0, PiNN provides a canonical data loader pinn.io.load_ds that handles dataset with different formats, see below for the API documentation and available datasets.

TFRecord

The tfrecord format is a serialized format for efficient data reading in TensorFlow. PiNN can save datasets in the TFRecord dataset. When PiNN writes the dataset, it creates a .yml file records the data structure of the dataset, and a .tfr file holds the data. For example:

from glob import glob
from pinn.io import load_ds, write_tfrecord
from pinn.io import write_tfrecord
filelist = glob('/home/yunqi/datasets/QM9/dsgdb9nsd/*.xyz')
dataset = load_ds(filelist, fmt='qm9', splits={'train':8, 'test':2})['train']
write_tfrecord('train.yml', train_set)
train_ds = load_ds('train.yml')

We advise you to convert your dataset into the TFRecord format for training. The advantage of using this format is that it allows for the storage of preprocessed data and batched dataset.

Splitting the dataset

It is a common practice to split the dataset into subsets for validation in machine learning tasks. PiNN dataset loaders support a split option to do this. The split can be a dictionary specifying the subsets and their relative ratios. The dataset loader will return a dictionary of datasets with corresponding ratios. For example:

from pinn.io import load_ds
dataset = load_ds(files, fmt='qm9', splits={'train':8, 'test':2})
train = dataset['train']
test = dataset['test']

Here train and test will be tf.data.Dataset objects which to be consumed by our models. The loaders also accepts a seed parameter for the split to be reproducible, and the default seed is 0.

Batching the dataset

Most TensorFlow operations (caching, repeating, shuffling) can be directly applied to the dataset. However, to handle datasets with different numbers of atoms in each structure, which is often the case, we use a special sparse_batch operation to create mini-batches of the data in a sparse form. For example:

from pinn.io import sparse_batch
dataset = load_ds(fileanme)
batched = dataset.apply(sparse_batch(100))

Custom format

To be able to shuffle and split the dataset, PiNN require the dataset to be represented as a list of data. In the simplest case, the dataset could be a list of structure files, each contains one structure and label (or a sample). PiNN provides a list_loader decorator which turns a function reading a single sample into a function that transform a list of samples into a dataset. For example:

from pinn.io import list_loader

@list_loader()
def load_file_list(filename):
    # read a single file here
    coord = ...
    elems = ...
    e_data = ...
    datum = {'coord': coord, 'elems':elems, 'e_data': e_data}
    return datum

An example notebook on preparing a custom dataset is here.

Available formats

Format	Loader	Description
tfr	`load_tfrecord`	See TFRecord
runner	`load_runner`	Loader for datasets in the RuNNer foramt
ase	`load_ase`	Load the files with the `ase.io.iead` function
qm9	`load_qm9`	A xyz-like file format used in the QM9¹ dataset
ani	`load_ani`	HD5-based format used in the ANI-1² dataset
cp2k	`load_cp2k`	Loader for CP2K output (experimental)

API documentation

pinn.io.load_ds

This loader tries to guess the format when dataset is a string:

load_tfrecoard if it ends with '.yml'
load_runner if it ends with '.data'
try to load it with load_ase

If the fmt is specified, the loader will use a corresponsing dataset loader.

Parameters:

Parameter	Description
`dataset`	dataset a file or input for a loader according to `fmt`
`fmt`	dataset format, see avialable formats.
`splits`	key-val pairs specifying the ratio of subsets
`shuffle`	shuffle the dataset (only used when splitting)
`seed`	random seed for shuffling
`**kwargs`	extra arguments to loaders

Source code in pinn/io/__init__.py

def load_ds(dataset, fmt='auto', splits=None, shuffle=True, seed=0, **kwargs):
    """This loader tries to guess the format when dataset is a string:

    - `load_tfrecoard` if it ends with '.yml'
    - `load_runner` if it ends with '.data'
    - try to load it with `load_ase`

    If the `fmt` is specified, the loader will use a corresponsing dataset loader.

    Args:
        dataset: dataset a file or input for a loader according to `fmt`
        fmt (str): dataset format, see avialable formats.
        splits (dict): key-val pairs specifying the ratio of subsets
        shuffle (bool): shuffle the dataset (only used when splitting)
        seed (int): random seed for shuffling
        **kwargs: extra arguments to loaders
    """
    loaders = {'tfr':    load_tfrecord,
               'runner': load_runner,
               'ase':    load_ase,
               'qm9':    load_qm9,
               'ani':    load_ani,
               'cp2k':   load_cp2k}
    if fmt=='auto':
        if dataset.endswith('.yml'):
            return load_tfrecord(dataset, splits=splits, shuffle=shuffle, seed=seed)
        if dataset.endswith('.data'):
            return load_runner(dataset, splits=splits, shuffle=shuffle, seed=seed)
        else:
            return load_ase(dataset, splits=splits, shuffle=shuffle, seed=seed)
    else:
        return loaders[fmt](dataset, splits=splits, shuffle=shuffle, seed=seed, **kwargs)

pinn.io.load_tfrecord

Load tfrecord dataset.

Note that the splits given by load_tfrecord should be consistent with other loaders, but the orders of data points will not be shuffled. Make sure to use a large shuffling buffer when splits given by load_tfrecord is used in training.

Parameters:

Parameter	Description
`dataset`	filename of the .yml metadata file to be loaded.
`splits`	key-val pairs specifying the ratio of subsets
`shuffle`	shuffle the dataset (only used when splitting)
`seed`	random seed for shuffling

Source code in pinn/io/tfr.py

def load_tfrecord(dataset, splits=None, shuffle=True, seed=0):
    """Load tfrecord dataset.

    Note that the splits given by load_tfrecord should be consistent with other
    loaders, but the orders of data points will not be shuffled. Make sure to
    use a large shuffling buffer when splits given by `load_tfrecord` is used in
    training.

    Args:
       dataset (str): filename of the .yml metadata file to be loaded.
       splits (dict): key-val pairs specifying the ratio of subsets
       shuffle (bool): shuffle the dataset (only used when splitting)
       seed (int): random seed for shuffling

    """
    import sys, yaml
    import numpy as np
    import tensorflow as tf
    from pinn.io.base import split_list
    from tensorflow.python.lib.io.file_io import FileIO
    # dataset
    with FileIO(dataset, 'r') as f:
        ds_spec = yaml.safe_load(f)
        format_dict = ds_spec['format']

    dtypes = {k: format_dict[k]['dtype'] for k in format_dict.keys()}
    shapes = {k: format_dict[k]['shape'] for k in format_dict.keys()}
    feature_dict = {k: tf.io.FixedLenFeature([], tf.string) for k in dtypes}

    def parser(example):
        return tf.io.parse_single_example(example, feature_dict)
    def converter(tensors):
        tensors = {k: tf.io.parse_tensor(v, dtypes[k])
                   for k, v in tensors.items()}
        [v.set_shape(shapes[k]) for k, v in tensors.items()]
        return tensors
    tfr = '.'.join(dataset.split('.')[:-1]+['tfr'])
    dataset = tf.data.TFRecordDataset(tfr).map(parser).map(converter)
    # tfr splitter
    if splits is None:
        return dataset
    else:
        n_sample = ds_spec['info']['n_sample']
        splits = split_list(np.int64(list(range(n_sample))),
                            splits=splits, shuffle=shuffle, seed=seed)
        splitted = {k: tf.data.Dataset.zip((
            dataset, tf.data.Dataset.range(n_sample))).filter(
                lambda d, i: tf.reduce_any(tf.equal(v,i))).map(
                    lambda d, i: d)
                    for k,v in splits.items()}
        return splitted

pinn.io.load_ase

Loads a ASE trajectory

Parameters:

Parameter	Description
`dataset`	a filename or trajectory
`splits`	key-val pairs specifying the ratio of subsets
`shuffle`	shuffle the dataset (only used when splitting)
`seed`	random seed for shuffling

Source code in pinn/io/ase.py

def load_ase(dataset, splits=None, shuffle=True, seed=0):
    """
    Loads a ASE trajectory

    Args:
        dataset (str or ase.io.trajectory): a filename or trajectory
        splits (dict): key-val pairs specifying the ratio of subsets
        shuffle (bool): shuffle the dataset (only used when splitting)
        seed (int): random seed for shuffling
    """
    from ase.io import read

    if isinstance(dataset, str):
        dataset = read(dataset, index=':')

    ds_spec = _ase_spec(dataset[0])
    @list_loader(ds_spec=ds_spec)
    def _ase_loader(atoms):
        datum = {
            'elems': atoms.numbers,
            'coord': atoms.positions,
        }
        if 'cell' in ds_spec:
            datum['cell'] = atoms.cell[:]

        if 'e_data' in ds_spec:
            datum['e_data'] = atoms.get_potential_energy()

        if 'f_data' in ds_spec:
            datum['f_data'] = atoms.get_forces()

        if 'q_data' in ds_spec:
            datum['q_data'] = atoms.get_charges()

        if 'd_data' in ds_spec:
            datum['d_data'] = atoms.get_dipole_moment()
        return datum

    return _ase_loader(dataset, splits=splits, shuffle=shuffle, seed=seed)

pinn.io.load_runner

Loads runner formatted trajectory

Parameters:

Parameter	Description
`flist`	one or a list of runner formatted trajectory(s)
`splits`	key-val pairs specifying the ratio of subsets
`shuffle`	shuffle the dataset (only used when splitting)
`seed`	random seed for shuffling

Source code in pinn/io/runner.py

def load_runner(flist, splits=None, shuffle=True, seed=0):
    """
    Loads runner formatted trajectory
    Args:
        flist (str): one or a list of runner formatted trajectory(s)
        splits (dict): key-val pairs specifying the ratio of subsets
        shuffle (bool): shuffle the dataset (only used when splitting)
        seed (int): random seed for shuffling
    """
    if isinstance(flist, str):
        flist = [flist]
    frame_list = []
    for fname in flist:
        frame_list += _gen_frame_list(fname)
    return _frame_loader(frame_list, splits=splits, shuffle=shuffle, seed=seed)

pinn.io.load_qm9

Loads the QM9 dataset

QM9 provides a variety of labels, but typically we are only training on one target, e.g. U0. A label_map option is offered to choose the output dataset structure, by default, it only takes "U0" and maps that to "e_data", i.e. label_map={'e_data': 'U0'}.

Other available labels are

['tag', 'index', 'A', 'B', 'C', 'mu', 'alpha', 'homo', 'lumo',
 'gap', 'r2', 'zpve', 'U0', 'U', 'H', 'G', 'Cv']

Desciptions of those tags can be found in QM9's description file.

Parameters:

Parameter	Description
`flist`	list of QM9-formatted data files.
`dataset`	filename of the .yml metadata file to be loaded.
`splits`	key-val pairs specifying the ratio of subsets
`shuffle`	shuffle the dataset (only used when splitting)
`seed`	random seed for shuffling
`label_map`	dictionary mapping labels to output datasets

Source code in pinn/io/qm9.py

def load_qm9(flist, label_map={'e_data': 'U0'}, splits=None, shuffle=True, seed=0):
    """Loads the QM9 dataset

    QM9 provides a variety of labels, but typically we are only
    training on one target, e.g. U0. A ``label_map`` option is
    offered to choose the output dataset structure, by default, it
    only takes "U0" and maps that to "e_data",
    i.e. `label_map={'e_data': 'U0'}`.

    Other available labels are

    ```python
    ['tag', 'index', 'A', 'B', 'C', 'mu', 'alpha', 'homo', 'lumo',
     'gap', 'r2', 'zpve', 'U0', 'U', 'H', 'G', 'Cv']
    ```

    Desciptions of those tags can be found in QM9's description file.

    Args:
        flist (list): list of QM9-formatted data files.
        dataset (str): filename of the .yml metadata file to be loaded.
        splits (dict): key-val pairs specifying the ratio of subsets
        shuffle (bool): shuffle the dataset (only used when splitting)
        seed (int): random seed for shuffling
        label_map (dict): dictionary mapping labels to output datasets
    """
    import numpy as np
    import tensorflow as tf
    from pinn.io.base import list_loader
    from ase.data import atomic_numbers

    _labels = ['tag', 'index', 'A', 'B', 'C', 'mu', 'alpha', 'homo', 'lumo', 'gap',
               'r2', 'zpve', 'U0', 'U', 'H', 'G', 'Cv']
    _label_ind = {k: i for i, k in enumerate(_labels)}

    @list_loader(ds_spec=_qm9_spec(label_map))
    def _qm9_loader(fname):
        with open(fname) as f:
            lines = f.readlines()
        elems = [atomic_numbers[l.split()[0]] for l in lines[2:-3]]
        coord = [[i.replace('*^', 'E') for i in l.split()[1:4]]
                 for l in lines[2:-3]]
        elems = np.array(elems, np.int32)
        coord = np.array(coord, float)
        data = {'elems': elems, 'coord': coord}
        for k, v in label_map.items():
            data[k] = float(lines[1].split()[_label_ind[v]])
        return data
    return _qm9_loader(flist, splits=splits, shuffle=shuffle, seed=seed)

pinn.io.load_ani

Loads the ANI-1 dataset

Parameters:

Parameter	Description
`filelist`	filenames of ANI-1 h5 files.
`split`	key-val pairs specifying the ratio of subsets
`shuffle`	shuffle the dataset (only used when splitting)
`seed`	random seed for shuffling
`cycle_length`	number of parallel threads to read h5 file

Source code in pinn/io/ani.py

def load_ani(filelist, split=False, shuffle=True, seed=0, cycle_length=4):
    """Loads the ANI-1 dataset

    Args:
        filelist (list): filenames of ANI-1 h5 files.
        split (dict): key-val pairs specifying the ratio of subsets
        shuffle (bool): shuffle the dataset (only used when splitting)
        seed (int): random seed for shuffling
        cycle_length (int): number of parallel threads to read h5 file
    """
    import h5py
    import numpy as np
    import tensorflow as tf
    from pinn.io.base import split_list
    ds_spec = {
        'elems': {'dtype':  tf.int32,   'shape': [None]},
        'coord': {'dtype':  tf.keras.backend.floatx(), 'shape': [None, 3]},
        'e_data': {'dtype': tf.keras.backend.floatx(), 'shape': []}}
    # Load the list of samples
    sample_list = []
    for fname in filelist:
        store = h5py.File(fname)
        k1 = list(store.keys())[0]
        samples = store[k1]
        for k2 in samples.keys():
            sample_list.append((fname, '{}/{}'.format(k1, k2)))
    # Generate dataset from sample list

    def generator_fn(samplelist): return tf.data.Dataset.from_generator(
            lambda: _ani_generator(samplelist), output_signature=ds_spec).interleave(
            lambda x: tf.data.Dataset.from_tensor_slices(x),
            cycle_length=cycle_length)
    # Generate nested dataset
    subsets = split_list(sample_list, split=split, shuffle=shuffle, seed=0)
    splitted = map_nested(generator_fn, subsets)
    return splitted

pinn.io.load_cp2k

This is a experimental loader for CP2K data

It takes data from different sources, the CP2K output file and dat files, which will be specified in the files dictionary. A list of "keys" is used to specify the data to read and where it is read from.

key	data source	provides
`force`	`files['out']`	`f_data`
`energy`	`files['out']`	`e_data`
`stress`	`files['out']`	`coord`, `elems`
`cell_dat`	`files['cell_dat']`	`cell`

Parameters:

Parameter	Description
`files`	input files
`keys`	data to read
`splits`	key-val pairs specifying the ratio of subsets
`shuffle`	shuffle the dataset (only used when splitting)
`seed`	random seed for shuffling

Source code in pinn/io/cp2k.py

def load_cp2k(files, keys, splits=None, shuffle=True, seed=0):
    """This is a experimental loader for CP2K data

    It takes data from different sources, the CP2K output file and dat files,
    which will be specified in the files dictionary. A list of "keys" is used to
    specify the data to read and where it is read from.

    | key        | data source         | provides         |
    |------------|---------------------|------------------|
    | `force`    | `files['out']`      | `f_data`         |
    | `energy`   | `files['out']`      | `e_data`         |
    | `stress`   | `files['out']`      | `coord`, `elems` |
    | `cell_dat` | `files['cell_dat']` | `cell`           |

    Args:
        files (dict): input files
        keys (list): data to read
        splits (dict): key-val pairs specifying the ratio of subsets
        shuffle (bool): shuffle the dataset (only used when splitting)
        seed (int): random seed for shuffling
    """
    from pinn.io import list_loader
    ds_spec = {}
    for key in keys:
        for name in provides[key]:
            ds_spec.update({name:formats[name]})

    all_list = _gen_list(files, keys)

    @list_loader(ds_spec=ds_spec)
    def _frame_loader(i):
        results = {}
        for k,v in all_list.items():
            results.update(loaders[k](v[i]))
        return results

    return _frame_loader(list(range(len(all_list['coord']))),
                         splits=splits, shuffle=shuffle, seed=seed)

R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data, 1:140022, August 2014. doi:10.1038/sdata.2014.22. ↩
J. S. Smith, O. Isayev, and Adrian E. Roitberg. ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules. Sci. Data, 4:170193, December 2017. doi:10.1038/sdata.2017.193. ↩