Data loaders
In PiNN, the dataset is represented with the TensorFlow Dataset
class. Several
dataset loaders are implemented in PiNN to load data from common formats.
Starting from v1.0, PiNN provides a canonical data loader pinn.io.load_ds
that
handles dataset with different formats, see below for the API documentation and
available datasets.
TFRecord
The tfrecord format is a serialized format for efficient data reading in
TensorFlow. PiNN can save datasets in the TFRecord dataset. When PiNN writes the
dataset, it creates a .yml
file records the data structure of the dataset, and
a .tfr
file holds the data. For example:
from glob import glob
from pinn.io import load_ds, write_tfrecord
from pinn.io import write_tfrecord
filelist = glob('/home/yunqi/datasets/QM9/dsgdb9nsd/*.xyz')
dataset = load_ds(filelist, fmt='qm9', splits={'train':8, 'test':2})['train']
write_tfrecord('train.yml', train_set)
train_ds = load_ds('train.yml')
We advise you to convert your dataset into the TFRecord format for training. The advantage of using this format is that it allows for the storage of preprocessed data and batched dataset.
Splitting the dataset
It is a common practice to split the dataset into subsets for validation in
machine learning tasks. PiNN dataset loaders support a split
option to do
this. The split
can be a dictionary specifying the subsets and their relative
ratios. The dataset loader will return a dictionary of datasets with
corresponding ratios. For example:
from pinn.io import load_ds
dataset = load_ds(files, fmt='qm9', splits={'train':8, 'test':2})
train = dataset['train']
test = dataset['test']
Here train
and test
will be tf.data.Dataset
objects which to be consumed
by our models. The loaders also accepts a seed parameter for the split to be
reproducible, and the default seed is 0
.
Batching the dataset
Most TensorFlow operations (caching, repeating, shuffling) can be
directly applied to the dataset. However, to handle datasets with
different numbers of atoms in each structure, which is often the case,
we use a special sparse_batch
operation to create mini-batches of
the data in a sparse form. For example:
from pinn.io import sparse_batch
dataset = load_ds(fileanme)
batched = dataset.apply(sparse_batch(100))
Custom format
To be able to shuffle and split the dataset, PiNN require the dataset to be
represented as a list of data. In the simplest case, the dataset could be a list
of structure files, each contains one structure and label (or a sample). PiNN
provides a list_loader
decorator which turns a function reading a single
sample into a function that transform a list of samples into a dataset. For
example:
from pinn.io import list_loader
@list_loader()
def load_file_list(filename):
# read a single file here
coord = ...
elems = ...
e_data = ...
datum = {'coord': coord, 'elems':elems, 'e_data': e_data}
return datum
An example notebook on preparing a custom dataset is here.
Available formats
Format | Loader | Description |
---|---|---|
tfr | load_tfrecord |
See TFRecord |
runner | load_runner |
Loader for datasets in the RuNNer foramt |
ase | load_ase |
Load the files with the ase.io.iead function |
qm9 | load_qm9 |
A xyz-like file format used in the QM91 dataset |
ani | load_ani |
HD5-based format used in the ANI-12 dataset |
cp2k | load_cp2k |
Loader for CP2K output (experimental) |
deepmd-kit | load_deepmd |
Loader for deepmk-kit input |
API documentation
pinn.io.load_ds
This loader tries to guess the format when dataset is a string:
load_tfrecoard
if it ends with '.yml'load_runner
if it ends with '.data'- try to load it with
load_ase
If the fmt
is specified, the loader will use a corresponsing dataset loader.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
Dataset
|
dataset a file or input for a loader according to |
required |
fmt
|
str
|
dataset format, see avialable formats. |
'auto'
|
splits
|
dict
|
key-val pairs specifying the ratio of subsets |
None
|
shuffle
|
bool
|
shuffle the dataset (only used when splitting) |
True
|
seed
|
int
|
random seed for shuffling |
0
|
**kwargs
|
dict
|
extra arguments to loaders |
{}
|
Source code in pinn/io/__init__.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
pinn.io.load_tfrecord
Load tfrecord dataset.
Note that the splits given by load_tfrecord
should be the same
as other dataset loaders. However, the sequence is not guaranteed
with shuffle=True
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
str
|
filename of the .yml metadata file to be loaded. |
required |
splits
|
dict
|
key-val pairs specifying the ratio of subsets |
None
|
shuffle
|
bool
|
shuffle the dataset (only used when splitting) |
True
|
seed
|
int
|
random seed for shuffling |
0
|
Source code in pinn/io/tfr.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
pinn.io.load_ase
Loads a ASE trajectory
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
str | trajectory
|
a filename or trajectory |
required |
splits
|
dict
|
key-val pairs specifying the ratio of subsets |
None
|
shuffle
|
bool
|
shuffle the dataset (only used when splitting) |
True
|
seed
|
int
|
random seed for shuffling |
0
|
Source code in pinn/io/ase.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
pinn.io.load_runner
Loads runner formatted trajectory Args: flist (str): one or a list of runner formatted trajectory(s) splits (dict): key-val pairs specifying the ratio of subsets shuffle (bool): shuffle the dataset (only used when splitting) seed (int): random seed for shuffling
Source code in pinn/io/runner.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
|
pinn.io.load_qm9
Loads the QM9 dataset
QM9 provides a variety of labels, but typically we are only
training on one target, e.g. U0. A label_map
option is
offered to choose the output dataset structure, by default, it
only takes "U0" and maps that to "e_data",
i.e. label_map={'e_data': 'U0'}
.
Other available labels are
['tag', 'index', 'A', 'B', 'C', 'mu', 'alpha', 'homo', 'lumo',
'gap', 'r2', 'zpve', 'U0', 'U', 'H', 'G', 'Cv']
Desciptions of those tags can be found in QM9's description file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
flist
|
list
|
list of QM9-formatted data files. |
required |
splits
|
dict
|
key-val pairs specifying the ratio of subsets |
None
|
shuffle
|
bool
|
shuffle the dataset (only used when splitting) |
True
|
seed
|
int
|
random seed for shuffling |
0
|
label_map
|
dict
|
dictionary mapping labels to output datasets |
{'e_data': 'U0'}
|
Source code in pinn/io/qm9.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
|
pinn.io.load_ani
Loads the ANI-1 dataset
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filelist
|
list
|
filenames of ANI-1 h5 files. |
required |
split
|
dict
|
key-val pairs specifying the ratio of subsets |
False
|
shuffle
|
bool
|
shuffle the dataset (only used when splitting) |
True
|
seed
|
int
|
random seed for shuffling |
0
|
cycle_length
|
int
|
number of parallel threads to read h5 file |
4
|
Source code in pinn/io/ani.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
|
pinn.io.load_cp2k
This is a experimental loader for CP2K data
It takes data from different sources, the CP2K output file and dat files, which will be specified in the files dictionary. A list of "keys" is used to specify the data to read and where it is read from.
key | data source | provides |
---|---|---|
force |
files['out'] |
f_data |
energy |
files['out'] |
e_data |
stress |
files['out'] |
coord , elems |
cell_dat |
files['cell_dat'] |
cell |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
files
|
dict
|
input files |
required |
keys
|
list
|
data to read |
required |
splits
|
dict
|
key-val pairs specifying the ratio of subsets |
None
|
shuffle
|
bool
|
shuffle the dataset (only used when splitting) |
True
|
seed
|
int
|
random seed for shuffling |
0
|
Source code in pinn/io/cp2k.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
|
pinn.io.load_deepmd
This is loader for deepmd input data. It takes a dict of key and file path or a directory path which contains the data files. If type_map
is provided, it will be used to convert the type id to atomic numbers.
key | data source | provides |
---|---|---|
coord |
path/coord.raw |
coord |
force |
path/force.raw |
f_data |
energy |
path/energy.raw |
e_data |
virial |
path/virial.raw |
s_data |
box |
path/box.raw |
cell |
type |
path/type.raw |
elems |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
files
|
dict | Path | str
|
input files |
required |
type_map
|
dict | Path | str
|
mapping of type id to atomic number |
None
|
pbc
|
bool
|
flag of periodic boundary condition |
True
|
shuffle
|
bool
|
shuffle the dataset (only used when splitting) |
True
|
seed
|
int
|
random seed for shuffling |
0
|
Source code in pinn/io/deepmd.py
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
|
-
1 R. Ramakrishnan, P.O. Dral, M. Rupp, and O.A. von Lilienfeld, “Quantum chemistry structures and properties of 134 kilo molecules,” Sci. Data 1, 140022 (2014). ↩
-
1 J.S. Smith, O. Isayev, and A.E. Roitberg, “ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules,” Sci. Data 4, 170193 (2017). ↩