Quick tour with QM9

This notebook showcases a simple example of training a neural network potential on the QM9 dataset with PiNN.

# Install PiNN &amp; download QM9 dataset
!pip install tensorflow==2.9
!pip install git+https://github.com/Teoroo-CMC/PiNN
!mkdir -p /tmp/dsgdb9nsd &amp;&amp; curl -sSL https://ndownloader.figshare.com/files/3195389 | tar xj -C /tmp/dsgdb9nsd

import os, warnings
import tensorflow as tf
from glob import glob
from ase.collections import g2
from pinn.io import load_qm9, sparse_batch
from pinn import get_model, get_calc
# CPU is used for documentation generation, feel free to use your GPU!
os.environ['CUDA_VISIBLE_DEVICES'] = '' 
# We heavily use indexed slices to do sparse summations,
# which causes tensorflow to complain, 
# we believe it's safe to ignore this warning.
index_warning = 'Converting sparse IndexedSlices'
warnings.filterwarnings('ignore', index_warning)

Getting the dataset

PiNN adapts TensorFlow's dataset API to handle different datasets.

For this and the following notebooks the QM9 dataset (https://doi.org/10.6084/m9.figshare.978904) is used.
To follow the notebooks, download the dataset and change the directory accordingly.

The dataset will be automatically split into subsets according to the split_ratio.
Note that to use the dataset with the estimator, the datasets should be a function, instead of a dataset object.

filelist = glob('/tmp/dsgdb9nsd/*.xyz')
dataset = lambda: load_qm9(filelist, splits={'train':8, 'test':2})
train = lambda: dataset()['train'].repeat().shuffle(1000).apply(sparse_batch(100))
test = lambda: dataset()['test'].repeat().apply(sparse_batch(100))

Defining the model

In PiNN, models are defined at two levels: models and networks.

A model (model_fn) defines the target, loss and training detail.
A network defines the structure of the neural network.

In this example, we will use the potential model, and the PiNet network. The configuration of a model is stored in a nested dictionary as shown below. Available options of the network and model can be found in the documentation.

!rm -rf /tmp/PiNet_QM9
params = {'model_dir': '/tmp/PiNet_QM9',
          'network': {
              'name': 'PiNet',
              'params': {
                  'depth': 4,
                  'rc':4.0,
                  'atom_types':[1,6,7,8,9]
              },
          },
          'model': {
              'name': 'potential_model',
              'params': {
                  'learning_rate': 1e-3
              }
          }
}
model = get_model(params)

Configuring the training process

The defined model is indeed a tf.Estimator object, thus, the training can be easily controlled

train_spec = tf.estimator.TrainSpec(input_fn=train, max_steps=1000)
eval_spec = tf.estimator.EvalSpec(input_fn=test, steps=100)

Train and evaluate

tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

Using the model

The trained model can be used as an ASE calculator.

from ase.collections import g2
from pinn import get_calc
params = {'model_dir': '/tmp/PiNet_QM92',
          'network': {
              'name': 'PiNet',
              'params': {
                  'depth': 4,
                  'rc':4.0,
                  'atom_types':[1,6,7,8,9]
              },
          },
          'model': {
              'name': 'potential_model',
              'params': {
                  'learning_rate': 1e-3
              }
          }
}

calc = get_calc(params)
calc.properties = ['energy']
atoms = g2['C2H4']
atoms.set_calculator(calc)
atoms.get_forces(), atoms.get_potential_energy()

Conclusion

You have trained your first PiNN model, though the accuracy is not so satisfying (RMSE=21 Hartree!). Also, the training speed is slow as it's limited by the IO and pre-processing of data.

We will show in following notebooks that:

Proper scaling of the energy will improve the accuracy of the model.
The training speed can be enhanced by caching and pre-processing the data.