Extrapolation and Interpolation in Machine Learning Modeling with Fast Food and `astartes`

Machine learning is a groundbreaking tool for tackling high-dimensional datasets with complex correlations that humans struggle to comprehend. An important nuance of ML is the difference between using a model for interpolation or extrapolation, meaning either inference or prediction. This work will demonstrate visually what interpolation and extrapolation mean in the context of machine learning using astartes, a Python package that makes it easy to tackle in ML modeling. The intended audience for this notebook is data scientists, programmers, and those generally interested in ML. Extensive knowledge of Python programming is not required, though some beginner experience with pandas, scikit-learn, and ML concepts is helpful to help contextualize the information being presented.

Many different sampling approaches are made available with astartes, both interpolative and extrapolative. Some of the implemented algorithms are extensions on one another while others are entirely unique and quite complex. But what do they actually look like when it comes to splitting data into groups? And how do the different sampling approaches affect model performance?

For this notebook, we will use a very tangible dataset - the Burger King ® menu - and subject it to the various sampling algorithms present in astartes and then visualize what the results look like. The features for the dataset are the grams of protein, fat, and carbohydrates present in each menu item at the restaurant, and the target is the number of calories in that item. We know a priori that there is an underlying correlation in this data. Each macronutrient has a different number of calories, and by calculating a weighted sum of the grams of each, we can easily find the total calories. Our goal is for the model to learn this simple correlation.

The dataset used in this work was retrieved from this link and is in the public domain (see the CC0 license).

Part 0. Perusing the Menu

Let’s start by loading the dataset and generating our first plot of its contents. To get the data into this notebook, we will use pandas, one of the ubiquitous Python packages for machine learning. You can read more about Pandas in their documentation. The data itself is in the csv file format and made available in the astartes GitHub repository.

import pandas as pd

with open("burger-king-menu.csv", "r") as f:
    menu = pd.read_csv(f)
menu.describe()

	Calories	Fat Calories	Fat (g)	Saturated Fat (g)	Trans Fat (g)	Cholesterol (mg)	Sodium (mg)	Total Carb (g)	Dietary Fiber (g)	Sugars (g)	Protein (g)	Weight Watchers
count	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000
mean	501.428571	278.311688	30.967532	9.805195	0.636364	101.753247	993.246753	35.181818	1.779221	6.636364	20.909091	497.064935
std	307.612685	184.393762	20.535966	8.118431	1.128682	97.958659	613.426403	20.716588	1.690713	6.973463	17.145033	302.238070
min	10.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	12.000000
25%	260.000000	140.000000	16.000000	3.500000	0.000000	25.000000	470.000000	26.000000	1.000000	1.000000	12.000000	252.000000
50%	430.000000	250.000000	28.000000	8.000000	0.000000	70.000000	1010.000000	30.000000	1.000000	6.000000	17.000000	416.000000
75%	700.000000	380.000000	42.000000	14.000000	0.500000	175.000000	1420.000000	49.000000	2.000000	10.000000	28.000000	690.000000
max	1220.000000	750.000000	84.000000	33.000000	4.500000	390.000000	2840.000000	110.000000	9.000000	40.000000	71.000000	1192.000000

As you can see from the output of menu.describe(), this dataset has 77 total items (rows) and more columns than we need, so let’s drop many of the columns. We will also rename some of the remaining columns to be easier to use, and we will hold onto the labels ("Category") for later comparison to the splits that astartes creates.

menu.drop(
    [
        "Item",
        "Fat Calories",
        "Saturated Fat (g)",
        "Trans Fat (g)",
        "Cholesterol (mg)",
        "Sodium (mg)",
        "Dietary Fiber (g)",
        "Sugars (g)",
        "Weight Watchers",
    ],
    axis=1,
    inplace=True,
)
menu.rename(
    columns={
        "Total Carb (g)": "Carbohydrates",
        "Fat (g)": "Fat",
        "Protein (g)": "Protein",
    },
    inplace=True,
)
menu

	Category	Calories	Fat	Carbohydrates	Protein
0	Burgers	660.0	40.0	49.0	28.0
1	Burgers	740.0	46.0	50.0	32.0
2	Burgers	790.0	51.0	50.0	35.0
3	Burgers	900.0	58.0	49.0	48.0
4	Burgers	980.0	64.0	50.0	52.0
...	...	...	...	...	...
72	Breakfast	40.0	0.0	11.0	0.0
73	Breakfast	140.0	15.0	1.0	1.0
74	Breakfast	80.0	8.0	2.0	0.0
75	Breakfast	150.0	15.0	3.0	0.0
76	Breakfast	90.0	6.0	8.0	0.0

77 rows × 5 columns

Now that we have the data ready to go, let’s plot it in 3D and see what it looks like. To do so, we will use plotly. First let’s write a nice wrapper function that will take in our data points and give back a nice 3D scatter plot that we can rotate and zoom in on. If you’re curious, go ahead and read through the code below - otherwise just remember that we have a function scatter3d that makes a nice looking 3D scatter plot.

scatter3d Source Code

import plotly
import plotly.express as px
import numpy as np

plotly.offline.init_notebook_mode()


def scatter3d(menu, idxs=None):
    """Generates a nicely formatted 3D scatter plot of the Menu data
    optionally showing train/validation/test splits.

    All of the logic related to `idxs` will be used later on when we
    partition our data into training, validation, and testing sets.

    Args:
        menu (pd.df):  menu dataframe
        idxs (tuple of np.array, optional): train, validation, and
            test indexes. Defaults to None.
    """
    plot_args = dict(
        data_frame=menu,
        x="Protein",
        y="Carbohydrates",
        z="Fat",
        color="Calories",
        opacity=1,
        range_color=[10, 1220],
        height=800,
        width=800,
    )

    # if the user specified indexes, change their size and marker to indicate them
    if idxs:
        # build lists of the size and split name
        size_array, name_array = [1] * 77, [1] * 77
        for split, split_size, split_name in zip(
            idxs, (20, 2, 5), ("Training", "Validation", "Testing")
        ):
            for idx in split:
                size_array[idx] = split_size
                name_array[idx] = split_name
        # pass these through to plotly call
        plot_args["symbol"], plot_args["size"] = "Split", "MarkerSize"
        menu["MarkerSize"], menu["Split"] = np.array(size_array), np.array(name_array)

    # actual call to plotly
    fig = px.scatter_3d(**plot_args)

    # add a legend for different split types
    if idxs:
        fig.update_layout(
            legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
        )
        fig.update_traces(marker=dict(line=dict(width=0)))

        # make the markers consistent between plots
        symbols = {"Training": "circle", "Validation": "diamond", "Testing": "square"}
        for i, d in enumerate(fig.data):
            fig.data[i].marker.symbol = symbols[fig.data[i].name]

    # customize the colors
    fig.update_layout(
        dict(
            plot_bgcolor="rgba(0, 0, 0, 0)",
            paper_bgcolor="rgba(0, 0, 0, 0)",
        )
    )
    axis_args = dict(
        backgroundcolor="rgba(0, 0, 0,0)",
        gridcolor="grey",
        showbackground=True,
        zerolinecolor="grey",
    )
    fig.update_layout(scene=dict(xaxis=axis_args, yaxis=axis_args, zaxis=axis_args))

    # render the plot
    plotly.offline.iplot(fig)

scatter3d(menu)

Hint: Try dragging this plot around to change your perspective get a better look at the data!

We can immediately make some interesting conclusions just from a quick visual inspection of this plot. As expected, items with a larger total number of macronutrients have larger amounts of calories (because the items are just physically bigger). Items that have a lot of fat also have more calories than those with a lot of carbohydrates, which aligns with our understanding - fat has 9 calories per gram, whereas carbohydrates have only 4 calories per gram.

Part 1. Comparing Interpolative Splits

This section will focus on sampling approaches that attempt to enforce interpolation. If you have a large dataset and are confident that no new data will be outside the feature space of your previous examples and/or you only intend to use your final model for inference, this is a great approach.

Each of these algorithms works in generally the same way conceptually - we pick one or two points to start with and put them in the training set, and then continually take more points from the dataset sequentially until the training, validation, and testing sets are full. The method by which we make this choice is what differentiates these algorithms.

Random

The first approach that almost every machine learning researcher would use is random splitting. Although this method does not strictly enforce interpolation, random sampling often creates splits that have very similar distributions. The spread of values for the features and targets in the testing and validation set will most likely be very similar to those for the the training set, i.e. between all of the training points in the plot.

Here’s how you do random sampling using astartes, and the plot afterward shows what the resulting training, testing, and validation set look like. If you are already familiar with machine learning with sklearn, this function call should look pretty familiar. The only big difference is that we now specify which sampling approach we use, and we can return the indices belonging to the training, validation, and testing sets directly!

from astartes import train_val_test_split

Tip: to install astartes on your own machine, run pip install astartes.

train_val_test_split returns a training, validation, and testing set (in that order) for every array that is passed in. In this case, we pass in a feature array and a target array, so we get 6 arrays back. We have also enabled the option to return_indices, which will add 3 more arrays to the return statement that indicate which indices from the feature array belong to which split. This is handy for re-splitting the data later without having to save the splits themselves, and in our case for plotting what the splits look like.

The call itself contains the feature array X, the target array y, the choice of sampler (if we had not given one, it would default to "random" anyway), the size of the training, validation, and testing sets, and finally the flag to turn on return_indices. We could have also passed in a random_state to control the exact splitting results (astartes uses 42 by default to ensure reproducibility between runs) or labels if we wanted astartes to provide labels_train etc. for us.

(
    random_X_train,
    random_X_val,
    random_X_test,
    random_y_train,
    random_y_val,
    random_y_test,
    random_idxs_train,
    random_idxs_val,
    random_idxs_test,
) = train_val_test_split(
    menu[["Protein", "Carbohydrates", "Fat"]].to_numpy(),
    menu["Calories"].to_numpy(),
    sampler="random",
    train_size=0.5,
    val_size=0.25,
    test_size=0.25,
    return_indices=True,
)

/home/jackson/mambaforge/envs/usrse23/lib/python3.11/site-packages/astartes/main.py:334: ImperfectSplittingWarning:

Actual train/test split differs from requested size. Requested train size of 0.50, got 0.49. Requested test size of 0.25, got 0.26.

astartes warns us that the actual result in the sampling does not perfectly match what we requested (we don’t have exactly 100 data points to match it perfectly) but it is close enough!

scatter3d(menu, (random_idxs_train, random_idxs_val, random_idxs_test))

As shown in the plot, nearly all of the points in the testing and validation sets are between a pair of points are in the testing set. When it comes time for the model to make predictions on these smaller sets, it will have already seen some very similar points.

Of note, though, is that there are some points in the testing set that are totally outside of the training set. This is just the way the random sampling happened to divide the data - were we to specify a different random state in astartes, we would get a different result. The persistent issue with random sampling, though, is that this is always a risk (particularly) on small datasets.

Let’s explore some other samplers that try and ensure interpolation is always being strictly enforced.

Kennard-Stone

This algorithm seeks to remedy the above problems by very intentionally sampling through the feature space that is occupied by our foods. You can read the original paper by Saldhana et. al at this link but I’ll try and give a more high-level, intuitive explanation.

We start by calculating how far away each point in the dataset is from all the other points. This is referred to as the pairwise distance matrix, or simply the distance matrix. The way that we actually quantify distance is a complicated question. In cases like this, it’s easy to visualize the euclidian distance between each point in the dataset and use that as our metric. But in higher-dimensional datasets it can much harder to pick a reasonable metric. Some common metrics you might see in the literature are the various L-norms (of which euclidian distance is member), manhattan distance, or cosine distance.

Side Note: There’s another class of distance metrics that operate on binary features (only 0s and 1s, rather than arbitrary values like we have here) called similarity metrics. These operate in the opposite ‘direction’ as distance metrics, i.e. a similarity of 1 would be a distance of 0 and vice versa (if distance is normalized), but can still be used in applications like this.

The Kennard-Stone algorithm uses this distance matrix to find the two members of the dataset that are the furthest apart in space. This first pair is placed into the training set. From then on, it successively adds points to the training set based on which remaining point has the maximum minimum distance. For each point in the dataset, we start by identifying which point already in the training set it is closest to (the minimum part of max-min distance). The next point to be added to the training set is the one for which this value is the largest (the maximum part of max-min). In doing so, we uniformly sample our way inwards from the ‘edges’ of the data towards the middle, only adding points to the training set that are the least close to the existing points.

Let’s see what this actually looks like in 3D space on our menu.

(
    ks_X_train,
    ks_X_val,
    ks_X_test,
    ks_y_train,
    ks_y_val,
    ks_y_test,
    ks_idxs_train,
    ks_idxs_val,
    ks_idxs_test,
) = train_val_test_split(
    menu[["Protein", "Carbohydrates", "Fat"]].to_numpy(),
    menu["Calories"].to_numpy(),
    sampler="kennard_stone",
    train_size=0.5,
    val_size=0.25,
    test_size=0.25,
    return_indices=True,
)

/home/jackson/mambaforge/envs/usrse23/lib/python3.11/site-packages/astartes/main.py:334: ImperfectSplittingWarning:

Actual train/test split differs from requested size. Requested train size of 0.50, got 0.49. Requested test size of 0.25, got 0.26.

scatter3d(menu, (ks_idxs_train, ks_idxs_val, ks_idxs_test))

Kennard-Stone started by selecting the two items which are furthest from one another on the plot - the very top yellow point and the bottom purple one. Form there on it picked points that were furthest from the training set, moving its way inward on the data.

Compared to random, we still get a reasonably good look at all the data with our training set. In the lower-calories region, we have fewer ‘clumps’ of points than we did with Random sampling. Of note though is that we still missed some of the outliers. While they are far away in space from the initialization points, they weren’t quite far enough to be sampled in to the training set. This shortcoming is known in the Kennard-Stone algorithm, but it is sometimes just a function of small datasets.

Joint Sample Partitioning in X and Y (SPXY) seeks to remedy this problem by also considering the value of the target space - not just the feature space. Let’s check it out.

SPXY

The SPXY algorithm is an extension to the Kennard-Stone algorithm that also considers the values of the target for each data point in the feature space. It works by calculating the distance matrix for the features and the target (implicitly requiring the target to be comparable in this manner, i.e. a number or vector) and then normalizing each before summing them together. This “augmented” distance matrix is then carried forward into the same maximum minimum distance iterative process described above.

For a comprehensive mathematical description of the procedure, check out the original paper here. Otherwise, let’s see how it performs visually.

(
    spxy_X_train,
    spxy_X_val,
    spxy_X_test,
    spxy_y_train,
    spxy_y_val,
    spxy_y_test,
    spxy_idxs_train,
    spxy_idxs_val,
    spxy_idxs_test,
) = train_val_test_split(
    menu[["Protein", "Carbohydrates", "Fat"]].to_numpy(),
    menu["Calories"].to_numpy(),
    sampler="spxy",
    train_size=0.5,
    val_size=0.25,
    test_size=0.25,
    return_indices=True,
)

/home/jackson/mambaforge/envs/usrse23/lib/python3.11/site-packages/astartes/main.py:334: ImperfectSplittingWarning:

Actual train/test split differs from requested size. Requested train size of 0.50, got 0.49. Requested test size of 0.25, got 0.26.

scatter3d(menu, (spxy_idxs_train, spxy_idxs_val, spxy_idxs_test))

Now all the points in the testing and validation sets lie within the training set. Our training samples are also more evenly distributed and ‘diffuse’ than with pure Kennard-Stone. When trying to enforce interpolation, SPXY is a great option when mathematically feasible.

Choosing an Interpolation Algorithm

If you have a sufficiently large dataset, you can most likely throw caution to the wind and use random sampling and never notice a performance difference when using your model. This requires you to have lots of data (there are no good heuristics for what constitutes “lots”), at least with more examples than features.

If you are operating in a low data regime or you explicitly want to try and enforce interpolation, use the SPXY sampler if your target can be compared with distance metrics and the Kennard-Stone sampler otherwise.

To the latter point about interpolation, though - the authors encourage users who use ML for discovery to not focus on interpolation. One of the most critical functions of a well-trained ML model is its ability to extrapolate (or lack thereof). If we already have tons of data that completely covers all possible feature and target space, we don’t need to do any original discovery because we already have all the possibilities. Since interpolation is inherently an easier task, model performance on these data splits tends to be quite good. However, if the goal is to apply these models to new types of data (such as new regions of chemical space for drug or materials discovery or predicting the number of calories for a new menu item in the toy example here), interpolation splits will tend to yield overly optimistic results. For this goal, having a separate testing set is not enough; this set must be different than the data used for training so we can evaluate how the model performs when given new/different data. Thus, we need to measure how well you model can extrapolate. To learn how to do so, read Part 2, with some additional discussion found in the references below:

Bryce Meredig et al. “Can Machine Learning Identify the Next High-Temperature Superconductor? Examining Extrapolation Performance for Materials Discovery”. In: Mol. Syst. Des. Eng. 3.5 (2018), pp. 819–825.
Samantha Durdy et al. “Random Projections and Kernelised Leave One Cluster Out Cross Validation: Universal Baselines and Evaluation Tools for Supervised Machine Learning of Material Properties”. In: Digital Discovery 1 (2022), pp. 763–778.

Part 2. Comparing Extrapolative Splits

This next class of algorithms takes a fundamentally different approach to data partitioning. Interpolation samplers operate one data point at a time (sometimes two, for the first step only) and serially select from the dataset to build the training, validation, and testing sets. When enforcing extrapolation, we instead start by clustering the data. This process uses distance metrics to determine groups of points, called clusters, that are close to one another in space. Every cluster is then meant to represent points that are ‘similar’ to one another and that likely have similar target values since they occupy similar feature space.

After these clusters are determined by some algorithm, we assign different clusters to the training, validation, and testing sets. In doing so, we can guarantee that the model will not see all of the possible feature space in the training set. When we go to tune its performance on the validation set and then evaluate performance on the testing set, we will be measuring the model’s capacity to extrapolate to clusters of data points that are dissimilar to those in the training set. This is a much harder task, and we might expect lower performance when extrapolating with a model that was intended for interpolation, but we will get to that in Part 3. For now, let’s compare how different algorithms enforce extrapolation.

K-Means

The K-Means algorithm attempts to cluster data into k clusters such that every point in the dataset belongs to the cluster with the closest average value. The ideal configuration of clusters is the one which minimizes the variance within each cluster. In plain english, K-Means tries to group all the data based on which points ‘look like’ one another to the greatest extent possible.

The precise solution to this problem is easy to verify yet difficult to directly calculate (NP-Hard for the computer science theorists in the crowd). Packages like sklearn provide implementations that quickly get very close, which I invite the curious to read more about here.

astartes uses K-Means clustering as one way of enforcing extrapolation. This function call also shows a new feature of astartes - the hopts argument, which stands for hyperparameter options. This dictionary contains keyword arguments that are passed to the sampling algorithm when it executes. For the K-Means approach, we are going to customize the number of clusters that it makes. Each sampler has different hopts and you can find them in the astartes documentation.

The clustering algorithms also add a new group of return values to train_val_test_split, the cluster labels. Each point in the data will be assigned a cluster, and these labels can be compared to the known labels for the data (in this case the Category of food item) to qualitatively describe how performant the clustering approach was.

(
    kmeans_X_train,
    kmeans_X_val,
    kmeans_X_test,
    kmeans_y_train,
    kmeans_y_val,
    kmeans_y_test,
    kmeans_clusters_train,
    kmeans_clusters_val,
    kmeans_clusters_test,
    kmeans_idxs_train,
    kmeans_idxs_val,
    kmeans_idxs_test,
) = train_val_test_split(
    menu[["Protein", "Carbohydrates", "Fat"]].to_numpy(),
    menu["Calories"].to_numpy(),
    sampler="kmeans",
    train_size=0.5,
    val_size=0.25,
    test_size=0.25,
    return_indices=True,
    hopts=dict(n_clusters=5),
)

/home/jackson/mambaforge/envs/usrse23/lib/python3.11/site-packages/astartes/main.py:334: ImperfectSplittingWarning:

Actual train/test split differs from requested size. Requested train size of 0.50, got 0.62. Requested validation size of 0.25, got 0.14. Requested test size of 0.25, got 0.14.

Note: One Shortfall of clustering algorithms with small datasets is that it can be difficult to satisfy your exact requested split sizes if the clusters that are identified happen to be too large. Watch out for helpful warnings from astartes!

scatter3d(menu, (kmeans_idxs_train, kmeans_idxs_val, kmeans_idxs_test))

It’s clear that the K-Means sampling grouped together many of our low calories examples into a large cluster and assigned it to training, forcing hard extrapolation into the high-calorie region. Let’s see what the clusters look like:

from collections import Counter

print(
    "Training Data: ",
    Counter(menu["Category"].iloc[kmeans_idxs_train]),
    kmeans_clusters_train,
)
print(
    "Testing Data:",
    Counter(menu["Category"].iloc[kmeans_idxs_test]),
    kmeans_clusters_val,
)
print(
    "Validation Data:",
    Counter(menu["Category"].iloc[kmeans_idxs_val]),
    kmeans_clusters_test,
)

Training Data:  Counter({'Breakfast': 27, 'Burgers': 11, 'Chicken': 10}) [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0]
Testing Data: Counter({'Burgers': 8, 'Breakfast': 2, 'Chicken': 1}) [4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4]
Validation Data: Counter({'Burgers': 7, 'Chicken': 7, 'Breakfast': 4}) [3 3 2 2 2 2 2 2 2 2 2]

The training data is composed of two large clusters (labeled 0 and 1) that is primarily the breakfast category. The testing and validation data is composed of mostly burgers and a mixture of burgers and chicken, respectively. This is the key thrust of extrapolation. We will mostly be showing the model breakfast food items when training, but when we evaluate the performance, we ask it to look at different clusters that contain higher proportion of the other food categories. As programmers we know that they should follow the same underlying distribution, and if the model is truly ‘learning’ it should be able to predict successfully for each category.

Sphere Exclusion

The Sphere Exclusion algorithm is conceptually easy to understand and produces distinct and highly tunable results compared to other approaches. To begin, we select an arbitrary starting point in the dataset which will be the first cluster center. From there, we group all the points in the dataset that are within the distance_cutoff to that cluster. This procedure is repeated until all data has been assigned to a cluster.

Because users directly control the distance_cutoff, the size of the resulting clusters can be tuned carefully. With a larger cutoff, more data will be grouped into each cluster (and vice versa). Let’s see visually what the difference compared to other samplers looks like:

(
    spex_X_train,
    spex_X_val,
    spex_X_test,
    spex_y_train,
    spex_y_val,
    spex_y_test,
    spex_clusters_train,
    spex_clusters_val,
    spex_clusters_test,
    spex_idxs_train,
    spex_idxs_val,
    spex_idxs_test,
) = train_val_test_split(
    menu[["Protein", "Carbohydrates", "Fat"]].to_numpy(),
    menu["Calories"].to_numpy(),
    sampler="sphere_exclusion",
    train_size=0.5,
    val_size=0.25,
    test_size=0.25,
    return_indices=True,
    hopts=dict(
        # normalized between zero and one
        distance_cutoff=0.25,
    ),
)

/home/jackson/mambaforge/envs/usrse23/lib/python3.11/site-packages/astartes/main.py:334: ImperfectSplittingWarning:

Actual train/test split differs from requested size. Requested train size of 0.50, got 0.57. Requested validation size of 0.25, got 0.23. Requested test size of 0.25, got 0.23.

scatter3d(menu, (spex_idxs_train, spex_idxs_val, spex_idxs_test))

Rotate this plot around and zoom in near the ‘edges’ of the data in space. Sphere Exclusion does a fantastic job of sampling this region of the dataset, placing the center-most data points into the training set and forcing the model to extrapolate in multiple directions.

If we look at the actual labels that are in each of the sets (the clusters that the data belong to):

print("Training Clusters:", spex_clusters_train)
print("Validation Clusters:", spex_clusters_val)
print("Testing Clusters:", spex_clusters_test)

Training Clusters: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0]
Validation Clusters: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Testing Clusters: [4 4 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3]

we can see that the largest, center-most cluster was sorted into the training set and the remaining points went into smaller clusters around it. There is good separation between the different clusters in space, and we expect this extrapolation to be more challenging than typical interpolation.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN is a highly-cited algorithm that can be difficult to control but incredibly useful for data clustering. When data is distributed non-linearly in feature space, DBSCAN excels at grouping together those points without including non-members. Another unique feature is the ‘noise’ label that DBSCAN will apply to some data that it deems to be outliers, often useful for empirical scientific data.

The eps parameter shown in the hopts dictionary is absolutely critical to the function of the algorithm. It determines what constitutes noise in the dataset, and setting it inappropriately can cause the entire dataset to be either marked as noise or grouped into a single cluster. astartes will warn users when an improperly configured call to DBSCAN leads to improperly filled training, testing, or validation sets.

Let’s visualize the splits themselves:

(
    dbscan_X_train,
    dbscan_X_val,
    dbscan_X_test,
    dbscan_y_train,
    dbscan_y_val,
    dbscan_y_test,
    dbscan_clusters_train,
    dbscan_clusters_val,
    dbscan_clusters_test,
    dbscan_idxs_train,
    dbscan_idxs_val,
    dbscan_idxs_test,
) = train_val_test_split(
    menu[["Protein", "Carbohydrates", "Fat"]].to_numpy(),
    menu["Calories"].to_numpy(),
    sampler="dbscan",
    train_size=0.5,
    val_size=0.25,
    test_size=0.25,
    return_indices=True,
    hopts=dict(
        eps=8.8,
    ),
)

/home/jackson/mambaforge/envs/usrse23/lib/python3.11/site-packages/astartes/main.py:334: ImperfectSplittingWarning:

Actual train/test split differs from requested size. Requested train size of 0.50, got 0.62. Requested validation size of 0.25, got 0.22. Requested test size of 0.25, got 0.22.

scatter3d(menu, (dbscan_idxs_train, dbscan_idxs_val, dbscan_idxs_test))

Part of the data assigned to testing and validation lies well outside of the training set, while others are much closer but still separated in space. This may not be as challenging of an extrapolation task as we saw from the other algorithms, but will still certainly be a more informative performance metric than interpolation.

Choosing an Extrapolative Sampler

All extrapolative samplers in astartes should provide a different and interesting challenge for model evaluation, and the authors encourage users to try using more than one sampler in the modeling. The three shown in this demonstration are universal and good starting points, and a complete list of implemented samplers can be found on the astartes GitHub.

In general though, the K-Means algorithm is a tried-and-true method for the efficient clustering of data and is recommended by the authors as an excellent starting (and likely ending) point for extrapolative sampling. For large datasets it may not be possible to use other approaches - Sphere Exclusion, for example, requires calculating the entire pairwise distance matrix. Without sufficiently memory, it may not be possible to do so. Because K-Means does not need to ever hold this matrix in memory it will be both faster and in some cases the only option.

For applications where Sphere Exclusion or DBSCAN have been found to be highly effective, it is of course prudent to try those approaches directly. As mentioned above, DBSCAN is known to effectively identify non-linear clusters in space and is therefore appropriate for ML on such datasets.

Part 3. Comparing Model Performance

Since we know from our understanding of the problem that the underlying trend in this data is linear, it would make the most sense to train a simple linear model. However, as a demonstration, we are going to use a different modeling approach that is common in the ML space: Random Forest Regression. This approach aggregates huge numbers of decision trees into an ensemble (forest) and uses their collective predictions as a means of estimation. Random Forest Regression has a well-earned reputation for robustness and flexibility and is often a first attempt model in ML studies.

Here we will defer to the default configuration options for the Random Forest, but I invite you to check out the sklearn docs for a complete description of what can be modified.

Interpolation

Let’s start by organizing our data into structures that are easy to loop through:

algorithm_name = (
    "Random",
    "Kennard-Stone",
    "SPXY",
)
training_data = (
    (random_X_train, random_y_train),
    (ks_X_train, ks_y_train),
    (spxy_X_train, spxy_y_train),
)
validation_data = (
    (random_X_val, random_y_val),
    (ks_X_val, ks_y_val),
    (spxy_X_val, spxy_y_val),
)
testing_data = (
    (random_X_test, random_y_test),
    (ks_X_test, ks_y_test),
    (spxy_X_test, spxy_y_test),
)

and then train some models to compare the performance across different interpolation approaches.

First, let’s get the boilerplate setup out of the way:

from sklearn.ensemble import RandomForestRegressor

# tabulate provides nicely formatted tabular output
from tabulate import tabulate

Now for each split approach, we train a RandomForestRegressor on the training set and check its performance on the validation set. From there, we would normally manually try and improve the performance on the validation set by identifying important features, changing the representation, and/or tuning the model hyperparameters. To keep things brief, we are just going to use a for loop to try and find the best number of n_estimators. After this is done, we follow best-practices in ML by using the testing set (also known as the holdout set) for a final round of checking to see if our model is (1) accurate and (2) not over-fit to the training and validation sets - Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. MIT press, 2012. - Andreas C. Müller and Sarah Guido. Introduction to Machine Learning with Python: A Guide for Data Scientists. ”O’Reilly Media, Inc.”, 2016. - Aurélien Géron. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, Inc., 2019. - Bharath Ramsundar et al. Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More. O’Reilly Media, Inc., 2019.

def train_models(name, training, validation, testing):
    headers = ["Sampler", "Baseline Validation R^2", "Optimized Validation R^2", "Optimized Testing R^2"]
    table_entries = []
    for name, training, validation,  testing in zip(
        algorithm_name, training_data,validation_data , testing_data
    ):
        table_row = [name]
        # baseline model
        rfr = RandomForestRegressor(random_state=42)
        rfr.fit(*training)
        table_row.append(rfr.score(*validation))

        # tune this model
        tuned_rfr, high_score = None, None
        for n_estimators in np.arange(10, 100, step=10):
            rfr = RandomForestRegressor(random_state=42, n_estimators=n_estimators)
            rfr.fit(*training)
            if high_score is None or rfr.score(*validation) > high_score:
                high_score = rfr.score(*validation)
                tuned_rfr = rfr
        table_row.append(tuned_rfr.score(*validation))
        
        # test this tuned model
        table_row.append(tuned_rfr.score(*testing))
        table_entries.append(table_row)

    print(tabulate(table_entries, headers=headers, floatfmt=".3f"))

train_models(algorithm_name, training_data, testing_data, validation_data)

Sampler          Baseline Validation R^2    Optimized Validation R^2    Optimized Testing R^2
-------------  -------------------------  --------------------------  -----------------------
Random                             0.971                       0.971                    0.930
Kennard-Stone                      0.988                       0.988                    0.975
SPXY                               0.991                       0.992                    0.975

SPXY did the best job in preparing the training data to make predictions easier. During training, the model had seen the absolute limits of what the possible feature and target space could look like and had no problems inferring other results.

In all cases, the models performed better on the validation set than on the testing set, but testing performance was still quite good. We struggled to make any improvements when tuning our model because they were already quite accurate!

At this point, one might assume they are done with modeling and have successfully done ML - and that might be the case if the model is only ever used for interpolation. If the goal is to extrapolate into new space and drive discovery, we should also try this modeling approach on the extrapolative training, testing, and validation sets.

Extrapolation

Let’s repeat the above process with the splits we generated earlier. This is not a perfect comparison to the interpolative samplers because our clustering resulted in larger-than-requested training sets, but if anything that should be advantageous for the models.

algorithm_name = (
    "K-Means",
    "Sphere Exclusion",
    "DBSCAN",
)
training_data = (
    (kmeans_X_train, kmeans_y_train),
    (spex_X_train, spex_y_train),
    (dbscan_X_train, dbscan_y_train),
)
validation_data = (
    (kmeans_X_val, kmeans_y_val),
    (spex_X_val, spex_y_val),
    (dbscan_X_val, dbscan_y_val),
)
testing_data = (
    (kmeans_X_test, kmeans_y_test),
    (spex_X_test, spex_y_test),
    (dbscan_X_test, dbscan_y_test),
)

train_models(algorithm_name, training_data, testing_data, validation_data)

Sampler             Baseline Validation R^2    Optimized Validation R^2    Optimized Testing R^2
----------------  -------------------------  --------------------------  -----------------------
K-Means                              -1.278                      -1.267                  -12.565
Sphere Exclusion                     -1.495                      -1.147                    0.631
DBSCAN                                0.628                       0.656                    0.949

Disaster strikes.

What we see in the above table is a total failure of the model, with an exception for the DBSCAN test set where we got lucky. We took the exact same modeling approach as before, but now our results are terrible.

As you may have guessed, Random Forest Regression is inherently incapable of extrapolation. This is a fundamental limitation of the method and of decision trees in general.

It may then seem disingenuous to show the table above and laude astartes as the real fix. But the point made here is that without rigorously checking: 1. Any new data that a model is used to predict on is actually within the training data (i.e. interpolative) 2. What the performance of a model is in extrapolation the user might be inadvertently using a model that is this inaccurate.

Bear in mind that it’s never as cut-and-dried when deciding if new data is interpolative or not. In this simple 3D example, we can plot the entire dataset to see for ourselves where the sets are, but in high-dimensional problems this is not an option.

So then, let’s continue forward with this newfound knowledge about decision trees to a new model. Let’s use something we might expect to better extrapolate - the infamous neural network (in this case a Multi-layer Perceptron):

from sklearn.neural_network import MLPRegressor


table_entries = []
for name, training, testing, validation in zip(
    algorithm_name, training_data, testing_data, validation_data
):
    table_row = [name]
    mlpr = MLPRegressor(
        random_state=42,
        max_iter=1000,
        # prevent over-fitting to the training data
        early_stopping=True,
    )
    mlpr.fit(*training)
    table_row.append(mlpr.score(*validation))

    # tune this model
    tuned_mlpr, high_score = None, None
    for validation_fraction in (0.12, 0.08, 0.04):
        mlpr = MLPRegressor(
            random_state=42,
            max_iter=1000,
            early_stopping=True,
            validation_fraction=validation_fraction,
        )
        mlpr.fit(*training)
        if high_score is None or mlpr.score(*validation) > high_score:
            high_score = mlpr.score(*validation)
            tuned_mlpr = mlpr
    table_row.append(tuned_mlpr.score(*validation))
    
    # test this tuned model
    table_row.append(tuned_mlpr.score(*testing))
    table_entries.append(table_row)


headers = ["Sampler", "Baseline Validation R^2","Optimized Validation R^2", "Optimized Testing R^2"]
print(tabulate(table_entries, headers=headers, floatfmt=".3f"))

Sampler             Baseline Validation R^2    Optimized Validation R^2    Optimized Testing R^2
----------------  -------------------------  --------------------------  -----------------------
K-Means                               0.904                       0.934                    0.900
Sphere Exclusion                      0.806                       0.822                    0.986
DBSCAN                                0.984                       0.986                    0.999

Now that’s more what we might expect to see. Our Neural Network with extrapolative splits is not as performant as the Random Forest with interpolative splits, but now we have a more realistic expectation of what our model is truly capable of doing. Perhaps more importantly, we can tune the model to perform better in extrapolation and then use it for new discovery. This step is left as an exercise to the curious programmer.

For completeness, what would happen if we used a linear model (which we know matches the underlying trend)?

from sklearn.linear_model import LinearRegression


headers = ["Sampler","Validation R^2", "Testing R^2"]
table_entries = []
for name, training, testing, validation in zip(
    algorithm_name, training_data, testing_data, validation_data
):
    table_row = [name]
    linreg = LinearRegression()
    linreg.fit(*training)
    table_row.append(linreg.score(*validation))
    # skip optimization since there is nothing to optimize
    table_row.append(linreg.score(*testing))
    table_entries.append(table_row)

print(tabulate(table_entries, headers=headers, floatfmt=".3f"))

Sampler             Validation R^2    Testing R^2
----------------  ----------------  -------------
K-Means                      0.989          0.984
Sphere Exclusion             0.997          1.000
DBSCAN                       0.992          1.000

Nearly perfect performance across the board, in some cases even perfect within our number of significant figures.

We can even check what the coefficients of the model are to see if they match our knowledge (4 cal/g carbohydrate or protein & 9 cal/g fat):

print(
    "Protein={:.2f} cal/g Carbohydrates={:.2f} cal/g Fat={:.2f} cal/g".format(
        *linreg.coef_
    )
)

Protein=3.91 cal/g Carbohydrates=3.94 cal/g Fat=9.13 cal/g

…which they do almost perfectly. It’s easy to throw fancy models at problems, but it pays to understand the trends behind the data wherever possible!

Part 4. Conclusions

Hopefully this demonstration has proved that interpolation and extrapolation are of utmost concern for rigorous machine learning modeling. We saw in Part 0 how data can be spread out in feature space, and that we can often draw important conclusions from just these initial human checks. In Part 1 we compared different approaches to enforcing interpolative splitting for inference with machine learning. Graphically, interpolation tries to sample the limits of feature space so that the model is never ‘surprised’ by the testing or validation sets. Part 2 demonstrated extrapolation and the relationship between the clusters the algorithms find and the labels we assign to the data. Finally in Part 3 the consequences of not addressing the interpolation versus extrapolation question were brought to light when our model crashed and burned when asked to extrapolate. Until this point it has been difficult to properly address this issue for lack of a software tool. The authors hope that astartes can help bridge this gap.

Presented at USRSE2023, Chicago, IL, October 16-18, 2023.