astartes.samplers.extrapolation package

Submodules

astartes.samplers.extrapolation.dbscan module

class astartes.samplers.extrapolation.dbscan.DBSCAN(X, y, labels, configs)

Bases: AbstractSampler

astartes.samplers.extrapolation.kmeans module

class astartes.samplers.extrapolation.kmeans.KMeans(X, y, labels, configs)

Bases: AbstractSampler

astartes.samplers.extrapolation.molecular_weight module

This sampler partitions the data based on molecular weight. It first sorts the molecules by molecular weight and then places the smallest molecules in the training set, the next smallest in the validation set if applicable, and finally the largest molecules in the testing set.

class astartes.samplers.extrapolation.molecular_weight.MolecularWeight(X, y, labels, configs)

Bases: TargetProperty

astartes.samplers.extrapolation.optisim module

The Optimizable K-Dissimilarity Selection (OptiSim) algorithm, as originally described by Clark (https://pubs.acs.org/doi/full/10.1021/ci970282v), adapted to work for arbitrary distance metrics.

The original algorithm: 1. Initialization

  • Take a featurized dataset and select an arbitrary starting data point for the selection set.

  • Treat the remaining data as ‘candidates’.

  • Create an empty ‘recycling bin’.

  • Create an empty subsample set.

  • Create an empty selection set.

  1. Remove a random point from the candidates.

  • if it has a similarity greater than a given cutoff to any of the members of the selection set,

recycle it (or conversely, if it is within a cutoff distance) - otherwise, add to subsample set

  1. Repeat 2 until one of two conditions is met:

  1. The subsample reaches the pre-determined maximum size K or

  2. The candidates are exhausted.

4. If Step 3 resulted in condition b, move all data from recycling bin and go to Step 2.

5. If subsample is empty, quit (all remaining candidates are similar, the most dissimilar data points have already been identified)

6. Pick the most dissimilar (relative to data points already in selection set) point in the subsample and add it to the selection set.

  1. Move the remaining points in the subsample to the recycling bin.

  2. If size(selection set) is sufficient, quit. Otherwise, go to Step 2.

As suggested in the original paper, the members of the selection set are then used as cluster centers, and we assign every element in the dataset to belong to the cluster containing the selection set member to which it is the most similar. To implement this step, use scipy.spatial.distance.cdist.

This algorithm seems like it might introduce an infinite loop if the subsample is not filled and all of the remaining candidates are within the cutoff and cannot be added. Might need a stop condition here? Unless the empyting of the recycling bin will somehow fix this. Also possible that one would never have a partially filled subsample after looking at the full dataset since it is more probable that ALL the points would be rejected and the subsample would be empty.

Likely just check for no more points being possible to fit into the subsample, and exit if that is the case.

class astartes.samplers.extrapolation.optisim.OptiSim(X, y, labels, configs)

Bases: AbstractSampler

get_dist(i, j)

Calculates pdist and returns distance between two samples

move_item(item, source_set, destintation_set)

Moves item from source_set to destination_set

rchoose(set)

Choose a random element from a set with self._rng

astartes.samplers.extrapolation.scaffold module

This sampler partitions the data based on the Bemis-Murcko scaffold function as implemented in RDKit. Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887−2893. Landrum, G. et al. RDKit: Open-Source Cheminformatics; 2006; https://www.rdkit.org.

The goal is to cluster molecules that share the same scaffold. Later, these clusters will be assigned to training, validation, and testing split to create data splits that will measure extrapolation by testing on scaffolds that are not in the training set.

class astartes.samplers.extrapolation.scaffold.Scaffold(X, y, labels, configs)

Bases: AbstractSampler

generate_bemis_murcko_scaffold(include_chirality=False)

Compute the Bemis-Murcko scaffold for an RDKit molecule.

Params:

mol: A smiles string or an RDKit molecule. include_chirality: Whether to include chirality.

Returns:

Bemis-Murcko scaffold

scaffold_to_smiles(mols)

Computes scaffold for each smiles string and returns a mapping from scaffolds to sets of smiles.

Params:

mols: A list of smiles strings or RDKit molecules.

Returns:

A dictionary mapping each unique scaffold to all smiles (or smiles indices) which have that scaffold.

str_to_mol()

Converts an InChI or SMILES string to an RDKit molecule.

Params:

string: The InChI or SMILES string.

Returns:

An RDKit molecule.

astartes.samplers.extrapolation.sphere_exclusion module

The Sphere Exclusion clustering algorithm.

This re-implementation draws from this blog post on the RDKit blog, though abstracted to work for arbitrary feature vectors: http://rdkit.blogspot.com/2020/11/sphere-exclusion-clustering-with-rdkit.html As well as this paper: https://www.daylight.com/cheminformatics/whitepapers/ClusteringWhitePaper.pdf

But instead of using tanimoto similarity, which has a domain between zero and one, it uses euclidian distance to enable processing arbitrary valued vectors.

class astartes.samplers.extrapolation.sphere_exclusion.SphereExclusion(X, y, labels, configs)

Bases: AbstractSampler

astartes.samplers.extrapolation.target_property module

This sampler partitions the data based on the regression target y. It first sorts the data by y value and then constructs the training set to have either the smallest (largest) y values, the validation set to have the next smallest (largest) set of y values, and the testing set to have the largest (smallest) y values.

class astartes.samplers.extrapolation.target_property.TargetProperty(X, y, labels, configs)

Bases: AbstractSampler

astartes.samplers.extrapolation.time_based module

class astartes.samplers.extrapolation.time_based.TimeBased(X, y, labels, configs)

Bases: AbstractSampler

Module contents