astartes.samplers.extrapolation package
Submodules
astartes.samplers.extrapolation.dbscan module
- class astartes.samplers.extrapolation.dbscan.DBSCAN(X, y, labels, configs)
Bases:
AbstractSampler
astartes.samplers.extrapolation.kmeans module
- class astartes.samplers.extrapolation.kmeans.KMeans(X, y, labels, configs)
Bases:
AbstractSampler
astartes.samplers.extrapolation.molecular_weight module
This sampler partitions the data based on molecular weight. It first sorts the molecules by molecular weight and then places the smallest molecules in the training set, the next smallest in the validation set if applicable, and finally the largest molecules in the testing set.
- class astartes.samplers.extrapolation.molecular_weight.MolecularWeight(X, y, labels, configs)
Bases:
TargetProperty
astartes.samplers.extrapolation.optisim module
The Optimizable K-Dissimilarity Selection (OptiSim) algorithm, as originally described by Clark (https://pubs.acs.org/doi/full/10.1021/ci970282v), adapted to work for arbitrary distance metrics.
The original algorithm: 1. Initialization
Take a featurized dataset and select an arbitrary starting data point for the selection set.
Treat the remaining data as ‘candidates’.
Create an empty ‘recycling bin’.
Create an empty subsample set.
Create an empty selection set.
Remove a random point from the candidates.
if it has a similarity greater than a given cutoff to any of the members of the selection set,
recycle it (or conversely, if it is within a cutoff distance) - otherwise, add to subsample set
Repeat 2 until one of two conditions is met:
The subsample reaches the pre-determined maximum size K or
The candidates are exhausted.
4. If Step 3 resulted in condition b, move all data from recycling bin and go to Step 2.
5. If subsample is empty, quit (all remaining candidates are similar, the most dissimilar data points have already been identified)
6. Pick the most dissimilar (relative to data points already in selection set) point in the subsample and add it to the selection set.
Move the remaining points in the subsample to the recycling bin.
If size(selection set) is sufficient, quit. Otherwise, go to Step 2.
As suggested in the original paper, the members of the selection set are then used as cluster centers, and we assign every element in the dataset to belong to the cluster containing the selection set member to which it is the most similar. To implement this step, use scipy.spatial.distance.cdist.
This algorithm seems like it might introduce an infinite loop if the subsample is not filled and all of the remaining candidates are within the cutoff and cannot be added. Might need a stop condition here? Unless the empyting of the recycling bin will somehow fix this. Also possible that one would never have a partially filled subsample after looking at the full dataset since it is more probable that ALL the points would be rejected and the subsample would be empty.
Likely just check for no more points being possible to fit into the subsample, and exit if that is the case.
- class astartes.samplers.extrapolation.optisim.OptiSim(X, y, labels, configs)
Bases:
AbstractSampler
- get_dist(i, j)
Calculates pdist and returns distance between two samples
- move_item(item, source_set, destintation_set)
Moves item from source_set to destination_set
- rchoose(set)
Choose a random element from a set with self._rng
astartes.samplers.extrapolation.scaffold module
This sampler partitions the data based on the Bemis-Murcko scaffold function as implemented in RDKit. Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887−2893. Landrum, G. et al. RDKit: Open-Source Cheminformatics; 2006; https://www.rdkit.org.
The goal is to cluster molecules that share the same scaffold. Later, these clusters will be assigned to training, validation, and testing split to create data splits that will measure extrapolation by testing on scaffolds that are not in the training set.
- class astartes.samplers.extrapolation.scaffold.Scaffold(X, y, labels, configs)
Bases:
AbstractSampler
- generate_bemis_murcko_scaffold(include_chirality=False)
Compute the Bemis-Murcko scaffold for an RDKit molecule.
- Params:
mol: A smiles string or an RDKit molecule. include_chirality: Whether to include chirality.
- Returns:
Bemis-Murcko scaffold
- scaffold_to_smiles(mols)
Computes scaffold for each smiles string and returns a mapping from scaffolds to sets of smiles.
- Params:
mols: A list of smiles strings or RDKit molecules.
- Returns:
A dictionary mapping each unique scaffold to all smiles (or smiles indices) which have that scaffold.
- str_to_mol()
Converts an InChI or SMILES string to an RDKit molecule.
- Params:
string: The InChI or SMILES string.
- Returns:
An RDKit molecule.
astartes.samplers.extrapolation.sphere_exclusion module
The Sphere Exclusion clustering algorithm.
This re-implementation draws from this blog post on the RDKit blog, though abstracted to work for arbitrary feature vectors: http://rdkit.blogspot.com/2020/11/sphere-exclusion-clustering-with-rdkit.html As well as this paper: https://www.daylight.com/cheminformatics/whitepapers/ClusteringWhitePaper.pdf
But instead of using tanimoto similarity, which has a domain between zero and one, it uses euclidian distance to enable processing arbitrary valued vectors.
- class astartes.samplers.extrapolation.sphere_exclusion.SphereExclusion(X, y, labels, configs)
Bases:
AbstractSampler
astartes.samplers.extrapolation.target_property module
This sampler partitions the data based on the regression target y. It first sorts the data by y value and then constructs the training set to have either the smallest (largest) y values, the validation set to have the next smallest (largest) set of y values, and the testing set to have the largest (smallest) y values.
- class astartes.samplers.extrapolation.target_property.TargetProperty(X, y, labels, configs)
Bases:
AbstractSampler
astartes.samplers.extrapolation.time_based module
- class astartes.samplers.extrapolation.time_based.TimeBased(X, y, labels, configs)
Bases:
AbstractSampler