astartes.utils package

Submodules

astartes.utils.aimsim_featurizer module

class astartes.utils.aimsim_featurizer.Molecule(mol_graph=None, mol_text=None, mol_property_val=None, mol_descriptor_val=None, mol_src=None, mol_smiles=None)

Bases: object

An abstraction of a molecule

mol_graph

Graph-level information of molecule. Implemented as an RDKIT mol object.

Type:: RDKIT mol object

mol_text

Text identifier of the molecule.

Type:: str

mol_property_val

Some property associated with the molecule. This is typically the response being studied. E.g. Boiling point, Selectivity etc.

Type:: float

descriptor

Vector representation of a molecule. Commonly a fingerprint.

Type:: Descriptor object

set_descriptor(: arbitrary_descriptor_val=None, fingerprint_type=None, fingerprint_params=None): Set the descriptor value either by passing an arbitrary value or by specifying a fingerprint that will be generated.

get_descriptor_val(): Get the descriptor value as an numpy array.

match_fingerprint_from(reference_mol): Generate the same fingerprint as the reference_mol.

get_similarity_to(target_mol, similarity_measure): Get the similarity to target_mol using a similarity_measure of choice.

get_name(): Get the mol_text attribute.

get_mol_property_val(): Get mol_property_val attribute.

draw(fpath=None, **kwargs): Draw the molecule.

is_same(source_molecule, target_molecule): Static method used to check equivalence of two molecules.

__init__(mol_graph=None, mol_text=None, mol_property_val=None, mol_descriptor_val=None, mol_src=None, mol_smiles=None)

Constructor

Parameters:

mol_graph (RDKIT mol object) – Graph-level information of molecule. Implemented as an RDKIT mol object. Default is None.
mol_text (str) – Text identifier of the molecule. Default is None. Identifiers can be: —————— 1. Name of the molecule. 2. SMILES string representing the molecule.
mol_property_val (float) – Some property associated with the molecule. This is typically the response being studied. E.g. Boiling point, Selectivity etc. Default is None.
mol_descriptor_val (numpy ndarray) – Descriptor value for the molecule. Must be numpy array or list. Default is None.
mol_src (str) –
Source file or SMILES string to load molecule. Acceptable files: -> .pdb file -> .txt file with SMILE string in first column, first row and

(optionally) property in second column, first row.

Default is None. If provided mol_graph is attempted to be loaded from it.
mol_smiles (str) – SMILES string for molecule. If provided, mol_graph is loaded from it. If mol_text not set in keyword argument, this string is used to set it.

draw(fpath=None, **kwargs)

Draw or molecule graph.

Parameters:

fpath (str) – Path of file to store image. If None, image is displayed in io as a Tkinter windows. Default is None.
kwargs (keyword arguments) – Arguments to modify plot properties.

get_descriptor_val()

Get value of molecule descriptor.

Returns:: value(s) of the descriptor.
Return type:: np.ndarray

get_mol_property_val()

get_name()

get_similarity_to(target_mol, similarity_measure)

Get a similarity metric to a target molecule

Parameters:

target_mol (AIMSim.ops Molecule) – Target molecule. Similarity score is with respect to this molecule
similarity_measure (AIMSim.ops SimilarityMeasure) – metric used.

Returns:

Similarity coefficient by the chosen: method.

Return type:

similarity_score (float)

Raises:

NotInitializedError – If target_molecule has uninitialized descriptor. See note.

static is_same(source_molecule, target_molecule)

Check if the target_molecule is a duplicate of source_molecule.

Parameters:

source_molecule (AIMSim.chemical_datastructures Molecule) – Source molecule to compare.
target_molecule (AIMSim.chemical_datastructures Molecule) – Target molecule to compare.

Returns:

True if the molecules are the same.

Return type:

bool

match_fingerprint_from(reference_mol)

If target_mol.descriptor is a fingerprint, this method will try to calculate the fingerprint of the self molecules. If this fails because of the absence of mol_graph attribute in target_molecule, a ValueError is raised.

Parameters:

reference_mol (AIMSim.ops Molecule) – Target molecule. Fingerprint
reference. (of this molecule is used as the)

Raises:

ValueError –

set_descriptor(arbitrary_descriptor_val=None, fingerprint_type=None, fingerprint_params=None)

Sets molecular descriptor attribute.

Parameters:

arbitrary_descriptor_val (np.array or list) – Arbitrary descriptor vector. Default is None.
fingerprint_type (str) – String label specifying which fingerprint to use. Default is None.
fingerprint_params (dict) – Additional parameters for modifying fingerprint defaults. Default is None.

astartes.utils.aimsim_featurizer.featurize_molecules(molecules, fingerprint, fprints_hopts)

Call AIMSim’s Molecule to featurize the molecules according to the arguments.

Parameters:

molecules (np.array) – SMILES strings or RDKit molecule objects.
fingerprint (str) – The molecular fingerprint to be used.
fprints_hopts (dict) – Hyperparameters for AIMSim.

Returns:

X array (featurized molecules)

Return type:

np.array

astartes.utils.array_type_helpers module

astartes.utils.array_type_helpers.convert_to_array(obj: object, name: str)

Attempt to convert obj named name to a numpy array, with appropriate warnings and exceptions.

Parameters:

obj (object) – The item to attempt to convert.
name (str) – Human-readable name for printing.

astartes.utils.array_type_helpers.panda_handla(X, y, labels)

Helper function to deal with supporting Pandas data types in astartes

Parameters:

X (Dataframe) – Features with column names
y (Series) – Targets
labels (Series) – Labels for data

Returns:

Empty if no pandas types, metadata-filled otherwise

Return type:

dict

astartes.utils.array_type_helpers.return_helper(sampler_instance, train_idxs, val_idxs, test_idxs, return_indices, output_is_pandas)

Convenience function to return the requested arrays appropriately.

Parameters:

sampler_instance (sampler) – The fit sampler instance.
test_size (float) – Fraction of data to use in test.
val_size (float) – Fraction of data to use in val.
train_size (float) – Fraction of data to use in train.
return_indices (bool) – Return indices after the value arrays.
output_is_pandas (dict) – metadata about casting to pandas.

Returns:

Either many arrays or indices in arrays.

Return type:

np.array

Notes

This function copies and pastes a lot of code when it could instead use some loop over (X, y, labels, sampler_instance.get_clusters()) but such an implementation is more error prone. This is long and not the prettiest, but it is definitely doing what we want.

astartes.utils.exceptions module

Exceptions used by astartes

exception astartes.utils.exceptions.InvalidConfigurationError(message=None)

Bases: RuntimeError

Used when user-requested split/data would not work.

__init__(message=None)

exception astartes.utils.exceptions.InvalidModelTypeError(message=None)

Bases: RuntimeError

Used when user-provided model is invalid.

__init__(message=None)

exception astartes.utils.exceptions.MoleculesNotInstalledError(message=None)

Bases: RuntimeError

Used when attempting to featurize molecules without install.

__init__(message=None)

exception astartes.utils.exceptions.SamplerNotImplementedError(message=None)

Bases: RuntimeError

Used when attempting to call a non-existent sampler.

__init__(message=None)

exception astartes.utils.exceptions.UncastableInputError(message=None)

Bases: RuntimeError

Used when X, y, or labels cannot be cast to a np.array.

__init__(message=None)

astartes.utils.fast_kennard_stone module

astartes.utils.fast_kennard_stone.fast_kennard_stone(ks_distance: ndarray) → ndarray

Implements the Kennard-Stone algorithm

Parameters:: ks_distance (np.ndarray) – Distance matrix
Returns:: Indices in order of Kennard-Stone selection
Return type:: np.ndarray

astartes.utils.sampler_factory module

class astartes.utils.sampler_factory.SamplerFactory(sampler)

Bases: object

__init__(sampler)

Initialize SamplerFactory and copy a lowercased ‘sampler’ into an attribute.

Parameters:: sampler (string) – The desired sampler.

get_sampler(X, y, labels, hopts)

Instantiate (which also performs fitting) and return the sampler.

Parameters:

X (np.array) – Feature array.
y (np.array) – Target array.
labels (np.array) – Label array.
hopts (dict) – Hyperparameters for the sampler.

Raises:

SamplerNotImplementedError – Raised when an non-existent or not yet implemented sampler is requested.

Returns:

The fit sampler instance.

Return type:

astartes.sampler

astartes.utils.user_utils module

astartes.utils.user_utils.display_results_as_table(error_dict): Helper function to print a dictionary as a neat tabulate

astartes.utils.user_utils.generate_regression_results_dict(sklearn_model, X, y, samplers=['random'], random_state=0, samplers_hopts={}, train_size=0.8, val_size=0.1, test_size=0.1, print_results=False, additional_metrics={})

Helper function to train a sklearn model using the provided data and provided sampler types.

Parameters:

X (np.array, pd.DataFrame) – Numpy array or pandas DataFrame of feature vectors.
y (np.array, pd.Series) – Targets corresponding to X, must be of same size.
train_size (float, optional) – Fraction of dataset to use in training set. Defaults to 0.8.
val_size (float, optional) – Fraction of dataset to use in validation set. Defaults to 0.1.
test_size (float, optional) – Fraction of dataset to use in test set. Defaults to 0.1.
random_state (int, optional) – The random seed used throughout astartes.
samplers_hopts (dict, optional) – Should be a dictionary of dictionaries with the keys specifying the sampler and the values being another dictionary with the corresponding hyperparameters. Defaults to {}.
print_results (bool, optional) – whether to print the resulting dictionary as a neat table
additional_metrics (dict, optional) – mapping of name (str) to metric (func) for additional metrics such as those in sklearn.metrics or user-provided functions

Returns:

nested dictionary with the format of

{

sampler: {

‘mae’:{: ‘train’: [], ‘val’: [], ‘test’: [],

}, ‘rmse’:{

’train’: [], ‘val’: [], ‘test’: [],

}, ‘R2’:{

’train’: [], ‘val’: [], ‘test’: [],

},

}

Return type:

dict

astartes.utils.warnings module

Warnings used by astartes

exception astartes.utils.warnings.ConversionWarning(message=None)

Bases: RuntimeWarning

Used when passed data is not a numpy array.

__init__(message=None)

exception astartes.utils.warnings.ImperfectSplittingWarning(message=None)

Bases: RuntimeWarning

Used when a sampler cannot match requested splits.

__init__(message=None)

exception astartes.utils.warnings.NoMatchingScaffold(message=None)

Bases: Warning

Used when an RDKit molecule does not match any Bemis-Murcko scaffold and returns an empty string.

__init__(message=None)

exception astartes.utils.warnings.NormalizationWarning(message=None)

Bases: RuntimeWarning

Used when a requested split does not add to 1.

__init__(message=None)

Module contents

astartes.utils.generate_regression_results_dict(sklearn_model, X, y, samplers=['random'], random_state=0, samplers_hopts={}, train_size=0.8, val_size=0.1, test_size=0.1, print_results=False, additional_metrics={})

Helper function to train a sklearn model using the provided data and provided sampler types.

Parameters:

X (np.array, pd.DataFrame) – Numpy array or pandas DataFrame of feature vectors.
y (np.array, pd.Series) – Targets corresponding to X, must be of same size.
train_size (float, optional) – Fraction of dataset to use in training set. Defaults to 0.8.
val_size (float, optional) – Fraction of dataset to use in validation set. Defaults to 0.1.
test_size (float, optional) – Fraction of dataset to use in test set. Defaults to 0.1.
random_state (int, optional) – The random seed used throughout astartes.
samplers_hopts (dict, optional) – Should be a dictionary of dictionaries with the keys specifying the sampler and the values being another dictionary with the corresponding hyperparameters. Defaults to {}.
print_results (bool, optional) – whether to print the resulting dictionary as a neat table
additional_metrics (dict, optional) – mapping of name (str) to metric (func) for additional metrics such as those in sklearn.metrics or user-provided functions

Returns:

nested dictionary with the format of

{

sampler: {

‘mae’:{: ‘train’: [], ‘val’: [], ‘test’: [],

}, ‘rmse’:{

’train’: [], ‘val’: [], ‘test’: [],

}, ‘R2’:{

’train’: [], ‘val’: [], ‘test’: [],

},

}

Return type:

dict