astartes.utils package

Submodules

astartes.utils.aimsim_featurizer module

class astartes.utils.aimsim_featurizer.Molecule(mol_graph=None, mol_text=None, mol_property_val=None, mol_descriptor_val=None, mol_src=None, mol_smiles=None)

Bases: object

An abstraction of a molecule

mol_graph

Graph-level information of molecule. Implemented as an RDKIT mol object.

Type:

RDKIT mol object

mol_text

Text identifier of the molecule.

Type:

str

mol_property_val

Some property associated with the molecule. This is typically the response being studied. E.g. Boiling point, Selectivity etc.

Type:

float

descriptor

Vector representation of a molecule. Commonly a fingerprint.

Type:

Descriptor object

set_descriptor(

arbitrary_descriptor_val=None, fingerprint_type=None, fingerprint_params=None): Set the descriptor value either by passing an arbitrary value or by specifying a fingerprint that will be generated.

get_descriptor_val()

Get the descriptor value as an numpy array.

match_fingerprint_from(reference_mol)

Generate the same fingerprint as the reference_mol.

get_similarity_to(target_mol, similarity_measure)

Get the similarity to target_mol using a similarity_measure of choice.

get_name()

Get the mol_text attribute.

get_mol_property_val()

Get mol_property_val attribute.

draw(fpath=None, **kwargs)

Draw the molecule.

is_same(source_molecule, target_molecule)

Static method used to check equivalence of two molecules.

__init__(mol_graph=None, mol_text=None, mol_property_val=None, mol_descriptor_val=None, mol_src=None, mol_smiles=None)

Constructor

Parameters:
  • mol_graph (RDKIT mol object) – Graph-level information of molecule. Implemented as an RDKIT mol object. Default is None.

  • mol_text (str) – Text identifier of the molecule. Default is None. Identifiers can be: —————— 1. Name of the molecule. 2. SMILES string representing the molecule.

  • mol_property_val (float) – Some property associated with the molecule. This is typically the response being studied. E.g. Boiling point, Selectivity etc. Default is None.

  • mol_descriptor_val (numpy ndarray) – Descriptor value for the molecule. Must be numpy array or list. Default is None.

  • mol_src (str) –

    Source file or SMILES string to load molecule. Acceptable files: -> .pdb file -> .txt file with SMILE string in first column, first row and

    (optionally) property in second column, first row.

    Default is None. If provided mol_graph is attempted to be loaded from it.

  • mol_smiles (str) – SMILES string for molecule. If provided, mol_graph is loaded from it. If mol_text not set in keyword argument, this string is used to set it.

draw(fpath=None, **kwargs)

Draw or molecule graph.

Parameters:
  • fpath (str) – Path of file to store image. If None, image is displayed in io as a Tkinter windows. Default is None.

  • kwargs (keyword arguments) – Arguments to modify plot properties.

get_descriptor_val()

Get value of molecule descriptor.

Returns:

value(s) of the descriptor.

Return type:

np.ndarray

get_mol_property_val()
get_name()
get_similarity_to(target_mol, similarity_measure)

Get a similarity metric to a target molecule

Parameters:
  • target_mol (AIMSim.ops Molecule) – Target molecule. Similarity score is with respect to this molecule

  • similarity_measure (AIMSim.ops SimilarityMeasure) – metric used.

Returns:

Similarity coefficient by the chosen

method.

Return type:

similarity_score (float)

Raises:

NotInitializedError – If target_molecule has uninitialized descriptor. See note.

static is_same(source_molecule, target_molecule)

Check if the target_molecule is a duplicate of source_molecule.

Parameters:
  • source_molecule (AIMSim.chemical_datastructures Molecule) – Source molecule to compare.

  • target_molecule (AIMSim.chemical_datastructures Molecule) – Target molecule to compare.

Returns:

True if the molecules are the same.

Return type:

bool

match_fingerprint_from(reference_mol)

If target_mol.descriptor is a fingerprint, this method will try to calculate the fingerprint of the self molecules. If this fails because of the absence of mol_graph attribute in target_molecule, a ValueError is raised.

Parameters:
  • reference_mol (AIMSim.ops Molecule) – Target molecule. Fingerprint

  • reference. (of this molecule is used as the)

Raises:

ValueError

set_descriptor(arbitrary_descriptor_val=None, fingerprint_type=None, fingerprint_params=None)

Sets molecular descriptor attribute.

Parameters:
  • arbitrary_descriptor_val (np.array or list) – Arbitrary descriptor vector. Default is None.

  • fingerprint_type (str) – String label specifying which fingerprint to use. Default is None.

  • fingerprint_params (dict) – Additional parameters for modifying fingerprint defaults. Default is None.

astartes.utils.aimsim_featurizer.featurize_molecules(molecules, fingerprint, fprints_hopts)

Call AIMSim’s Molecule to featurize the molecules according to the arguments.

Parameters:
  • molecules (np.array) – SMILES strings or RDKit molecule objects.

  • fingerprint (str) – The molecular fingerprint to be used.

  • fprints_hopts (dict) – Hyperparameters for AIMSim.

Returns:

X array (featurized molecules)

Return type:

np.array

astartes.utils.array_type_helpers module

astartes.utils.array_type_helpers.convert_to_array(obj: object, name: str)

Attempt to convert obj named name to a numpy array, with appropriate warnings and exceptions.

Parameters:
  • obj (object) – The item to attempt to convert.

  • name (str) – Human-readable name for printing.

astartes.utils.array_type_helpers.panda_handla(X, y, labels)

Helper function to deal with supporting Pandas data types in astartes

Parameters:
  • X (Dataframe) – Features with column names

  • y (Series) – Targets

  • labels (Series) – Labels for data

Returns:

Empty if no pandas types, metadata-filled otherwise

Return type:

dict

astartes.utils.array_type_helpers.return_helper(sampler_instance, train_idxs, val_idxs, test_idxs, return_indices, output_is_pandas)

Convenience function to return the requested arrays appropriately.

Parameters:
  • sampler_instance (sampler) – The fit sampler instance.

  • test_size (float) – Fraction of data to use in test.

  • val_size (float) – Fraction of data to use in val.

  • train_size (float) – Fraction of data to use in train.

  • return_indices (bool) – Return indices after the value arrays.

  • output_is_pandas (dict) – metadata about casting to pandas.

Returns:

Either many arrays or indices in arrays.

Return type:

np.array

Notes

This function copies and pastes a lot of code when it could instead use some loop over (X, y, labels, sampler_instance.get_clusters()) but such an implementation is more error prone. This is long and not the prettiest, but it is definitely doing what we want.

astartes.utils.exceptions module

Exceptions used by astartes

exception astartes.utils.exceptions.InvalidConfigurationError(message=None)

Bases: RuntimeError

Used when user-requested split/data would not work.

__init__(message=None)
exception astartes.utils.exceptions.InvalidModelTypeError(message=None)

Bases: RuntimeError

Used when user-provided model is invalid.

__init__(message=None)
exception astartes.utils.exceptions.MoleculesNotInstalledError(message=None)

Bases: RuntimeError

Used when attempting to featurize molecules without install.

__init__(message=None)
exception astartes.utils.exceptions.SamplerNotImplementedError(message=None)

Bases: RuntimeError

Used when attempting to call a non-existent sampler.

__init__(message=None)
exception astartes.utils.exceptions.UncastableInputError(message=None)

Bases: RuntimeError

Used when X, y, or labels cannot be cast to a np.array.

__init__(message=None)

astartes.utils.fast_kennard_stone module

astartes.utils.fast_kennard_stone.fast_kennard_stone(ks_distance: ndarray) ndarray

Implements the Kennard-Stone algorithm

Parameters:

ks_distance (np.ndarray) – Distance matrix

Returns:

Indices in order of Kennard-Stone selection

Return type:

np.ndarray

astartes.utils.sampler_factory module

class astartes.utils.sampler_factory.SamplerFactory(sampler)

Bases: object

__init__(sampler)

Initialize SamplerFactory and copy a lowercased ‘sampler’ into an attribute.

Parameters:

sampler (string) – The desired sampler.

get_sampler(X, y, labels, hopts)

Instantiate (which also performs fitting) and return the sampler.

Parameters:
  • X (np.array) – Feature array.

  • y (np.array) – Target array.

  • labels (np.array) – Label array.

  • hopts (dict) – Hyperparameters for the sampler.

Raises:

SamplerNotImplementedError – Raised when an non-existent or not yet implemented sampler is requested.

Returns:

The fit sampler instance.

Return type:

astartes.sampler

astartes.utils.user_utils module

astartes.utils.user_utils.display_results_as_table(error_dict)

Helper function to print a dictionary as a neat tabulate

astartes.utils.user_utils.generate_regression_results_dict(sklearn_model, X, y, samplers=['random'], random_state=0, samplers_hopts={}, train_size=0.8, val_size=0.1, test_size=0.1, print_results=False, additional_metrics={})

Helper function to train a sklearn model using the provided data and provided sampler types.

Parameters:
  • X (np.array, pd.DataFrame) – Numpy array or pandas DataFrame of feature vectors.

  • y (np.array, pd.Series) – Targets corresponding to X, must be of same size.

  • train_size (float, optional) – Fraction of dataset to use in training set. Defaults to 0.8.

  • val_size (float, optional) – Fraction of dataset to use in validation set. Defaults to 0.1.

  • test_size (float, optional) – Fraction of dataset to use in test set. Defaults to 0.1.

  • random_state (int, optional) – The random seed used throughout astartes.

  • samplers_hopts (dict, optional) – Should be a dictionary of dictionaries with the keys specifying the sampler and the values being another dictionary with the corresponding hyperparameters. Defaults to {}.

  • print_results (bool, optional) – whether to print the resulting dictionary as a neat table

  • additional_metrics (dict, optional) – mapping of name (str) to metric (func) for additional metrics such as those in sklearn.metrics or user-provided functions

Returns:

nested dictionary with the format of
{
sampler: {
‘mae’:{

‘train’: [], ‘val’: [], ‘test’: [],

}, ‘rmse’:{

’train’: [], ‘val’: [], ‘test’: [],

}, ‘R2’:{

’train’: [], ‘val’: [], ‘test’: [],

},

},

}

Return type:

dict

astartes.utils.warnings module

Warnings used by astartes

exception astartes.utils.warnings.ConversionWarning(message=None)

Bases: RuntimeWarning

Used when passed data is not a numpy array.

__init__(message=None)
exception astartes.utils.warnings.ImperfectSplittingWarning(message=None)

Bases: RuntimeWarning

Used when a sampler cannot match requested splits.

__init__(message=None)
exception astartes.utils.warnings.NoMatchingScaffold(message=None)

Bases: Warning

Used when an RDKit molecule does not match any Bemis-Murcko scaffold and returns an empty string.

__init__(message=None)
exception astartes.utils.warnings.NormalizationWarning(message=None)

Bases: RuntimeWarning

Used when a requested split does not add to 1.

__init__(message=None)

Module contents

astartes.utils.generate_regression_results_dict(sklearn_model, X, y, samplers=['random'], random_state=0, samplers_hopts={}, train_size=0.8, val_size=0.1, test_size=0.1, print_results=False, additional_metrics={})

Helper function to train a sklearn model using the provided data and provided sampler types.

Parameters:
  • X (np.array, pd.DataFrame) – Numpy array or pandas DataFrame of feature vectors.

  • y (np.array, pd.Series) – Targets corresponding to X, must be of same size.

  • train_size (float, optional) – Fraction of dataset to use in training set. Defaults to 0.8.

  • val_size (float, optional) – Fraction of dataset to use in validation set. Defaults to 0.1.

  • test_size (float, optional) – Fraction of dataset to use in test set. Defaults to 0.1.

  • random_state (int, optional) – The random seed used throughout astartes.

  • samplers_hopts (dict, optional) – Should be a dictionary of dictionaries with the keys specifying the sampler and the values being another dictionary with the corresponding hyperparameters. Defaults to {}.

  • print_results (bool, optional) – whether to print the resulting dictionary as a neat table

  • additional_metrics (dict, optional) – mapping of name (str) to metric (func) for additional metrics such as those in sklearn.metrics or user-provided functions

Returns:

nested dictionary with the format of
{
sampler: {
‘mae’:{

‘train’: [], ‘val’: [], ‘test’: [],

}, ‘rmse’:{

’train’: [], ‘val’: [], ‘test’: [],

}, ‘R2’:{

’train’: [], ‘val’: [], ‘test’: [],

},

},

}

Return type:

dict