astartes.utils package
Submodules
astartes.utils.aimsim_featurizer module
- class astartes.utils.aimsim_featurizer.Molecule(mol_graph=None, mol_text=None, mol_property_val=None, mol_descriptor_val=None, mol_src=None, mol_smiles=None)
Bases:
object
An abstraction of a molecule
- mol_graph
Graph-level information of molecule. Implemented as an RDKIT mol object.
- Type:
RDKIT mol object
- mol_text
Text identifier of the molecule.
- Type:
str
- mol_property_val
Some property associated with the molecule. This is typically the response being studied. E.g. Boiling point, Selectivity etc.
- Type:
float
- descriptor
Vector representation of a molecule. Commonly a fingerprint.
- Type:
Descriptor object
- set_descriptor(
arbitrary_descriptor_val=None, fingerprint_type=None, fingerprint_params=None): Set the descriptor value either by passing an arbitrary value or by specifying a fingerprint that will be generated.
- get_descriptor_val()
Get the descriptor value as an numpy array.
- match_fingerprint_from(reference_mol)
Generate the same fingerprint as the reference_mol.
- get_similarity_to(target_mol, similarity_measure)
Get the similarity to target_mol using a similarity_measure of choice.
- get_name()
Get the mol_text attribute.
- get_mol_property_val()
Get mol_property_val attribute.
- draw(fpath=None, **kwargs)
Draw the molecule.
- is_same(source_molecule, target_molecule)
Static method used to check equivalence of two molecules.
- __init__(mol_graph=None, mol_text=None, mol_property_val=None, mol_descriptor_val=None, mol_src=None, mol_smiles=None)
Constructor
- Parameters:
mol_graph (RDKIT mol object) – Graph-level information of molecule. Implemented as an RDKIT mol object. Default is None.
mol_text (str) – Text identifier of the molecule. Default is None. Identifiers can be: —————— 1. Name of the molecule. 2. SMILES string representing the molecule.
mol_property_val (float) – Some property associated with the molecule. This is typically the response being studied. E.g. Boiling point, Selectivity etc. Default is None.
mol_descriptor_val (numpy ndarray) – Descriptor value for the molecule. Must be numpy array or list. Default is None.
mol_src (str) –
Source file or SMILES string to load molecule. Acceptable files: -> .pdb file -> .txt file with SMILE string in first column, first row and
(optionally) property in second column, first row.
Default is None. If provided mol_graph is attempted to be loaded from it.
mol_smiles (str) – SMILES string for molecule. If provided, mol_graph is loaded from it. If mol_text not set in keyword argument, this string is used to set it.
- draw(fpath=None, **kwargs)
Draw or molecule graph.
- Parameters:
fpath (str) – Path of file to store image. If None, image is displayed in io as a Tkinter windows. Default is None.
kwargs (keyword arguments) – Arguments to modify plot properties.
- get_descriptor_val()
Get value of molecule descriptor.
- Returns:
value(s) of the descriptor.
- Return type:
np.ndarray
- get_mol_property_val()
- get_name()
- get_similarity_to(target_mol, similarity_measure)
Get a similarity metric to a target molecule
- Parameters:
target_mol (AIMSim.ops Molecule) – Target molecule. Similarity score is with respect to this molecule
similarity_measure (AIMSim.ops SimilarityMeasure) – metric used.
- Returns:
- Similarity coefficient by the chosen
method.
- Return type:
similarity_score (float)
- Raises:
NotInitializedError – If target_molecule has uninitialized descriptor. See note.
- static is_same(source_molecule, target_molecule)
Check if the target_molecule is a duplicate of source_molecule.
- Parameters:
source_molecule (AIMSim.chemical_datastructures Molecule) – Source molecule to compare.
target_molecule (AIMSim.chemical_datastructures Molecule) – Target molecule to compare.
- Returns:
True if the molecules are the same.
- Return type:
bool
- match_fingerprint_from(reference_mol)
If target_mol.descriptor is a fingerprint, this method will try to calculate the fingerprint of the self molecules. If this fails because of the absence of mol_graph attribute in target_molecule, a ValueError is raised.
- Parameters:
reference_mol (AIMSim.ops Molecule) – Target molecule. Fingerprint
reference. (of this molecule is used as the)
- Raises:
ValueError –
- set_descriptor(arbitrary_descriptor_val=None, fingerprint_type=None, fingerprint_params=None)
Sets molecular descriptor attribute.
- Parameters:
arbitrary_descriptor_val (np.array or list) – Arbitrary descriptor vector. Default is None.
fingerprint_type (str) – String label specifying which fingerprint to use. Default is None.
fingerprint_params (dict) – Additional parameters for modifying fingerprint defaults. Default is None.
- astartes.utils.aimsim_featurizer.featurize_molecules(molecules, fingerprint, fprints_hopts)
Call AIMSim’s Molecule to featurize the molecules according to the arguments.
- Parameters:
molecules (np.array) – SMILES strings or RDKit molecule objects.
fingerprint (str) – The molecular fingerprint to be used.
fprints_hopts (dict) – Hyperparameters for AIMSim.
- Returns:
X array (featurized molecules)
- Return type:
np.array
astartes.utils.array_type_helpers module
- astartes.utils.array_type_helpers.convert_to_array(obj: object, name: str)
Attempt to convert obj named name to a numpy array, with appropriate warnings and exceptions.
- Parameters:
obj (object) – The item to attempt to convert.
name (str) – Human-readable name for printing.
- astartes.utils.array_type_helpers.panda_handla(X, y, labels)
Helper function to deal with supporting Pandas data types in astartes
- Parameters:
X (Dataframe) – Features with column names
y (Series) – Targets
labels (Series) – Labels for data
- Returns:
Empty if no pandas types, metadata-filled otherwise
- Return type:
dict
- astartes.utils.array_type_helpers.return_helper(sampler_instance, train_idxs, val_idxs, test_idxs, return_indices, output_is_pandas)
Convenience function to return the requested arrays appropriately.
- Parameters:
sampler_instance (sampler) – The fit sampler instance.
test_size (float) – Fraction of data to use in test.
val_size (float) – Fraction of data to use in val.
train_size (float) – Fraction of data to use in train.
return_indices (bool) – Return indices after the value arrays.
output_is_pandas (dict) – metadata about casting to pandas.
- Returns:
Either many arrays or indices in arrays.
- Return type:
np.array
Notes
This function copies and pastes a lot of code when it could instead use some loop over (X, y, labels, sampler_instance.get_clusters()) but such an implementation is more error prone. This is long and not the prettiest, but it is definitely doing what we want.
astartes.utils.exceptions module
Exceptions used by astartes
- exception astartes.utils.exceptions.InvalidConfigurationError(message=None)
Bases:
RuntimeError
Used when user-requested split/data would not work.
- __init__(message=None)
- exception astartes.utils.exceptions.InvalidModelTypeError(message=None)
Bases:
RuntimeError
Used when user-provided model is invalid.
- __init__(message=None)
- exception astartes.utils.exceptions.MoleculesNotInstalledError(message=None)
Bases:
RuntimeError
Used when attempting to featurize molecules without install.
- __init__(message=None)
astartes.utils.fast_kennard_stone module
- astartes.utils.fast_kennard_stone.fast_kennard_stone(ks_distance: ndarray) ndarray
Implements the Kennard-Stone algorithm
- Parameters:
ks_distance (np.ndarray) – Distance matrix
- Returns:
Indices in order of Kennard-Stone selection
- Return type:
np.ndarray
astartes.utils.sampler_factory module
- class astartes.utils.sampler_factory.SamplerFactory(sampler)
Bases:
object
- __init__(sampler)
Initialize SamplerFactory and copy a lowercased ‘sampler’ into an attribute.
- Parameters:
sampler (string) – The desired sampler.
- get_sampler(X, y, labels, hopts)
Instantiate (which also performs fitting) and return the sampler.
- Parameters:
X (np.array) – Feature array.
y (np.array) – Target array.
labels (np.array) – Label array.
hopts (dict) – Hyperparameters for the sampler.
- Raises:
SamplerNotImplementedError – Raised when an non-existent or not yet implemented sampler is requested.
- Returns:
The fit sampler instance.
- Return type:
astartes.sampler
astartes.utils.user_utils module
- astartes.utils.user_utils.display_results_as_table(error_dict)
Helper function to print a dictionary as a neat tabulate
- astartes.utils.user_utils.generate_regression_results_dict(sklearn_model, X, y, samplers=['random'], random_state=0, samplers_hopts={}, train_size=0.8, val_size=0.1, test_size=0.1, print_results=False, additional_metrics={})
Helper function to train a sklearn model using the provided data and provided sampler types.
- Parameters:
X (np.array, pd.DataFrame) – Numpy array or pandas DataFrame of feature vectors.
y (np.array, pd.Series) – Targets corresponding to X, must be of same size.
train_size (float, optional) – Fraction of dataset to use in training set. Defaults to 0.8.
val_size (float, optional) – Fraction of dataset to use in validation set. Defaults to 0.1.
test_size (float, optional) – Fraction of dataset to use in test set. Defaults to 0.1.
random_state (int, optional) – The random seed used throughout astartes.
samplers_hopts (dict, optional) – Should be a dictionary of dictionaries with the keys specifying the sampler and the values being another dictionary with the corresponding hyperparameters. Defaults to {}.
print_results (bool, optional) – whether to print the resulting dictionary as a neat table
additional_metrics (dict, optional) – mapping of name (str) to metric (func) for additional metrics such as those in sklearn.metrics or user-provided functions
- Returns:
- nested dictionary with the format of
- {
- sampler: {
- ‘mae’:{
‘train’: [], ‘val’: [], ‘test’: [],
}, ‘rmse’:{
’train’: [], ‘val’: [], ‘test’: [],
}, ‘R2’:{
’train’: [], ‘val’: [], ‘test’: [],
},
},
}
- Return type:
dict
astartes.utils.warnings module
Warnings used by astartes
- exception astartes.utils.warnings.ConversionWarning(message=None)
Bases:
RuntimeWarning
Used when passed data is not a numpy array.
- __init__(message=None)
- exception astartes.utils.warnings.ImperfectSplittingWarning(message=None)
Bases:
RuntimeWarning
Used when a sampler cannot match requested splits.
- __init__(message=None)
Module contents
- astartes.utils.generate_regression_results_dict(sklearn_model, X, y, samplers=['random'], random_state=0, samplers_hopts={}, train_size=0.8, val_size=0.1, test_size=0.1, print_results=False, additional_metrics={})
Helper function to train a sklearn model using the provided data and provided sampler types.
- Parameters:
X (np.array, pd.DataFrame) – Numpy array or pandas DataFrame of feature vectors.
y (np.array, pd.Series) – Targets corresponding to X, must be of same size.
train_size (float, optional) – Fraction of dataset to use in training set. Defaults to 0.8.
val_size (float, optional) – Fraction of dataset to use in validation set. Defaults to 0.1.
test_size (float, optional) – Fraction of dataset to use in test set. Defaults to 0.1.
random_state (int, optional) – The random seed used throughout astartes.
samplers_hopts (dict, optional) – Should be a dictionary of dictionaries with the keys specifying the sampler and the values being another dictionary with the corresponding hyperparameters. Defaults to {}.
print_results (bool, optional) – whether to print the resulting dictionary as a neat table
additional_metrics (dict, optional) – mapping of name (str) to metric (func) for additional metrics such as those in sklearn.metrics or user-provided functions
- Returns:
- nested dictionary with the format of
- {
- sampler: {
- ‘mae’:{
‘train’: [], ‘val’: [], ‘test’: [],
}, ‘rmse’:{
’train’: [], ‘val’: [], ‘test’: [],
}, ‘R2’:{
’train’: [], ‘val’: [], ‘test’: [],
},
},
}
- Return type:
dict