Transitioning from ``sklearn`` to ``astartes``
======================================================

Step 1. Installation
--------------------

``astartes`` has been designed to rely on (1) as few packages as possible and (2) packages which are already likely to be installed in a Machine Learning (ML) Python workflow (i.e. Numpy and Sklearn). Because of this, ``astartes`` should be compatible with your *existing* workflow such as a conda environment.

To install ``astartes`` for general ML use (the sampling of arbitrary vectors): **\ ``pip install astartes``\ **

For users in cheminformatics, ``astartes`` has an optional add-on that includes featurization as part of the sampling. To install, type **\ ``pip install 'astartes[molecules]'``\ **. With this extra install, ``astartes`` uses  `\ ``AIMSim`` <https://vlachosgroup.github.io/AIMSim/README.html>`_ to encode SMILES strings as feature vectors. The SMILES strings are parsed into molecular graphs using RDKit and then sampled with a single function call: ``train_test_split_molecules``.


* If your workflow already has a featurization scheme in place (i.e. you already have a vector representation of your chemical of interest), you can directly use ``train_test_split`` (though we invite you to explore the many molecular descriptors made available through AIMSim).

Step 2. Changing the ``import`` Statement
---------------------------------------------

In one of the first few lines of your Python script, you have the line ``from sklearn.model_selection import train_test_split``. To switch to using ``astartes`` change this line to ``from astartes import train_test_split``.

That's it! You are now using ``astartes``.

If you were just calling ``train_test_split(X, y)``\ , your script should now work in the exact same way as ``sklearn`` with no changes required. 

.. code-block:: python

   X_train, X_test, y_train, y_test = train_test_split(
       X,
       y,
       random_state=42,
   )

*becomes*

.. code-block:: python

   X_train, X_test, y_train, y_test = train_test_split(
       X,
       y,
       random_state=42,
   )

But we encourage you to try one of our many other samplers (see below)!

Step 3. Specifying an Algorithmic Sampler
-----------------------------------------

By default (for interoperability), ``astartes`` will use a random sampler to produce train/test splits - but the real value of ``astartes`` is in the algorithmic sampling algorithms it implements. Check out the `README for a complete list of available algorithms <https://github.com/JacksonBurns/astartes#implemented-sampling-algorithms>`_ and how to call and customize them.

If you existing call to ``train_test_split`` looks like this:

.. code-block:: python

   X_train, X_test, y_train, y_test = train_test_split(
       X,
       y,
   )

and you want to try out using Kennard-Stone sampling, switch it to this:

.. code-block:: python

   X_train, X_test, y_train, y_test = train_test_split(
       X,
       y,
       sampler="kennard_stone",
   )

That's it!

Step 4. Passing Keyword Arguments
---------------------------------

All of the arguments to the ``sklearn``\ 's ``train_test_split`` can still be passed to ``astartes``\ ' ``train_test_split``\ :

.. code-block:: python

   X_train, X_test, y_train, y_test, labels_train, labels_test = train_test_split(
       X,
       y,
       labels,
       train_size = 0.75,
       test_size = 0.25,
       sampler = "kmeans",
       hopts = {"n_clusters": 4},
   )

Some samplers have tunable hyperparameters that allow you to more finely control their behavior. To do this with Sphere Exclusion, for example, switch your call to this:

.. code-block:: python

   X_train, X_test, y_train, y_test = train_test_split(
       X,
       y,
       sampler="sphere_exclusion",
       hopts={"distance_cutoff":0.15},
   )

Step 5. Useful ``astartes`` Features
----------------------------------------

``return_indices``\ : Improve Code Clarity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are circumstances where the indices of the train/test data can be useful (for example, if ``y`` or ``labels`` are large, memory-intense objects), and there is no way to directly return these indices in ``sklearn``. ``astartes`` will return the sampling splits themselves by default, but it can also return the indices for the user to manipulate according to their needs:

.. code-block:: python

   X_train, X_test, y_train, y_test, labels_train, labels_test = train_test_split(
       X,
       y,
       labels,
       return_indices = False,
   )

*could instead be*

.. code-block:: python

   X_train, X_test, y_train, y_test, labels_train, labels_test, indices_train, indices_test = train_test_split(
       X,
       y,
       labels,
       return_indices = True,
   )

If ``y`` or ``labels`` were large, memory-intense objects it could be beneficial to *not* pass them in to ``train_test_split`` and instead separate the existing lists later using the returned indices.

``train_val_test_split``\ : More Rigorous ML
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Behind the scenes, ``train_test_split`` is actually just a one-line function that calls the real workhorse of ``astartes`` - ``train_val_test_split``\ :

.. code-block:: python

   def train_test_split(
       X: np.array,
       ...
       return_indices: bool = False,
   ):
       return train_val_test_split(
           X, y, labels, train_size, 0, test_size, sampler, hopts, return_indices
       )

The function call to ``train_val_test_split`` is identical to ``train_test_split`` and supports all the same samplers and hyperparameters, except for one additional keyword argument ``val_size``\ :

.. code-block:: python

   def train_val_test_split(
       X: np.array,
       y: np.array = None,
       labels: np.array = None,
       train_size: float = 0.8,
       val_size: float = 0.1,
       test_size: float = 0.1,
       sampler: str = "random",
       hopts: dict = {},
       return_indices: bool = False,
   ):

When called, this will return *three* arrays from ``X``\ , ``y``\ , and ``labels`` (or three arrays of indices, if ``return_indices=True``\ ) rather than the usual two, according to the values given for ``train_size``\ , ``val_size``\ , and ``test_size`` in the function call.

.. code-block:: python

   X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(
       X,
       y,
       train_size: float = 0.8,
       val_size: float = 0.1,
       test_size: float = 0.1,
   )

For truly rigorous ML modeling, the validation set should be used for hyperparameter tuning and the test set held out until the *very final* change has been made to the model to get a true sense of its performance. For better or for worse, this is *not* the current standard for ML modeling, but the authors believe it should be.

Custom Warnings: ``ImperfectSplittingWarning`` and ``NormalizationWarning``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the event that your requested train/validation/test split is not mathematically possible given the dimensions of the input data (i.e. you request 50/25/25 but have 101 data points), ``astartes`` will warn you during runtime that it has occurred. ``sklearn`` simply moves on quietly, and while this is fine *most* of the time, the authors felt it prudent to warn the user.
When entering a train/validation/test split, ``astartes`` will check that it is normalized and make it so if not, warning the user during runtime. This will hopefully help prevent head-scratching hours of debugging.