Adding a model to Helix¶

This page concerns what is required for a new model to be included into Helix. We will not accept models that do not meet this specification.

Model specifications¶

Any model you add to Helix must conform to scikit-learn’s API. This is to ensure that all models in Helix behave consistently without requiring checks on the capabilities of each individual model. This in turn keeps the complexity of the code base down.

Please see the scikit-learn docs to for details about their API.

Currently we only support supervised machine learning algorithms in the form of classifiers or regressors. Classifiers must implement the ClassifierMixin and regressors must implement the RegressorMixin. Both must implement the BaseEstimator class.

class MyClassifier(ClassifierMixin, BaseEstimator):
    ...


class MyRegressor(RegressorMixin, BaseEstimator):
    ...

You must then override the fit and predict methods with the logic needed to fit your model and make predictions on data, respectively. For classifiers, you must also override the predict_proba method, which returns the probabilities for each class for each prediction. This is not a requirement of scikit-learn but of Helix.

class MyClassifier(ClassifierMixin, BaseEstimator):
    def fit(self, X, y):
        # perform fitting logic
        ...
        return self

    def predict(self, X):
        # perform prediction logic
        preds = ...
        return preds

    def predict_proba(self, X):
        # perform prediction logic and estimate class probabilities
        probs = ...
        return probs


class MyRegressor(RegressorMixin, BaseEstimator):
    def fit(self, X, y):
        # perform fitting logic
        ...
        return self

    def predict(self, X):
        # perform prediction logic
        preds = ...
        return preds

See here for more information.

Your model must be saveable as a pickle file (.pkl). This is how Helix persists the models it trains.

Hyperparameter tuning¶

Your models must be designed with manual and automatic hyperparameter tuning in mind. Make sure your model’s hyperparameters are set in the constructor of your class.

class MyNewModel(ClassifierMixin, BaseEstimator):
    def __init__(self, param1: int, param2: float):
        self.param1 = param1
        self.param2 = param2

As per the notes in BaseEstimator, do not include *args, and **kwargs to the constructor signature. The hyperparameters must be exhaustively set.

Automatic hyperparameter search (AHPS)¶

For your model to work with AHPS, your model needs to have tunable hyperparameters. If there aren’t any, it will not work with GridSearchCV, which is how we perform AHPS.

You will need to create a search grid of your hyperparameters. AHPS uses this to find the best combination of hyperparameters for a model trained on a given dataset.

MY_MODEL_GRID = {
    "param1": [1, 2, 3],
    "param2": [1.0, 1.1, 1.2]
}

Add your grid to helix/options/search_grids.py.

You can make your grid as large or small as you like. It is not necessary to include all hyperparameters in the grid, either. The more hyperparameters you include, and the more values you add, the training process take longer to complete, so place consider the user experience when deciding how many parameters to train, and how many values to test.

Manual hyperparameter search¶

You will need to create a form for users to input the values they wish to use for the hyperparameters of your model.

For each field, it would be helpful to include a help message explaining what the hyperparameter is and what it does.

How to integrate your model into Helix¶

The following examples will show you how you can integrate your new model into Helix and make it available for users.

Register your model name¶

In helix/options/enums.py, edit the ModelNames enum by adding your model name.

class ModelNames(StrEnum):
    LinearModel = "linear model"
    RandomForest = "random forest"
    XGBoost = "xgboost"
    SVM = "svm"
    ...
    MyNewModel = "my new model"

Making your model available to Helix¶

If your model is a classifier, edit CLASSIFIERS in helix/options/choices/ml_models.py by adding your model like so:

# import your model

CLASSIFIERS: dict[ModelNames, type] = {
    ModelNames.LinearModel: LogisticRegression,
    ModelNames.RandomForest: RandomForestClassifier,
    ModelNames.XGBoost: XGBClassifier,
    ModelNames.SVM: SVC,
    ...
    ModelNames.MyNewModel: MyModel
}

If your model is a regressor, edit REGRESSORS in helix/options/choices/ml_models.py by adding your model like so:

# import your model

REGRESSORS: dict[ModelNames, type] = {
    ModelNames.LinearModel: LinearRegression,
    ModelNames.RandomForest: RandomForestRegressor,
    ModelNames.XGBoost: XGBRegressor,
    ModelNames.SVM: SVR,
    ...
    ModelNames.MyNewModel: MyModel
}

Create the form component¶

# helix/components/forms/forms_ml_opts.py
from helix.options.search_grids import MY_MODEL_GRID


# Create the form component for MyModel
def _my_model_opts(use_hyperparam_search: bool) -> dict:
    model_types = {}
    if not use_hyperparam_search:
        st.write("Options")
        param1 = st.number_input(
            "param1",
            value=1,
            help="""
            The first hyperparameter to my model.
            The bigger it is, ther more accuarate the model.
            """
        )
        param2 = st.number_input(
            "param2",
            value=0.1,
            max_value=1.0,
            min_value=0.0,
            help="""
            The second hyperparameter to my model.
            The closer the value to 1.0, the smarter it is.
            """
        )
        params = {
            "param1": param1,
            "param2": param2
        }
        st.divider()
    else:
        params = MY_MODEL_GRID
    
    model_types["MY_MODEL"] = {
        "use": True,
        "params": params,
    }
    return model_types

Add the form component to the main form¶

To make your model selectable by the user, edit ml_options_form in helix/components/forms/forms_ml_opts.py as shown below.

# helix/components/forms/forms_ml_opts.py

def ml_options_form():
    ...
    # Look for this to find where the models are made available
    st.subheader("Select and cofigure which models to train")
    ...
    # Add this underneath to make your model available
    if st.toggle("My Model", value=False):
        my_model_type = _my_model_opts(use_hyperparam_search)
        model_types.update(mymodel_type)

Documentation¶

Please add your new model to the user documentation. To do this, edit the the “Options” subsection of “Selecting models to train” indocs/users/train_models.md. This is a Markdown file, please see this Markdown guide for information on how to write using Markdown.

If you do not document your model, your model will not be added to Helix.

Your explanation must include the hyperparameters and explanations of what they do to your model. It should also include a brief explanation of the theory of the model and link to any relevant papers or documentation concerning the model.

Example¶

My New Model

My model uses a super cool algorithm that optimises 2 parameters param1 and param2 to asymptotically approach Artificial General Intelligence (AGI).

The paper can be found at [link here].

param1: The first hyperparameter to my model. The bigger it is, ther more accuarate the model.

param2: The second hyperparameter to my model. The closer the value to 1.0, the smarter it is.

Testing¶

You must test that your model works with Helix for it to be included. Helix uses pytest and streamlit’s testing framework.

You must add a test for both automatic hyperparameter search and manual hyperparameter tuning.

What to test¶

What you are testing, in this case, is not the performance of the model in terms of some metric like accuracy or R^2, but whether your model is properly integrated into Helix. Your tests should check the following:

That there are no errors or exceptions when running the model
That it creates the model directory in the experiment
That it creates the expected .pkl file
That it creates the plot directory for the experiment and that that directory is not empty. i.e. you get the performance plots
That you get the file with the expected predictions
That you get the file with the model metrics

How to add tests¶

You should add your tests to tests/pages/test_4_Train_Models.py.

Generally, you will write 2 test functions: one to test your model with automatic hyperparameter search, and one to test it with manual hyperparameter tuning. Take the tests for SVM models. You will find 2 tests: test_auto_svm and test_manual_svm. You might call your tests: test_auto_<model_name> and test_manual_<model_name>.

Testing AHPS¶

This test simulates the user setting up the model to be trained with GridSearchCV. This test should take one parameter called new_regression_experiment (or new_classification_experiment if training a classifier) of type str.

Below is test_auto_svm as an expample:

def test_auto_svm(new_regression_experiment: str):
    # Arrange
    exp_dir = helix_experiments_base_dir() / new_regression_experiment
    expected_model_dir = ml_model_dir(exp_dir)
    expected_plot_dir = ml_plot_dir(exp_dir)
    expected_preds_file = ml_predictions_path(exp_dir)
    expected_metrics_file = ml_metrics_path(exp_dir)
    k = 3
    at = AppTest.from_file("helix/pages/4_Train_Models.py", default_timeout=120)
    at.run()

    # Act
    # Select the experiment
    exp_selector = get_element_by_key(
        at, "selectbox", ViewExperimentKeys.ExperimentName
    )
    exp_selector.select(new_regression_experiment).run()
    # Set the number of k-folds
    k_input = get_element_by_label(at, "number_input", "k")
    k_input.set_value(k).run()
    # Select SVM
    svm_toggle = get_element_by_label(at, "toggle", "Support Vector Machine")
    svm_toggle.set_value(True).run()
    # Leave hyperparameters on their default values
    # Leave save models and plots as true to get the outputs
    # Click run
    button = get_element_by_label(at, "button", "Run Training")
    button.click().run()

    # Assert
    assert not at.exception
    assert not at.error
    assert expected_model_dir.exists()
    assert list(
        filter(lambda x: x.endswith(".pkl"), map(str, expected_model_dir.iterdir()))
    )  # directory is not empty
    assert expected_plot_dir.exists()
    assert list(
        filter(lambda x: x.endswith(".png"), map(str, expected_plot_dir.iterdir()))
    )  # directory is not empty
    assert expected_preds_file.exists()
    assert expected_metrics_file.exists()

You should be able to create a copy of this example and rename it to test_auto_<my_model>. Then, replace the 2 lines underneath where it says “# Select SVM” with the following:

my_model_toggle = get_element_by_label(at, "toggle", "My Model")
my_model_toggle.set_value(True).run()

Subsitute “My Model” with the actual name of your model.

Testing manual hyperparameter tuning¶

This test simulates the user setting up the model to be trained without AHPS. This test should take 3 parameters called new_regression_experiment (or new_classification_experiment if testing a classifier) of type str, data_split_method of type DataSplitMethods and holdout_or_k of type int.

Below is test_manual_svm as an expample. The decorator above the function signature doesn’t need to be altered; it causes the test to run the page with bootstrapping and cross-validation.

@pytest.mark.parametrize(
    "data_split_method,holdout_or_k",
    [
        (DataSplitMethods.Holdout.capitalize(), 3),
        (DataSplitMethods.KFold.capitalize(), 3),
    ],
)
def test_manual_svm(
    new_regression_experiment: str, data_split_method: DataSplitMethods, holdout_or_k: int
):
    # Arrange
    exp_dir = helix_experiments_base_dir() / new_regression_experiment
    expected_model_dir = ml_model_dir(exp_dir)
    expected_plot_dir = ml_plot_dir(exp_dir)
    expected_preds_file = ml_predictions_path(exp_dir)
    expected_metrics_file = ml_metrics_path(exp_dir)
    at = AppTest.from_file("helix/pages/4_Train_Models.py", default_timeout=120)
    at.run()

    # Act
    # Select the experiment
    exp_selector = get_element_by_key(
        at, "selectbox", ViewExperimentKeys.ExperimentName
    )
    exp_selector.select(new_regression_experiment).run()
    # Unselect AHPS, which is on by default
    ahps_toggle = get_element_by_key(
        at, "toggle", ExecutionStateKeys.UseHyperParamSearch
    )
    ahps_toggle.set_value(False).run()
    # Select the data split method
    data_split_selector = get_element_by_label(at, "selectbox", "Data split method")
    data_split_selector.select(data_split_method).run()
    # Set the number of bootstraps / k-folds
    if holdout_input := get_element_by_label(
        at, "number_input", "Number of bootstraps"
    ):
        holdout_input.set_value(holdout_or_k).run()
    if k_input := get_element_by_label(at, "number_input", "k"):
        k_input.set_value(holdout_or_k).run()
    # Select SVM
    svm_toggle = get_element_by_label(at, "toggle", "Support Vector Machine")
    svm_toggle.set_value(True).run()
    # Leave hyperparameters on their default values
    # Leave save models and plots as true to get the outputs
    # Click run
    button = get_element_by_label(at, "button", "Run Training")
    button.click().run()

    # Assert
    assert not at.exception
    assert not at.error
    assert expected_model_dir.exists()
    assert list(
        filter(lambda x: x.endswith(".pkl"), map(str, expected_model_dir.iterdir()))
    )  # directory is not empty
    assert expected_plot_dir.exists()
    assert list(
        filter(lambda x: x.endswith(".png"), map(str, expected_plot_dir.iterdir()))
    )  # directory is not empty
    assert expected_preds_file.exists()
    assert expected_metrics_file.exists()

Similar to the automatic hyperparameter search test, you should only need to edit the lines underneath where it says “# Select SVM”, replacing them with the following:

my_model_toggle = get_element_by_label(at, "toggle", "My Model")
my_model_toggle.set_value(True).run()