Adding a model to Helix¶
This page concerns what is required for a new model to be included into Helix. We will not accept models that do not meet this specification.
Model specifications¶
Any model you add to Helix must conform to scikit-learn
’s API.
This is to ensure that all models in Helix behave consistently without requiring checks on the capabilities of each individual model. This in turn keeps the complexity of the code base down.
Please see the scikit-learn
docs to for details about their API.
Currently we only support supervised machine learning algorithms in the form of classifiers or regressors. Classifiers must implement the ClassifierMixin
and regressors must implement the RegressorMixin
. Both must implement the BaseEstimator
class.
class MyClassifier(ClassifierMixin, BaseEstimator):
...
class MyRegressor(RegressorMixin, BaseEstimator):
...
You must then override the fit
and predict
methods with the logic needed to fit your model and make predictions on data, respectively.
For classifiers, you must also override the predict_proba
method, which returns the probabilities for each class for each prediction. This is not a requirement of scikit-learn
but of Helix.
class MyClassifier(ClassifierMixin, BaseEstimator):
def fit(self, X, y):
# perform fitting logic
...
return self
def predict(self, X):
# perform prediction logic
preds = ...
return preds
def predict_proba(self, X):
# perform prediction logic and estimate class probabilities
probs = ...
return probs
class MyRegressor(RegressorMixin, BaseEstimator):
def fit(self, X, y):
# perform fitting logic
...
return self
def predict(self, X):
# perform prediction logic
preds = ...
return preds
See here for more information.
Your model must be saveable as a pickle file (.pkl
). This is how Helix persists the models it trains.
Hyperparameter tuning¶
Your models must be designed with manual and automatic hyperparameter tuning in mind. Make sure your model’s hyperparameters are set in the constructor of your class.
class MyNewModel(ClassifierMixin, BaseEstimator):
def __init__(self, param1: int, param2: float):
self.param1 = param1
self.param2 = param2
As per the notes in BaseEstimator
, do not include *args
, and **kwargs
to the constructor signature. The hyperparameters must be exhaustively set.
Automatic hyperparameter search (AHPS)¶
For your model to work with AHPS, your model needs to have tunable hyperparameters. If there aren’t any, it will not work with GridSearchCV
, which is how we perform AHPS.
You will need to create a search grid of your hyperparameters. AHPS uses this to find the best combination of hyperparameters for a model trained on a given dataset.
MY_MODEL_GRID = {
"param1": [1, 2, 3],
"param2": [1.0, 1.1, 1.2]
}
Add your grid to helix/options/search_grids.py
.
You can make your grid as large or small as you like. It is not necessary to include all hyperparameters in the grid, either. The more hyperparameters you include, and the more values you add, the training process take longer to complete, so place consider the user experience when deciding how many parameters to train, and how many values to test.
Manual hyperparameter search¶
You will need to create a form for users to input the values they wish to use for the hyperparameters of your model.
For each field, it would be helpful to include a help message explaining what the hyperparameter is and what it does.
How to integrate your model into Helix¶
The following examples will show you how you can integrate your new model into Helix and make it available for users.
Register your model name¶
In helix/options/enums.py
, edit the ModelNames
enum by adding your model name.
class ModelNames(StrEnum):
LinearModel = "linear model"
RandomForest = "random forest"
XGBoost = "xgboost"
SVM = "svm"
...
MyNewModel = "my new model"
Making your model available to Helix¶
If your model is a classifier, edit CLASSIFIERS
in helix/options/choices/ml_models.py
by adding your model like so:
# import your model
CLASSIFIERS: dict[ModelNames, type] = {
ModelNames.LinearModel: LogisticRegression,
ModelNames.RandomForest: RandomForestClassifier,
ModelNames.XGBoost: XGBClassifier,
ModelNames.SVM: SVC,
...
ModelNames.MyNewModel: MyModel
}
If your model is a regressor, edit REGRESSORS
in helix/options/choices/ml_models.py
by adding your model like so:
# import your model
REGRESSORS: dict[ModelNames, type] = {
ModelNames.LinearModel: LinearRegression,
ModelNames.RandomForest: RandomForestRegressor,
ModelNames.XGBoost: XGBRegressor,
ModelNames.SVM: SVR,
...
ModelNames.MyNewModel: MyModel
}
Create the form component¶
# helix/components/forms/forms_ml_opts.py
from helix.options.search_grids import MY_MODEL_GRID
# Create the form component for MyModel
def _my_model_opts(use_hyperparam_search: bool) -> dict:
model_types = {}
if not use_hyperparam_search:
st.write("Options")
param1 = st.number_input(
"param1",
value=1,
help="""
The first hyperparameter to my model.
The bigger it is, ther more accuarate the model.
"""
)
param2 = st.number_input(
"param2",
value=0.1,
max_value=1.0,
min_value=0.0,
help="""
The second hyperparameter to my model.
The closer the value to 1.0, the smarter it is.
"""
)
params = {
"param1": param1,
"param2": param2
}
st.divider()
else:
params = MY_MODEL_GRID
model_types["MY_MODEL"] = {
"use": True,
"params": params,
}
return model_types
Add the form component to the main form¶
To make your model selectable by the user, edit ml_options_form
in helix/components/forms/forms_ml_opts.py
as shown below.
# helix/components/forms/forms_ml_opts.py
def ml_options_form():
...
# Look for this to find where the models are made available
st.subheader("Select and cofigure which models to train")
...
# Add this underneath to make your model available
if st.toggle("My Model", value=False):
my_model_type = _my_model_opts(use_hyperparam_search)
model_types.update(mymodel_type)
Documentation¶
Please add your new model to the user documentation. To do this, edit the the “Options” subsection of “Selecting models to train” indocs/users/train_models.md
. This is a Markdown file, please see this Markdown guide for information on how to write using Markdown.
If you do not document your model, your model will not be added to Helix.
Your explanation must include the hyperparameters and explanations of what they do to your model. It should also include a brief explanation of the theory of the model and link to any relevant papers or documentation concerning the model.
Example¶
My New Model
My model uses a super cool algorithm that optimises 2 parameters
param1
andparam2
to asymptotically approach Artificial General Intelligence (AGI).The paper can be found at [link here].
param1: The first hyperparameter to my model. The bigger it is, ther more accuarate the model.
param2: The second hyperparameter to my model. The closer the value to 1.0, the smarter it is.
Testing¶
You must test that your model works with Helix for it to be included. Helix uses pytest
and streamlit
’s testing framework.
You must add a test for both automatic hyperparameter search and manual hyperparameter tuning.
What to test¶
What you are testing, in this case, is not the performance of the model in terms of some metric like accuracy or R^2, but whether your model is properly integrated into Helix. Your tests should check the following:
That there are no errors or exceptions when running the model
That it creates the model directory in the experiment
That it creates the expected
.pkl
fileThat it creates the plot directory for the experiment and that that directory is not empty. i.e. you get the performance plots
That you get the file with the expected predictions
That you get the file with the model metrics
How to add tests¶
You should add your tests to tests/pages/test_4_Train_Models.py
.
Generally, you will write 2 test functions: one to test your model with automatic hyperparameter search, and one to test it with manual hyperparameter tuning. Take the tests for SVM models. You will find 2 tests: test_auto_svm
and test_manual_svm
. You might call your tests: test_auto_<model_name>
and test_manual_<model_name>
.
Testing AHPS¶
This test simulates the user setting up the model to be trained with GridSearchCV
. This test should take one parameter called new_regression_experiment
(or new_classification_experiment
if training a classifier) of type str
.
Below is test_auto_svm
as an expample:
def test_auto_svm(new_regression_experiment: str):
# Arrange
exp_dir = helix_experiments_base_dir() / new_regression_experiment
expected_model_dir = ml_model_dir(exp_dir)
expected_plot_dir = ml_plot_dir(exp_dir)
expected_preds_file = ml_predictions_path(exp_dir)
expected_metrics_file = ml_metrics_path(exp_dir)
k = 3
at = AppTest.from_file("helix/pages/4_Train_Models.py", default_timeout=120)
at.run()
# Act
# Select the experiment
exp_selector = get_element_by_key(
at, "selectbox", ViewExperimentKeys.ExperimentName
)
exp_selector.select(new_regression_experiment).run()
# Set the number of k-folds
k_input = get_element_by_label(at, "number_input", "k")
k_input.set_value(k).run()
# Select SVM
svm_toggle = get_element_by_label(at, "toggle", "Support Vector Machine")
svm_toggle.set_value(True).run()
# Leave hyperparameters on their default values
# Leave save models and plots as true to get the outputs
# Click run
button = get_element_by_label(at, "button", "Run Training")
button.click().run()
# Assert
assert not at.exception
assert not at.error
assert expected_model_dir.exists()
assert list(
filter(lambda x: x.endswith(".pkl"), map(str, expected_model_dir.iterdir()))
) # directory is not empty
assert expected_plot_dir.exists()
assert list(
filter(lambda x: x.endswith(".png"), map(str, expected_plot_dir.iterdir()))
) # directory is not empty
assert expected_preds_file.exists()
assert expected_metrics_file.exists()
You should be able to create a copy of this example and rename it to test_auto_<my_model>
. Then, replace the 2 lines underneath where it says “# Select SVM” with the following:
my_model_toggle = get_element_by_label(at, "toggle", "My Model")
my_model_toggle.set_value(True).run()
Subsitute “My Model” with the actual name of your model.
Testing manual hyperparameter tuning¶
This test simulates the user setting up the model to be trained without AHPS. This test should take 3 parameters called new_regression_experiment
(or new_classification_experiment
if testing a classifier) of type str
, data_split_method
of type DataSplitMethods
and holdout_or_k
of type int
.
Below is test_manual_svm
as an expample. The decorator above the function signature doesn’t need to be altered; it causes the test to run the page with bootstrapping and cross-validation.
@pytest.mark.parametrize(
"data_split_method,holdout_or_k",
[
(DataSplitMethods.Holdout.capitalize(), 3),
(DataSplitMethods.KFold.capitalize(), 3),
],
)
def test_manual_svm(
new_regression_experiment: str, data_split_method: DataSplitMethods, holdout_or_k: int
):
# Arrange
exp_dir = helix_experiments_base_dir() / new_regression_experiment
expected_model_dir = ml_model_dir(exp_dir)
expected_plot_dir = ml_plot_dir(exp_dir)
expected_preds_file = ml_predictions_path(exp_dir)
expected_metrics_file = ml_metrics_path(exp_dir)
at = AppTest.from_file("helix/pages/4_Train_Models.py", default_timeout=120)
at.run()
# Act
# Select the experiment
exp_selector = get_element_by_key(
at, "selectbox", ViewExperimentKeys.ExperimentName
)
exp_selector.select(new_regression_experiment).run()
# Unselect AHPS, which is on by default
ahps_toggle = get_element_by_key(
at, "toggle", ExecutionStateKeys.UseHyperParamSearch
)
ahps_toggle.set_value(False).run()
# Select the data split method
data_split_selector = get_element_by_label(at, "selectbox", "Data split method")
data_split_selector.select(data_split_method).run()
# Set the number of bootstraps / k-folds
if holdout_input := get_element_by_label(
at, "number_input", "Number of bootstraps"
):
holdout_input.set_value(holdout_or_k).run()
if k_input := get_element_by_label(at, "number_input", "k"):
k_input.set_value(holdout_or_k).run()
# Select SVM
svm_toggle = get_element_by_label(at, "toggle", "Support Vector Machine")
svm_toggle.set_value(True).run()
# Leave hyperparameters on their default values
# Leave save models and plots as true to get the outputs
# Click run
button = get_element_by_label(at, "button", "Run Training")
button.click().run()
# Assert
assert not at.exception
assert not at.error
assert expected_model_dir.exists()
assert list(
filter(lambda x: x.endswith(".pkl"), map(str, expected_model_dir.iterdir()))
) # directory is not empty
assert expected_plot_dir.exists()
assert list(
filter(lambda x: x.endswith(".png"), map(str, expected_plot_dir.iterdir()))
) # directory is not empty
assert expected_preds_file.exists()
assert expected_metrics_file.exists()
Similar to the automatic hyperparameter search test, you should only need to edit the lines underneath where it says “# Select SVM”, replacing them with the following:
my_model_toggle = get_element_by_label(at, "toggle", "My Model")
my_model_toggle.set_value(True).run()
Subsitute “My Model” with the actual name of your model.
Running the tests¶
The tests will run when you open a pull request to Helix. They will re-run everytime you push to that PR. You can also run them manually:
uv run pytests
Be patient, the tests can take several minutes. Your changes may affect other tests, so be aware.