# Adding a model to Helix

This page concerns what is required for a new model to be included into Helix. We will **not** accept models that do not meet this specification.

## Model specifications
Any model you add to Helix **must** conform to `scikit-learn`'s API.
This is to ensure that all models in Helix behave consistently without requiring checks on the capabilities of each individual model. This in turn keeps the complexity of the code base down.

Please see the `scikit-learn` docs to for details about their [API](https://scikit-learn.org/stable/developers/develop.html).

Currently we only support supervised machine learning algorithms in the form of classifiers or regressors. **Classifiers** must implement the [`ClassifierMixin`][ClassifierMixin] and **regressors** must implement the [`RegressorMixin`][RegressorMixin]. **Both** must implement the [`BaseEstimator`][BaseEstimator] class.

```python
class MyClassifier(ClassifierMixin, BaseEstimator):
    ...


class MyRegressor(RegressorMixin, BaseEstimator):
    ...
```

You **must** then override the `fit` and `predict` methods with the logic needed to fit your model and make predictions on data, respectively. 
For classifiers, you **must** also override the `predict_proba` method, which returns the probabilities for each class for each prediction. This is not a requirement of `scikit-learn` but of Helix.

```python
class MyClassifier(ClassifierMixin, BaseEstimator):
    def fit(self, X, y):
        # perform fitting logic
        ...
        return self

    def predict(self, X):
        # perform prediction logic
        preds = ...
        return preds

    def predict_proba(self, X):
        # perform prediction logic and estimate class probabilities
        probs = ...
        return probs


class MyRegressor(RegressorMixin, BaseEstimator):
    def fit(self, X, y):
        # perform fitting logic
        ...
        return self

    def predict(self, X):
        # perform prediction logic
        preds = ...
        return preds
```

See [here](https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator) for more information.

Your model **must** be saveable as a pickle file (`.pkl`). This is how Helix persists the models it trains.

## Hyperparameter tuning
Your models **must** be designed with *manual* and *automatic* hyperparameter tuning in mind. Make sure your model's hyperparameters are set in the constructor of your class.

```python
class MyNewModel(ClassifierMixin, BaseEstimator):
    def __init__(self, param1: int, param2: float):
        self.param1 = param1
        self.param2 = param2
```
As per the notes in [`BaseEstimator`][BaseEstimator], do not include `*args`, and `**kwargs` to the constructor signature. The hyperparameters must be exhaustively set.

### Automatic hyperparameter search (AHPS)
For your model to work with AHPS, your model needs to have tunable hyperparameters. If there aren't any, it will **not** work with [`GridSearchCV`][GridSearchCV], which is how we perform AHPS.

You will need to create a search grid of your hyperparameters. AHPS uses this to find the best combination of hyperparameters for a model trained on a given dataset.

```python
MY_MODEL_GRID = {
    "param1": [1, 2, 3],
    "param2": [1.0, 1.1, 1.2]
}
```

Add your grid to `helix/options/search_grids.py`. 

You can make your grid as large or small as you like. It is not necessary to include all hyperparameters in the grid, either. The more hyperparameters you include, and the more values you add, the training process take longer to complete, so place consider the user experience when deciding how many parameters to train, and how many values to test.

### Manual hyperparameter search
You will need to create a form for users to input the values they wish to use for the hyperparameters of your model.

For each field, it would be helpful to include a help message explaining what the hyperparameter is and what it does.

### How to integrate your model into Helix
The following examples will show you how you can integrate your new model into Helix and make it available for users.

#### Register your model name
In `helix/options/enums.py`, edit the `ModelNames` enum by adding your model name.

```python
class ModelNames(StrEnum):
    LinearModel = "linear model"
    RandomForest = "random forest"
    XGBoost = "xgboost"
    SVM = "svm"
    ...
    MyNewModel = "my new model"
```

#### Making your model available to Helix
If your model is a classifier, edit `CLASSIFIERS` in `helix/options/choices/ml_models.py` by adding your model like so:

```python
# import your model

CLASSIFIERS: dict[ModelNames, type] = {
    ModelNames.LinearModel: LogisticRegression,
    ModelNames.RandomForest: RandomForestClassifier,
    ModelNames.XGBoost: XGBClassifier,
    ModelNames.SVM: SVC,
    ...
    ModelNames.MyNewModel: MyModel
}
```

If your model is a regressor, edit `REGRESSORS` in `helix/options/choices/ml_models.py` by adding your model like so:

```python
# import your model

REGRESSORS: dict[ModelNames, type] = {
    ModelNames.LinearModel: LinearRegression,
    ModelNames.RandomForest: RandomForestRegressor,
    ModelNames.XGBoost: XGBRegressor,
    ModelNames.SVM: SVR,
    ...
    ModelNames.MyNewModel: MyModel
}
```

#### Create the form component
```python
# helix/components/forms/forms_ml_opts.py
from helix.options.search_grids import MY_MODEL_GRID


# Create the form component for MyModel
def _my_model_opts(use_hyperparam_search: bool) -> dict:
    model_types = {}
    if not use_hyperparam_search:
        st.write("Options")
        param1 = st.number_input(
            "param1",
            value=1,
            help="""
            The first hyperparameter to my model.
            The bigger it is, ther more accuarate the model.
            """
        )
        param2 = st.number_input(
            "param2",
            value=0.1,
            max_value=1.0,
            min_value=0.0,
            help="""
            The second hyperparameter to my model.
            The closer the value to 1.0, the smarter it is.
            """
        )
        params = {
            "param1": param1,
            "param2": param2
        }
        st.divider()
    else:
        params = MY_MODEL_GRID
    
    model_types["MY_MODEL"] = {
        "use": True,
        "params": params,
    }
    return model_types
```

#### Add the form component to the main form
To make your model selectable by the user, edit `ml_options_form` in `helix/components/forms/forms_ml_opts.py` as shown below.

```python
# helix/components/forms/forms_ml_opts.py

def ml_options_form():
    ...
    # Look for this to find where the models are made available
    st.subheader("Select and cofigure which models to train")
    ...
    # Add this underneath to make your model available
    if st.toggle("My Model", value=False):
        my_model_type = _my_model_opts(use_hyperparam_search)
        model_types.update(mymodel_type)
```

## Documentation
Please add your new model to the user documentation. To do this, edit the the **"Options"** subsection of **"Selecting models to train"** in`docs/users/train_models.md`. This is a Markdown file, please see this [Markdown guide](https://www.markdownguide.org/getting-started/) for information on how to write using Markdown.

**If you do not document your model, your model will not be added to Helix.**

Your explanation **must** include the hyperparameters and explanations of what they do to your model. It should also include a brief explanation of the theory of the model and link to any relevant papers or documentation concerning the model.

### Example
> - **My New Model**
>
>> My model uses a super cool algorithm that optimises 2 parameters `param1` and `param2` to asymptotically approach Artificial General Intelligence (AGI).
>
>> The paper can be found at [link here].
>> - param1: The first hyperparameter to my model. The bigger it is, ther more accuarate the model.
>> - param2: The second hyperparameter to my model. The closer the value to 1.0, the smarter it is.

## Testing
You **must** test that your model works with Helix for it to be included. Helix uses [`pytest`](https://docs.pytest.org/en/stable/index.html) and `streamlit`'s [testing framework](https://docs.streamlit.io/develop/concepts/app-testing/get-started).

You **must** add a test for both automatic hyperparameter search and manual hyperparameter tuning.

### What to test
What you are testing, in this case, is not the performance of the model in terms of some metric like accuracy or R^2, but whether your model is properly integrated into Helix. Your tests should check the following:
- That there are no errors or exceptions when running the model
- That it creates the model directory in the experiment
- That it creates the expected `.pkl` file
- That it creates the plot directory for the experiment and that that directory is not empty. i.e. you get the performance plots
- That you get the file with the expected predictions
- That you get the file with the model metrics

### How to add tests
You should add your tests to `tests/pages/test_4_Train_Models.py`.

Generally, you will write 2 test functions: one to test your model with automatic hyperparameter search, and one to test it with manual hyperparameter tuning. Take the tests for SVM models. You will find 2 tests: `test_auto_svm` and `test_manual_svm`. You might call your tests: `test_auto_<model_name>` and `test_manual_<model_name>`.

#### Testing AHPS
This test simulates the user setting up the model to be trained with [`GridSearchCV`][GridSearchCV]. This test should take one parameter called `new_regression_experiment` (or `new_classification_experiment` if training a classifier) of type `str`.

Below is `test_auto_svm` as an expample:

```python
def test_auto_svm(new_regression_experiment: str):
    # Arrange
    exp_dir = helix_experiments_base_dir() / new_regression_experiment
    expected_model_dir = ml_model_dir(exp_dir)
    expected_plot_dir = ml_plot_dir(exp_dir)
    expected_preds_file = ml_predictions_path(exp_dir)
    expected_metrics_file = ml_metrics_path(exp_dir)
    k = 3
    at = AppTest.from_file("helix/pages/4_Train_Models.py", default_timeout=120)
    at.run()

    # Act
    # Select the experiment
    exp_selector = get_element_by_key(
        at, "selectbox", ViewExperimentKeys.ExperimentName
    )
    exp_selector.select(new_regression_experiment).run()
    # Set the number of k-folds
    k_input = get_element_by_label(at, "number_input", "k")
    k_input.set_value(k).run()
    # Select SVM
    svm_toggle = get_element_by_label(at, "toggle", "Support Vector Machine")
    svm_toggle.set_value(True).run()
    # Leave hyperparameters on their default values
    # Leave save models and plots as true to get the outputs
    # Click run
    button = get_element_by_label(at, "button", "Run Training")
    button.click().run()

    # Assert
    assert not at.exception
    assert not at.error
    assert expected_model_dir.exists()
    assert list(
        filter(lambda x: x.endswith(".pkl"), map(str, expected_model_dir.iterdir()))
    )  # directory is not empty
    assert expected_plot_dir.exists()
    assert list(
        filter(lambda x: x.endswith(".png"), map(str, expected_plot_dir.iterdir()))
    )  # directory is not empty
    assert expected_preds_file.exists()
    assert expected_metrics_file.exists()
```

You should be able to create a copy of this example and rename it to `test_auto_<my_model>`. Then, replace the 2 lines underneath where it says "# Select SVM" with the following:
```python
my_model_toggle = get_element_by_label(at, "toggle", "My Model")
my_model_toggle.set_value(True).run()
```
Subsitute "My Model" with the actual name of your model.

#### Testing manual hyperparameter tuning
This test simulates the user setting up the model to be trained without AHPS. This test should take 3 parameters called `new_regression_experiment` (or `new_classification_experiment` if testing a classifier) of type `str`, `data_split_method` of type `DataSplitMethods` and `holdout_or_k` of type `int`.

Below is `test_manual_svm` as an expample. The decorator above the function signature doesn't need to be altered; it causes the test to run the page with bootstrapping and cross-validation.

```python
@pytest.mark.parametrize(
    "data_split_method,holdout_or_k",
    [
        (DataSplitMethods.Holdout.capitalize(), 3),
        (DataSplitMethods.KFold.capitalize(), 3),
    ],
)
def test_manual_svm(
    new_regression_experiment: str, data_split_method: DataSplitMethods, holdout_or_k: int
):
    # Arrange
    exp_dir = helix_experiments_base_dir() / new_regression_experiment
    expected_model_dir = ml_model_dir(exp_dir)
    expected_plot_dir = ml_plot_dir(exp_dir)
    expected_preds_file = ml_predictions_path(exp_dir)
    expected_metrics_file = ml_metrics_path(exp_dir)
    at = AppTest.from_file("helix/pages/4_Train_Models.py", default_timeout=120)
    at.run()

    # Act
    # Select the experiment
    exp_selector = get_element_by_key(
        at, "selectbox", ViewExperimentKeys.ExperimentName
    )
    exp_selector.select(new_regression_experiment).run()
    # Unselect AHPS, which is on by default
    ahps_toggle = get_element_by_key(
        at, "toggle", ExecutionStateKeys.UseHyperParamSearch
    )
    ahps_toggle.set_value(False).run()
    # Select the data split method
    data_split_selector = get_element_by_label(at, "selectbox", "Data split method")
    data_split_selector.select(data_split_method).run()
    # Set the number of bootstraps / k-folds
    if holdout_input := get_element_by_label(
        at, "number_input", "Number of bootstraps"
    ):
        holdout_input.set_value(holdout_or_k).run()
    if k_input := get_element_by_label(at, "number_input", "k"):
        k_input.set_value(holdout_or_k).run()
    # Select SVM
    svm_toggle = get_element_by_label(at, "toggle", "Support Vector Machine")
    svm_toggle.set_value(True).run()
    # Leave hyperparameters on their default values
    # Leave save models and plots as true to get the outputs
    # Click run
    button = get_element_by_label(at, "button", "Run Training")
    button.click().run()

    # Assert
    assert not at.exception
    assert not at.error
    assert expected_model_dir.exists()
    assert list(
        filter(lambda x: x.endswith(".pkl"), map(str, expected_model_dir.iterdir()))
    )  # directory is not empty
    assert expected_plot_dir.exists()
    assert list(
        filter(lambda x: x.endswith(".png"), map(str, expected_plot_dir.iterdir()))
    )  # directory is not empty
    assert expected_preds_file.exists()
    assert expected_metrics_file.exists()
```

Similar to the automatic hyperparameter search test, you should only need to edit the lines underneath where it says "# Select SVM", replacing them with the following:
```python
my_model_toggle = get_element_by_label(at, "toggle", "My Model")
my_model_toggle.set_value(True).run()
```
Subsitute "My Model" with the actual name of your model.

### Running the tests
The tests will run when you open a pull request to Helix. They will re-run everytime you push to that PR. You can also run them manually:

```bash
uv run pytests
```

Be patient, the tests can take several minutes. Your changes may affect other tests, so be aware.

[BaseEstimator]: https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#sklearn.base.BaseEstimator
[ClassifierMixin]: https://scikit-learn.org/stable/modules/generated/sklearn.base.ClassifierMixin.html
[RegressorMixin]: https://scikit-learn.org/stable/modules/generated/sklearn.base.RegressorMixin.html
[GridSearchCV]: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV