helix.services package

Subpackages

Submodules

helix.services.configuration module

helix.services.configuration.load_data_options(path: Path) DataOptions

Load the data options from the JSON file given in path.

Parameters:

path (Path) – The path to the JSON file containing the data options.

Returns:

The data options.

Return type:

DataOptions

helix.services.configuration.load_data_preprocessing_options(path: Path) PreprocessingOptions

Load data preprocessing options from the given path. The path will be to a json file containing the options.

Parameters:

path (Path) – The path the json file containing the options.

Returns:

The data preprocessing options.

Return type:

PreprocessingOptions

helix.services.configuration.load_execution_options(path: Path) ExecutionOptions

Load experiment execution options from the given path. The path will be to a json file containing the options.

Parameters:

path (Path) – The path the json file containing the options.

Returns:

The execution options.

Return type:

ExecutionOptions

helix.services.configuration.load_fi_options(path: Path) FeatureImportanceOptions | None

Load feature importance options.

Parameters:

path (Path) – The path to the feature importance options file.

Returns:

The feature importance options.

Return type:

FeatureImportanceOptions | None

helix.services.configuration.load_fuzzy_options(path: Path) FuzzyOptions | None

Load fuzzy options.

Parameters:

path (Path) – The path to the fuzzy options file.

Returns:

The fuzzy options.

Return type:

FuzzyOptions | None

helix.services.configuration.load_ml_options(path: Path) MachineLearningOptions | None

Load machine learning options from the given path. The path will be to a json file containing the options.

Parameters:

path (Path) – The path the json file containing the options.

Returns:

The machine learning options.

Return type:

MachineLearningOptions

helix.services.configuration.load_plot_options(path: Path) PlottingOptions

Load plotting options from the given path. The path will be to a json file containing the plot options.

Parameters:

path (Path) – The path the json file containing the options.

Returns:

The plotting options.

Return type:

PlottingOptions

helix.services.configuration.save_options(path: Path, options: Options)

Save options to a json file at the specified path.

Parameters:
  • path (Path) – The path to the json file.

  • options (Options) – The options to save.

helix.services.data module

class helix.services.data.DataBuilder(data_path: str, random_state: int, normalisation: str, logger: object = None, data_split: DataSplitOptions | None = None, problem_type: str = None)

Bases: object

Data builder class

ingest()
class helix.services.data.TabularData(X_train: list[pandas.core.frame.DataFrame], X_test: list[pandas.core.frame.DataFrame], y_train: list[pandas.core.frame.DataFrame], y_test: list[pandas.core.frame.DataFrame])

Bases: object

X_test: list[DataFrame]
X_train: list[DataFrame]
y_test: list[DataFrame]
y_train: list[DataFrame]
helix.services.data.ingest_data(exec_opts: ExecutionOptions, data_opts: DataOptions, _logger: Logger) TabularData

Load data from disk if the data is not in the streamlit cache, else return the stored value. This behaviour is controlled by the decorator on the function signature (@st.cache_data).

Parameters:
  • exec_opts (ExecutionOptions) – The execution options.

  • data_opts (DataOptions) – The data options.

  • _logger (Logger) – The logger.

Returns:

The ingested data.

Return type:

TabularData

helix.services.data.read_data(data_path: Path, _logger: Logger) DataFrame

Read a data file into memory from a ‘.csv’ or ‘.xlsx’ file.

Parameters:
  • data_path (Path) – The path to the file to be read.

  • logger (Logger) – The logger.

Raises:

ValueError – The data file wasn’t a ‘.csv’ or ‘.xlsx’ file.

Returns:

The data read from the file.

Return type:

pd.DataFrame

helix.services.data.save_data(data_path: Path, data: DataFrame, logger: Logger)

Save data to either a ‘.csv’ or ‘.xlsx’ file.

Parameters:
  • data_path (Path) – The path to save the data to.

  • data (pd.DataFrame) – The data to save.

  • logger (Logger) – The logger.

Raises:

ValueError – The data file wasn’t a ‘.csv’ or ‘.xlsx’ file.

helix.services.experiments module

helix.services.experiments.create_experiment(save_dir: Path, plotting_options: PlottingOptions, execution_options: ExecutionOptions, data_options: DataOptions)

Create an experiment on disk with it’s global plotting options, execution options and data options saved as json files.

Parameters:
  • save_dir (Path) – The path to where the experiment will be created.

  • plotting_options (PlottingOptions) – The plotting options to save.

  • execution_options (ExecutionOptions) – The execution options to save.

  • data_options (DataOptions) – The data options to save.

helix.services.experiments.delete_previous_fi_results(experiment_path: Path)

Delete previous feature importance results.

Parameters:

experiment_path (Path) – The path to the experiment.

helix.services.experiments.find_previous_fi_results(experiment_path: Path) bool

Find previous feature importance results.

Parameters:

experiment_path (Path) – The path to the experiment.

Returns:

whether previous experiments exist or not.

Return type:

bool

helix.services.experiments.get_experiments(base_dir: Path | None = None) list[str]

Get the list of experiments in the Helix experiment directory.

If base_dir is not specified, the default from helix_experiments_base_dir is used

Parameters:
  • base_dir (Path | None, optional) – Specify a base directory for experiments.

  • None. (Defaults to)

Returns:

The list of experiments.

Return type:

list[str]

helix.services.logs module

helix.services.logs.get_logs(log_dir: Path) str

Get the latest log file for the latest run to display.

Parameters:

log_dir (Path) – The directory to search for the latest logs.

Raises:

NotADirectoryErrorlog_dir does not point to a directory.

Returns:

The text of the latest log file.

Return type:

str

helix.services.metrics module

helix.services.metrics.find_mean_model_index(full_metrics: dict, aggregated_metrics: dict, metric_name: str) int

Find the index of the model with the mean of the metric.

helix.services.metrics.get_metrics(problem_type: ProblemTypes, logger: object = None) dict

Get the metrics functions for a given problem type.

For classification: - Accuracy - F1 - Precision - Recall - ROC AUC

For Regression - R2 - MAE - RMSE

Parameters:
  • problem_type (ProblemTypes) – Where the problem is classification or regression.

  • logger (object, optional) – The logger. Defaults to None.

Raises:

ValueError – When you give an incorrect problem type.

Returns:

A dict of score names and functions.

Return type:

dict

helix.services.ml_models module

helix.services.ml_models.get_model(model_type: type, model_params: dict = None) MlModel

Get a new instance of the requested machine learning model.

If the model is to be used in a grid search, specify model_params=None.

Parameters:
  • model_type (type) – The Python type (constructor) of the model to instantiate.

  • model_params (dict, optional) – The parameters to pass to the model constructor. Defaults to None.

Returns:

A new instance of the requested machine learning model.

Return type:

MlModel

helix.services.ml_models.get_model_type(model_type: str, problem_type: ProblemTypes) type

Fetch the appropriate type for a given model name based on the problem type.

Parameters:
  • model_type (dict) – The kind of model.

  • problem_type (ProblemTypes) – Type of problem (classification or regression).

Raises:

ValueError – If a model type is not recognised or unsupported.

Returns:

The constructor for a machine learning model class.

Return type:

type

helix.services.ml_models.load_models(path: Path) dict[str, list]

Load pre-trained machine learning models.

Parameters:

path (Path) – The path to the directory where the models are saved.

Returns:

The pre-trained models.

Return type:

dict[str, list]

helix.services.ml_models.load_models_to_explain(path: Path, model_names: list) dict[str, list]

Load pre-trained machine learning models.

Parameters:
  • path (Path) – The path to the directory where the models are saved.

  • model_names (str) – The name of the models to explain.

Returns:

The pre-trained models.

Return type:

dict[str, list]

helix.services.ml_models.models_exist(path: Path) bool
helix.services.ml_models.save_model(model, path: Path)

Save a machine learning model to the given file path.

Parameters:
  • model (_type_) – The model to save. Must be picklable.

  • path (Path) – The file path to save the model.

helix.services.ml_models.save_model_predictions(predictions: DataFrame, path: Path)

Save the predictions of the models to the given file path.

Parameters:
  • predictions (DataFrame) – The predictions to save.

  • path (Path) – The file path to save the predictions.

helix.services.ml_models.save_models_metrics(metrics: dict, path: Path)

Save the statistical metrics of the models to the given file path.

Parameters:
  • metrics (dict) – The metrics to save.

  • path (Path) – The file path to save the metrics.

helix.services.plotting module

helix.services.plotting.plot_auc_roc(y_classes_labels: ndarray, y_score_probs: ndarray, model_name: str, set_name: str, directory: Path, plot_opts: PlottingOptions) None

Plot the ROC curve for a multi-class classification model.

Parameters:
  • y_classes_labels (numpy.ndarray) – The true labels of the classes.

  • y_score_probs (numpy.ndarray) – The predicted probabilities of the classes.

  • model_name (string) – The name of the model.

  • set_name (string) – The name of the set (train or test).

  • directory (Path) – The directory path to save the plot.

  • plot_opts (PlottingOptions) – The plotting options.

helix.services.plotting.plot_bar_chart(df: DataFrame, sort_key: Any, plot_opts: PlottingOptions, title: str, x_label: str, y_label: str, n_features: int = 10, error_bars: DataFrame | None = None) Figure

Plot a bar chart of the top n features from the given dataframe.

Parameters:
  • df (pd.DataFrame) – The data to be plotted.

  • plot_opts (PlottingOptions) – The options for styling the plot.

  • sort_key (str) – The key by which to sort the data. This can be the name of a column.

  • title (str) – The title of the plot.

  • x_label (str) – The label for the X axis.

  • y_label (str) – The label for the Y axis.

  • n_features (int, optional) – The top number of featurs to plot. Defaults to 10.

  • error_bars (pd.DataFrame | None, optional) – Error bars for the plot. Defaults to None.

  • directory (Path | None, optional) – The directory to save the plot. Defaults to None.

  • model_name (str | None, optional) – The name of the model. Defaults to None.

  • set_name (str | None, optional) – The name of the set (train or test). Defaults to None.

Returns:

The bar chart of the top n features.

Return type:

Figure

helix.services.plotting.plot_beta_coefficients(coefficients: ndarray, feature_names: list, plot_opts: PlottingOptions, model_name: str, dependent_variable: str | None = None, standard_errors: ndarray | None = None, is_classification: bool = False) Figure

Create a bar plot of model coefficients with different colors for positive/negative values.

Parameters:
  • coefficients (np.ndarray) – The model coefficients. For logistic regression, uses first class coefficients

  • feature_names (list) – Names of the features

  • plot_opts (PlottingOptions) – Plot styling options

  • model_name (str) – Name of the model for the plot title

  • dependent_variable (str | None, optional) – Name of the dependent variable. Defaults to None.

  • standard_errors (np.ndarray | None, optional) – Standard errors of coefficients. Defaults to None.

  • is_classification (bool, optional) – Whether this is a classification model. Defaults to False.

Returns:

The coefficient plot

Return type:

Figure

helix.services.plotting.plot_confusion_matrix(y_true: ndarray, y_pred: ndarray, model_name: str, set_name: str, directory: Path, plot_opts: PlottingOptions) None

Plot the confusion matrix for a multi-class or binary classification model.

Parameters:
  • y_true (np.ndarray) – The true labels.

  • y_pred (np.ndarray) – The predicted labels.

  • model_name (str) – The name of the model.

  • set_name (str) – The name of the set (train or test).

  • directory (Path) – The directory path to save the plot.

  • plot_opts (PlottingOptions) – The plotting options.

helix.services.plotting.plot_global_shap_importance(shap_values: DataFrame, plot_opts: PlottingOptions, num_features_to_plot: int, title: str) Figure

Produce a bar chart of global SHAP values.

Parameters:
  • shap_values (pd.DataFrame) – The DataFrame containing the global SHAP values.

  • plot_opts (PlottingOptions) – The plotting options.

  • num_features_to_plot (int) – The number of top features to plot.

  • title (str) – The plot title.

Returns:

The bar chart of global SHAP values.

Return type:

Figure

helix.services.plotting.plot_lime_importance(df: DataFrame, plot_opts: PlottingOptions, num_features_to_plot: int, title: str) Figure

Plot LIME importance.

Parameters:
  • df (pd.DataFrame) – The LIME data to plot

  • plot_opts (PlottingOptions) – The plotting options.

  • num_features_to_plot (int) – The top number of features to plot.

  • title (str) – The title of the plot.

Returns:

The LIME plot.

Return type:

Figure

helix.services.plotting.plot_local_shap_importance(shap_values: Explainer, plot_opts: PlottingOptions, num_features_to_plot: int, title: str) Figure

Plot a beeswarm plot of the local SHAP values.

Parameters:
  • shap_values (shap.Explainer) – The SHAP explainer to produce the plot from.

  • plot_opts (PlottingOptions) – The plotting options.

  • num_features_to_plot (int) – The number of top features to plot.

  • title (str) – The plot title.

Returns:

The beeswarm plot of local SHAP values.

Return type:

Figure

helix.services.plotting.plot_permutation_importance(df: DataFrame, plot_opts: PlottingOptions, n_features: int, title: str) Figure

Plot a bar chart of the top n features in the feature importance dataframe, with the given title and styled with the given options.

Parameters:
  • df (pd.DataFrame) – The dataframe containing the permutation importance.

  • plot_opts (PlottingOptions) – The options for how to configure the plot.

  • n_features (int) – The top number of features to plot.

  • title (str) – The title of the plot.

Returns:

The bar chart of the top n features.

Return type:

Figure

helix.services.plotting.plot_scatter(y, yp, r2: float, set_name: str, dependent_variable: str, model_name: str, plot_opts: PlottingOptions) Figure

Create a scatter plot comparing predicted vs actual values.

Parameters:
  • y (_type_) – True y values.

  • yp (_type_) – Predicted y values.

  • r2 (float) – R-squared between y`and `yp.

  • set_name (str) – “Train” or “Test”.

  • dependent_variable (str) – The name of the dependent variable.

  • model_name (str) – Name of the model.

  • plot_opts (PlottingOptions) – Options for styling the plot.

Returns:

The scatter plot figure

Return type:

Figure

helix.services.preprocessing module

helix.services.preprocessing.convert_nominal_to_numeric(data: DataFrame) DataFrame

Convert all nominal (categorical) columns in a DataFrame to numeric values. This function identifies all object or category type columns in the input DataFrame and converts them to numeric representations using pandas’ factorize method. Each unique category is assigned a unique integer value.

Parameters:

data (pd.DataFrame) – The input DataFrame containing columns to be converted.

Returns:

A DataFrame with all categorical columns converted to numeric values.

Return type:

pd.DataFrame

helix.services.preprocessing.find_non_numeric_columns(data: DataFrame | Series) List[str]

Find non-numeric columns in a DataFrame or check if a Series contains non-numeric values.

Parameters:

data (Union[pd.DataFrame, pd.Series]) – The DataFrame or Series to check.

Returns:

If data is a DataFrame, returns a list of non-numeric column names.

If data is a Series, returns [“Series”] if it contains non-numeric values, else an empty list.

Return type:

List[str]

helix.services.preprocessing.normalise_independent_variables(normalisation_method: str, X)

Normalise the independent variables based on the selected method.

Parameters:
  • normalisation_method (str) – The normalisation method to use.

  • X (pd.DataFrame) – The independent variables to normalise.

Returns:

The normalised independent variables.

Return type:

pd.DataFrame

helix.services.preprocessing.run_feature_selection(preprocessing_opts: PreprocessingOptions, data: DataFrame) DataFrame

Run feature selection on the data based on the selected methods.

Parameters:
  • feature_selection_methods (dict) – A dictionary of the feature selection methods to use.

  • data (pd.DataFrame) – The data to perform feature selection on.

Returns:

The processed data.

Return type:

pd.DataFrame

helix.services.preprocessing.run_preprocessing(data: DataFrame, experiment_path: Path, config: PreprocessingOptions) DataFrame
helix.services.preprocessing.transform_dependent_variable(transformation_y_method: str, y)

Transform the dependent variable based on the selected method.

Parameters:
  • transformation_y_method (str) – The transformation method to use.

  • y (pd.Series) – The dependent variable to transform.

Returns:

The transformed dependent variable.

Return type:

pd.Series

helix.services.statistical_tests module

helix.services.statistical_tests.create_normality_test_table(data: DataFrame) DataFrame | None

Create a dataframe with normality test results for numerical columns.

Parameters:

data – Input DataFrame containing the data to test

Returns:

DataFrame containing normality test results for each numerical column, or None if no valid columns are found

helix.services.statistical_tests.kolmogorov_smirnov_test(data: ndarray | list, reference_dist: str = 'norm') Tuple[float, float]

Perform Kolmogorov-Smirnov test to determine if a sample comes from a reference distribution. By default, tests against a normal distribution.

Parameters:
  • data – Input array of observations to test. Can be a numpy array or a list.

  • reference_dist – String specifying the reference distribution. Default is ‘norm’ for normal distribution. Other options include: ‘uniform’, ‘expon’, etc.

Returns:

  • statistic: The test statistic

  • p_value: The p-value for the hypothesis test

Return type:

Tuple containing

Note

  • Null hypothesis: the data comes from the specified distribution

  • If p-value < alpha (typically 0.05), reject the null hypothesis (data does not come from the specified distribution)

  • If p-value >= alpha, fail to reject the null hypothesis (data may come from the specified distribution)

helix.services.statistical_tests.shapiro_wilk_test(data: ndarray | list) Tuple[float, float]

Perform Shapiro-Wilk test for normality on the input data.

The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.

Parameters:

data – Input array of observations to test for normality. Can be a numpy array or a list.

Returns:

  • statistic: The test statistic

  • p_value: The p-value for the hypothesis test

Return type:

Tuple containing

Note

  • Null hypothesis: the data is normally distributed

  • If p-value < alpha (typically 0.05), reject the null hypothesis (data is not normally distributed)

  • If p-value >= alpha, fail to reject the null hypothesis (data may be normally distributed)

helix.services.weights_init module

helix.services.weights_init.kaiming_init(m: Module, nonlinearity: str = 'relu') None

Initializes the weights of Linear layers using Kaiming initialization.

Parameters:
  • m (torch.nn.Module) – The module to initialize.

  • nonlinearity (str) – The nonlinearity used in the network

  • (e.g.

  • 'relu'

  • "relu". ('leaky_relu'). Defaults to)

Returns:

None

helix.services.weights_init.normal_init(m: Module, mean: float = 0.0, std: float = 0.02) None

Initializes the weights of Linear layers using a normal distribution.

Parameters:
  • m (torch.nn.Module) – The module to initialize.

  • mean (float) – The mean of the normal distribution. Defaults to 0.0.

  • std (float) – The standard deviation of the normal distribution.

  • 0.02. (Defaults to)

Returns:

None

helix.services.weights_init.xavier_init(m: Module) None

Initializes the weights of Linear layers using Xavier initialization.

Parameters:

m (torch.nn.Module) – The module to initialize.

Returns:

None

Module contents