helix.services package¶
Subpackages¶
- helix.services.feature_importance package
- helix.services.machine_learning package
Submodules¶
helix.services.configuration module¶
- helix.services.configuration.load_data_options(path: Path) DataOptions ¶
Load the data options from the JSON file given in path.
- Parameters:
path (Path) – The path to the JSON file containing the data options.
- Returns:
The data options.
- Return type:
- helix.services.configuration.load_data_preprocessing_options(path: Path) PreprocessingOptions ¶
Load data preprocessing options from the given path. The path will be to a json file containing the options.
- Parameters:
path (Path) – The path the json file containing the options.
- Returns:
The data preprocessing options.
- Return type:
- helix.services.configuration.load_execution_options(path: Path) ExecutionOptions ¶
Load experiment execution options from the given path. The path will be to a json file containing the options.
- Parameters:
path (Path) – The path the json file containing the options.
- Returns:
The execution options.
- Return type:
- helix.services.configuration.load_fi_options(path: Path) FeatureImportanceOptions | None ¶
Load feature importance options.
- Parameters:
path (Path) – The path to the feature importance options file.
- Returns:
The feature importance options.
- Return type:
FeatureImportanceOptions | None
- helix.services.configuration.load_fuzzy_options(path: Path) FuzzyOptions | None ¶
Load fuzzy options.
- Parameters:
path (Path) – The path to the fuzzy options file.
- Returns:
The fuzzy options.
- Return type:
FuzzyOptions | None
- helix.services.configuration.load_ml_options(path: Path) MachineLearningOptions | None ¶
Load machine learning options from the given path. The path will be to a json file containing the options.
- Parameters:
path (Path) – The path the json file containing the options.
- Returns:
The machine learning options.
- Return type:
- helix.services.configuration.load_plot_options(path: Path) PlottingOptions ¶
Load plotting options from the given path. The path will be to a json file containing the plot options.
- Parameters:
path (Path) – The path the json file containing the options.
- Returns:
The plotting options.
- Return type:
- helix.services.configuration.save_options(path: Path, options: Options)¶
Save options to a json file at the specified path.
- Parameters:
path (Path) – The path to the json file.
options (Options) – The options to save.
helix.services.data module¶
- class helix.services.data.DataBuilder(data_path: str, random_state: int, normalisation: str, logger: object = None, data_split: DataSplitOptions | None = None, problem_type: str = None)¶
Bases:
object
Data builder class
- ingest()¶
- class helix.services.data.TabularData(X_train: list[pandas.core.frame.DataFrame], X_test: list[pandas.core.frame.DataFrame], y_train: list[pandas.core.frame.DataFrame], y_test: list[pandas.core.frame.DataFrame])¶
Bases:
object
- X_test: list[DataFrame]¶
- X_train: list[DataFrame]¶
- y_test: list[DataFrame]¶
- y_train: list[DataFrame]¶
- helix.services.data.ingest_data(exec_opts: ExecutionOptions, data_opts: DataOptions, _logger: Logger) TabularData ¶
Load data from disk if the data is not in the streamlit cache, else return the stored value. This behaviour is controlled by the decorator on the function signature (@st.cache_data).
- Parameters:
exec_opts (ExecutionOptions) – The execution options.
data_opts (DataOptions) – The data options.
_logger (Logger) – The logger.
- Returns:
The ingested data.
- Return type:
- helix.services.data.read_data(data_path: Path, _logger: Logger) DataFrame ¶
Read a data file into memory from a ‘.csv’ or ‘.xlsx’ file.
- Parameters:
data_path (Path) – The path to the file to be read.
logger (Logger) – The logger.
- Raises:
ValueError – The data file wasn’t a ‘.csv’ or ‘.xlsx’ file.
- Returns:
The data read from the file.
- Return type:
pd.DataFrame
- helix.services.data.save_data(data_path: Path, data: DataFrame, logger: Logger)¶
Save data to either a ‘.csv’ or ‘.xlsx’ file.
- Parameters:
data_path (Path) – The path to save the data to.
data (pd.DataFrame) – The data to save.
logger (Logger) – The logger.
- Raises:
ValueError – The data file wasn’t a ‘.csv’ or ‘.xlsx’ file.
helix.services.experiments module¶
- helix.services.experiments.create_experiment(save_dir: Path, plotting_options: PlottingOptions, execution_options: ExecutionOptions, data_options: DataOptions)¶
Create an experiment on disk with it’s global plotting options, execution options and data options saved as json files.
- Parameters:
save_dir (Path) – The path to where the experiment will be created.
plotting_options (PlottingOptions) – The plotting options to save.
execution_options (ExecutionOptions) – The execution options to save.
data_options (DataOptions) – The data options to save.
- helix.services.experiments.delete_previous_fi_results(experiment_path: Path)¶
Delete previous feature importance results.
- Parameters:
experiment_path (Path) – The path to the experiment.
- helix.services.experiments.find_previous_fi_results(experiment_path: Path) bool ¶
Find previous feature importance results.
- Parameters:
experiment_path (Path) – The path to the experiment.
- Returns:
whether previous experiments exist or not.
- Return type:
bool
- helix.services.experiments.get_experiments(base_dir: Path | None = None) list[str] ¶
Get the list of experiments in the Helix experiment directory.
If base_dir is not specified, the default from helix_experiments_base_dir is used
- Parameters:
base_dir (Path | None, optional) – Specify a base directory for experiments.
None. (Defaults to)
- Returns:
The list of experiments.
- Return type:
list[str]
helix.services.logs module¶
- helix.services.logs.get_logs(log_dir: Path) str ¶
Get the latest log file for the latest run to display.
- Parameters:
log_dir (Path) – The directory to search for the latest logs.
- Raises:
NotADirectoryError – log_dir does not point to a directory.
- Returns:
The text of the latest log file.
- Return type:
str
helix.services.metrics module¶
- helix.services.metrics.find_mean_model_index(full_metrics: dict, aggregated_metrics: dict, metric_name: str) int ¶
Find the index of the model with the mean of the metric.
- helix.services.metrics.get_metrics(problem_type: ProblemTypes, logger: object = None) dict ¶
Get the metrics functions for a given problem type.
For classification: - Accuracy - F1 - Precision - Recall - ROC AUC
For Regression - R2 - MAE - RMSE
- Parameters:
problem_type (ProblemTypes) – Where the problem is classification or regression.
logger (object, optional) – The logger. Defaults to None.
- Raises:
ValueError – When you give an incorrect problem type.
- Returns:
A dict of score names and functions.
- Return type:
dict
helix.services.ml_models module¶
- helix.services.ml_models.get_model(model_type: type, model_params: dict = None) MlModel ¶
Get a new instance of the requested machine learning model.
If the model is to be used in a grid search, specify model_params=None.
- Parameters:
model_type (type) – The Python type (constructor) of the model to instantiate.
model_params (dict, optional) – The parameters to pass to the model constructor. Defaults to None.
- Returns:
A new instance of the requested machine learning model.
- Return type:
MlModel
- helix.services.ml_models.get_model_type(model_type: str, problem_type: ProblemTypes) type ¶
Fetch the appropriate type for a given model name based on the problem type.
- Parameters:
model_type (dict) – The kind of model.
problem_type (ProblemTypes) – Type of problem (classification or regression).
- Raises:
ValueError – If a model type is not recognised or unsupported.
- Returns:
The constructor for a machine learning model class.
- Return type:
type
- helix.services.ml_models.load_models(path: Path) dict[str, list] ¶
Load pre-trained machine learning models.
- Parameters:
path (Path) – The path to the directory where the models are saved.
- Returns:
The pre-trained models.
- Return type:
dict[str, list]
- helix.services.ml_models.load_models_to_explain(path: Path, model_names: list) dict[str, list] ¶
Load pre-trained machine learning models.
- Parameters:
path (Path) – The path to the directory where the models are saved.
model_names (str) – The name of the models to explain.
- Returns:
The pre-trained models.
- Return type:
dict[str, list]
- helix.services.ml_models.models_exist(path: Path) bool ¶
- helix.services.ml_models.save_model(model, path: Path)¶
Save a machine learning model to the given file path.
- Parameters:
model (_type_) – The model to save. Must be picklable.
path (Path) – The file path to save the model.
- helix.services.ml_models.save_model_predictions(predictions: DataFrame, path: Path)¶
Save the predictions of the models to the given file path.
- Parameters:
predictions (DataFrame) – The predictions to save.
path (Path) – The file path to save the predictions.
- helix.services.ml_models.save_models_metrics(metrics: dict, path: Path)¶
Save the statistical metrics of the models to the given file path.
- Parameters:
metrics (dict) – The metrics to save.
path (Path) – The file path to save the metrics.
helix.services.plotting module¶
- helix.services.plotting.plot_auc_roc(y_classes_labels: ndarray, y_score_probs: ndarray, model_name: str, set_name: str, directory: Path, plot_opts: PlottingOptions) None ¶
Plot the ROC curve for a multi-class classification model.
- Parameters:
y_classes_labels (numpy.ndarray) – The true labels of the classes.
y_score_probs (numpy.ndarray) – The predicted probabilities of the classes.
model_name (string) – The name of the model.
set_name (string) – The name of the set (train or test).
directory (Path) – The directory path to save the plot.
plot_opts (PlottingOptions) – The plotting options.
- helix.services.plotting.plot_bar_chart(df: DataFrame, sort_key: Any, plot_opts: PlottingOptions, title: str, x_label: str, y_label: str, n_features: int = 10, error_bars: DataFrame | None = None) Figure ¶
Plot a bar chart of the top n features from the given dataframe.
- Parameters:
df (pd.DataFrame) – The data to be plotted.
plot_opts (PlottingOptions) – The options for styling the plot.
sort_key (str) – The key by which to sort the data. This can be the name of a column.
title (str) – The title of the plot.
x_label (str) – The label for the X axis.
y_label (str) – The label for the Y axis.
n_features (int, optional) – The top number of featurs to plot. Defaults to 10.
error_bars (pd.DataFrame | None, optional) – Error bars for the plot. Defaults to None.
directory (Path | None, optional) – The directory to save the plot. Defaults to None.
model_name (str | None, optional) – The name of the model. Defaults to None.
set_name (str | None, optional) – The name of the set (train or test). Defaults to None.
- Returns:
The bar chart of the top n features.
- Return type:
Figure
- helix.services.plotting.plot_beta_coefficients(coefficients: ndarray, feature_names: list, plot_opts: PlottingOptions, model_name: str, dependent_variable: str | None = None, standard_errors: ndarray | None = None, is_classification: bool = False) Figure ¶
Create a bar plot of model coefficients with different colors for positive/negative values.
- Parameters:
coefficients (np.ndarray) – The model coefficients. For logistic regression, uses first class coefficients
feature_names (list) – Names of the features
plot_opts (PlottingOptions) – Plot styling options
model_name (str) – Name of the model for the plot title
dependent_variable (str | None, optional) – Name of the dependent variable. Defaults to None.
standard_errors (np.ndarray | None, optional) – Standard errors of coefficients. Defaults to None.
is_classification (bool, optional) – Whether this is a classification model. Defaults to False.
- Returns:
The coefficient plot
- Return type:
Figure
- helix.services.plotting.plot_confusion_matrix(y_true: ndarray, y_pred: ndarray, model_name: str, set_name: str, directory: Path, plot_opts: PlottingOptions) None ¶
Plot the confusion matrix for a multi-class or binary classification model.
- Parameters:
y_true (np.ndarray) – The true labels.
y_pred (np.ndarray) – The predicted labels.
model_name (str) – The name of the model.
set_name (str) – The name of the set (train or test).
directory (Path) – The directory path to save the plot.
plot_opts (PlottingOptions) – The plotting options.
- helix.services.plotting.plot_global_shap_importance(shap_values: DataFrame, plot_opts: PlottingOptions, num_features_to_plot: int, title: str) Figure ¶
Produce a bar chart of global SHAP values.
- Parameters:
shap_values (pd.DataFrame) – The DataFrame containing the global SHAP values.
plot_opts (PlottingOptions) – The plotting options.
num_features_to_plot (int) – The number of top features to plot.
title (str) – The plot title.
- Returns:
The bar chart of global SHAP values.
- Return type:
Figure
- helix.services.plotting.plot_lime_importance(df: DataFrame, plot_opts: PlottingOptions, num_features_to_plot: int, title: str) Figure ¶
Plot LIME importance.
- Parameters:
df (pd.DataFrame) – The LIME data to plot
plot_opts (PlottingOptions) – The plotting options.
num_features_to_plot (int) – The top number of features to plot.
title (str) – The title of the plot.
- Returns:
The LIME plot.
- Return type:
Figure
- helix.services.plotting.plot_local_shap_importance(shap_values: Explainer, plot_opts: PlottingOptions, num_features_to_plot: int, title: str) Figure ¶
Plot a beeswarm plot of the local SHAP values.
- Parameters:
shap_values (shap.Explainer) – The SHAP explainer to produce the plot from.
plot_opts (PlottingOptions) – The plotting options.
num_features_to_plot (int) – The number of top features to plot.
title (str) – The plot title.
- Returns:
The beeswarm plot of local SHAP values.
- Return type:
Figure
- helix.services.plotting.plot_permutation_importance(df: DataFrame, plot_opts: PlottingOptions, n_features: int, title: str) Figure ¶
Plot a bar chart of the top n features in the feature importance dataframe, with the given title and styled with the given options.
- Parameters:
df (pd.DataFrame) – The dataframe containing the permutation importance.
plot_opts (PlottingOptions) – The options for how to configure the plot.
n_features (int) – The top number of features to plot.
title (str) – The title of the plot.
- Returns:
The bar chart of the top n features.
- Return type:
Figure
- helix.services.plotting.plot_scatter(y, yp, r2: float, set_name: str, dependent_variable: str, model_name: str, plot_opts: PlottingOptions) Figure ¶
Create a scatter plot comparing predicted vs actual values.
- Parameters:
y (_type_) – True y values.
yp (_type_) – Predicted y values.
r2 (float) – R-squared between y`and `yp.
set_name (str) – “Train” or “Test”.
dependent_variable (str) – The name of the dependent variable.
model_name (str) – Name of the model.
plot_opts (PlottingOptions) – Options for styling the plot.
- Returns:
The scatter plot figure
- Return type:
Figure
helix.services.preprocessing module¶
- helix.services.preprocessing.convert_nominal_to_numeric(data: DataFrame) DataFrame ¶
Convert all nominal (categorical) columns in a DataFrame to numeric values. This function identifies all object or category type columns in the input DataFrame and converts them to numeric representations using pandas’ factorize method. Each unique category is assigned a unique integer value.
- Parameters:
data (pd.DataFrame) – The input DataFrame containing columns to be converted.
- Returns:
A DataFrame with all categorical columns converted to numeric values.
- Return type:
pd.DataFrame
- helix.services.preprocessing.find_non_numeric_columns(data: DataFrame | Series) List[str] ¶
Find non-numeric columns in a DataFrame or check if a Series contains non-numeric values.
- Parameters:
data (Union[pd.DataFrame, pd.Series]) – The DataFrame or Series to check.
- Returns:
- If data is a DataFrame, returns a list of non-numeric column names.
If data is a Series, returns [“Series”] if it contains non-numeric values, else an empty list.
- Return type:
List[str]
- helix.services.preprocessing.normalise_independent_variables(normalisation_method: str, X)¶
Normalise the independent variables based on the selected method.
- Parameters:
normalisation_method (str) – The normalisation method to use.
X (pd.DataFrame) – The independent variables to normalise.
- Returns:
The normalised independent variables.
- Return type:
pd.DataFrame
- helix.services.preprocessing.run_feature_selection(preprocessing_opts: PreprocessingOptions, data: DataFrame) DataFrame ¶
Run feature selection on the data based on the selected methods.
- Parameters:
feature_selection_methods (dict) – A dictionary of the feature selection methods to use.
data (pd.DataFrame) – The data to perform feature selection on.
- Returns:
The processed data.
- Return type:
pd.DataFrame
- helix.services.preprocessing.run_preprocessing(data: DataFrame, experiment_path: Path, config: PreprocessingOptions) DataFrame ¶
- helix.services.preprocessing.transform_dependent_variable(transformation_y_method: str, y)¶
Transform the dependent variable based on the selected method.
- Parameters:
transformation_y_method (str) – The transformation method to use.
y (pd.Series) – The dependent variable to transform.
- Returns:
The transformed dependent variable.
- Return type:
pd.Series
helix.services.statistical_tests module¶
- helix.services.statistical_tests.create_normality_test_table(data: DataFrame) DataFrame | None ¶
Create a dataframe with normality test results for numerical columns.
- Parameters:
data – Input DataFrame containing the data to test
- Returns:
DataFrame containing normality test results for each numerical column, or None if no valid columns are found
- helix.services.statistical_tests.kolmogorov_smirnov_test(data: ndarray | list, reference_dist: str = 'norm') Tuple[float, float] ¶
Perform Kolmogorov-Smirnov test to determine if a sample comes from a reference distribution. By default, tests against a normal distribution.
- Parameters:
data – Input array of observations to test. Can be a numpy array or a list.
reference_dist – String specifying the reference distribution. Default is ‘norm’ for normal distribution. Other options include: ‘uniform’, ‘expon’, etc.
- Returns:
statistic: The test statistic
p_value: The p-value for the hypothesis test
- Return type:
Tuple containing
Note
Null hypothesis: the data comes from the specified distribution
If p-value < alpha (typically 0.05), reject the null hypothesis (data does not come from the specified distribution)
If p-value >= alpha, fail to reject the null hypothesis (data may come from the specified distribution)
- helix.services.statistical_tests.shapiro_wilk_test(data: ndarray | list) Tuple[float, float] ¶
Perform Shapiro-Wilk test for normality on the input data.
The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.
- Parameters:
data – Input array of observations to test for normality. Can be a numpy array or a list.
- Returns:
statistic: The test statistic
p_value: The p-value for the hypothesis test
- Return type:
Tuple containing
Note
Null hypothesis: the data is normally distributed
If p-value < alpha (typically 0.05), reject the null hypothesis (data is not normally distributed)
If p-value >= alpha, fail to reject the null hypothesis (data may be normally distributed)
helix.services.weights_init module¶
- helix.services.weights_init.kaiming_init(m: Module, nonlinearity: str = 'relu') None ¶
Initializes the weights of Linear layers using Kaiming initialization.
- Parameters:
m (torch.nn.Module) – The module to initialize.
nonlinearity (str) – The nonlinearity used in the network
(e.g.
'relu'
"relu". ('leaky_relu'). Defaults to)
- Returns:
None
- helix.services.weights_init.normal_init(m: Module, mean: float = 0.0, std: float = 0.02) None ¶
Initializes the weights of Linear layers using a normal distribution.
- Parameters:
m (torch.nn.Module) – The module to initialize.
mean (float) – The mean of the normal distribution. Defaults to 0.0.
std (float) – The standard deviation of the normal distribution.
0.02. (Defaults to)
- Returns:
None
- helix.services.weights_init.xavier_init(m: Module) None ¶
Initializes the weights of Linear layers using Xavier initialization.
- Parameters:
m (torch.nn.Module) – The module to initialize.
- Returns:
None