Package 'shapr'

Title: Prediction Explanation with Dependence-Aware Shapley Values
Description: Complex machine learning models are often hard to interpret. However, in many situations it is crucial to understand and explain why a model made a specific prediction. Shapley values is the only method for such prediction explanation framework with a solid theoretical foundation. Previously known methods for estimating the Shapley values do, however, assume feature independence. This package implements methods which accounts for any feature dependence, and thereby produces more accurate estimates of the true Shapley values. An accompanying 'Python' wrapper ('shaprpy') is available through the GitHub repository.
Authors: Martin Jullum [cre, aut] , Lars Henry Berge Olsen [aut] , Annabelle Redelmeier [aut], Jon Lachmann [aut] , Nikolai Sellereite [aut] , Anders Løland [ctb], Jens Christian Wahl [ctb], Camilla Lingjærde [ctb], Norsk Regnesentral [cph, fnd]
Maintainer: Martin Jullum <[email protected]>
License: MIT + file LICENSE
Version: 1.0.1.9000
Built: 2025-01-22 10:22:59 UTC
Source: https://github.com/norskregnesentral/shapr

Help Index


Explain the output of machine learning models with dependence-aware (conditional/observational) Shapley values

Description

Computes dependence-aware Shapley values for observations in x_explain from the specified model by using the method specified in approach to estimate the conditional expectation. See Aas et al. (2021) for a thorough introduction to dependence-aware prediction explanation with Shapley values.

Usage

explain(
  model,
  x_explain,
  x_train,
  approach,
  phi0,
  iterative = NULL,
  max_n_coalitions = NULL,
  group = NULL,
  n_MC_samples = 1000,
  seed = 1,
  verbose = "basic",
  predict_model = NULL,
  get_model_specs = NULL,
  prev_shapr_object = NULL,
  asymmetric = FALSE,
  causal_ordering = NULL,
  confounding = NULL,
  extra_computation_args = list(),
  iterative_args = list(),
  output_args = list(),
  ...
)

Arguments

model

Model object. Specifies the model whose predictions we want to explain. Run get_supported_models() for a table of which models explain supports natively. Unsupported models can still be explained by passing predict_model and (optionally) get_model_specs, see details for more information.

x_explain

Matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained.

x_train

Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula.

approach

Character vector of length 1 or one less than the number of features. All elements should, either be "gaussian", "copula", "empirical", "ctree", "vaeac", "categorical", "timeseries", "independence", "regression_separate", or "regression_surrogate". The two regression approaches can not be combined with any other approach. See details for more information.

phi0

Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable.

iterative

Logical or NULL If NULL (default), the argument is set to TRUE if there are more than 5 features/groups, and FALSE otherwise. If eventually TRUE, the Shapley values are estimated iteratively in an iterative manner. This provides sufficiently accurate Shapley value estimates faster. First an initial number of coalitions is sampled, then bootsrapping is used to estimate the variance of the Shapley values. A convergence criterion is used to determine if the variances of the Shapley values are sufficiently small. If the variances are too high, we estimate the number of required samples to reach convergence, and thereby add more coalitions. The process is repeated until the variances are below the threshold. Specifics related to the iterative process and convergence criterion are set through iterative_args.

max_n_coalitions

Integer. The upper limit on the number of unique feature/group coalitions to use in the iterative procedure (if iterative = TRUE). If iterative = FALSE it represents the number of feature/group coalitions to use directly. The quantity refers to the number of unique feature coalitions if group = NULL, and group coalitions if group != NULL. max_n_coalitions = NULL corresponds to max_n_coalitions=2^n_features.

group

List. If NULL regular feature wise Shapley values are computed. If provided, group wise Shapley values are computed. group then has length equal to the number of groups. The list element contains character vectors with the features included in each of the different groups. See Jullum et al. (2021) for more information on group wise Shapley values.

n_MC_samples

Positive integer. For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration of every conditional expectation. For approach="ctree", n_MC_samples corresponds to the number of samples from the leaf node (see an exception related to the ctree.sample argument setup_approach.ctree()). For approach="empirical", n_MC_samples is the KK parameter in equations (14-15) of Aas et al. (2021), i.e. the maximum number of observations (with largest weights) that is used, see also the empirical.eta argument setup_approach.empirical().

seed

Positive integer. Specifies the seed before any randomness based code is being run. If NULL no seed is set in the calling environment.

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE)

  • the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

predict_model

Function. The prediction function used when model is not natively supported. (Run get_supported_models() for a list of natively supported models.) The function must have two arguments, model and newdata which specify, respectively, the model and a data.frame/data.table to compute predictions for. The function must give the prediction as a numeric vector. NULL (the default) uses functions specified internally. Can also be used to override the default function for natively supported model classes.

get_model_specs

Function. An optional function for checking model/data consistency when model is not natively supported. (Run get_supported_models() for a list of natively supported models.) The function takes model as argument and provides a list with 3 elements:

labels

Character vector with the names of each feature.

classes

Character vector with the classes of each features.

factor_levels

Character vector with the levels for any categorical features.

If NULL (the default) internal functions are used for natively supported model classes, and the checking is disabled for unsupported model classes. Can also be used to override the default function for natively supported model classes.

prev_shapr_object

shapr object or string. If an object of class shapr is provided, or string with a path to where intermediate results are stored, then the function will use the previous object to continue the computation. This is useful if the computation is interrupted or you want higher accuracy than already obtained, and therefore want to continue the iterative estimation. See the general usage for examples.

asymmetric

Logical. Not applicable for (regular) non-causal or asymmetric explanations. If FALSE (default), explain computes regular symmetric Shapley values, If TRUE, then explain compute asymmetric Shapley values based on the (partial) causal ordering given by causal_ordering. That is, explain only uses the feature combinations/coalitions that respect the causal ordering when computing the asymmetric Shapley values. If asymmetric is TRUE and confounding is NULL (default), then explain computes asymmetric conditional Shapley values as specified in Frye et al. (2020). If confounding is provided, i.e., not NULL, then explain computes asymmetric causal Shapley values as specified in Heskes et al. (2020).

causal_ordering

List. Not applicable for (regular) non-causal or asymmetric explanations. causal_ordering is an unnamed list of vectors specifying the components of the partial causal ordering that the coalitions must respect. Each vector represents a component and contains one or more features/groups identified by their names (strings) or indices (integers). If causal_ordering is NULL (default), no causal ordering is assumed and all possible coalitions are allowed. No causal ordering is equivalent to a causal ordering with a single component that includes all features (list(1:n_features)) or groups (list(1:n_groups)) for feature-wise and group-wise Shapley values, respectively. For feature-wise Shapley values and causal_ordering = list(c(1, 2), c(3, 4)), the interpretation is that features 1 and 2 are the ancestors of features 3 and 4, while features 3 and 4 are on the same level. Note: All features/groups must be included in the causal_ordering without any duplicates.

confounding

Logical vector. Not applicable for (regular) non-causal or asymmetric explanations. confounding is a vector of logicals specifying whether confounding is assumed or not for each component in the causal_ordering. If NULL (default), then no assumption about the confounding structure is made and explain computes asymmetric/symmetric conditional Shapley values, depending on the value of asymmetric. If confounding is a single logical, i.e., FALSE or TRUE, then this assumption is set globally for all components in the causal ordering. Otherwise, confounding must be a vector of logicals of the same length as causal_ordering, indicating the confounding assumption for each component. When confounding is specified, then explain computes asymmetric/symmetric causal Shapley values, depending on the value of asymmetric. The approach cannot be regression_separate and regression_surrogate as the regression-based approaches are not applicable to the causal Shapley value methodology.

extra_computation_args

Named list. Specifies extra arguments related to the computation of the Shapley values. See get_extra_comp_args_default() for description of the arguments and their default values.

iterative_args

Named list. Specifies the arguments for the iterative procedure. See get_iterative_args_default() for description of the arguments and their default values.

output_args

Named list. Specifies certain arguments related to the output of the function. See get_output_args_default() for description of the arguments and their default values.

...

Arguments passed on to setup_approach.categorical, setup_approach.copula, setup_approach.ctree, setup_approach.empirical, setup_approach.gaussian, setup_approach.independence, setup_approach.regression_separate, setup_approach.regression_surrogate, setup_approach.timeseries, setup_approach.vaeac

categorical.joint_prob_dt

Data.table. (Optional) Containing the joint probability distribution for each combination of feature values. NULL means it is estimated from the x_train and x_explain.

categorical.epsilon

Numeric value. (Optional) If categorical.joint_probability_dt is not supplied, probabilities/frequencies are estimated using x_train. If certain observations occur in x_explain and NOT in x_train, then epsilon is used as the proportion of times that these observations occurs in the training data. In theory, this proportion should be zero, but this causes an error later in the Shapley computation.

internal

List. Not used directly, but passed through from explain().

ctree.mincriterion

Numeric scalar or vector. Either a scalar or vector of length equal to the number of features in the model. The value is equal to 1 - α\alpha where α\alpha is the nominal level of the conditional independence tests. If it is a vector, this indicates which value to use when conditioning on various numbers of features. The default value is 0.95.

ctree.minsplit

Numeric scalar. Determines minimum value that the sum of the left and right daughter nodes required for a split. The default value is 20.

ctree.minbucket

Numeric scalar. Determines the minimum sum of weights in a terminal node required for a split The default value is 7.

ctree.sample

Boolean. If TRUE (default), then the method always samples n_MC_samples observations from the leaf nodes (with replacement). If FALSE and the number of observations in the leaf node is less than n_MC_samples, the method will take all observations in the leaf. If FALSE and the number of observations in the leaf node is more than n_MC_samples, the method will sample n_MC_samples observations (with replacement). This means that there will always be sampling in the leaf unless sample = FALSE and the number of obs in the node is less than n_MC_samples.

empirical.type

Character. (default = "fixed_sigma") Should be equal to either "independence","fixed_sigma", "AICc_each_k" "AICc_full". "independence" is deprecated. Use approach = "independence" instead. "fixed_sigma" uses a fixed bandwidth (set through empirical.fixed_sigma) in the kernel density estimation. "AICc_each_k" and "AICc_full" optimize the bandwidth using the AICc criterion, with respectively one bandwidth per coalition size and one bandwidth for all coalition sizes.

empirical.eta

Numeric scalar. Needs to be ⁠0 < eta <= 1⁠. The default value is 0.95. Represents the minimum proportion of the total empirical weight that data samples should use. If e.g. eta = .8 we will choose the K samples with the largest weight so that the sum of the weights accounts for 80\ eta is the η\eta parameter in equation (15) of Aas et al. (2021).

empirical.fixed_sigma

Positive numeric scalar. The default value is 0.1. Represents the kernel bandwidth in the distance computation used when conditioning on all different coalitions. Only used when empirical.type = "fixed_sigma"

empirical.n_samples_aicc

Positive integer. Number of samples to consider in AICc optimization. The default value is 1000. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.eval_max_aicc

Positive integer. Maximum number of iterations when optimizing the AICc. The default value is 20. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.start_aicc

Numeric. Start value of the sigma parameter when optimizing the AICc. The default value is 0.1. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.cov_mat

Numeric matrix. (Optional) The covariance matrix of the data generating distribution used to define the Mahalanobis distance. NULL means it is estimated from x_train.

gaussian.mu

Numeric vector. (Optional) Containing the mean of the data generating distribution. NULL means it is estimated from the x_train.

gaussian.cov_mat

Numeric matrix. (Optional) Containing the covariance matrix of the data generating distribution. NULL means it is estimated from the x_train.

regression.model

A tidymodels object of class model_specs. Default is a linear regression model, i.e., parsnip::linear_reg(). See tidymodels for all possible models, and see the vignette for how to add new/own models. Note, to make it easier to call explain() from Python, the regression.model parameter can also be a string specifying the model which will be parsed and evaluated. For example, ⁠"parsnip::rand_forest(mtry = hardhat::tune(), trees = 100, engine = "ranger", mode = "regression")"⁠ is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.tune_values

Either NULL (default), a data.frame/data.table/tibble, or a function. The data.frame must contain the possible hyperparameter value combinations to try. The column names must match the names of the tunable parameters specified in regression.model. If regression.tune_values is a function, then it should take one argument x which is the training data for the current coalition and returns a data.frame/data.table/tibble with the properties described above. Using a function allows the hyperparameter values to change based on the size of the coalition See the regression vignette for several examples. Note, to make it easier to call explain() from Python, the regression.tune_values can also be a string containing an R function. For example, "function(x) return(dials::grid_regular(dials::mtry(c(1, ncol(x)))), levels = 3))" is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.vfold_cv_para

Either NULL (default) or a named list containing the parameters to be sent to rsample::vfold_cv(). See the regression vignette for several examples.

regression.recipe_func

Either NULL (default) or a function that that takes in a recipes::recipe() object and returns a modified recipes::recipe() with potentially additional recipe steps. See the regression vignette for several examples. Note, to make it easier to call explain() from Python, the regression.recipe_func can also be a string containing an R function. For example, "function(recipe) return(recipes::step_ns(recipe, recipes::all_numeric_predictors(), deg_free = 2))" is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.surrogate_n_comb

Positive integer. Specifies the number of unique coalitions to apply to each training observation. The default is the number of sampled coalitions in the present iteration. Any integer between 1 and the default is allowed. Larger values requires more memory, but may improve the surrogate model. If the user sets a value lower than the maximum, we sample this amount of unique coalitions separately for each training observations. That is, on average, all coalitions should be equally trained.

timeseries.fixed_sigma

Positive numeric scalar. Represents the kernel bandwidth in the distance computation. The default value is 2.

timeseries.bounds

Numeric vector of length two. Specifies the lower and upper bounds of the timeseries. The default is c(NULL, NULL), i.e. no bounds. If one or both of these bounds are not NULL, we restrict the sampled time series to be between these bounds. This is useful if the underlying time series are scaled between 0 and 1, for example.

vaeac.depth

Positive integer (default is 3). The number of hidden layers in the neural networks of the masked encoder, full encoder, and decoder.

vaeac.width

Positive integer (default is 32). The number of neurons in each hidden layer in the neural networks of the masked encoder, full encoder, and decoder.

vaeac.latent_dim

Positive integer (default is 8). The number of dimensions in the latent space.

vaeac.lr

Positive numeric (default is 0.001). The learning rate used in the torch::optim_adam() optimizer.

vaeac.activation_function

An torch::nn_module() representing an activation function such as, e.g., torch::nn_relu() (default), torch::nn_leaky_relu(), torch::nn_selu(), or torch::nn_sigmoid().

vaeac.n_vaeacs_initialize

Positive integer (default is 4). The number of different vaeac models to initiate in the start. Pick the best performing one after vaeac.extra_parameters$epochs_initiation_phase epochs (default is 2) and continue training that one.

vaeac.epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes vaeac.extra_parameters$epochs_initiation_phase, where the default is 2.

vaeac.extra_parameters

Named list with extra parameters to the vaeac approach. See vaeac_get_extra_para_default() for description of possible additional parameters and their default values.

Details

The shapr package implements kernelSHAP estimation of dependence-aware Shapley values with eight different Monte Carlo-based approaches for estimating the conditional distributions of the data. These are all introduced in the general usage. (From R: vignette("general_usage", package = "shapr")). Moreover, Aas et al. (2021) gives a general introduction to dependence-aware Shapley values, and the three approaches "empirical", "gaussian", "copula", and also discusses "independence". Redelmeier et al. (2020) introduces the approach "ctree". Olsen et al. (2022) introduces the "vaeac" approach. Approach "timeseries" is discussed in Jullum et al. (2021). shapr has also implemented two regression-based approaches "regression_separate" and "regression_surrogate", as described in Olsen et al. (2024). It is also possible to combine the different approaches, see the general usage for more information.

The package also supports the computation of causal and asymmetric Shapley values as introduced by Heskes et al. (2020) and Frye et al. (2020). Asymmetric Shapley values were proposed by Heskes et al. (2020) as a way to incorporate causal knowledge in the real world by restricting the possible feature combinations/coalitions when computing the Shapley values to those consistent with a (partial) causal ordering. Causal Shapley values were proposed by Frye et al. (2020) as a way to explain the total effect of features on the prediction, taking into account their causal relationships, by adapting the sampling procedure in shapr.

The package allows for parallelized computation with progress updates through the tightly connected future::future and progressr::progressr packages. See the examples below. For iterative estimation (iterative=TRUE), intermediate results may also be printed to the console (according to the verbose argument). Moreover, the intermediate results are written to disk. This combined batch computing of the v(S) values, enables fast and accurate estimation of the Shapley values in a memory friendly manner.

Value

Object of class c("shapr", "list"). Contains the following items:

shapley_values_est

data.table with the estimated Shapley values with explained observation in the rows and features along the columns. The column none is the prediction not devoted to any of the features (given by the argument phi0)

shapley_values_sd

data.table with the standard deviation of the Shapley values reflecting the uncertainty. Note that this only reflects the coalition sampling part of the kernelSHAP procedure, and is therefore by definition 0 when all coalitions is used. Only present when extra_computation_args$compute_sd=TRUE, which is the default when iterative = TRUE

internal

List with the different parameters, data, functions and other output used internally.

pred_explain

Numeric vector with the predictions for the explained observations

MSEv

List with the values of the MSEv evaluation criterion for the approach. See the MSEv evaluation section in the general usage for details.

timing

List containing timing information for the different parts of the computation. init_time and end_time gives the time stamps for the start and end of the computation. total_time_secs gives the total time in seconds for the complete execution of explain(). main_timing_secs gives the time in seconds for the main computations. iter_timing_secs gives for each iteration of the iterative estimation, the time spent on the different parts iterative estimation routine.

Author(s)

Martin Jullum, Lars Henry Berge Olsen

References

Examples

## Not run: 

# Load example data
data("airquality")
airquality <- airquality[complete.cases(airquality), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"

# Split data into test- and training data
data_train <- head(airquality, -3)
data_explain <- tail(airquality, 3)

x_train <- data_train[, x_var]
x_explain <- data_explain[, x_var]

# Fit a linear model
lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + ")))
model <- lm(lm_formula, data = data_train)

# Explain predictions
p <- mean(data_train[, y_var])

# (Optionally) enable parallelization via the future package
if (requireNamespace("future", quietly = TRUE)) {
  future::plan("multisession", workers = 2)
}


# (Optionally) enable progress updates within every iteration via the progressr package
if (requireNamespace("progressr", quietly = TRUE)) {
  progressr::handlers(global = TRUE)
}

# Empirical approach
explain1 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "empirical",
  phi0 = p,
  n_MC_samples = 1e2
)

# Gaussian approach
explain2 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "gaussian",
  phi0 = p,
  n_MC_samples = 1e2
)

# Gaussian copula approach
explain3 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "copula",
  phi0 = p,
  n_MC_samples = 1e2
)

# ctree approach
explain4 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "ctree",
  phi0 = p,
  n_MC_samples = 1e2
)

# Combined approach
approach <- c("gaussian", "gaussian", "empirical")
explain5 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = approach,
  phi0 = p,
  n_MC_samples = 1e2
)

# Print the Shapley values
print(explain1$shapley_values_est)

# Plot the results
if (requireNamespace("ggplot2", quietly = TRUE)) {
  plot(explain1)
  plot(explain1, plot_type = "waterfall")
}

# Group-wise explanations
group_list <- list(A = c("Temp", "Month"), B = c("Wind", "Solar.R"))

explain_groups <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  group = group_list,
  approach = "empirical",
  phi0 = p,
  n_MC_samples = 1e2
)
print(explain_groups$shapley_values_est)

# Separate and surrogate regression approaches with linear regression models.
explain_separate_lm <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  phi0 = p,
  approach = "regression_separate",
  regression.model = parsnip::linear_reg()
)

explain_surrogate_lm <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  phi0 = p,
  approach = "regression_surrogate",
  regression.model = parsnip::linear_reg()
)

# Iterative estimation
# For illustration purposes only. By default not used for such small dimensions as here

# Gaussian approach
explain_iterative <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "gaussian",
  phi0 = p,
  n_MC_samples = 1e2,
  iterative = TRUE,
  iterative_args = list(initial_n_coalitions = 10)
)

## End(Not run)

Explain a forecast from time series models with dependence-aware (conditional/observational) Shapley values

Description

Computes dependence-aware Shapley values for observations in explain_idx from the specified model by using the method specified in approach to estimate the conditional expectation. See Aas, et. al (2021) for a thorough introduction to dependence-aware prediction explanation with Shapley values.

Usage

explain_forecast(
  model,
  y,
  xreg = NULL,
  train_idx = NULL,
  explain_idx,
  explain_y_lags,
  explain_xreg_lags = explain_y_lags,
  horizon,
  approach,
  phi0,
  max_n_coalitions = NULL,
  iterative = NULL,
  group_lags = TRUE,
  group = NULL,
  n_MC_samples = 1000,
  seed = 1,
  predict_model = NULL,
  get_model_specs = NULL,
  verbose = "basic",
  extra_computation_args = list(),
  iterative_args = list(),
  output_args = list(),
  ...
)

Arguments

model

Model object. Specifies the model whose predictions we want to explain. Run get_supported_models() for a table of which models explain supports natively. Unsupported models can still be explained by passing predict_model and (optionally) get_model_specs, see details for more information.

y

Matrix, data.frame/data.table or a numeric vector. Contains the endogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained.

xreg

Matrix, data.frame/data.table or a numeric vector. Contains the exogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. As exogenous variables are used contemporaneously when producing a forecast, this item should contain nrow(y) + horizon rows.

train_idx

Numeric vector. The row indices in data and reg denoting points in time to use when estimating the conditional expectations in the Shapley value formula. If train_idx = NULL (default) all indices not selected to be explained will be used.

explain_idx

Numeric vector. The row indices in data and reg denoting points in time to explain.

explain_y_lags

Numeric vector. Denotes the number of lags that should be used for each variable in y when making a forecast.

explain_xreg_lags

Numeric vector. If xreg != NULL, denotes the number of lags that should be used for each variable in xreg when making a forecast.

horizon

Numeric. The forecast horizon to explain. Passed to the predict_model function.

approach

Character vector of length 1 or one less than the number of features. All elements should, either be "gaussian", "copula", "empirical", "ctree", "vaeac", "categorical", "timeseries", "independence", "regression_separate", or "regression_surrogate". The two regression approaches can not be combined with any other approach. See details for more information.

phi0

Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable.

max_n_coalitions

Integer. The upper limit on the number of unique feature/group coalitions to use in the iterative procedure (if iterative = TRUE). If iterative = FALSE it represents the number of feature/group coalitions to use directly. The quantity refers to the number of unique feature coalitions if group = NULL, and group coalitions if group != NULL. max_n_coalitions = NULL corresponds to max_n_coalitions=2^n_features.

iterative

Logical or NULL If NULL (default), the argument is set to TRUE if there are more than 5 features/groups, and FALSE otherwise. If eventually TRUE, the Shapley values are estimated iteratively in an iterative manner. This provides sufficiently accurate Shapley value estimates faster. First an initial number of coalitions is sampled, then bootsrapping is used to estimate the variance of the Shapley values. A convergence criterion is used to determine if the variances of the Shapley values are sufficiently small. If the variances are too high, we estimate the number of required samples to reach convergence, and thereby add more coalitions. The process is repeated until the variances are below the threshold. Specifics related to the iterative process and convergence criterion are set through iterative_args.

group_lags

Logical. If TRUE all lags of each variable are grouped together and explained as a group. If FALSE all lags of each variable are explained individually.

group

List. If NULL regular feature wise Shapley values are computed. If provided, group wise Shapley values are computed. group then has length equal to the number of groups. The list element contains character vectors with the features included in each of the different groups. See Jullum et al. (2021) for more information on group wise Shapley values.

n_MC_samples

Positive integer. For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration of every conditional expectation. For approach="ctree", n_MC_samples corresponds to the number of samples from the leaf node (see an exception related to the ctree.sample argument setup_approach.ctree()). For approach="empirical", n_MC_samples is the KK parameter in equations (14-15) of Aas et al. (2021), i.e. the maximum number of observations (with largest weights) that is used, see also the empirical.eta argument setup_approach.empirical().

seed

Positive integer. Specifies the seed before any randomness based code is being run. If NULL no seed is set in the calling environment.

predict_model

Function. The prediction function used when model is not natively supported. (Run get_supported_models() for a list of natively supported models.) The function must have two arguments, model and newdata which specify, respectively, the model and a data.frame/data.table to compute predictions for. The function must give the prediction as a numeric vector. NULL (the default) uses functions specified internally. Can also be used to override the default function for natively supported model classes.

get_model_specs

Function. An optional function for checking model/data consistency when model is not natively supported. (Run get_supported_models() for a list of natively supported models.) The function takes model as argument and provides a list with 3 elements:

labels

Character vector with the names of each feature.

classes

Character vector with the classes of each features.

factor_levels

Character vector with the levels for any categorical features.

If NULL (the default) internal functions are used for natively supported model classes, and the checking is disabled for unsupported model classes. Can also be used to override the default function for natively supported model classes.

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE)

  • the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

extra_computation_args

Named list. Specifies extra arguments related to the computation of the Shapley values. See get_extra_comp_args_default() for description of the arguments and their default values.

iterative_args

Named list. Specifies the arguments for the iterative procedure. See get_iterative_args_default() for description of the arguments and their default values.

output_args

Named list. Specifies certain arguments related to the output of the function. See get_output_args_default() for description of the arguments and their default values.

...

Arguments passed on to setup_approach.categorical, setup_approach.copula, setup_approach.ctree, setup_approach.empirical, setup_approach.gaussian, setup_approach.independence, setup_approach.timeseries, setup_approach.vaeac

categorical.joint_prob_dt

Data.table. (Optional) Containing the joint probability distribution for each combination of feature values. NULL means it is estimated from the x_train and x_explain.

categorical.epsilon

Numeric value. (Optional) If categorical.joint_probability_dt is not supplied, probabilities/frequencies are estimated using x_train. If certain observations occur in x_explain and NOT in x_train, then epsilon is used as the proportion of times that these observations occurs in the training data. In theory, this proportion should be zero, but this causes an error later in the Shapley computation.

internal

List. Not used directly, but passed through from explain().

ctree.mincriterion

Numeric scalar or vector. Either a scalar or vector of length equal to the number of features in the model. The value is equal to 1 - α\alpha where α\alpha is the nominal level of the conditional independence tests. If it is a vector, this indicates which value to use when conditioning on various numbers of features. The default value is 0.95.

ctree.minsplit

Numeric scalar. Determines minimum value that the sum of the left and right daughter nodes required for a split. The default value is 20.

ctree.minbucket

Numeric scalar. Determines the minimum sum of weights in a terminal node required for a split The default value is 7.

ctree.sample

Boolean. If TRUE (default), then the method always samples n_MC_samples observations from the leaf nodes (with replacement). If FALSE and the number of observations in the leaf node is less than n_MC_samples, the method will take all observations in the leaf. If FALSE and the number of observations in the leaf node is more than n_MC_samples, the method will sample n_MC_samples observations (with replacement). This means that there will always be sampling in the leaf unless sample = FALSE and the number of obs in the node is less than n_MC_samples.

empirical.type

Character. (default = "fixed_sigma") Should be equal to either "independence","fixed_sigma", "AICc_each_k" "AICc_full". "independence" is deprecated. Use approach = "independence" instead. "fixed_sigma" uses a fixed bandwidth (set through empirical.fixed_sigma) in the kernel density estimation. "AICc_each_k" and "AICc_full" optimize the bandwidth using the AICc criterion, with respectively one bandwidth per coalition size and one bandwidth for all coalition sizes.

empirical.eta

Numeric scalar. Needs to be ⁠0 < eta <= 1⁠. The default value is 0.95. Represents the minimum proportion of the total empirical weight that data samples should use. If e.g. eta = .8 we will choose the K samples with the largest weight so that the sum of the weights accounts for 80\ eta is the η\eta parameter in equation (15) of Aas et al. (2021).

empirical.fixed_sigma

Positive numeric scalar. The default value is 0.1. Represents the kernel bandwidth in the distance computation used when conditioning on all different coalitions. Only used when empirical.type = "fixed_sigma"

empirical.n_samples_aicc

Positive integer. Number of samples to consider in AICc optimization. The default value is 1000. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.eval_max_aicc

Positive integer. Maximum number of iterations when optimizing the AICc. The default value is 20. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.start_aicc

Numeric. Start value of the sigma parameter when optimizing the AICc. The default value is 0.1. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.cov_mat

Numeric matrix. (Optional) The covariance matrix of the data generating distribution used to define the Mahalanobis distance. NULL means it is estimated from x_train.

gaussian.mu

Numeric vector. (Optional) Containing the mean of the data generating distribution. NULL means it is estimated from the x_train.

gaussian.cov_mat

Numeric matrix. (Optional) Containing the covariance matrix of the data generating distribution. NULL means it is estimated from the x_train.

timeseries.fixed_sigma

Positive numeric scalar. Represents the kernel bandwidth in the distance computation. The default value is 2.

timeseries.bounds

Numeric vector of length two. Specifies the lower and upper bounds of the timeseries. The default is c(NULL, NULL), i.e. no bounds. If one or both of these bounds are not NULL, we restrict the sampled time series to be between these bounds. This is useful if the underlying time series are scaled between 0 and 1, for example.

vaeac.depth

Positive integer (default is 3). The number of hidden layers in the neural networks of the masked encoder, full encoder, and decoder.

vaeac.width

Positive integer (default is 32). The number of neurons in each hidden layer in the neural networks of the masked encoder, full encoder, and decoder.

vaeac.latent_dim

Positive integer (default is 8). The number of dimensions in the latent space.

vaeac.lr

Positive numeric (default is 0.001). The learning rate used in the torch::optim_adam() optimizer.

vaeac.activation_function

An torch::nn_module() representing an activation function such as, e.g., torch::nn_relu() (default), torch::nn_leaky_relu(), torch::nn_selu(), or torch::nn_sigmoid().

vaeac.n_vaeacs_initialize

Positive integer (default is 4). The number of different vaeac models to initiate in the start. Pick the best performing one after vaeac.extra_parameters$epochs_initiation_phase epochs (default is 2) and continue training that one.

vaeac.epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes vaeac.extra_parameters$epochs_initiation_phase, where the default is 2.

vaeac.extra_parameters

Named list with extra parameters to the vaeac approach. See vaeac_get_extra_para_default() for description of possible additional parameters and their default values.

Details

This function explains a forecast of length horizon. The argument train_idx is analogous to x_train in explain(), however, it just contains the time indices of where in the data the forecast should start for each training sample. In the same way explain_idx defines the time index (indices) which will precede a forecast to be explained.

As any autoregressive forecast model will require a set of lags to make a forecast at an arbitrary point in time, explain_y_lags and explain_xreg_lags define how many lags are required to "refit" the model at any given time index. This allows the different approaches to work in the same way they do for time-invariant models.

See the forecasting section of the general usages for further details.

Value

Object of class c("shapr", "list"). Contains the following items:

shapley_values_est

data.table with the estimated Shapley values with explained observation in the rows and features along the columns. The column none is the prediction not devoted to any of the features (given by the argument phi0)

shapley_values_sd

data.table with the standard deviation of the Shapley values reflecting the uncertainty. Note that this only reflects the coalition sampling part of the kernelSHAP procedure, and is therefore by definition 0 when all coalitions is used. Only present when extra_computation_args$compute_sd=TRUE, which is the default when iterative = TRUE

internal

List with the different parameters, data, functions and other output used internally.

pred_explain

Numeric vector with the predictions for the explained observations

MSEv

List with the values of the MSEv evaluation criterion for the approach. See the MSEv evaluation section in the general usage for details.

timing

List containing timing information for the different parts of the computation. init_time and end_time gives the time stamps for the start and end of the computation. total_time_secs gives the total time in seconds for the complete execution of explain(). main_timing_secs gives the time in seconds for the main computations. iter_timing_secs gives for each iteration of the iterative estimation, the time spent on the different parts iterative estimation routine.

Author(s)

Jon Lachmann, Martin Jullum

References

Examples

## Not run: 
# Load example data
data("airquality")
data <- data.table::as.data.table(airquality)

# Fit an AR(2) model.
model_ar_temp <- ar(data$Temp, order = 2)

# Calculate the zero prediction values for a three step forecast.
p0_ar <- rep(mean(data$Temp), 3)

# Empirical approach, explaining forecasts starting at T = 152 and T = 153.
explain_forecast(
  model = model_ar_temp,
  y = data[, "Temp"],
  train_idx = 2:151,
  explain_idx = 152:153,
  explain_y_lags = 2,
  horizon = 3,
  approach = "empirical",
  phi0 = p0_ar,
  group_lags = FALSE
)

## End(Not run)

Gets the default values for the extra estimation arguments

Description

Gets the default values for the extra estimation arguments

Usage

get_extra_comp_args_default(
  internal,
  paired_shap_sampling = isFALSE(internal$parameters$asymmetric),
  kernelSHAP_reweighting = "on_all_cond",
  compute_sd = isFALSE(internal$parameters$exact),
  n_boot_samps = 100,
  max_batch_size = 10,
  min_n_batches = 10
)

Arguments

internal

List. Not used directly, but passed through from explain().

paired_shap_sampling

Logical. If TRUE paired versions of all sampled coalitions are also included in the computation. That is, if there are 5 features and e.g. coalitions (1,3,5) are sampled, then also coalition (2,4) is used for computing the Shapley values. This is done to reduce the variance of the Shapley value estimates. TRUE is the default and is recommended for highest accuracy. For asymmetric, FALSE is the default and the only legal value.

kernelSHAP_reweighting

String. How to reweight the sampling frequency weights in the kernelSHAP solution after sampling. The aim of this is to reduce the randomness and thereby the variance of the Shapley value estimates. The options are one of 'none', 'on_N', 'on_all', 'on_all_cond' (default). 'none' means no reweighting, i.e. the sampling frequency weights are used as is. 'on_N' means the sampling frequencies are averaged over all coalitions with the same original sampling probabilities. 'on_all' means the original sampling probabilities are used for all coalitions. 'on_all_cond' means the original sampling probabilities are used for all coalitions, while adjusting for the probability that they are sampled at least once. 'on_all_cond' is preferred as it performs the best in simulation studies, see Olsen & Jullum (2024).

compute_sd

Logical. Whether to estimate the standard deviations of the Shapley value estimates. This is TRUE whenever sampling based kernelSHAP is applied (either iteratively or with a fixed number of coalitions).

n_boot_samps

Integer. The number of bootstrapped samples (i.e. samples with replacement) from the set of all coalitions used to estimate the standard deviations of the Shapley value estimates.

max_batch_size

Integer. The maximum number of coalitions to estimate simultaneously within each iteration. A larger numbers requires more memory, but may have a slight computational advantage.

min_n_batches

Integer. The minimum number of batches to split the computation into within each iteration. Larger numbers gives more frequent progress updates. If parallelization is applied, this should be set no smaller than the number of parallel workers.

Author(s)

Martin Jullum

References


Function to specify arguments of the iterative estimation procedure

Description

Function to specify arguments of the iterative estimation procedure

Usage

get_iterative_args_default(
  internal,
  initial_n_coalitions = ceiling(min(200, max(5, internal$parameters$n_features,
    (2^internal$parameters$n_features)/10), internal$parameters$max_n_coalitions)),
  fixed_n_coalitions_per_iter = NULL,
  max_iter = 20,
  convergence_tol = 0.02,
  n_coal_next_iter_factor_vec = c(seq(0.1, 1, by = 0.1), rep(1, max_iter - 10))
)

Arguments

internal

List. Not used directly, but passed through from explain().

initial_n_coalitions

Integer. Number of coalitions to use in the first estimation iteration.

fixed_n_coalitions_per_iter

Integer. Number of n_coalitions to use in each iteration. NULL (default) means setting it based on estimates based on a set convergence threshold.

max_iter

Integer. Maximum number of estimation iterations

convergence_tol

Numeric. The t variable in the convergence threshold formula on page 6 in the paper Covert and Lee (2021), 'Improving KernelSHAP: Practical Shapley Value Estimation via Linear Regression' https://arxiv.org/pdf/2012.01536. Smaller values requires more coalitions before convergence is reached.

n_coal_next_iter_factor_vec

Numeric vector. The number of n_coalitions that must be used to reach convergence in the next iteration is estimated. The number of n_coalitions actually used in the next iteration is set to this estimate multiplied by n_coal_next_iter_factor_vec[i] for iteration i. It is wise to start with smaller numbers to avoid using too many n_coalitions due to uncertain estimates in the first iterations.

Details

The functions sets default values for the iterative estimation procedure, according to the function defaults. If the argument iterative of explain() is FALSE, it sets parameters corresponding to the use of a non-iterative estimation procedure

Author(s)

Martin Jullum


Gets the default values for the output arguments

Description

Gets the default values for the output arguments

Usage

get_output_args_default(
  keep_samp_for_vS = FALSE,
  MSEv_uniform_comb_weights = TRUE,
  saving_path = tempfile("shapr_obj_", fileext = ".rds")
)

Arguments

keep_samp_for_vS

Logical. Indicates whether the samples used in the Monte Carlo estimation of v_S should be returned (in internal$output). Not used for approach="regression_separate" or approach="regression_surrogate".

MSEv_uniform_comb_weights

Logical. If TRUE (default), then the function weights the coalitions uniformly when computing the MSEv criterion. If FALSE, then the function use the Shapley kernel weights to weight the coalitions when computing the MSEv criterion. Note that the Shapley kernel weights are replaced by the sampling frequency when not all coalitions are considered.

saving_path

String. The path to the directory where the results of the iterative estimation procedure should be saved. Defaults to a temporary directory.

Author(s)

Martin Jullum


Gets the implemented approaches

Description

Gets the implemented approaches

Usage

get_supported_approaches()

Value

Character vector. The names of the implemented approaches that can be passed to argument approach in explain().


Provides a data.table with the supported models

Description

Provides a data.table with the supported models

Usage

get_supported_models()

Plots of the MSEv Evaluation Criterion

Description

Make plots to visualize and compare the MSEv evaluation criterion for a list of explain() objects applied to the same data and model. The function creates bar plots and line plots with points to illustrate the overall MSEv evaluation criterion, but also for each observation/explicand and coalition by only averaging over the coalitions and observations/explicands, respectively.

Usage

plot_MSEv_eval_crit(
  explanation_list,
  index_x_explain = NULL,
  id_coalition = NULL,
  CI_level = if (length(explanation_list[[1]]$pred_explain) < 20) NULL else 0.95,
  geom_col_width = 0.9,
  plot_type = "overall"
)

Arguments

explanation_list

A list of explain() objects applied to the same data and model. If the entries in the list are named, then the function use these names. Otherwise, they default to the approach names (with integer suffix for duplicates) for the explanation objects in explanation_list.

index_x_explain

Integer vector. Which of the test observations to plot. E.g. if you have explained 10 observations using explain(), you can generate a plot for the first 5 observations by setting index_x_explain = 1:5.

id_coalition

Integer vector. Which of the coalitions to plot. E.g. if you used n_coalitions = 16 in explain(), you can generate a plot for the first 5 coalitions and the 10th by setting id_coalition = c(1:5, 10).

CI_level

Positive numeric between zero and one. Default is 0.95 if the number of observations to explain is larger than 20, otherwise CI_level = NULL, which removes the confidence intervals. The level of the approximate confidence intervals for the overall MSEv and the MSEv_coalition. The confidence intervals are based on that the MSEv scores are means over the observations/explicands, and that means are approximation normal. Since the standard deviations are estimated, we use the quantile t from the T distribution with N_explicands - 1 degrees of freedom corresponding to the provided level. Here, N_explicands is the number of observations/explicands. MSEv +/- tSD(MSEv)/sqrt(N_explicands). Note that the explain() function already scales the standard deviation by sqrt(N_explicands), thus, the CI are MSEv \/- tMSEv_sd, where the values MSEv and MSEv_sd are extracted from the MSEv data.tables in the objects in the explanation_list.

geom_col_width

Numeric. Bar width. By default, set to 90% of the ggplot2::resolution() of the data.

plot_type

Character vector. The possible options are "overall" (default), "comb", and "explicand". If plot_type = "overall", then the plot (one bar plot) associated with the overall MSEv evaluation criterion for each method is created, i.e., when averaging over both the coalitions and observations/explicands. If plot_type = "comb", then the plots (one line plot and one bar plot) associated with the MSEv evaluation criterion for each coalition are created, i.e., when we only average over the observations/explicands. If plot_type = "explicand", then the plots (one line plot and one bar plot) associated with the MSEv evaluation criterion for each observations/explicands are created, i.e., when we only average over the coalitions. If plot_type is a vector of one or several of "overall", "comb", and "explicand", then the associated plots are created.

Value

Either a single ggplot2::ggplot() object of the MSEv criterion when plot_type = "overall", or a list of ggplot2::ggplot() objects based on the plot_type parameter.

Author(s)

Lars Henry Berge Olsen

Examples

## Not run: 
# Load necessary libraries
library(xgboost)
library(data.table)
library(shapr)
library(ggplot2)

# Get the data
data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]

#' Define the features and the response
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"

# Split data into test and training data set
ind_x_explain <- 1:25
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]

# Fitting a basic xgboost model to the training data
model <- xgboost::xgboost(
  data = as.matrix(x_train),
  label = y_train,
  nround = 20,
  verbose = FALSE
)

# Specifying the phi_0, i.e. the expected prediction without any features
phi0 <- mean(y_train)

# Independence approach
explanation_independence <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "independence",
  phi0 = phi0,
  n_MC_samples = 1e2
)

# Gaussian 1e1 approach
explanation_gaussian_1e1 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "gaussian",
  phi0 = phi0,
  n_MC_samples = 1e1
)

# Gaussian 1e2 approach
explanation_gaussian_1e2 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "gaussian",
  phi0 = phi0,
  n_MC_samples = 1e2
)

# ctree approach
explanation_ctree <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "ctree",
  phi0 = phi0,
  n_MC_samples = 1e2
)

# Combined approach
explanation_combined <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = c("gaussian", "independence", "ctree"),
  phi0 = phi0,
  n_MC_samples = 1e2
)

# Create a list of explanations with names
explanation_list_named <- list(
  "Ind." = explanation_independence,
  "Gaus. 1e1" = explanation_gaussian_1e1,
  "Gaus. 1e2" = explanation_gaussian_1e2,
  "Ctree" = explanation_ctree,
  "Combined" = explanation_combined
)

if (requireNamespace("ggplot2", quietly = TRUE)) {
  # Create the default MSEv plot where we average over both the coalitions and observations
  # with approximate 95% confidence intervals
  plot_MSEv_eval_crit(explanation_list_named, CI_level = 0.95, plot_type = "overall")

  # Can also create plots of the MSEv criterion averaged only over the coalitions or observations.
  MSEv_figures <- plot_MSEv_eval_crit(explanation_list_named,
    CI_level = 0.95,
    plot_type = c("overall", "comb", "explicand")
  )
  MSEv_figures$MSEv_bar
  MSEv_figures$MSEv_coalition_bar
  MSEv_figures$MSEv_explicand_bar

  # When there are many coalitions or observations, then it can be easier to look at line plots
  MSEv_figures$MSEv_coalition_line_point
  MSEv_figures$MSEv_explicand_line_point

  # We can specify which observations or coalitions to plot
  plot_MSEv_eval_crit(explanation_list_named,
    plot_type = "explicand",
    index_x_explain = c(1, 3:4, 6),
    CI_level = 0.95
  )$MSEv_explicand_bar
  plot_MSEv_eval_crit(explanation_list_named,
    plot_type = "comb",
    id_coalition = c(3, 4, 9, 13:15),
    CI_level = 0.95
  )$MSEv_coalition_bar

  # We can alter the figures if other palette schemes or design is wanted
  bar_text_n_decimals <- 1
  MSEv_figures$MSEv_bar +
    ggplot2::scale_x_discrete(limits = rev(levels(MSEv_figures$MSEv_bar$data$Method))) +
    ggplot2::coord_flip() +
    ggplot2::scale_fill_discrete() + #' Default ggplot2 palette
    ggplot2::theme_minimal() + #' This must be set before the other theme call
    ggplot2::theme(
      plot.title = ggplot2::element_text(size = 10),
      legend.position = "bottom"
    ) +
    ggplot2::guides(fill = ggplot2::guide_legend(nrow = 1, ncol = 6)) +
    ggplot2::geom_text(
      ggplot2::aes(label = sprintf(
        paste("%.", sprintf("%d", bar_text_n_decimals), "f", sep = ""),
        round(MSEv, bar_text_n_decimals)
      )),
      vjust = -1.1, # This value must be altered based on the plot dimension
      hjust = 1.1, # This value must be altered based on the plot dimension
      color = "black",
      position = ggplot2::position_dodge(0.9),
      size = 5
    )
}

## End(Not run)

Shapley value bar plots for several explanation objects

Description

Make plots to visualize and compare the estimated Shapley values for a list of explain() objects applied to the same data and model. For group-wise Shapley values, the features values plotted are the mean feature values for all features in each group.

Usage

plot_SV_several_approaches(
  explanation_list,
  index_explicands = NULL,
  index_explicands_sort = FALSE,
  only_these_features = NULL,
  plot_phi0 = FALSE,
  digits = 4,
  add_zero_line = FALSE,
  axis_labels_n_dodge = NULL,
  axis_labels_rotate_angle = NULL,
  horizontal_bars = TRUE,
  facet_scales = "free",
  facet_ncol = 2,
  geom_col_width = 0.85,
  brewer_palette = NULL,
  include_group_feature_means = FALSE
)

Arguments

explanation_list

A list of explain() objects applied to the same data and model. If the entries in the list are named, then the function use these names. Otherwise, they default to the approach names (with integer suffix for duplicates) for the explanation objects in explanation_list.

index_explicands

Integer vector. Which of the explicands (test observations) to plot. E.g. if you have explained 10 observations using explain(), you can generate a plot for the first 5 observations/explicands and the 10th by setting index_x_explain = c(1:5, 10). The argument index_explicands_sort must be FALSE to plot the explicand in the order specified in index_x_explain.

index_explicands_sort

Boolean. If FALSE (default), then shapr plots the explicands in the order specified in index_explicands. If TRUE, then shapr sort the indices in increasing order based on their id.

only_these_features

String vector. Containing the names of the features which are to be included in the bar plots.

plot_phi0

Boolean. If we are to include the ϕ0\phi_0 in the bar plots or not.

digits

Integer. Number of significant digits to use in the feature description. Applicable for plot_type "bar" and "waterfall"

add_zero_line

Boolean. If we are to add a black line for a feature contribution of 0.

axis_labels_n_dodge

Integer. The number of rows that should be used to render the labels. This is useful for displaying labels that would otherwise overlap.

axis_labels_rotate_angle

Numeric. The angle of the axis label, where 0 means horizontal, 45 means tilted, and 90 means vertical. Compared to setting the angle inggplot2::theme() / ggplot2::element_text(), this also uses some heuristics to automatically pick the hjust and vjust that you probably want.

horizontal_bars

Boolean. Flip Cartesian coordinates so that horizontal becomes vertical, and vertical, horizontal. This is primarily useful for converting geoms and statistics which display y conditional on x, to x conditional on y. See ggplot2::coord_flip().

facet_scales

Should scales be free ("free", the default), fixed ("fixed"), or free in one dimension ("free_x", "free_y")? The user has to change the latter manually depending on the value of horizontal_bars.

facet_ncol

Integer. The number of columns in the facet grid. Default is facet_ncol = 2.

geom_col_width

Numeric. Bar width. By default, set to 85% of the ggplot2::resolution() of the data.

brewer_palette

String. Name of one of the color palettes from RColorBrewer::RColorBrewer(). If NULL, then the function uses the default ggplot2::ggplot() color scheme. The following palettes are available for use with these scales:

Diverging

BrBG, PiYG, PRGn, PuOr, RdBu, RdGy, RdYlBu, RdYlGn, Spectral

Qualitative

Accent, Dark2, Paired, Pastel1, Pastel2, Set1, Set2, Set3

Sequential

Blues, BuGn, BuPu, GnBu, Greens, Greys, Oranges, OrRd, PuBu, PuBuGn, PuRd, Purples, RdPu, Reds, YlGn, YlGnBu, YlOrBr, YlOrRd

include_group_feature_means

Logical. Whether to include the average feature value in a group on the y-axis or not. If FALSE (default), then no value is shown for the groups. If TRUE, then shapr includes the mean of the features in each group.

Value

A ggplot2::ggplot() object.

Author(s)

Lars Henry Berge Olsen

Examples

## Not run: 
# Load necessary libraries
library(xgboost)
library(data.table)

# Get the data
data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]

# Define the features and the response
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"

# Split data into test and training data set
ind_x_explain <- 1:12
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]

# Fitting a basic xgboost model to the training data
model <- xgboost::xgboost(
  data = as.matrix(x_train),
  label = y_train,
  nround = 20,
  verbose = FALSE
)

# Specifying the phi_0, i.e. the expected prediction without any features
phi0 <- mean(y_train)

# Independence approach
explanation_independence <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "independence",
  phi0 = phi0,
  n_MC_samples = 1e2
)

# Empirical approach
explanation_empirical <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "empirical",
  phi0 = phi0,
  n_MC_samples = 1e2
)

# Gaussian 1e1 approach
explanation_gaussian_1e1 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "gaussian",
  phi0 = phi0,
  n_MC_samples = 1e1
)

# Gaussian 1e2 approach
explanation_gaussian_1e2 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "gaussian",
  phi0 = phi0,
  n_MC_samples = 1e2
)

# Combined approach
explanation_combined <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = c("gaussian", "ctree", "empirical"),
  phi0 = phi0,
  n_MC_samples = 1e2
)

# Create a list of explanations with names
explanation_list <- list(
  "Ind." = explanation_independence,
  "Emp." = explanation_empirical,
  "Gaus. 1e1" = explanation_gaussian_1e1,
  "Gaus. 1e2" = explanation_gaussian_1e2,
  "Combined" = explanation_combined
)

if (requireNamespace("ggplot2", quietly = TRUE)) {
  # The function uses the provided names.
  plot_SV_several_approaches(explanation_list)

  # We can change the number of columns in the grid of plots and add other visual alterations
  plot_SV_several_approaches(explanation_list,
    facet_ncol = 3,
    facet_scales = "free_y",
    add_zero_line = TRUE,
    digits = 2,
    brewer_palette = "Paired",
    geom_col_width = 0.6
  ) +
    ggplot2::theme_minimal() +
    ggplot2::theme(legend.position = "bottom", plot.title = ggplot2::element_text(size = 0))


  # We can specify which explicands to plot to get less chaotic plots and make the bars vertical
  plot_SV_several_approaches(explanation_list,
    index_explicands = c(1:2, 5, 10),
    horizontal_bars = FALSE,
    axis_labels_rotate_angle = 45
  )

  # We can change the order of the features by specifying the
  # order using the `only_these_features` parameter.
  plot_SV_several_approaches(explanation_list,
    index_explicands = c(1:2, 5, 10),
    only_these_features = c("Temp", "Solar.R", "Month", "Wind")
  )

  # We can also remove certain features if we are not interested in them
  # or want to focus on, e.g., two features. The function will give a
  # message to if the user specifies non-valid feature names.
  plot_SV_several_approaches(explanation_list,
    index_explicands = c(1:2, 5, 10),
    only_these_features = c("Temp", "Solar.R"),
    plot_phi0 = TRUE
  )
}

## End(Not run)

Plot the training VLB and validation IWAE for vaeac models

Description

This function makes (ggplot2::ggplot()) figures of the training VLB and the validation IWAE for a list of explain() objects with approach = "vaeac". See setup_approach() for more information about the vaeac approach. Two figures are returned by the function. In the figure, each object in explanation_list gets its own facet, while in the second figure, we plot the criteria in each facet for all objects.

Usage

plot_vaeac_eval_crit(
  explanation_list,
  plot_from_nth_epoch = 1,
  plot_every_nth_epoch = 1,
  criteria = c("VLB", "IWAE"),
  plot_type = c("method", "criterion"),
  facet_wrap_scales = "fixed",
  facet_wrap_ncol = NULL
)

Arguments

explanation_list

A list of explain() objects applied to the same data, model, and vaeac must be the used approach. If the entries in the list is named, then the function use these names. Otherwise, it defaults to the approach names (with integer suffix for duplicates) for the explanation objects in explanation_list.

plot_from_nth_epoch

Integer. If we are only plot the results form the nth epoch and so forth. The first epochs can be large in absolute value and make the rest of the plot difficult to interpret.

plot_every_nth_epoch

Integer. If we are only to plot every nth epoch. Usefully to illustrate the overall trend, as there can be a lot of fluctuation and oscillation in the values between each epoch.

criteria

Character vector. The possible options are "VLB", "IWAE", "IWAE_running". Default is the first two.

plot_type

Character vector. The possible options are "method" and "criterion". Default is to plot both.

facet_wrap_scales

String. Should the scales be fixed ("fixed", the default), free ("free"), or free in one dimension ("free_x", "free_y").

facet_wrap_ncol

Integer. Number of columns in the facet wrap.

Details

See Olsen et al. (2022) or the blog post for a summary of the VLB and IWAE.

Value

Either a single ggplot2::ggplot() object or a list of ggplot2::ggplot() objects based on the plot_type parameter.

Author(s)

Lars Henry Berge Olsen

References

Examples

## Not run: 
library(xgboost)
library(data.table)
library(shapr)

data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]

x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"

ind_x_explain <- 1:6
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]

# Fitting a basic xgboost model to the training data
model <- xgboost(data = as.matrix(x_train), label = y_train, nround = 100, verbose = FALSE)

# Specifying the phi_0, i.e. the expected prediction without any features
p0 <- mean(y_train)

# Train vaeac with and without paired sampling
explanation_paired <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = approach,
  phi0 = p0,
  n_MC_samples = 1, # As we are only interested in the training of the vaeac
  vaeac.epochs = 10, # Should be higher in applications.
  vaeac.n_vaeacs_initialize = 1,
  vaeac.width = 16,
  vaeac.depth = 2,
  vaeac.extra_parameters = list(vaeac.paired_sampling = TRUE)
)

explanation_regular <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = approach,
  phi0 = p0,
  n_MC_samples = 1, # As we are only interested in the training of the vaeac
  vaeac.epochs = 10, # Should be higher in applications.
  vaeac.width = 16,
  vaeac.depth = 2,
  vaeac.n_vaeacs_initialize = 1,
  vaeac.extra_parameters = list(vaeac.paired_sampling = FALSE)
)

# Collect the explanation objects in an named list
explanation_list <- list(
  "Regular sampling" = explanation_regular,
  "Paired sampling" = explanation_paired
)

# Call the function with the named list, will use the provided names
plot_vaeac_eval_crit(explanation_list = explanation_list)

# The function also works if we have only one method,
# but then one should only look at the method plot.
plot_vaeac_eval_crit(
  explanation_list = explanation_list[2],
  plot_type = "method"
)

# Can alter the plot
plot_vaeac_eval_crit(
  explanation_list = explanation_list,
  plot_from_nth_epoch = 2,
  plot_every_nth_epoch = 2,
  facet_wrap_scales = "free"
)

# If we only want the VLB
plot_vaeac_eval_crit(
  explanation_list = explanation_list,
  criteria = "VLB",
  plot_type = "criterion"
)

# If we want only want the criterion version
tmp_fig_criterion <-
  plot_vaeac_eval_crit(explanation_list = explanation_list, plot_type = "criterion")

# Since tmp_fig_criterion is a ggplot2 object, we can alter it
# by, e.g,. adding points or smooths with se bands
tmp_fig_criterion + ggplot2::geom_point(shape = "circle", size = 1, ggplot2::aes(col = Method))
tmp_fig_criterion$layers[[1]] <- NULL
tmp_fig_criterion + ggplot2::geom_smooth(method = "loess", formula = y ~ x, se = TRUE) +
  ggplot2::scale_color_brewer(palette = "Set1") +
  ggplot2::theme_minimal()

## End(Not run)

Plot Pairwise Plots for Imputed and True Data

Description

A function that creates a matrix of plots (GGally::ggpairs()) from generated imputations from the unconditioned distribution p(x)p(\boldsymbol{x}) estimated by a vaeac model, and then compares the imputed values with data from the true distribution (if provided). See ggpairs for an introduction to GGally::ggpairs(), and the corresponding vignette.

Usage

plot_vaeac_imputed_ggpairs(
  explanation,
  which_vaeac_model = "best",
  x_true = NULL,
  add_title = TRUE,
  alpha = 0.5,
  upper_cont = c("cor", "points", "smooth", "smooth_loess", "density", "blank"),
  upper_cat = c("count", "cross", "ratio", "facetbar", "blank"),
  upper_mix = c("box", "box_no_facet", "dot", "dot_no_facet", "facethist",
    "facetdensity", "denstrip", "blank"),
  lower_cont = c("points", "smooth", "smooth_loess", "density", "cor", "blank"),
  lower_cat = c("facetbar", "ratio", "count", "cross", "blank"),
  lower_mix = c("facetdensity", "box", "box_no_facet", "dot", "dot_no_facet",
    "facethist", "denstrip", "blank"),
  diag_cont = c("densityDiag", "barDiag", "blankDiag"),
  diag_cat = c("barDiag", "blankDiag"),
  cor_method = c("pearson", "kendall", "spearman")
)

Arguments

explanation

Shapr list. The output list from the explain() function.

which_vaeac_model

String. Indicating which vaeac model to use when generating the samples. Possible options are always 'best', 'best_running', and 'last'. All possible options can be obtained by calling names(explanation$internal$parameters$vaeac$models).

x_true

Data.table containing the data from the distribution that the vaeac model is fitted to.

add_title

Logical. If TRUE, then a title is added to the plot based on the internal description of the vaeac model specified in which_vaeac_model.

alpha

Numeric between 0 and 1 (default is 0.5). The degree of color transparency.

upper_cont

String. Type of plot to use in upper triangle for continuous features, see GGally::ggpairs(). Possible options are: 'cor' (default), 'points', 'smooth', 'smooth_loess', 'density', and 'blank'.

upper_cat

String. Type of plot to use in upper triangle for categorical features, see GGally::ggpairs(). Possible options are: 'count' (default), 'cross', 'ratio', 'facetbar', and 'blank'.

upper_mix

String. Type of plot to use in upper triangle for mixed features, see GGally::ggpairs(). Possible options are: 'box' (default), 'box_no_facet', 'dot', 'dot_no_facet', 'facethist', 'facetdensity', 'denstrip', and 'blank'

lower_cont

String. Type of plot to use in lower triangle for continuous features, see GGally::ggpairs(). Possible options are: 'points' (default), 'smooth', 'smooth_loess', 'density', 'cor', and 'blank'.

lower_cat

String. Type of plot to use in lower triangle for categorical features, see GGally::ggpairs(). Possible options are: 'facetbar' (default), 'ratio', 'count', 'cross', and 'blank'.

lower_mix

String. Type of plot to use in lower triangle for mixed features, see GGally::ggpairs(). Possible options are: 'facetdensity' (default), 'box', 'box_no_facet', 'dot', 'dot_no_facet', 'facethist', 'denstrip', and 'blank'.

diag_cont

String. Type of plot to use on the diagonal for continuous features, see GGally::ggpairs(). Possible options are: 'densityDiag' (default), 'barDiag', and 'blankDiag'.

diag_cat

String. Type of plot to use on the diagonal for categorical features, see GGally::ggpairs(). Possible options are: 'barDiag' (default) and 'blankDiag'.

cor_method

String. Type of correlation measure, see GGally::ggpairs(). Possible options are: 'pearson' (default), 'kendall', and 'spearman'.

Value

A GGally::ggpairs() figure.

Author(s)

Lars Henry Berge Olsen

References

Examples

## Not run: 
library(xgboost)
library(data.table)
library(shapr)

data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]

x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"

ind_x_explain <- 1:6
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]

# Fitting a basic xgboost model to the training data
model <- xgboost(
  data = as.matrix(x_train),
  label = y_train,
  nround = 100,
  verbose = FALSE
)

explanation <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "vaeac",
  phi0 = mean(y_train),
  n_MC_samples = 1,
  vaeac.epochs = 10,
  vaeac.n_vaeacs_initialize = 1
)

# Plot the results
figure <- plot_vaeac_imputed_ggpairs(
  explanation = explanation,
  which_vaeac_model = "best",
  x_true = x_train,
  add_title = TRUE
)
figure

# Note that this is an ggplot2 object which we can alter, e.g., we can change the colors.
figure +
  ggplot2::scale_color_manual(values = c("#E69F00", "#999999")) +
  ggplot2::scale_fill_manual(values = c("#E69F00", "#999999"))

## End(Not run)

Plot of the Shapley value explanations

Description

Plots the individual prediction explanations.

Usage

## S3 method for class 'shapr'
plot(
  x,
  plot_type = "bar",
  digits = 3,
  index_x_explain = NULL,
  top_k_features = NULL,
  col = NULL,
  bar_plot_phi0 = TRUE,
  bar_plot_order = "largest_first",
  scatter_features = NULL,
  scatter_hist = TRUE,
  include_group_feature_means = FALSE,
  beeswarm_cex = 1/length(index_x_explain)^(1/4),
  ...
)

Arguments

x

An shapr object. The output from explain().

plot_type

Character. Specifies the type of plot to produce. "bar" (the default) gives a regular horizontal bar plot of the Shapley value magnitudes. "waterfall" gives a waterfall plot indicating the changes in the prediction score due to each features contribution (their Shapley values). "scatter" plots the feature values on the x-axis and Shapley values on the y-axis, as well as (optionally) a background scatter_hist showing the distribution of the feature data. "beeswarm" summarizes the distribution of the Shapley values along the x-axis for all the features. Each point gives the shapley value of a given instance, where the points are colored by the feature value of that instance.

digits

Integer. Number of significant digits to use in the feature description. Applicable for plot_type "bar" and "waterfall"

index_x_explain

Integer vector. Which of the test observations to plot. E.g. if you have explained 10 observations using explain(), you can generate a plot for the first 5 observations by setting index_x_explain = 1:5.

top_k_features

Integer. How many features to include in the plot. E.g. if you have 15 features in your model you can plot the 5 most important features, for each explanation, by setting top_k_features = 1:5. Applicable for plot_type "bar" and "waterfall"

col

Character vector (where length depends on plot type). The color codes (hex codes or other names understood by ggplot2::ggplot()) for positive and negative Shapley values, respectively. The default is col=NULL, plotting with the default colors respective to the plot type. For plot_type = "bar" and plot_type = "waterfall", the default is c("#00BA38","#F8766D"). For plot_type = "beeswarm", the default is c("#F8766D","yellow","#00BA38"). For plot_type = "scatter", the default is "#619CFF".

If you want to alter the colors i the plot, the length of the col vector depends on plot type. For plot_type = "bar" or plot_type = "waterfall", two colors should be provided, first for positive and then for negative Shapley values. For plot_type = "beeswarm", either two or three colors can be given. If two colors are given, then the first color determines the color that points with high feature values will have, and the second determines the color of points with low feature values. If three colors are given, then the first colors high feature values, the second colors mid-range feature values, and the third colors low feature values. For instance, col = c("red", "yellow", "blue") will make high values red, mid-range values yellow, and low values blue. For plot_type = "scatter", a single color is to be given, which determines the color of the points on the scatter plot.

bar_plot_phi0

Logical. Whether to include phi0 in the plot for plot_type = "bar".

bar_plot_order

Character. Specifies what order to plot the features with respect to the magnitude of the shapley values with plot_type = "bar": "largest_first" (the default) plots the features ordered from largest to smallest absolute Shapley value. "smallest_first" plots the features ordered from smallest to largest absolute Shapley value. "original" plots the features in the original order of the data table.

scatter_features

Integer or character vector. Only used for plot_type = "scatter". Specifies what features to include in (scatter) plot. Can be a numerical vector indicating feature index, or a character vector, indicating the name(s) of the feature(s) to plot.

scatter_hist

Logical. Only used for plot_type = "scatter". Whether to include a scatter_hist indicating the distribution of the data when making the scatter plot. Note that the bins are scaled so that when all the bins are stacked they fit the span of the y-axis of the plot.

include_group_feature_means

Logical. Whether to include the average feature value in a group on the y-axis or not. If FALSE (default), then no value is shown for the groups. If TRUE, then shapr includes the mean of the features in each group.

beeswarm_cex

Numeric. The cex argument of ggbeeswarm::geom_beeswarm(), controlling the spacing in the beeswarm plots.

...

Other arguments passed to underlying functions, like ggbeeswarm::geom_beeswarm() for plot_type = "beeswarm".

Details

See the examples below, or vignette("general_usage", package = "shapr") for an examples of how you should use the function.

Value

ggplot object with plots of the Shapley value explanations

Author(s)

Martin Jullum, Vilde Ung, Lars Henry Berge Olsen

Examples

## Not run: 
data("airquality")
airquality <- airquality[complete.cases(airquality), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"

# Split data into test- and training data
data_train <- head(airquality, -50)
data_explain <- tail(airquality, 50)

x_train <- data_train[, x_var]
x_explain <- data_explain[, x_var]

# Fit a linear model
lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + ")))
model <- lm(lm_formula, data = data_train)

# Explain predictions
p <- mean(data_train[, y_var])

# Empirical approach
x <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "empirical",
  phi0 = p,
  n_MC_samples = 1e2
)

if (requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("ggbeeswarm", quietly = TRUE)) {
  # The default plotting option is a bar plot of the Shapley values
  # We draw bar plots for the first 4 observations
  plot(x, index_x_explain = 1:4)

  # We can also make waterfall plots
  plot(x, plot_type = "waterfall", index_x_explain = 1:4)
  # And only showing the 2 features with largest contribution
  plot(x, plot_type = "waterfall", index_x_explain = 1:4, top_k_features = 2)

  # Or scatter plots showing the distribution of the shapley values and feature values
  plot(x, plot_type = "scatter")
  # And only for a specific feature
  plot(x, plot_type = "scatter", scatter_features = "Temp")

  # Or a beeswarm plot summarising the Shapley values and feature values for all features
  plot(x, plot_type = "beeswarm")
  plot(x, plot_type = "beeswarm", col = c("red", "black")) # we can change colors

  # Additional arguments can be passed to ggbeeswarm::geom_beeswarm() using the '...' argument.
  # For instance, sometimes the beeswarm plots overlap too much.
  # This can be fixed with the 'corral="wrap" argument.
  # See ?ggbeeswarm::geom_beeswarm for more information.
  plot(x, plot_type = "beeswarm", corral = "wrap")
}

# Example of scatter and beeswarm plot with factor variables
airquality$Month_factor <- as.factor(month.abb[airquality$Month])
airquality <- airquality[complete.cases(airquality), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month_factor")
y_var <- "Ozone"

# Split data into test- and training data
data_train <- airquality
data_explain <- tail(airquality, 50)

x_train <- data_train[, x_var]
x_explain <- data_explain[, x_var]

# Fit a linear model
lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + ")))
model <- lm(lm_formula, data = data_train)

# Explain predictions
p <- mean(data_train[, y_var])

# Empirical approach
x <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "ctree",
  phi0 = p,
  n_MC_samples = 1e2
)

if (requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("ggbeeswarm", quietly = TRUE)) {
  plot(x, plot_type = "scatter")
  plot(x, plot_type = "beeswarm")
}

## End(Not run)

Print method for shapr objects

Description

Print method for shapr objects

Usage

## S3 method for class 'shapr'
print(x, digits = 4, ...)

Arguments

x

A shapr object

digits

Scalar Integer. Number of digits to display to the console

...

Unused


Function to specify the extra parameters in the vaeac model

Description

In this function, we specify the default values for the extra parameters used in explain() for approach = "vaeac".

Usage

vaeac_get_extra_para_default(
  vaeac.model_description = make.names(Sys.time()),
  vaeac.folder_to_save_model = tempdir(),
  vaeac.pretrained_vaeac_model = NULL,
  vaeac.cuda = FALSE,
  vaeac.epochs_initiation_phase = 2,
  vaeac.epochs_early_stopping = NULL,
  vaeac.save_every_nth_epoch = NULL,
  vaeac.val_ratio = 0.25,
  vaeac.val_iwae_n_samples = 25,
  vaeac.batch_size = 64,
  vaeac.batch_size_sampling = NULL,
  vaeac.running_avg_n_values = 5,
  vaeac.skip_conn_layer = TRUE,
  vaeac.skip_conn_masked_enc_dec = TRUE,
  vaeac.batch_normalization = FALSE,
  vaeac.paired_sampling = TRUE,
  vaeac.masking_ratio = 0.5,
  vaeac.mask_gen_coalitions = NULL,
  vaeac.mask_gen_coalitions_prob = NULL,
  vaeac.sigma_mu = 10000,
  vaeac.sigma_sigma = 1e-04,
  vaeac.sample_random = TRUE,
  vaeac.save_data = FALSE,
  vaeac.log_exp_cont_feat = FALSE,
  vaeac.which_vaeac_model = "best",
  vaeac.save_model = TRUE
)

Arguments

vaeac.model_description

String (default is make.names(Sys.time())). String containing, e.g., the name of the data distribution or additional parameter information. Used in the save name of the fitted model. If not provided, then a name will be generated based on base::Sys.time() to ensure a unique name. We use base::make.names() to ensure a valid file name for all operating systems.

vaeac.folder_to_save_model

String (default is base::tempdir()). String specifying a path to a folder where the function is to save the fitted vaeac model. Note that the path will be removed from the returned explain() object if vaeac.save_model = FALSE. Furthermore, the model cannot be moved from its original folder if we are to use the vaeac_train_model_continue() function to continue training the model.

vaeac.pretrained_vaeac_model

List or String (default is NULL). 1) Either a list of class vaeac, i.e., the list stored in explanation$internal$parameters$vaeac where explanation is the returned list from an earlier call to the explain() function. 2) A string containing the path to where the vaeac model is stored on disk, for example, explanation$internal$parameters$vaeac$models$best.

vaeac.cuda

Logical (default is FALSE). If TRUE, then the vaeac model will be trained using cuda/GPU. If torch::cuda_is_available() is FALSE, the we fall back to use CPU. If FALSE, we use the CPU. Using a GPU for smaller tabular dataset often do not improve the efficiency. See vignette("installation", package = "torch") fo help to enable running on the GPU (only Linux and Windows).

vaeac.epochs_initiation_phase

Positive integer (default is 2). The number of epochs to run each of the vaeac.n_vaeacs_initialize vaeac models before continuing to train only the best performing model.

vaeac.epochs_early_stopping

Positive integer (default is NULL). The training stops if there has been no improvement in the validation IWAE for vaeac.epochs_early_stopping epochs. If the user wants the training process to be solely based on this training criterion, then vaeac.epochs in explain() should be set to a large number. If NULL, then shapr will internally set vaeac.epochs_early_stopping = vaeac.epochs such that early stopping does not occur.

vaeac.save_every_nth_epoch

Positive integer (default is NULL). If provided, then the vaeac model after every vaeac.save_every_nth_epochth epoch will be saved.

vaeac.val_ratio

Numeric (default is 0.25). Scalar between 0 and 1 indicating the ratio of instances from the input data which will be used as validation data. That is, vaeac.val_ratio = 0.25 means that ⁠75%⁠ of the provided data is used as training data, while the remaining ⁠25%⁠ is used as validation data.

vaeac.val_iwae_n_samples

Positive integer (default is 25). The number of generated samples used to compute the IWAE criterion when validating the vaeac model on the validation data.

vaeac.batch_size

Positive integer (default is 64). The number of samples to include in each batch during the training of the vaeac model. Used in torch::dataloader().

vaeac.batch_size_sampling

Positive integer (default is NULL) The number of samples to include in each batch when generating the Monte Carlo samples. If NULL, then the function generates the Monte Carlo samples for the provided coalitions and all explicands sent to explain() at the time. The number of coalitions are determined by the n_batches used by explain(). We recommend to tweak extra_computation_args$max_batch_size and extra_computation_args$min_n_batches rather than vaeac.batch_size_sampling. Larger batch sizes are often much faster provided sufficient memory.

vaeac.running_avg_n_values

Positive integer (default is 5). The number of previous IWAE values to include when we compute the running means of the IWAE criterion.

vaeac.skip_conn_layer

Logical (default is TRUE). If TRUE, we apply identity skip connections in each layer, see skip_connection(). That is, we add the input XX to the outcome of each hidden layer, so the output becomes X+activation(WX+b)X + activation(WX + b).

vaeac.skip_conn_masked_enc_dec

Logical (default is TRUE). If TRUE, we apply concatenate skip connections between the layers in the masked encoder and decoder. The first layer of the masked encoder will be linked to the last layer of the decoder. The second layer of the masked encoder will be linked to the second to last layer of the decoder, and so on.

vaeac.batch_normalization

Logical (default is FALSE). If TRUE, we apply batch normalization after the activation function. Note that if vaeac.skip_conn_layer = TRUE, then the normalization is applied after the inclusion of the skip connection. That is, we batch normalize the whole quantity X+activation(WX+b)X + activation(WX + b).

vaeac.paired_sampling

Logical (default is TRUE). If TRUE, we apply paired sampling to the training batches. That is, the training observations in each batch will be duplicated, where the first instance will be masked by SS while the second instance will be masked by Sˉ\bar{S}. This ensures that the training of the vaeac model becomes more stable as the model has access to the full version of each training observation. However, this will increase the training time due to more complex implementation and doubling the size of each batch. See paired_sampler() for more information.

vaeac.masking_ratio

Numeric (default is 0.5). Probability of masking a feature in the mcar_mask_generator() (MCAR = Missing Completely At Random). The MCAR masking scheme ensures that vaeac model can do arbitrary conditioning as all coalitions will be trained. vaeac.masking_ratio will be overruled if vaeac.mask_gen_coalitions is specified.

vaeac.mask_gen_coalitions

Matrix (default is NULL). Matrix containing the coalitions that the vaeac model will be trained on, see specified_masks_mask_generator(). This parameter is used internally in shapr when we only consider a subset of coalitions, i.e., when n_coalitions <2nfeatures< 2^{n_{\text{features}}}, and for group Shapley, i.e., when group is specified in explain().

vaeac.mask_gen_coalitions_prob

Numeric array (default is NULL). Array of length equal to the height of vaeac.mask_gen_coalitions containing the probabilities of sampling the corresponding coalitions in vaeac.mask_gen_coalitions.

vaeac.sigma_mu

Numeric (default is 1e4). One of two hyperparameter values in the normal-gamma prior used in the masked encoder, see Section 3.3.1 in Olsen et al. (2022).

vaeac.sigma_sigma

Numeric (default is 1e-4). One of two hyperparameter values in the normal-gamma prior used in the masked encoder, see Section 3.3.1 in Olsen et al. (2022).

vaeac.sample_random

Logical (default is TRUE). If TRUE, the function generates random Monte Carlo samples from the inferred generative distributions. If FALSE, the function use the most likely values, i.e., the mean and class with highest probability for continuous and categorical, respectively.

vaeac.save_data

Logical (default is FALSE). If TRUE, then the data is stored together with the model. Useful if one are to continue to train the model later using vaeac_train_model_continue().

vaeac.log_exp_cont_feat

Logical (default is FALSE). If we are to log\log transform all continuous features before sending the data to vaeac(). The vaeac model creates unbounded Monte Carlo sample values. Thus, if the continuous features are strictly positive (as for, e.g., the Burr distribution and Abalone data set), it can be advantageous to log\log transform the data to unbounded form before using vaeac. If TRUE, then vaeac_postprocess_data() will take the exp\exp of the results to get back to strictly positive values when using the vaeac model to impute missing values/generate the Monte Carlo samples.

vaeac.which_vaeac_model

String (default is best). The name of the vaeac model (snapshots from different epochs) to use when generating the Monte Carlo samples. The standard choices are: "best" (epoch with lowest IWAE), "best_running" (epoch with lowest running IWAE, see vaeac.running_avg_n_values), and last (the last epoch). Note that additional choices are available if vaeac.save_every_nth_epoch is provided. For example, if vaeac.save_every_nth_epoch = 5, then vaeac.which_vaeac_model can also take the values "epoch_5", "epoch_10", "epoch_15", and so on.

vaeac.save_model

Boolean. If TRUE (default), the vaeac model will be saved either in a base::tempdir() folder or in a user specified location in vaeac.folder_to_save_model. If FALSE, then the paths to model and the model will will be deleted from the returned object from explain().

Details

The vaeac model consists of three neural network (a full encoder, a masked encoder, and a decoder) based on the provided vaeac.depth and vaeac.width. The encoders map the full and masked input representations to latent representations, respectively, where the dimension is given by vaeac.latent_dim. The latent representations are sent to the decoder to go back to the real feature space and provide a samplable probabilistic representation, from which the Monte Carlo samples are generated. We use the vaeac method at the epoch with the lowest validation error (IWAE) by default, but other possibilities are available but setting the vaeac.which_vaeac_model parameter. See Olsen et al. (2022) for more details.

Value

Named list of the default values vaeac extra parameter arguments specified in this function call. Note that both vaeac.model_description and vaeac.folder_to_save_model will change with time and R session.

Author(s)

Lars Henry Berge Olsen

References


Continue to Train the vaeac Model

Description

Function that loads a previously trained vaeac model and continue the training, either on new data or on the same dataset as it was trained on before. If we are given a new dataset, then we assume that new dataset has the same distribution and one_hot_max_sizes as the original dataset.

Usage

vaeac_train_model_continue(
  explanation,
  epochs_new,
  lr_new = NULL,
  x_train = NULL,
  save_data = FALSE,
  verbose = NULL,
  seed = 1
)

Arguments

explanation

A explain() object and vaeac must be the used approach.

epochs_new

Positive integer. The number of extra epochs to conduct.

lr_new

Positive numeric. If we are to overwrite the old learning rate in the adam optimizer.

x_train

A data.table containing the training data. Categorical data must have class names 1,2,,K1,2,\dots,K.

save_data

Logical (default is FALSE). If TRUE, then the data is stored together with the model. Useful if one are to continue to train the model later using vaeac_train_model_continue().

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE)

  • the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

seed

Positive integer (default is 1). Seed for reproducibility. Specifies the seed before any randomness based code is being run.

Value

A list containing the training/validation errors and paths to where the vaeac models are saved on the disk.

Author(s)

Lars Henry Berge Olsen

References