Title: | Prediction Explanation with Dependence-Aware Shapley Values |
---|---|
Description: | Complex machine learning models are often hard to interpret. However, in many situations it is crucial to understand and explain why a model made a specific prediction. Shapley values is the only method for such prediction explanation framework with a solid theoretical foundation. Previously known methods for estimating the Shapley values do, however, assume feature independence. This package implements methods which accounts for any feature dependence, and thereby produces more accurate estimates of the true Shapley values. An accompanying 'Python' wrapper ('shaprpy') is available through the GitHub repository. |
Authors: | Martin Jullum [cre, aut] , Lars Henry Berge Olsen [aut] , Annabelle Redelmeier [aut], Jon Lachmann [aut] , Nikolai Sellereite [aut] , Anders Løland [ctb], Jens Christian Wahl [ctb], Camilla Lingjærde [ctb], Norsk Regnesentral [cph, fnd] |
Maintainer: | Martin Jullum <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.1.9000 |
Built: | 2025-01-22 10:22:59 UTC |
Source: | https://github.com/norskregnesentral/shapr |
Computes dependence-aware Shapley values for observations in x_explain
from the specified
model
by using the method specified in approach
to estimate the conditional expectation.
See Aas et al. (2021)
for a thorough introduction to dependence-aware prediction explanation with Shapley values.
explain( model, x_explain, x_train, approach, phi0, iterative = NULL, max_n_coalitions = NULL, group = NULL, n_MC_samples = 1000, seed = 1, verbose = "basic", predict_model = NULL, get_model_specs = NULL, prev_shapr_object = NULL, asymmetric = FALSE, causal_ordering = NULL, confounding = NULL, extra_computation_args = list(), iterative_args = list(), output_args = list(), ... )
explain( model, x_explain, x_train, approach, phi0, iterative = NULL, max_n_coalitions = NULL, group = NULL, n_MC_samples = 1000, seed = 1, verbose = "basic", predict_model = NULL, get_model_specs = NULL, prev_shapr_object = NULL, asymmetric = FALSE, causal_ordering = NULL, confounding = NULL, extra_computation_args = list(), iterative_args = list(), output_args = list(), ... )
model |
Model object.
Specifies the model whose predictions we want to explain.
Run |
x_explain |
Matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained. |
x_train |
Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula. |
approach |
Character vector of length |
phi0 |
Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable. |
iterative |
Logical or NULL
If |
max_n_coalitions |
Integer.
The upper limit on the number of unique feature/group coalitions to use in the iterative procedure
(if |
group |
List.
If |
n_MC_samples |
Positive integer.
For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration
of every conditional expectation.
For |
seed |
Positive integer.
Specifies the seed before any randomness based code is being run.
If |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings
|
predict_model |
Function.
The prediction function used when |
get_model_specs |
Function.
An optional function for checking model/data consistency when
If |
prev_shapr_object |
|
asymmetric |
Logical.
Not applicable for (regular) non-causal or asymmetric explanations.
If |
causal_ordering |
List.
Not applicable for (regular) non-causal or asymmetric explanations.
|
confounding |
Logical vector.
Not applicable for (regular) non-causal or asymmetric explanations.
|
extra_computation_args |
Named list.
Specifies extra arguments related to the computation of the Shapley values.
See |
iterative_args |
Named list.
Specifies the arguments for the iterative procedure.
See |
output_args |
Named list.
Specifies certain arguments related to the output of the function.
See |
... |
Arguments passed on to
|
The shapr
package implements kernelSHAP estimation of dependence-aware Shapley values with
eight different Monte Carlo-based approaches for estimating the conditional distributions of the data.
These are all introduced in the
general usage.
(From R: vignette("general_usage", package = "shapr")
).
Moreover,
Aas et al. (2021)
gives a general introduction to dependence-aware Shapley values, and the three approaches "empirical"
,
"gaussian"
, "copula"
, and also discusses "independence"
.
Redelmeier et al. (2020) introduces the approach "ctree"
.
Olsen et al. (2022) introduces the "vaeac"
approach.
Approach "timeseries"
is discussed in
Jullum et al. (2021).
shapr
has also implemented two regression-based approaches "regression_separate"
and "regression_surrogate"
,
as described in Olsen et al. (2024).
It is also possible to combine the different approaches, see the
general usage for more information.
The package also supports the computation of causal and asymmetric Shapley values as introduced by
Heskes et al. (2020) and
Frye et al. (2020).
Asymmetric Shapley values were proposed by
Heskes et al. (2020) as a way to incorporate causal knowledge in
the real world by restricting the possible feature combinations/coalitions when computing the Shapley values to
those consistent with a (partial) causal ordering.
Causal Shapley values were proposed by
Frye et al. (2020) as a way to explain the total effect of features
on the prediction, taking into account their causal relationships, by adapting the sampling procedure in shapr
.
The package allows for parallelized computation with progress updates through the tightly connected
future::future and progressr::progressr packages.
See the examples below.
For iterative estimation (iterative=TRUE
), intermediate results may also be printed to the console
(according to the verbose
argument).
Moreover, the intermediate results are written to disk.
This combined batch computing of the v(S) values, enables fast and accurate estimation of the Shapley values
in a memory friendly manner.
Object of class c("shapr", "list")
. Contains the following items:
shapley_values_est
data.table with the estimated Shapley values with explained observation in the rows and
features along the columns.
The column none
is the prediction not devoted to any of the features (given by the argument phi0
)
shapley_values_sd
data.table with the standard deviation of the Shapley values reflecting the uncertainty.
Note that this only reflects the coalition sampling part of the kernelSHAP procedure, and is therefore by
definition 0 when all coalitions is used.
Only present when extra_computation_args$compute_sd=TRUE
, which is the default when iterative = TRUE
internal
List with the different parameters, data, functions and other output used internally.
pred_explain
Numeric vector with the predictions for the explained observations
MSEv
List with the values of the MSEv evaluation criterion for the approach. See the MSEv evaluation section in the general usage for details.
timing
List containing timing information for the different parts of the computation.
init_time
and end_time
gives the time stamps for the start and end of the computation.
total_time_secs
gives the total time in seconds for the complete execution of explain()
.
main_timing_secs
gives the time in seconds for the main computations.
iter_timing_secs
gives for each iteration of the iterative estimation, the time spent on the different parts
iterative estimation routine.
Martin Jullum, Lars Henry Berge Olsen
## Not run: # Load example data data("airquality") airquality <- airquality[complete.cases(airquality), ] x_var <- c("Solar.R", "Wind", "Temp", "Month") y_var <- "Ozone" # Split data into test- and training data data_train <- head(airquality, -3) data_explain <- tail(airquality, 3) x_train <- data_train[, x_var] x_explain <- data_explain[, x_var] # Fit a linear model lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + "))) model <- lm(lm_formula, data = data_train) # Explain predictions p <- mean(data_train[, y_var]) # (Optionally) enable parallelization via the future package if (requireNamespace("future", quietly = TRUE)) { future::plan("multisession", workers = 2) } # (Optionally) enable progress updates within every iteration via the progressr package if (requireNamespace("progressr", quietly = TRUE)) { progressr::handlers(global = TRUE) } # Empirical approach explain1 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "empirical", phi0 = p, n_MC_samples = 1e2 ) # Gaussian approach explain2 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "gaussian", phi0 = p, n_MC_samples = 1e2 ) # Gaussian copula approach explain3 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "copula", phi0 = p, n_MC_samples = 1e2 ) # ctree approach explain4 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "ctree", phi0 = p, n_MC_samples = 1e2 ) # Combined approach approach <- c("gaussian", "gaussian", "empirical") explain5 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = approach, phi0 = p, n_MC_samples = 1e2 ) # Print the Shapley values print(explain1$shapley_values_est) # Plot the results if (requireNamespace("ggplot2", quietly = TRUE)) { plot(explain1) plot(explain1, plot_type = "waterfall") } # Group-wise explanations group_list <- list(A = c("Temp", "Month"), B = c("Wind", "Solar.R")) explain_groups <- explain( model = model, x_explain = x_explain, x_train = x_train, group = group_list, approach = "empirical", phi0 = p, n_MC_samples = 1e2 ) print(explain_groups$shapley_values_est) # Separate and surrogate regression approaches with linear regression models. explain_separate_lm <- explain( model = model, x_explain = x_explain, x_train = x_train, phi0 = p, approach = "regression_separate", regression.model = parsnip::linear_reg() ) explain_surrogate_lm <- explain( model = model, x_explain = x_explain, x_train = x_train, phi0 = p, approach = "regression_surrogate", regression.model = parsnip::linear_reg() ) # Iterative estimation # For illustration purposes only. By default not used for such small dimensions as here # Gaussian approach explain_iterative <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "gaussian", phi0 = p, n_MC_samples = 1e2, iterative = TRUE, iterative_args = list(initial_n_coalitions = 10) ) ## End(Not run)
## Not run: # Load example data data("airquality") airquality <- airquality[complete.cases(airquality), ] x_var <- c("Solar.R", "Wind", "Temp", "Month") y_var <- "Ozone" # Split data into test- and training data data_train <- head(airquality, -3) data_explain <- tail(airquality, 3) x_train <- data_train[, x_var] x_explain <- data_explain[, x_var] # Fit a linear model lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + "))) model <- lm(lm_formula, data = data_train) # Explain predictions p <- mean(data_train[, y_var]) # (Optionally) enable parallelization via the future package if (requireNamespace("future", quietly = TRUE)) { future::plan("multisession", workers = 2) } # (Optionally) enable progress updates within every iteration via the progressr package if (requireNamespace("progressr", quietly = TRUE)) { progressr::handlers(global = TRUE) } # Empirical approach explain1 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "empirical", phi0 = p, n_MC_samples = 1e2 ) # Gaussian approach explain2 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "gaussian", phi0 = p, n_MC_samples = 1e2 ) # Gaussian copula approach explain3 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "copula", phi0 = p, n_MC_samples = 1e2 ) # ctree approach explain4 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "ctree", phi0 = p, n_MC_samples = 1e2 ) # Combined approach approach <- c("gaussian", "gaussian", "empirical") explain5 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = approach, phi0 = p, n_MC_samples = 1e2 ) # Print the Shapley values print(explain1$shapley_values_est) # Plot the results if (requireNamespace("ggplot2", quietly = TRUE)) { plot(explain1) plot(explain1, plot_type = "waterfall") } # Group-wise explanations group_list <- list(A = c("Temp", "Month"), B = c("Wind", "Solar.R")) explain_groups <- explain( model = model, x_explain = x_explain, x_train = x_train, group = group_list, approach = "empirical", phi0 = p, n_MC_samples = 1e2 ) print(explain_groups$shapley_values_est) # Separate and surrogate regression approaches with linear regression models. explain_separate_lm <- explain( model = model, x_explain = x_explain, x_train = x_train, phi0 = p, approach = "regression_separate", regression.model = parsnip::linear_reg() ) explain_surrogate_lm <- explain( model = model, x_explain = x_explain, x_train = x_train, phi0 = p, approach = "regression_surrogate", regression.model = parsnip::linear_reg() ) # Iterative estimation # For illustration purposes only. By default not used for such small dimensions as here # Gaussian approach explain_iterative <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "gaussian", phi0 = p, n_MC_samples = 1e2, iterative = TRUE, iterative_args = list(initial_n_coalitions = 10) ) ## End(Not run)
Computes dependence-aware Shapley values for observations in explain_idx
from the specified
model
by using the method specified in approach
to estimate the conditional expectation.
See
Aas, et. al (2021)
for a thorough introduction to dependence-aware prediction explanation with Shapley values.
explain_forecast( model, y, xreg = NULL, train_idx = NULL, explain_idx, explain_y_lags, explain_xreg_lags = explain_y_lags, horizon, approach, phi0, max_n_coalitions = NULL, iterative = NULL, group_lags = TRUE, group = NULL, n_MC_samples = 1000, seed = 1, predict_model = NULL, get_model_specs = NULL, verbose = "basic", extra_computation_args = list(), iterative_args = list(), output_args = list(), ... )
explain_forecast( model, y, xreg = NULL, train_idx = NULL, explain_idx, explain_y_lags, explain_xreg_lags = explain_y_lags, horizon, approach, phi0, max_n_coalitions = NULL, iterative = NULL, group_lags = TRUE, group = NULL, n_MC_samples = 1000, seed = 1, predict_model = NULL, get_model_specs = NULL, verbose = "basic", extra_computation_args = list(), iterative_args = list(), output_args = list(), ... )
model |
Model object.
Specifies the model whose predictions we want to explain.
Run |
y |
Matrix, data.frame/data.table or a numeric vector. Contains the endogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. |
xreg |
Matrix, data.frame/data.table or a numeric vector. Contains the exogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. As exogenous variables are used contemporaneously when producing a forecast, this item should contain nrow(y) + horizon rows. |
train_idx |
Numeric vector.
The row indices in data and reg denoting points in time to use when estimating the conditional expectations in
the Shapley value formula.
If |
explain_idx |
Numeric vector. The row indices in data and reg denoting points in time to explain. |
explain_y_lags |
Numeric vector.
Denotes the number of lags that should be used for each variable in |
explain_xreg_lags |
Numeric vector.
If |
horizon |
Numeric.
The forecast horizon to explain. Passed to the |
approach |
Character vector of length |
phi0 |
Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable. |
max_n_coalitions |
Integer.
The upper limit on the number of unique feature/group coalitions to use in the iterative procedure
(if |
iterative |
Logical or NULL
If |
group_lags |
Logical.
If |
group |
List.
If |
n_MC_samples |
Positive integer.
For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration
of every conditional expectation.
For |
seed |
Positive integer.
Specifies the seed before any randomness based code is being run.
If |
predict_model |
Function.
The prediction function used when |
get_model_specs |
Function.
An optional function for checking model/data consistency when
If |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings
|
extra_computation_args |
Named list.
Specifies extra arguments related to the computation of the Shapley values.
See |
iterative_args |
Named list.
Specifies the arguments for the iterative procedure.
See |
output_args |
Named list.
Specifies certain arguments related to the output of the function.
See |
... |
Arguments passed on to
|
This function explains a forecast of length horizon
. The argument train_idx
is analogous to x_train in explain()
, however, it just contains the time indices of where
in the data the forecast should start for each training sample. In the same way explain_idx
defines the time index (indices) which will precede a forecast to be explained.
As any autoregressive forecast model will require a set of lags to make a forecast at an
arbitrary point in time, explain_y_lags
and explain_xreg_lags
define how many lags
are required to "refit" the model at any given time index. This allows the different
approaches to work in the same way they do for time-invariant models.
See the forecasting section of the general usages for further details.
Object of class c("shapr", "list")
. Contains the following items:
shapley_values_est
data.table with the estimated Shapley values with explained observation in the rows and
features along the columns.
The column none
is the prediction not devoted to any of the features (given by the argument phi0
)
shapley_values_sd
data.table with the standard deviation of the Shapley values reflecting the uncertainty.
Note that this only reflects the coalition sampling part of the kernelSHAP procedure, and is therefore by
definition 0 when all coalitions is used.
Only present when extra_computation_args$compute_sd=TRUE
, which is the default when iterative = TRUE
internal
List with the different parameters, data, functions and other output used internally.
pred_explain
Numeric vector with the predictions for the explained observations
MSEv
List with the values of the MSEv evaluation criterion for the approach. See the MSEv evaluation section in the general usage for details.
timing
List containing timing information for the different parts of the computation.
init_time
and end_time
gives the time stamps for the start and end of the computation.
total_time_secs
gives the total time in seconds for the complete execution of explain()
.
main_timing_secs
gives the time in seconds for the main computations.
iter_timing_secs
gives for each iteration of the iterative estimation, the time spent on the different parts
iterative estimation routine.
Jon Lachmann, Martin Jullum
## Not run: # Load example data data("airquality") data <- data.table::as.data.table(airquality) # Fit an AR(2) model. model_ar_temp <- ar(data$Temp, order = 2) # Calculate the zero prediction values for a three step forecast. p0_ar <- rep(mean(data$Temp), 3) # Empirical approach, explaining forecasts starting at T = 152 and T = 153. explain_forecast( model = model_ar_temp, y = data[, "Temp"], train_idx = 2:151, explain_idx = 152:153, explain_y_lags = 2, horizon = 3, approach = "empirical", phi0 = p0_ar, group_lags = FALSE ) ## End(Not run)
## Not run: # Load example data data("airquality") data <- data.table::as.data.table(airquality) # Fit an AR(2) model. model_ar_temp <- ar(data$Temp, order = 2) # Calculate the zero prediction values for a three step forecast. p0_ar <- rep(mean(data$Temp), 3) # Empirical approach, explaining forecasts starting at T = 152 and T = 153. explain_forecast( model = model_ar_temp, y = data[, "Temp"], train_idx = 2:151, explain_idx = 152:153, explain_y_lags = 2, horizon = 3, approach = "empirical", phi0 = p0_ar, group_lags = FALSE ) ## End(Not run)
Gets the default values for the extra estimation arguments
get_extra_comp_args_default( internal, paired_shap_sampling = isFALSE(internal$parameters$asymmetric), kernelSHAP_reweighting = "on_all_cond", compute_sd = isFALSE(internal$parameters$exact), n_boot_samps = 100, max_batch_size = 10, min_n_batches = 10 )
get_extra_comp_args_default( internal, paired_shap_sampling = isFALSE(internal$parameters$asymmetric), kernelSHAP_reweighting = "on_all_cond", compute_sd = isFALSE(internal$parameters$exact), n_boot_samps = 100, max_batch_size = 10, min_n_batches = 10 )
internal |
List.
Not used directly, but passed through from |
paired_shap_sampling |
Logical.
If |
kernelSHAP_reweighting |
String.
How to reweight the sampling frequency weights in the kernelSHAP solution after sampling.
The aim of this is to reduce the randomness and thereby the variance of the Shapley value estimates.
The options are one of |
compute_sd |
Logical. Whether to estimate the standard deviations of the Shapley value estimates. This is TRUE whenever sampling based kernelSHAP is applied (either iteratively or with a fixed number of coalitions). |
n_boot_samps |
Integer. The number of bootstrapped samples (i.e. samples with replacement) from the set of all coalitions used to estimate the standard deviations of the Shapley value estimates. |
max_batch_size |
Integer. The maximum number of coalitions to estimate simultaneously within each iteration. A larger numbers requires more memory, but may have a slight computational advantage. |
min_n_batches |
Integer. The minimum number of batches to split the computation into within each iteration. Larger numbers gives more frequent progress updates. If parallelization is applied, this should be set no smaller than the number of parallel workers. |
Martin Jullum
Function to specify arguments of the iterative estimation procedure
get_iterative_args_default( internal, initial_n_coalitions = ceiling(min(200, max(5, internal$parameters$n_features, (2^internal$parameters$n_features)/10), internal$parameters$max_n_coalitions)), fixed_n_coalitions_per_iter = NULL, max_iter = 20, convergence_tol = 0.02, n_coal_next_iter_factor_vec = c(seq(0.1, 1, by = 0.1), rep(1, max_iter - 10)) )
get_iterative_args_default( internal, initial_n_coalitions = ceiling(min(200, max(5, internal$parameters$n_features, (2^internal$parameters$n_features)/10), internal$parameters$max_n_coalitions)), fixed_n_coalitions_per_iter = NULL, max_iter = 20, convergence_tol = 0.02, n_coal_next_iter_factor_vec = c(seq(0.1, 1, by = 0.1), rep(1, max_iter - 10)) )
internal |
List.
Not used directly, but passed through from |
initial_n_coalitions |
Integer. Number of coalitions to use in the first estimation iteration. |
fixed_n_coalitions_per_iter |
Integer. Number of |
max_iter |
Integer. Maximum number of estimation iterations |
convergence_tol |
Numeric. The t variable in the convergence threshold formula on page 6 in the paper Covert and Lee (2021), 'Improving KernelSHAP: Practical Shapley Value Estimation via Linear Regression' https://arxiv.org/pdf/2012.01536. Smaller values requires more coalitions before convergence is reached. |
n_coal_next_iter_factor_vec |
Numeric vector. The number of |
The functions sets default values for the iterative estimation procedure, according to the function
defaults.
If the argument iterative
of explain()
is FALSE, it sets parameters corresponding to the use of a
non-iterative estimation procedure
Martin Jullum
Gets the default values for the output arguments
get_output_args_default( keep_samp_for_vS = FALSE, MSEv_uniform_comb_weights = TRUE, saving_path = tempfile("shapr_obj_", fileext = ".rds") )
get_output_args_default( keep_samp_for_vS = FALSE, MSEv_uniform_comb_weights = TRUE, saving_path = tempfile("shapr_obj_", fileext = ".rds") )
keep_samp_for_vS |
Logical.
Indicates whether the samples used in the Monte Carlo estimation of v_S should be returned (in |
MSEv_uniform_comb_weights |
Logical.
If |
saving_path |
String. The path to the directory where the results of the iterative estimation procedure should be saved. Defaults to a temporary directory. |
Martin Jullum
Gets the implemented approaches
get_supported_approaches()
get_supported_approaches()
Character vector.
The names of the implemented approaches that can be passed to argument approach
in explain()
.
Provides a data.table with the supported models
get_supported_models()
get_supported_models()
Make plots to visualize and compare the MSEv evaluation criterion for a list of
explain()
objects applied to the same data and model. The function creates
bar plots and line plots with points to illustrate the overall MSEv evaluation
criterion, but also for each observation/explicand and coalition by only averaging over
the coalitions and observations/explicands, respectively.
plot_MSEv_eval_crit( explanation_list, index_x_explain = NULL, id_coalition = NULL, CI_level = if (length(explanation_list[[1]]$pred_explain) < 20) NULL else 0.95, geom_col_width = 0.9, plot_type = "overall" )
plot_MSEv_eval_crit( explanation_list, index_x_explain = NULL, id_coalition = NULL, CI_level = if (length(explanation_list[[1]]$pred_explain) < 20) NULL else 0.95, geom_col_width = 0.9, plot_type = "overall" )
explanation_list |
A list of |
index_x_explain |
Integer vector.
Which of the test observations to plot. E.g. if you have
explained 10 observations using |
id_coalition |
Integer vector. Which of the coalitions to plot.
E.g. if you used |
CI_level |
Positive numeric between zero and one. Default is |
geom_col_width |
Numeric. Bar width. By default, set to 90% of the |
plot_type |
Character vector. The possible options are "overall" (default), "comb", and "explicand".
If |
Either a single ggplot2::ggplot()
object of the MSEv criterion when plot_type = "overall"
, or a list
of ggplot2::ggplot()
objects based on the plot_type
parameter.
Lars Henry Berge Olsen
## Not run: # Load necessary libraries library(xgboost) library(data.table) library(shapr) library(ggplot2) # Get the data data("airquality") data <- data.table::as.data.table(airquality) data <- data[complete.cases(data), ] #' Define the features and the response x_var <- c("Solar.R", "Wind", "Temp", "Month") y_var <- "Ozone" # Split data into test and training data set ind_x_explain <- 1:25 x_train <- data[-ind_x_explain, ..x_var] y_train <- data[-ind_x_explain, get(y_var)] x_explain <- data[ind_x_explain, ..x_var] # Fitting a basic xgboost model to the training data model <- xgboost::xgboost( data = as.matrix(x_train), label = y_train, nround = 20, verbose = FALSE ) # Specifying the phi_0, i.e. the expected prediction without any features phi0 <- mean(y_train) # Independence approach explanation_independence <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "independence", phi0 = phi0, n_MC_samples = 1e2 ) # Gaussian 1e1 approach explanation_gaussian_1e1 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "gaussian", phi0 = phi0, n_MC_samples = 1e1 ) # Gaussian 1e2 approach explanation_gaussian_1e2 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "gaussian", phi0 = phi0, n_MC_samples = 1e2 ) # ctree approach explanation_ctree <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "ctree", phi0 = phi0, n_MC_samples = 1e2 ) # Combined approach explanation_combined <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = c("gaussian", "independence", "ctree"), phi0 = phi0, n_MC_samples = 1e2 ) # Create a list of explanations with names explanation_list_named <- list( "Ind." = explanation_independence, "Gaus. 1e1" = explanation_gaussian_1e1, "Gaus. 1e2" = explanation_gaussian_1e2, "Ctree" = explanation_ctree, "Combined" = explanation_combined ) if (requireNamespace("ggplot2", quietly = TRUE)) { # Create the default MSEv plot where we average over both the coalitions and observations # with approximate 95% confidence intervals plot_MSEv_eval_crit(explanation_list_named, CI_level = 0.95, plot_type = "overall") # Can also create plots of the MSEv criterion averaged only over the coalitions or observations. MSEv_figures <- plot_MSEv_eval_crit(explanation_list_named, CI_level = 0.95, plot_type = c("overall", "comb", "explicand") ) MSEv_figures$MSEv_bar MSEv_figures$MSEv_coalition_bar MSEv_figures$MSEv_explicand_bar # When there are many coalitions or observations, then it can be easier to look at line plots MSEv_figures$MSEv_coalition_line_point MSEv_figures$MSEv_explicand_line_point # We can specify which observations or coalitions to plot plot_MSEv_eval_crit(explanation_list_named, plot_type = "explicand", index_x_explain = c(1, 3:4, 6), CI_level = 0.95 )$MSEv_explicand_bar plot_MSEv_eval_crit(explanation_list_named, plot_type = "comb", id_coalition = c(3, 4, 9, 13:15), CI_level = 0.95 )$MSEv_coalition_bar # We can alter the figures if other palette schemes or design is wanted bar_text_n_decimals <- 1 MSEv_figures$MSEv_bar + ggplot2::scale_x_discrete(limits = rev(levels(MSEv_figures$MSEv_bar$data$Method))) + ggplot2::coord_flip() + ggplot2::scale_fill_discrete() + #' Default ggplot2 palette ggplot2::theme_minimal() + #' This must be set before the other theme call ggplot2::theme( plot.title = ggplot2::element_text(size = 10), legend.position = "bottom" ) + ggplot2::guides(fill = ggplot2::guide_legend(nrow = 1, ncol = 6)) + ggplot2::geom_text( ggplot2::aes(label = sprintf( paste("%.", sprintf("%d", bar_text_n_decimals), "f", sep = ""), round(MSEv, bar_text_n_decimals) )), vjust = -1.1, # This value must be altered based on the plot dimension hjust = 1.1, # This value must be altered based on the plot dimension color = "black", position = ggplot2::position_dodge(0.9), size = 5 ) } ## End(Not run)
## Not run: # Load necessary libraries library(xgboost) library(data.table) library(shapr) library(ggplot2) # Get the data data("airquality") data <- data.table::as.data.table(airquality) data <- data[complete.cases(data), ] #' Define the features and the response x_var <- c("Solar.R", "Wind", "Temp", "Month") y_var <- "Ozone" # Split data into test and training data set ind_x_explain <- 1:25 x_train <- data[-ind_x_explain, ..x_var] y_train <- data[-ind_x_explain, get(y_var)] x_explain <- data[ind_x_explain, ..x_var] # Fitting a basic xgboost model to the training data model <- xgboost::xgboost( data = as.matrix(x_train), label = y_train, nround = 20, verbose = FALSE ) # Specifying the phi_0, i.e. the expected prediction without any features phi0 <- mean(y_train) # Independence approach explanation_independence <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "independence", phi0 = phi0, n_MC_samples = 1e2 ) # Gaussian 1e1 approach explanation_gaussian_1e1 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "gaussian", phi0 = phi0, n_MC_samples = 1e1 ) # Gaussian 1e2 approach explanation_gaussian_1e2 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "gaussian", phi0 = phi0, n_MC_samples = 1e2 ) # ctree approach explanation_ctree <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "ctree", phi0 = phi0, n_MC_samples = 1e2 ) # Combined approach explanation_combined <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = c("gaussian", "independence", "ctree"), phi0 = phi0, n_MC_samples = 1e2 ) # Create a list of explanations with names explanation_list_named <- list( "Ind." = explanation_independence, "Gaus. 1e1" = explanation_gaussian_1e1, "Gaus. 1e2" = explanation_gaussian_1e2, "Ctree" = explanation_ctree, "Combined" = explanation_combined ) if (requireNamespace("ggplot2", quietly = TRUE)) { # Create the default MSEv plot where we average over both the coalitions and observations # with approximate 95% confidence intervals plot_MSEv_eval_crit(explanation_list_named, CI_level = 0.95, plot_type = "overall") # Can also create plots of the MSEv criterion averaged only over the coalitions or observations. MSEv_figures <- plot_MSEv_eval_crit(explanation_list_named, CI_level = 0.95, plot_type = c("overall", "comb", "explicand") ) MSEv_figures$MSEv_bar MSEv_figures$MSEv_coalition_bar MSEv_figures$MSEv_explicand_bar # When there are many coalitions or observations, then it can be easier to look at line plots MSEv_figures$MSEv_coalition_line_point MSEv_figures$MSEv_explicand_line_point # We can specify which observations or coalitions to plot plot_MSEv_eval_crit(explanation_list_named, plot_type = "explicand", index_x_explain = c(1, 3:4, 6), CI_level = 0.95 )$MSEv_explicand_bar plot_MSEv_eval_crit(explanation_list_named, plot_type = "comb", id_coalition = c(3, 4, 9, 13:15), CI_level = 0.95 )$MSEv_coalition_bar # We can alter the figures if other palette schemes or design is wanted bar_text_n_decimals <- 1 MSEv_figures$MSEv_bar + ggplot2::scale_x_discrete(limits = rev(levels(MSEv_figures$MSEv_bar$data$Method))) + ggplot2::coord_flip() + ggplot2::scale_fill_discrete() + #' Default ggplot2 palette ggplot2::theme_minimal() + #' This must be set before the other theme call ggplot2::theme( plot.title = ggplot2::element_text(size = 10), legend.position = "bottom" ) + ggplot2::guides(fill = ggplot2::guide_legend(nrow = 1, ncol = 6)) + ggplot2::geom_text( ggplot2::aes(label = sprintf( paste("%.", sprintf("%d", bar_text_n_decimals), "f", sep = ""), round(MSEv, bar_text_n_decimals) )), vjust = -1.1, # This value must be altered based on the plot dimension hjust = 1.1, # This value must be altered based on the plot dimension color = "black", position = ggplot2::position_dodge(0.9), size = 5 ) } ## End(Not run)
Make plots to visualize and compare the estimated Shapley values for a list of
explain()
objects applied to the same data and model. For group-wise Shapley values,
the features values plotted are the mean feature values for all features in each group.
plot_SV_several_approaches( explanation_list, index_explicands = NULL, index_explicands_sort = FALSE, only_these_features = NULL, plot_phi0 = FALSE, digits = 4, add_zero_line = FALSE, axis_labels_n_dodge = NULL, axis_labels_rotate_angle = NULL, horizontal_bars = TRUE, facet_scales = "free", facet_ncol = 2, geom_col_width = 0.85, brewer_palette = NULL, include_group_feature_means = FALSE )
plot_SV_several_approaches( explanation_list, index_explicands = NULL, index_explicands_sort = FALSE, only_these_features = NULL, plot_phi0 = FALSE, digits = 4, add_zero_line = FALSE, axis_labels_n_dodge = NULL, axis_labels_rotate_angle = NULL, horizontal_bars = TRUE, facet_scales = "free", facet_ncol = 2, geom_col_width = 0.85, brewer_palette = NULL, include_group_feature_means = FALSE )
explanation_list |
A list of |
index_explicands |
Integer vector. Which of the explicands (test observations) to plot.
E.g. if you have explained 10 observations using |
index_explicands_sort |
Boolean. If |
only_these_features |
String vector. Containing the names of the features which are to be included in the bar plots. |
plot_phi0 |
Boolean. If we are to include the |
digits |
Integer.
Number of significant digits to use in the feature description.
Applicable for |
add_zero_line |
Boolean. If we are to add a black line for a feature contribution of 0. |
axis_labels_n_dodge |
Integer. The number of rows that should be used to render the labels. This is useful for displaying labels that would otherwise overlap. |
axis_labels_rotate_angle |
Numeric. The angle of the axis label, where 0 means horizontal, 45 means tilted,
and 90 means vertical. Compared to setting the angle in |
horizontal_bars |
Boolean. Flip Cartesian coordinates so that horizontal becomes vertical,
and vertical, horizontal. This is primarily useful for converting geoms and statistics which display
y conditional on x, to x conditional on y. See |
facet_scales |
Should scales be free (" |
facet_ncol |
Integer. The number of columns in the facet grid. Default is |
geom_col_width |
Numeric. Bar width. By default, set to 85% of the |
brewer_palette |
String. Name of one of the color palettes from
|
include_group_feature_means |
Logical. Whether to include the average feature value in a group on the
y-axis or not. If |
A ggplot2::ggplot()
object.
Lars Henry Berge Olsen
## Not run: # Load necessary libraries library(xgboost) library(data.table) # Get the data data("airquality") data <- data.table::as.data.table(airquality) data <- data[complete.cases(data), ] # Define the features and the response x_var <- c("Solar.R", "Wind", "Temp", "Month") y_var <- "Ozone" # Split data into test and training data set ind_x_explain <- 1:12 x_train <- data[-ind_x_explain, ..x_var] y_train <- data[-ind_x_explain, get(y_var)] x_explain <- data[ind_x_explain, ..x_var] # Fitting a basic xgboost model to the training data model <- xgboost::xgboost( data = as.matrix(x_train), label = y_train, nround = 20, verbose = FALSE ) # Specifying the phi_0, i.e. the expected prediction without any features phi0 <- mean(y_train) # Independence approach explanation_independence <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "independence", phi0 = phi0, n_MC_samples = 1e2 ) # Empirical approach explanation_empirical <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "empirical", phi0 = phi0, n_MC_samples = 1e2 ) # Gaussian 1e1 approach explanation_gaussian_1e1 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "gaussian", phi0 = phi0, n_MC_samples = 1e1 ) # Gaussian 1e2 approach explanation_gaussian_1e2 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "gaussian", phi0 = phi0, n_MC_samples = 1e2 ) # Combined approach explanation_combined <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = c("gaussian", "ctree", "empirical"), phi0 = phi0, n_MC_samples = 1e2 ) # Create a list of explanations with names explanation_list <- list( "Ind." = explanation_independence, "Emp." = explanation_empirical, "Gaus. 1e1" = explanation_gaussian_1e1, "Gaus. 1e2" = explanation_gaussian_1e2, "Combined" = explanation_combined ) if (requireNamespace("ggplot2", quietly = TRUE)) { # The function uses the provided names. plot_SV_several_approaches(explanation_list) # We can change the number of columns in the grid of plots and add other visual alterations plot_SV_several_approaches(explanation_list, facet_ncol = 3, facet_scales = "free_y", add_zero_line = TRUE, digits = 2, brewer_palette = "Paired", geom_col_width = 0.6 ) + ggplot2::theme_minimal() + ggplot2::theme(legend.position = "bottom", plot.title = ggplot2::element_text(size = 0)) # We can specify which explicands to plot to get less chaotic plots and make the bars vertical plot_SV_several_approaches(explanation_list, index_explicands = c(1:2, 5, 10), horizontal_bars = FALSE, axis_labels_rotate_angle = 45 ) # We can change the order of the features by specifying the # order using the `only_these_features` parameter. plot_SV_several_approaches(explanation_list, index_explicands = c(1:2, 5, 10), only_these_features = c("Temp", "Solar.R", "Month", "Wind") ) # We can also remove certain features if we are not interested in them # or want to focus on, e.g., two features. The function will give a # message to if the user specifies non-valid feature names. plot_SV_several_approaches(explanation_list, index_explicands = c(1:2, 5, 10), only_these_features = c("Temp", "Solar.R"), plot_phi0 = TRUE ) } ## End(Not run)
## Not run: # Load necessary libraries library(xgboost) library(data.table) # Get the data data("airquality") data <- data.table::as.data.table(airquality) data <- data[complete.cases(data), ] # Define the features and the response x_var <- c("Solar.R", "Wind", "Temp", "Month") y_var <- "Ozone" # Split data into test and training data set ind_x_explain <- 1:12 x_train <- data[-ind_x_explain, ..x_var] y_train <- data[-ind_x_explain, get(y_var)] x_explain <- data[ind_x_explain, ..x_var] # Fitting a basic xgboost model to the training data model <- xgboost::xgboost( data = as.matrix(x_train), label = y_train, nround = 20, verbose = FALSE ) # Specifying the phi_0, i.e. the expected prediction without any features phi0 <- mean(y_train) # Independence approach explanation_independence <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "independence", phi0 = phi0, n_MC_samples = 1e2 ) # Empirical approach explanation_empirical <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "empirical", phi0 = phi0, n_MC_samples = 1e2 ) # Gaussian 1e1 approach explanation_gaussian_1e1 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "gaussian", phi0 = phi0, n_MC_samples = 1e1 ) # Gaussian 1e2 approach explanation_gaussian_1e2 <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "gaussian", phi0 = phi0, n_MC_samples = 1e2 ) # Combined approach explanation_combined <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = c("gaussian", "ctree", "empirical"), phi0 = phi0, n_MC_samples = 1e2 ) # Create a list of explanations with names explanation_list <- list( "Ind." = explanation_independence, "Emp." = explanation_empirical, "Gaus. 1e1" = explanation_gaussian_1e1, "Gaus. 1e2" = explanation_gaussian_1e2, "Combined" = explanation_combined ) if (requireNamespace("ggplot2", quietly = TRUE)) { # The function uses the provided names. plot_SV_several_approaches(explanation_list) # We can change the number of columns in the grid of plots and add other visual alterations plot_SV_several_approaches(explanation_list, facet_ncol = 3, facet_scales = "free_y", add_zero_line = TRUE, digits = 2, brewer_palette = "Paired", geom_col_width = 0.6 ) + ggplot2::theme_minimal() + ggplot2::theme(legend.position = "bottom", plot.title = ggplot2::element_text(size = 0)) # We can specify which explicands to plot to get less chaotic plots and make the bars vertical plot_SV_several_approaches(explanation_list, index_explicands = c(1:2, 5, 10), horizontal_bars = FALSE, axis_labels_rotate_angle = 45 ) # We can change the order of the features by specifying the # order using the `only_these_features` parameter. plot_SV_several_approaches(explanation_list, index_explicands = c(1:2, 5, 10), only_these_features = c("Temp", "Solar.R", "Month", "Wind") ) # We can also remove certain features if we are not interested in them # or want to focus on, e.g., two features. The function will give a # message to if the user specifies non-valid feature names. plot_SV_several_approaches(explanation_list, index_explicands = c(1:2, 5, 10), only_these_features = c("Temp", "Solar.R"), plot_phi0 = TRUE ) } ## End(Not run)
vaeac
modelsThis function makes (ggplot2::ggplot()
) figures of the training VLB and the validation IWAE for a list
of explain()
objects with approach = "vaeac"
. See setup_approach()
for more information about the
vaeac
approach. Two figures are returned by the function. In the figure, each object in explanation_list
gets
its own facet, while in the second figure, we plot the criteria in each facet for all objects.
plot_vaeac_eval_crit( explanation_list, plot_from_nth_epoch = 1, plot_every_nth_epoch = 1, criteria = c("VLB", "IWAE"), plot_type = c("method", "criterion"), facet_wrap_scales = "fixed", facet_wrap_ncol = NULL )
plot_vaeac_eval_crit( explanation_list, plot_from_nth_epoch = 1, plot_every_nth_epoch = 1, criteria = c("VLB", "IWAE"), plot_type = c("method", "criterion"), facet_wrap_scales = "fixed", facet_wrap_ncol = NULL )
explanation_list |
A list of |
plot_from_nth_epoch |
Integer. If we are only plot the results form the nth epoch and so forth. The first epochs can be large in absolute value and make the rest of the plot difficult to interpret. |
plot_every_nth_epoch |
Integer. If we are only to plot every nth epoch. Usefully to illustrate the overall trend, as there can be a lot of fluctuation and oscillation in the values between each epoch. |
criteria |
Character vector. The possible options are "VLB", "IWAE", "IWAE_running". Default is the first two. |
plot_type |
Character vector. The possible options are "method" and "criterion". Default is to plot both. |
facet_wrap_scales |
String. Should the scales be fixed (" |
facet_wrap_ncol |
Integer. Number of columns in the facet wrap. |
See Olsen et al. (2022) or the blog post for a summary of the VLB and IWAE.
Either a single ggplot2::ggplot()
object or a list of ggplot2::ggplot()
objects based on the
plot_type
parameter.
Lars Henry Berge Olsen
## Not run: library(xgboost) library(data.table) library(shapr) data("airquality") data <- data.table::as.data.table(airquality) data <- data[complete.cases(data), ] x_var <- c("Solar.R", "Wind", "Temp", "Month") y_var <- "Ozone" ind_x_explain <- 1:6 x_train <- data[-ind_x_explain, ..x_var] y_train <- data[-ind_x_explain, get(y_var)] x_explain <- data[ind_x_explain, ..x_var] # Fitting a basic xgboost model to the training data model <- xgboost(data = as.matrix(x_train), label = y_train, nround = 100, verbose = FALSE) # Specifying the phi_0, i.e. the expected prediction without any features p0 <- mean(y_train) # Train vaeac with and without paired sampling explanation_paired <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = approach, phi0 = p0, n_MC_samples = 1, # As we are only interested in the training of the vaeac vaeac.epochs = 10, # Should be higher in applications. vaeac.n_vaeacs_initialize = 1, vaeac.width = 16, vaeac.depth = 2, vaeac.extra_parameters = list(vaeac.paired_sampling = TRUE) ) explanation_regular <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = approach, phi0 = p0, n_MC_samples = 1, # As we are only interested in the training of the vaeac vaeac.epochs = 10, # Should be higher in applications. vaeac.width = 16, vaeac.depth = 2, vaeac.n_vaeacs_initialize = 1, vaeac.extra_parameters = list(vaeac.paired_sampling = FALSE) ) # Collect the explanation objects in an named list explanation_list <- list( "Regular sampling" = explanation_regular, "Paired sampling" = explanation_paired ) # Call the function with the named list, will use the provided names plot_vaeac_eval_crit(explanation_list = explanation_list) # The function also works if we have only one method, # but then one should only look at the method plot. plot_vaeac_eval_crit( explanation_list = explanation_list[2], plot_type = "method" ) # Can alter the plot plot_vaeac_eval_crit( explanation_list = explanation_list, plot_from_nth_epoch = 2, plot_every_nth_epoch = 2, facet_wrap_scales = "free" ) # If we only want the VLB plot_vaeac_eval_crit( explanation_list = explanation_list, criteria = "VLB", plot_type = "criterion" ) # If we want only want the criterion version tmp_fig_criterion <- plot_vaeac_eval_crit(explanation_list = explanation_list, plot_type = "criterion") # Since tmp_fig_criterion is a ggplot2 object, we can alter it # by, e.g,. adding points or smooths with se bands tmp_fig_criterion + ggplot2::geom_point(shape = "circle", size = 1, ggplot2::aes(col = Method)) tmp_fig_criterion$layers[[1]] <- NULL tmp_fig_criterion + ggplot2::geom_smooth(method = "loess", formula = y ~ x, se = TRUE) + ggplot2::scale_color_brewer(palette = "Set1") + ggplot2::theme_minimal() ## End(Not run)
## Not run: library(xgboost) library(data.table) library(shapr) data("airquality") data <- data.table::as.data.table(airquality) data <- data[complete.cases(data), ] x_var <- c("Solar.R", "Wind", "Temp", "Month") y_var <- "Ozone" ind_x_explain <- 1:6 x_train <- data[-ind_x_explain, ..x_var] y_train <- data[-ind_x_explain, get(y_var)] x_explain <- data[ind_x_explain, ..x_var] # Fitting a basic xgboost model to the training data model <- xgboost(data = as.matrix(x_train), label = y_train, nround = 100, verbose = FALSE) # Specifying the phi_0, i.e. the expected prediction without any features p0 <- mean(y_train) # Train vaeac with and without paired sampling explanation_paired <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = approach, phi0 = p0, n_MC_samples = 1, # As we are only interested in the training of the vaeac vaeac.epochs = 10, # Should be higher in applications. vaeac.n_vaeacs_initialize = 1, vaeac.width = 16, vaeac.depth = 2, vaeac.extra_parameters = list(vaeac.paired_sampling = TRUE) ) explanation_regular <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = approach, phi0 = p0, n_MC_samples = 1, # As we are only interested in the training of the vaeac vaeac.epochs = 10, # Should be higher in applications. vaeac.width = 16, vaeac.depth = 2, vaeac.n_vaeacs_initialize = 1, vaeac.extra_parameters = list(vaeac.paired_sampling = FALSE) ) # Collect the explanation objects in an named list explanation_list <- list( "Regular sampling" = explanation_regular, "Paired sampling" = explanation_paired ) # Call the function with the named list, will use the provided names plot_vaeac_eval_crit(explanation_list = explanation_list) # The function also works if we have only one method, # but then one should only look at the method plot. plot_vaeac_eval_crit( explanation_list = explanation_list[2], plot_type = "method" ) # Can alter the plot plot_vaeac_eval_crit( explanation_list = explanation_list, plot_from_nth_epoch = 2, plot_every_nth_epoch = 2, facet_wrap_scales = "free" ) # If we only want the VLB plot_vaeac_eval_crit( explanation_list = explanation_list, criteria = "VLB", plot_type = "criterion" ) # If we want only want the criterion version tmp_fig_criterion <- plot_vaeac_eval_crit(explanation_list = explanation_list, plot_type = "criterion") # Since tmp_fig_criterion is a ggplot2 object, we can alter it # by, e.g,. adding points or smooths with se bands tmp_fig_criterion + ggplot2::geom_point(shape = "circle", size = 1, ggplot2::aes(col = Method)) tmp_fig_criterion$layers[[1]] <- NULL tmp_fig_criterion + ggplot2::geom_smooth(method = "loess", formula = y ~ x, se = TRUE) + ggplot2::scale_color_brewer(palette = "Set1") + ggplot2::theme_minimal() ## End(Not run)
A function that creates a matrix of plots (GGally::ggpairs()
) from
generated imputations from the unconditioned distribution estimated by
a
vaeac
model, and then compares the imputed values with data from the true distribution (if provided).
See ggpairs for an
introduction to GGally::ggpairs()
, and the corresponding
vignette.
plot_vaeac_imputed_ggpairs( explanation, which_vaeac_model = "best", x_true = NULL, add_title = TRUE, alpha = 0.5, upper_cont = c("cor", "points", "smooth", "smooth_loess", "density", "blank"), upper_cat = c("count", "cross", "ratio", "facetbar", "blank"), upper_mix = c("box", "box_no_facet", "dot", "dot_no_facet", "facethist", "facetdensity", "denstrip", "blank"), lower_cont = c("points", "smooth", "smooth_loess", "density", "cor", "blank"), lower_cat = c("facetbar", "ratio", "count", "cross", "blank"), lower_mix = c("facetdensity", "box", "box_no_facet", "dot", "dot_no_facet", "facethist", "denstrip", "blank"), diag_cont = c("densityDiag", "barDiag", "blankDiag"), diag_cat = c("barDiag", "blankDiag"), cor_method = c("pearson", "kendall", "spearman") )
plot_vaeac_imputed_ggpairs( explanation, which_vaeac_model = "best", x_true = NULL, add_title = TRUE, alpha = 0.5, upper_cont = c("cor", "points", "smooth", "smooth_loess", "density", "blank"), upper_cat = c("count", "cross", "ratio", "facetbar", "blank"), upper_mix = c("box", "box_no_facet", "dot", "dot_no_facet", "facethist", "facetdensity", "denstrip", "blank"), lower_cont = c("points", "smooth", "smooth_loess", "density", "cor", "blank"), lower_cat = c("facetbar", "ratio", "count", "cross", "blank"), lower_mix = c("facetdensity", "box", "box_no_facet", "dot", "dot_no_facet", "facethist", "denstrip", "blank"), diag_cont = c("densityDiag", "barDiag", "blankDiag"), diag_cat = c("barDiag", "blankDiag"), cor_method = c("pearson", "kendall", "spearman") )
explanation |
Shapr list. The output list from the |
which_vaeac_model |
String. Indicating which |
x_true |
Data.table containing the data from the distribution that the |
add_title |
Logical. If |
alpha |
Numeric between |
upper_cont |
String. Type of plot to use in upper triangle for continuous features, see |
upper_cat |
String. Type of plot to use in upper triangle for categorical features, see |
upper_mix |
String. Type of plot to use in upper triangle for mixed features, see |
lower_cont |
String. Type of plot to use in lower triangle for continuous features, see |
lower_cat |
String. Type of plot to use in lower triangle for categorical features, see |
lower_mix |
String. Type of plot to use in lower triangle for mixed features, see |
diag_cont |
String. Type of plot to use on the diagonal for continuous features, see |
diag_cat |
String. Type of plot to use on the diagonal for categorical features, see |
cor_method |
String. Type of correlation measure, see |
A GGally::ggpairs()
figure.
Lars Henry Berge Olsen
## Not run: library(xgboost) library(data.table) library(shapr) data("airquality") data <- data.table::as.data.table(airquality) data <- data[complete.cases(data), ] x_var <- c("Solar.R", "Wind", "Temp", "Month") y_var <- "Ozone" ind_x_explain <- 1:6 x_train <- data[-ind_x_explain, ..x_var] y_train <- data[-ind_x_explain, get(y_var)] x_explain <- data[ind_x_explain, ..x_var] # Fitting a basic xgboost model to the training data model <- xgboost( data = as.matrix(x_train), label = y_train, nround = 100, verbose = FALSE ) explanation <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "vaeac", phi0 = mean(y_train), n_MC_samples = 1, vaeac.epochs = 10, vaeac.n_vaeacs_initialize = 1 ) # Plot the results figure <- plot_vaeac_imputed_ggpairs( explanation = explanation, which_vaeac_model = "best", x_true = x_train, add_title = TRUE ) figure # Note that this is an ggplot2 object which we can alter, e.g., we can change the colors. figure + ggplot2::scale_color_manual(values = c("#E69F00", "#999999")) + ggplot2::scale_fill_manual(values = c("#E69F00", "#999999")) ## End(Not run)
## Not run: library(xgboost) library(data.table) library(shapr) data("airquality") data <- data.table::as.data.table(airquality) data <- data[complete.cases(data), ] x_var <- c("Solar.R", "Wind", "Temp", "Month") y_var <- "Ozone" ind_x_explain <- 1:6 x_train <- data[-ind_x_explain, ..x_var] y_train <- data[-ind_x_explain, get(y_var)] x_explain <- data[ind_x_explain, ..x_var] # Fitting a basic xgboost model to the training data model <- xgboost( data = as.matrix(x_train), label = y_train, nround = 100, verbose = FALSE ) explanation <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "vaeac", phi0 = mean(y_train), n_MC_samples = 1, vaeac.epochs = 10, vaeac.n_vaeacs_initialize = 1 ) # Plot the results figure <- plot_vaeac_imputed_ggpairs( explanation = explanation, which_vaeac_model = "best", x_true = x_train, add_title = TRUE ) figure # Note that this is an ggplot2 object which we can alter, e.g., we can change the colors. figure + ggplot2::scale_color_manual(values = c("#E69F00", "#999999")) + ggplot2::scale_fill_manual(values = c("#E69F00", "#999999")) ## End(Not run)
Plots the individual prediction explanations.
## S3 method for class 'shapr' plot( x, plot_type = "bar", digits = 3, index_x_explain = NULL, top_k_features = NULL, col = NULL, bar_plot_phi0 = TRUE, bar_plot_order = "largest_first", scatter_features = NULL, scatter_hist = TRUE, include_group_feature_means = FALSE, beeswarm_cex = 1/length(index_x_explain)^(1/4), ... )
## S3 method for class 'shapr' plot( x, plot_type = "bar", digits = 3, index_x_explain = NULL, top_k_features = NULL, col = NULL, bar_plot_phi0 = TRUE, bar_plot_order = "largest_first", scatter_features = NULL, scatter_hist = TRUE, include_group_feature_means = FALSE, beeswarm_cex = 1/length(index_x_explain)^(1/4), ... )
x |
An |
plot_type |
Character.
Specifies the type of plot to produce.
|
digits |
Integer.
Number of significant digits to use in the feature description.
Applicable for |
index_x_explain |
Integer vector.
Which of the test observations to plot. E.g. if you have
explained 10 observations using |
top_k_features |
Integer.
How many features to include in the plot.
E.g. if you have 15 features in your model you can plot the 5 most important features,
for each explanation, by setting |
col |
Character vector (where length depends on plot type).
The color codes (hex codes or other names understood by If you want to alter the colors i the plot, the length of the |
bar_plot_phi0 |
Logical.
Whether to include |
bar_plot_order |
Character.
Specifies what order to plot the features with respect to the magnitude of the shapley values with
|
scatter_features |
Integer or character vector.
Only used for |
scatter_hist |
Logical.
Only used for |
include_group_feature_means |
Logical.
Whether to include the average feature value in a group on the y-axis or not.
If |
beeswarm_cex |
Numeric.
The cex argument of |
... |
Other arguments passed to underlying functions,
like |
See the examples below, or vignette("general_usage", package = "shapr")
for an examples of
how you should use the function.
ggplot object with plots of the Shapley value explanations
Martin Jullum, Vilde Ung, Lars Henry Berge Olsen
## Not run: data("airquality") airquality <- airquality[complete.cases(airquality), ] x_var <- c("Solar.R", "Wind", "Temp", "Month") y_var <- "Ozone" # Split data into test- and training data data_train <- head(airquality, -50) data_explain <- tail(airquality, 50) x_train <- data_train[, x_var] x_explain <- data_explain[, x_var] # Fit a linear model lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + "))) model <- lm(lm_formula, data = data_train) # Explain predictions p <- mean(data_train[, y_var]) # Empirical approach x <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "empirical", phi0 = p, n_MC_samples = 1e2 ) if (requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("ggbeeswarm", quietly = TRUE)) { # The default plotting option is a bar plot of the Shapley values # We draw bar plots for the first 4 observations plot(x, index_x_explain = 1:4) # We can also make waterfall plots plot(x, plot_type = "waterfall", index_x_explain = 1:4) # And only showing the 2 features with largest contribution plot(x, plot_type = "waterfall", index_x_explain = 1:4, top_k_features = 2) # Or scatter plots showing the distribution of the shapley values and feature values plot(x, plot_type = "scatter") # And only for a specific feature plot(x, plot_type = "scatter", scatter_features = "Temp") # Or a beeswarm plot summarising the Shapley values and feature values for all features plot(x, plot_type = "beeswarm") plot(x, plot_type = "beeswarm", col = c("red", "black")) # we can change colors # Additional arguments can be passed to ggbeeswarm::geom_beeswarm() using the '...' argument. # For instance, sometimes the beeswarm plots overlap too much. # This can be fixed with the 'corral="wrap" argument. # See ?ggbeeswarm::geom_beeswarm for more information. plot(x, plot_type = "beeswarm", corral = "wrap") } # Example of scatter and beeswarm plot with factor variables airquality$Month_factor <- as.factor(month.abb[airquality$Month]) airquality <- airquality[complete.cases(airquality), ] x_var <- c("Solar.R", "Wind", "Temp", "Month_factor") y_var <- "Ozone" # Split data into test- and training data data_train <- airquality data_explain <- tail(airquality, 50) x_train <- data_train[, x_var] x_explain <- data_explain[, x_var] # Fit a linear model lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + "))) model <- lm(lm_formula, data = data_train) # Explain predictions p <- mean(data_train[, y_var]) # Empirical approach x <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "ctree", phi0 = p, n_MC_samples = 1e2 ) if (requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("ggbeeswarm", quietly = TRUE)) { plot(x, plot_type = "scatter") plot(x, plot_type = "beeswarm") } ## End(Not run)
## Not run: data("airquality") airquality <- airquality[complete.cases(airquality), ] x_var <- c("Solar.R", "Wind", "Temp", "Month") y_var <- "Ozone" # Split data into test- and training data data_train <- head(airquality, -50) data_explain <- tail(airquality, 50) x_train <- data_train[, x_var] x_explain <- data_explain[, x_var] # Fit a linear model lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + "))) model <- lm(lm_formula, data = data_train) # Explain predictions p <- mean(data_train[, y_var]) # Empirical approach x <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "empirical", phi0 = p, n_MC_samples = 1e2 ) if (requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("ggbeeswarm", quietly = TRUE)) { # The default plotting option is a bar plot of the Shapley values # We draw bar plots for the first 4 observations plot(x, index_x_explain = 1:4) # We can also make waterfall plots plot(x, plot_type = "waterfall", index_x_explain = 1:4) # And only showing the 2 features with largest contribution plot(x, plot_type = "waterfall", index_x_explain = 1:4, top_k_features = 2) # Or scatter plots showing the distribution of the shapley values and feature values plot(x, plot_type = "scatter") # And only for a specific feature plot(x, plot_type = "scatter", scatter_features = "Temp") # Or a beeswarm plot summarising the Shapley values and feature values for all features plot(x, plot_type = "beeswarm") plot(x, plot_type = "beeswarm", col = c("red", "black")) # we can change colors # Additional arguments can be passed to ggbeeswarm::geom_beeswarm() using the '...' argument. # For instance, sometimes the beeswarm plots overlap too much. # This can be fixed with the 'corral="wrap" argument. # See ?ggbeeswarm::geom_beeswarm for more information. plot(x, plot_type = "beeswarm", corral = "wrap") } # Example of scatter and beeswarm plot with factor variables airquality$Month_factor <- as.factor(month.abb[airquality$Month]) airquality <- airquality[complete.cases(airquality), ] x_var <- c("Solar.R", "Wind", "Temp", "Month_factor") y_var <- "Ozone" # Split data into test- and training data data_train <- airquality data_explain <- tail(airquality, 50) x_train <- data_train[, x_var] x_explain <- data_explain[, x_var] # Fit a linear model lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + "))) model <- lm(lm_formula, data = data_train) # Explain predictions p <- mean(data_train[, y_var]) # Empirical approach x <- explain( model = model, x_explain = x_explain, x_train = x_train, approach = "ctree", phi0 = p, n_MC_samples = 1e2 ) if (requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("ggbeeswarm", quietly = TRUE)) { plot(x, plot_type = "scatter") plot(x, plot_type = "beeswarm") } ## End(Not run)
Print method for shapr objects
## S3 method for class 'shapr' print(x, digits = 4, ...)
## S3 method for class 'shapr' print(x, digits = 4, ...)
x |
A shapr object |
digits |
Scalar Integer. Number of digits to display to the console |
... |
Unused |
vaeac
modelIn this function, we specify the default values for the extra parameters used in explain()
for approach = "vaeac"
.
vaeac_get_extra_para_default( vaeac.model_description = make.names(Sys.time()), vaeac.folder_to_save_model = tempdir(), vaeac.pretrained_vaeac_model = NULL, vaeac.cuda = FALSE, vaeac.epochs_initiation_phase = 2, vaeac.epochs_early_stopping = NULL, vaeac.save_every_nth_epoch = NULL, vaeac.val_ratio = 0.25, vaeac.val_iwae_n_samples = 25, vaeac.batch_size = 64, vaeac.batch_size_sampling = NULL, vaeac.running_avg_n_values = 5, vaeac.skip_conn_layer = TRUE, vaeac.skip_conn_masked_enc_dec = TRUE, vaeac.batch_normalization = FALSE, vaeac.paired_sampling = TRUE, vaeac.masking_ratio = 0.5, vaeac.mask_gen_coalitions = NULL, vaeac.mask_gen_coalitions_prob = NULL, vaeac.sigma_mu = 10000, vaeac.sigma_sigma = 1e-04, vaeac.sample_random = TRUE, vaeac.save_data = FALSE, vaeac.log_exp_cont_feat = FALSE, vaeac.which_vaeac_model = "best", vaeac.save_model = TRUE )
vaeac_get_extra_para_default( vaeac.model_description = make.names(Sys.time()), vaeac.folder_to_save_model = tempdir(), vaeac.pretrained_vaeac_model = NULL, vaeac.cuda = FALSE, vaeac.epochs_initiation_phase = 2, vaeac.epochs_early_stopping = NULL, vaeac.save_every_nth_epoch = NULL, vaeac.val_ratio = 0.25, vaeac.val_iwae_n_samples = 25, vaeac.batch_size = 64, vaeac.batch_size_sampling = NULL, vaeac.running_avg_n_values = 5, vaeac.skip_conn_layer = TRUE, vaeac.skip_conn_masked_enc_dec = TRUE, vaeac.batch_normalization = FALSE, vaeac.paired_sampling = TRUE, vaeac.masking_ratio = 0.5, vaeac.mask_gen_coalitions = NULL, vaeac.mask_gen_coalitions_prob = NULL, vaeac.sigma_mu = 10000, vaeac.sigma_sigma = 1e-04, vaeac.sample_random = TRUE, vaeac.save_data = FALSE, vaeac.log_exp_cont_feat = FALSE, vaeac.which_vaeac_model = "best", vaeac.save_model = TRUE )
vaeac.model_description |
String (default is |
vaeac.folder_to_save_model |
String (default is |
vaeac.pretrained_vaeac_model |
List or String (default is |
vaeac.cuda |
Logical (default is |
vaeac.epochs_initiation_phase |
Positive integer (default is |
vaeac.epochs_early_stopping |
Positive integer (default is |
vaeac.save_every_nth_epoch |
Positive integer (default is |
vaeac.val_ratio |
Numeric (default is |
vaeac.val_iwae_n_samples |
Positive integer (default is |
vaeac.batch_size |
Positive integer (default is |
vaeac.batch_size_sampling |
Positive integer (default is |
vaeac.running_avg_n_values |
Positive integer (default is |
vaeac.skip_conn_layer |
Logical (default is |
vaeac.skip_conn_masked_enc_dec |
Logical (default is |
vaeac.batch_normalization |
Logical (default is |
vaeac.paired_sampling |
Logical (default is |
vaeac.masking_ratio |
Numeric (default is |
vaeac.mask_gen_coalitions |
Matrix (default is |
vaeac.mask_gen_coalitions_prob |
Numeric array (default is |
vaeac.sigma_mu |
Numeric (default is |
vaeac.sigma_sigma |
Numeric (default is |
vaeac.sample_random |
Logical (default is |
vaeac.save_data |
Logical (default is |
vaeac.log_exp_cont_feat |
Logical (default is |
vaeac.which_vaeac_model |
String (default is |
vaeac.save_model |
Boolean. If |
The vaeac
model consists of three neural network (a full encoder, a masked encoder, and a decoder) based
on the provided vaeac.depth
and vaeac.width
. The encoders map the full and masked input
representations to latent representations, respectively, where the dimension is given by vaeac.latent_dim
.
The latent representations are sent to the decoder to go back to the real feature space and
provide a samplable probabilistic representation, from which the Monte Carlo samples are generated.
We use the vaeac
method at the epoch with the lowest validation error (IWAE) by default, but
other possibilities are available but setting the vaeac.which_vaeac_model
parameter. See
Olsen et al. (2022) for more details.
Named list of the default values vaeac
extra parameter arguments specified in this function call.
Note that both vaeac.model_description
and vaeac.folder_to_save_model
will change with time and R session.
Lars Henry Berge Olsen
Function that loads a previously trained vaeac model and continue the training, either on new data or on the same dataset as it was trained on before. If we are given a new dataset, then we assume that new dataset has the same distribution and one_hot_max_sizes as the original dataset.
vaeac_train_model_continue( explanation, epochs_new, lr_new = NULL, x_train = NULL, save_data = FALSE, verbose = NULL, seed = 1 )
vaeac_train_model_continue( explanation, epochs_new, lr_new = NULL, x_train = NULL, save_data = FALSE, verbose = NULL, seed = 1 )
explanation |
A |
epochs_new |
Positive integer. The number of extra epochs to conduct. |
lr_new |
Positive numeric. If we are to overwrite the old learning rate in the adam optimizer. |
x_train |
A data.table containing the training data. Categorical data must have class names |
save_data |
Logical (default is |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings
|
seed |
Positive integer (default is |
A list containing the training/validation errors and paths to where the vaeac models are saved on the disk.
Lars Henry Berge Olsen