Multivariate Student-T Regression (Cholesky Decomposition)¶

In this example, we model and predict all parameters of a trivariate ($Y_{D}=3$) Student-T distribution. As a generalization of the multivariate Normal, the multivariate Student-T is suitable when modelling heavy-tailed data, i.e., when there is more mass in the tails of the distribution. The multivariate Student-T distribution has an additional degrees of freedom parameter $\mathbf{\nu}_{\mathbf{x}} > 2$ that governs the tail behaviour, where for $\mathbf{\nu}_{\mathbf{x}} \rightarrow \infty$ the Student-T converges in distribution to the multivariate Normal. The conditional means $\mathbf{\mu}(x) \in \mathbb{R}^{D}$ and the conditional covariance matrix $\mathbf{\nu}(x)\big(\mathbf{\nu}(x)-2\big)^{-1}\mathbf{\Sigma}(x) \in \mathbb{R}^{D \times D}$ are given as follows

$$ \mathbf{\mu}(x)=\begin{pmatrix}\mu_{1}(x) \\ \mu_{2}(x) \\ \mu_{3}(x)\end{pmatrix}, \qquad \qquad \mathbf{\Sigma}(x)= \begin{pmatrix} \sigma^{2}_{11}(x) & \rho_{1,2}(x)\sigma_{1}(x)\sigma_{2}(x) & \rho_{1,3}(x)\sigma_{1}(x)\sigma_{3}(x) \\ \rho_{2,1}(x)\sigma_{2}(x)\sigma_{1}(x) & \sigma^{2}_{22}(x) & \rho_{2,3}(x)\sigma_{2}(x)\sigma_{3}(x) \\ \rho_{3,1}(x)\sigma_{3}(x)\sigma_{1}(x) & \rho_{3,2}(x)\sigma_{3}(x)\sigma_{2}(x) & \sigma^{2}_{33}(x) \end{pmatrix} $$

where $\mathbf{\mu}(x)$ and $\mathbf{\Sigma}(x)$ are defined as for the multivariate Normal. To ensure positive definiteness of $\Sigma(\cdot)$, the $D(D + 1)/2$ entries of the covariance matrix must satisfy specific conditions. For the bivariate case, this can be ensured by applying exponential functions to the variances and a suitable transformation to restrict the coefficient of correlation $\rho \in [-1,1]$. However, in high-dimensional settings, where all moments are modelled as functions of covariates, ensuring positive definiteness of the covariance matrix becomes challenging, since joint restrictions for the elements are necessary. A computationally more tractable approach to ensure positive definiteness is based on the Cholesky decomposition, that uniquely decomposes the covariance matrix as follows

$$ \mathbf{\Sigma}(x) = \mathbf{L}(x) \mathbf{L}^{\prime}(x) $$

where $\mathbf{L}(\cdot) \in \mathbb{R}^{D \times D}$ is a lower triangular matrix. To ensure $\mathbf{\Sigma}(\cdot)$ to be positive definite, the $D$ diagonal elements $\ell_{ii}$ of $\mathbf{L}(\cdot)$ need to be strictly positive, whereas all $D(D −1)/2$ off diagonal elements $\ell_{ij}$ can take on any value. For the trivariate case, the Cholesky factor $\mathbf{L}(\cdot)$ is given as follows

$$ \mathbf{L}(x)= \begin{pmatrix} \exp\big(\ell_{11}(x)\big) & 0 & 0 \\ \ell_{21}(x) & \exp\big(\ell_{22}(x)\big) & 0 \\ \ell_{31}(x) & \ell_{32}(x) & \exp\big(\ell_{33}(x)\big)\\ \end{pmatrix} $$

Given the usefulness of the Cholesky decomposition, instead of estimating the entries of $\mathbf{\Sigma}(\cdot)$ directly, XGboostlss estimates the Cholesky factors $\mathbf{L}(\cdot)$ and then uses these for creating $\mathbf{\Sigma}(\cdot)$. However, in contrast to the original formulation of $\mathbf{\Sigma}(\cdot)$, the elements in $\mathbf{L}(\cdot)$ do not have any direct interpretation. Since XGBoostLSS is based on a one vs. all estimation strategy, where a separate tree is grown for each distributional parameter, estimating many parameters for a large dataset can become computationally extremely expensive. For more details, we refer to our related paper März, Alexander (2022), Multi-Target XGBoostLSS Regression.

Imports¶

In [17]:

Copied!





from xgboostlss.model import *
from xgboostlss.distributions.MVT import *
from xgboostlss.datasets.data_loader import load_simulated_multivariate_studentT_data

from sklearn.model_selection import train_test_split
import pandas as pd
import multiprocessing
import plotnine
from plotnine import *
plotnine.options.figure_size = (18, 9)
n_cpu = multiprocessing.cpu_count()
from xgboostlss.model import *
from xgboostlss.distributions.MVT import *
from xgboostlss.datasets.data_loader import load_simulated_multivariate_studentT_data

from sklearn.model_selection import train_test_split
import pandas as pd
import multiprocessing
import plotnine
from plotnine import *
plotnine.options.figure_size = (18, 9)
n_cpu = multiprocessing.cpu_count()

Data¶

In [2]:

Copied!





data_sim = load_simulated_multivariate_studentT_data()

# Create 60%, 20%, 20% split for train, validation and test 
train, validate, test = np.split(data_sim.sample(frac=1,random_state=123), [int(0.6*len(data_sim)), int(0.8*len(data_sim))])

# Train
x_train = train.filter(regex="x")
y_train = train.filter(regex="y").values
n_targets = y_train.shape[1]
dtrain = xgb.DMatrix(x_train, label=y_train, nthread=n_cpu)

# Validation
x_eval = validate.filter(regex="x")
y_eval = validate.filter(regex="y").values
deval = xgb.DMatrix(x_eval, label=y_eval, nthread=n_cpu)

# Test
x_test = test.filter(regex="x")
y_test = test.filter(regex="y").values
dtest = xgb.DMatrix(x_test, nthread=n_cpu)
data_sim = load_simulated_multivariate_studentT_data()

# Create 60%, 20%, 20% split for train, validation and test 
train, validate, test = np.split(data_sim.sample(frac=1,random_state=123), [int(0.6*len(data_sim)), int(0.8*len(data_sim))])

# Train
x_train = train.filter(regex="x")
y_train = train.filter(regex="y").values
n_targets = y_train.shape[1]
dtrain = xgb.DMatrix(x_train, label=y_train, nthread=n_cpu)

# Validation
x_eval = validate.filter(regex="x")
y_eval = validate.filter(regex="y").values
deval = xgb.DMatrix(x_eval, label=y_eval, nthread=n_cpu)

# Test
x_test = test.filter(regex="x")
y_test = test.filter(regex="y").values
dtest = xgb.DMatrix(x_test, nthread=n_cpu)

Distribution Selection¶

In [3]:

Copied!





# Specifies a multivariate Student-T distribution, using the Cholesky decompoisition. See ?MVT for details.
xgblss = XGBoostLSS(
    MVT(D=n_targets,                      # Specifies the number of targets
        stabilization="None",             # Options are "None", "MAD", "L2".
        response_fn="exp",                # Function to transform the lower-triangular factor of the covariance, e.g., "exp" or "softplus".
        loss_fn="nll"                     # Loss function, i.e., nll.
       )               
)
# Specifies a multivariate Student-T distribution, using the Cholesky decompoisition. See ?MVT for details.
xgblss = XGBoostLSS(
    MVT(D=n_targets,                      # Specifies the number of targets
        stabilization="None",             # Options are "None", "MAD", "L2".
        response_fn="exp",                # Function to transform the lower-triangular factor of the covariance, e.g., "exp" or "softplus".
        loss_fn="nll"                     # Loss function, i.e., nll.
       )               
)

Hyper-Parameter Optimization¶

Any XGBoost hyperparameter can be tuned, where the structure of the parameter dictionary needs to be as follows:

- Float/Int sample_type
    - {"param_name": ["sample_type", low, high, log]}
        - sample_type: str, Type of sampling, e.g., "float" or "int"
        - low: int, Lower endpoint of the range of suggested values
        - high: int, Upper endpoint of the range of suggested values
        - log: bool, Flag to sample the value from the log domain or not
    - Example: {"eta": "float", low=1e-5, high=1, log=True]}

- Categorical sample_type
    - {"param_name": ["sample_type", ["choice1", "choice2", "choice3", "..."]]}
        - sample_type: str, Type of sampling, either "categorical"
        - choice1, choice2, choice3, ...: str, Possible choices for the parameter
    - Example: {"booster": ["categorical", ["gbtree", "dart"]]}

- For parameters without tunable choice (this is needed if tree_method = "gpu_hist" and gpu_id needs to be specified)
    - {"param_name": ["none", [value]]},
        - param_name: str, Name of the parameter
        - value: int, Value of the parameter
    - Example: {"gpu_id": ["none", [0]]}

Depending on which parameters are optimized, it might happen that some of them are not used, e.g., when {"booster": ["categorical", ["gbtree", "gblinear"]]} and {"max_depth": ["int", 1, 10, False]} are specified, max_depth is not used when gblinear is sampled, since it has no such argument.

In [4]:

Copied!





param_dict = {
    "eta":              ["float", {"low": 1e-5,   "high": 1,     "log": True}],
    "max_depth":        ["int",   {"low": 1,      "high": 10,    "log": False}],
    "gamma":            ["float", {"low": 1e-8,   "high": 40,    "log": True}],
    "subsample":        ["float", {"low": 0.2,    "high": 1.0,   "log": False}],
    "colsample_bytree": ["float", {"low": 0.2,    "high": 1.0,   "log": False}],
    "min_child_weight": ["float", {"low": 1e-8,   "high": 500,   "log": True}],
}

np.random.seed(123)
opt_param = xgblss.hyper_opt(param_dict,
                             dtrain,
                             num_boost_round=100,        # Number of boosting iterations.
                             nfold=5,                    # Number of cv-folds.
                             early_stopping_rounds=20,   # Number of early-stopping rounds
                             max_minutes=120,            # Time budget in minutes, i.e., stop study after the given number of minutes.
                             n_trials=20,                # The number of trials. If this argument is set to None, there is no limitation on the number of trials.
                             silence=False,              # Controls the verbosity of the trail, i.e., user can silence the outputs of the trail.
                             seed=123,                   # Seed used to generate cv-folds.
                             hp_seed=None                # Seed for random number generator used in the Bayesian hyperparameter search.
                            )
param_dict = {
    "eta":              ["float", {"low": 1e-5,   "high": 1,     "log": True}],
    "max_depth":        ["int",   {"low": 1,      "high": 10,    "log": False}],
    "gamma":            ["float", {"low": 1e-8,   "high": 40,    "log": True}],
    "subsample":        ["float", {"low": 0.2,    "high": 1.0,   "log": False}],
    "colsample_bytree": ["float", {"low": 0.2,    "high": 1.0,   "log": False}],
    "min_child_weight": ["float", {"low": 1e-8,   "high": 500,   "log": True}],
}

np.random.seed(123)
opt_param = xgblss.hyper_opt(param_dict,
                             dtrain,
                             num_boost_round=100,        # Number of boosting iterations.
                             nfold=5,                    # Number of cv-folds.
                             early_stopping_rounds=20,   # Number of early-stopping rounds
                             max_minutes=120,            # Time budget in minutes, i.e., stop study after the given number of minutes.
                             n_trials=20,                # The number of trials. If this argument is set to None, there is no limitation on the number of trials.
                             silence=False,              # Controls the verbosity of the trail, i.e., user can silence the outputs of the trail.
                             seed=123,                   # Seed used to generate cv-folds.
                             hp_seed=None                # Seed for random number generator used in the Bayesian hyperparameter search.
                            )

[I 2023-06-22 17:15:25,217] A new study created in memory with name: XGBoostLSS Hyper-Parameter Optimization
C:\Users\maerzale\.virtualenvs\XGBoostLSS-vIPRRz-M\lib\site-packages\optuna\progress_bar.py:56: ExperimentalWarning: Progress bar is experimental (supported from v1.2.0). The interface can change in the future.

  0%|          | 0/20 [00:00<?, ?it/s]

[I 2023-06-22 17:15:35,816] Trial 0 finished with value: 7603.6756836 and parameters: {'eta': 0.7392838621851777, 'max_depth': 9, 'gamma': 4.70083764792819e-05, 'subsample': 0.2871240541569393, 'colsample_bytree': 0.6493047413626221, 'min_child_weight': 3.419322554897627e-07, 'booster': 'gbtree'}. Best is trial 0 with value: 7603.6756836.
[I 2023-06-22 17:16:07,030] Trial 1 finished with value: 5254.1501954 and parameters: {'eta': 0.3966502965778568, 'max_depth': 1, 'gamma': 1.6600425967824835e-08, 'subsample': 0.9171253889781197, 'colsample_bytree': 0.41440211790986414, 'min_child_weight': 1.0532289550075074e-05, 'booster': 'gbtree'}. Best is trial 1 with value: 5254.1501954.
[I 2023-06-22 17:17:00,070] Trial 2 finished with value: 5206.443456999999 and parameters: {'eta': 0.038785583444687854, 'max_depth': 4, 'gamma': 5.8722696245215776e-05, 'subsample': 0.4248055639993571, 'colsample_bytree': 0.8294165742341124, 'min_child_weight': 36.56175225862444, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:17:12,795] Trial 3 finished with value: 5553.1722656 and parameters: {'eta': 0.4414186988825811, 'max_depth': 5, 'gamma': 0.00011791508204121354, 'subsample': 0.33427841081736337, 'colsample_bytree': 0.8837740608717048, 'min_child_weight': 0.00013267683424125326, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:18:10,361] Trial 4 finished with value: 6429.72959 and parameters: {'eta': 8.927435826998966e-05, 'max_depth': 8, 'gamma': 9.605643604067764, 'subsample': 0.5103538865347832, 'colsample_bytree': 0.20114124271777928, 'min_child_weight': 0.0827673114020242, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:19:13,768] Trial 5 finished with value: 6370.665918 and parameters: {'eta': 0.00033172698077006983, 'max_depth': 9, 'gamma': 1.7574999204441155e-07, 'subsample': 0.8132726056620359, 'colsample_bytree': 0.28427741621986674, 'min_child_weight': 0.0004399160254325443, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:20:07,778] Trial 6 finished with value: 6177.9252928000005 and parameters: {'eta': 0.0013107972089424669, 'max_depth': 6, 'gamma': 0.0003947208057611046, 'subsample': 0.3331189089834271, 'colsample_bytree': 0.3216548967841096, 'min_child_weight': 5.059549450135537, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:20:58,275] Trial 7 finished with value: 5396.906738600001 and parameters: {'eta': 0.011557773869365524, 'max_depth': 7, 'gamma': 0.013201194426625354, 'subsample': 0.26151843109371326, 'colsample_bytree': 0.7391295414427628, 'min_child_weight': 65.18099581083081, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:21:13,314] Trial 8 finished with value: 5320.568163999999 and parameters: {'eta': 0.2523431062415394, 'max_depth': 6, 'gamma': 0.001509400459862839, 'subsample': 0.48011026573662763, 'colsample_bytree': 0.4961615347535019, 'min_child_weight': 16.626213089564008, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:22:00,388] Trial 9 finished with value: 5962.424120999999 and parameters: {'eta': 0.01687563867224601, 'max_depth': 1, 'gamma': 0.023216064605275356, 'subsample': 0.4799456519972923, 'colsample_bytree': 0.2626207990483921, 'min_child_weight': 78.29344531698013, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:22:10,639] Trial 10 pruned. Trial was pruned at iteration 20.
[I 2023-06-22 17:22:56,873] Trial 11 finished with value: 5508.1416018 and parameters: {'eta': 0.0641378538262785, 'max_depth': 1, 'gamma': 2.8982557587207602e-08, 'subsample': 0.9735721168706151, 'colsample_bytree': 0.5175644975545589, 'min_child_weight': 4.356420393421785e-06, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:23:29,445] Trial 12 finished with value: 5210.7024412 and parameters: {'eta': 0.07087312539925689, 'max_depth': 3, 'gamma': 1.0489385039217787e-08, 'subsample': 0.6851449972551221, 'colsample_bytree': 0.7725247787181241, 'min_child_weight': 1.0029863609677106e-07, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:24:10,467] Trial 13 finished with value: 5258.8476562 and parameters: {'eta': 0.04661822650438787, 'max_depth': 4, 'gamma': 1.1310639679505943e-06, 'subsample': 0.6779242255755213, 'colsample_bytree': 0.7889636971447412, 'min_child_weight': 2.483426962484972e-08, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:24:21,239] Trial 14 pruned. Trial was pruned at iteration 20.
[I 2023-06-22 17:24:53,849] Trial 15 finished with value: 5212.3180661999995 and parameters: {'eta': 0.07216085294730987, 'max_depth': 3, 'gamma': 2.1559971618413647e-07, 'subsample': 0.7707515948980044, 'colsample_bytree': 0.8276465651970226, 'min_child_weight': 0.6340046259700444, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:25:04,640] Trial 16 pruned. Trial was pruned at iteration 20.
[I 2023-06-22 17:25:32,397] Trial 17 finished with value: 5216.8640626 and parameters: {'eta': 0.12160352651992629, 'max_depth': 2, 'gamma': 1.4345659547891858e-05, 'subsample': 0.4100398046376301, 'colsample_bytree': 0.6005296616849958, 'min_child_weight': 0.002798417113404921, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.
[I 2023-06-22 17:25:44,204] Trial 18 pruned. Trial was pruned at iteration 20.
[I 2023-06-22 17:26:03,688] Trial 19 finished with value: 5321.579004 and parameters: {'eta': 0.18426312552314775, 'max_depth': 4, 'gamma': 1.3047634560131253e-05, 'subsample': 0.22375332133800668, 'colsample_bytree': 0.7527841834020992, 'min_child_weight': 1.9404945782758862, 'booster': 'gbtree'}. Best is trial 2 with value: 5206.443456999999.

Hyper-Parameter Optimization successfully finished.
  Number of finished trials:  20
  Best trial:
    Value: 5206.443456999999
    Params: 
    eta: 0.038785583444687854
    max_depth: 4
    gamma: 5.8722696245215776e-05
    subsample: 0.4248055639993571
    colsample_bytree: 0.8294165742341124
    min_child_weight: 36.56175225862444
    booster: gbtree
    opt_rounds: 89

Model Training¶

In [5]:

Copied!





np.random.seed(123)

opt_params = opt_param.copy()
n_rounds = opt_params["opt_rounds"]
del opt_params["opt_rounds"]

# Add evaluation set
eval_set = [(dtrain,"train"), (deval,"evaluation")]
eval_result = {}
            
# Train Model with optimized hyperparameters
xgblss.train(
    opt_params,
    dtrain,
    num_boost_round=n_rounds,
    evals=eval_set, 
    evals_result=eval_result,
    verbose_eval=False
)
np.random.seed(123)

opt_params = opt_param.copy()
n_rounds = opt_params["opt_rounds"]
del opt_params["opt_rounds"]

# Add evaluation set
eval_set = [(dtrain,"train"), (deval,"evaluation")]
eval_result = {}
            
# Train Model with optimized hyperparameters
xgblss.train(
    opt_params,
    dtrain,
    num_boost_round=n_rounds,
    evals=eval_set, 
    evals_result=eval_result,
    verbose_eval=False
)

In [6]:

Copied!





# Note that XGBoostLSS uses NLL.sum() instead of NLL.mean() for training, so that train-nll and evaluation-nll are not comparable. Hence we manually adjust them.
n_train = y_train.shape[0],
n_eval =  y_eval.shape[0]

eval_df = pd.DataFrame.from_dict({"train": np.array(eval_result["train"]["nll"]) / n_train,
                                  "evaluation": np.array(eval_result["evaluation"]["nll"]) / n_eval
                                 })
eval_df["iter"] = eval_df.index + 1
eval_df = eval_df.melt(id_vars="iter")

(
    ggplot(eval_df,
           aes(x="iter",
               y="value",
               color="variable")) + 
    geom_line() + 
    labs(title="XGBoostLSS Train and Eval Loss",
         x="Iteration",
         y="NLL") + 
    theme_bw(base_size=15) + 
    theme(legend_position="bottom",
          legend_title = element_blank()
         )
)
# Note that XGBoostLSS uses NLL.sum() instead of NLL.mean() for training, so that train-nll and evaluation-nll are not comparable. Hence we manually adjust them.
n_train = y_train.shape[0],
n_eval =  y_eval.shape[0]

eval_df = pd.DataFrame.from_dict({"train": np.array(eval_result["train"]["nll"]) / n_train,
                                  "evaluation": np.array(eval_result["evaluation"]["nll"]) / n_eval
                                 })
eval_df["iter"] = eval_df.index + 1
eval_df = eval_df.melt(id_vars="iter")

(
    ggplot(eval_df,
           aes(x="iter",
               y="value",
               color="variable")) + 
    geom_line() + 
    labs(title="XGBoostLSS Train and Eval Loss",
         x="Iteration",
         y="NLL") + 
    theme_bw(base_size=15) + 
    theme(legend_position="bottom",
          legend_title = element_blank()
         )
)

No description has been provided for this image

Out[6]:

<Figure Size: (1800 x 900)>

Prediction¶

In [7]:

Copied!





# Set seed for reproducibility
torch.manual_seed(123)

# Number of samples to draw from predicted distribution
n_samples = 1000
quant_sel = [0.05, 0.95] # Quantiles to calculate from predicted distribution

# Sample from predicted distribution
pred_samples = xgblss.predict(dtest,
                              pred_type="samples",
                              n_samples=n_samples,
                              seed=123)

# Calculate quantiles from predicted distribution
pred_quantiles = xgblss.predict(dtest,
                                pred_type="quantiles",
                                n_samples=n_samples,
                                quantiles=quant_sel)

# Returns predicted distributional parameters
pred_params = xgblss.predict(dtest,
                             pred_type="parameters")
# Set seed for reproducibility
torch.manual_seed(123)

# Number of samples to draw from predicted distribution
n_samples = 1000
quant_sel = [0.05, 0.95] # Quantiles to calculate from predicted distribution

# Sample from predicted distribution
pred_samples = xgblss.predict(dtest,
                              pred_type="samples",
                              n_samples=n_samples,
                              seed=123)

# Calculate quantiles from predicted distribution
pred_quantiles = xgblss.predict(dtest,
                                pred_type="quantiles",
                                n_samples=n_samples,
                                quantiles=quant_sel)

# Returns predicted distributional parameters
pred_params = xgblss.predict(dtest,
                             pred_type="parameters")

In [8]:

Copied!

pred_samples.head()
pred_samples.head()

Out[8]:

	target	y_sample0	y_sample1	y_sample2	y_sample3	y_sample4	y_sample5	y_sample6	y_sample7	y_sample8	...	y_sample990	y_sample991	y_sample992	y_sample993	y_sample994	y_sample995	y_sample996	y_sample997	y_sample998	y_sample999
0	y1	1.826371	1.655583	-6.198446	2.377529	2.347612	-0.469314	2.276113	1.981604	1.655352	...	1.241540	0.446188	2.881686	0.456408	4.557791	0.792019	0.273735	3.003406	2.027522	3.462481
1	y1	-0.691288	0.220897	1.418300	-1.052063	-1.664923	0.839701	-0.150900	-0.240443	-4.612601	...	2.699394	-0.257855	-0.209613	0.428605	-2.403315	-0.178805	-1.637506	0.494041	-0.516508	1.088700
2	y1	0.556013	-3.001471	0.166163	1.591058	-0.810171	2.257192	-0.412995	-0.064019	2.921766	...	2.103228	1.577710	0.250197	1.165794	4.092175	0.162337	2.432878	0.340290	2.414249	2.542186
3	y1	-0.596619	1.470465	3.600923	-0.979395	-0.944986	0.168351	-0.866526	0.585454	0.764140	...	-0.126247	-1.365120	-0.538163	1.906027	3.879993	-0.767500	0.677585	0.178133	0.747665	-1.436007
4	y1	1.163670	-0.438058	5.018644	1.299736	1.383374	2.107938	2.264123	-2.150131	0.163765	...	0.113781	5.510688	-3.278695	2.609784	1.670109	-0.972046	2.713919	0.600095	0.460487	1.574848

5 rows × 1001 columns

In [9]:

Copied!

pred_samples.tail()
pred_samples.tail()

Out[9]:

	target	y_sample0	y_sample1	y_sample2	y_sample3	y_sample4	y_sample5	y_sample6	y_sample7	y_sample8	...	y_sample990	y_sample991	y_sample992	y_sample993	y_sample994	y_sample995	y_sample996	y_sample997	y_sample998	y_sample999
5995	y3	2.210692	1.900934	1.085012	2.282264	1.407957	1.925822	3.017605	2.116382	1.712142	...	1.233792	1.270940	0.914976	3.310941	0.968399	2.943983	3.109435	1.458213	1.449159	-0.466676
5996	y3	2.939518	2.023760	4.522609	0.983502	0.713331	1.238452	4.170801	1.399968	2.816488	...	-0.049814	2.464761	-3.251768	4.062187	2.940999	-0.621344	4.111061	1.860313	2.257267	2.028987
5997	y3	-4.392200	2.491214	0.922925	0.271518	5.775393	-1.073733	-2.200954	2.369352	2.245128	...	3.623213	0.552548	-2.560454	2.538609	1.267258	1.283719	-0.459677	-5.194057	0.901482	3.651929
5998	y3	3.400825	1.560178	1.084008	0.235842	4.448450	-1.072958	1.268006	2.324106	1.993286	...	1.376254	1.916807	1.986617	-0.764349	0.537929	0.522068	-0.829565	1.133323	-1.039197	1.927012
5999	y3	2.940973	2.364419	1.320267	3.600506	3.648721	1.814938	1.846601	2.313592	2.298523	...	1.852614	3.375288	1.422922	2.623575	0.940880	2.236249	1.944052	3.142564	4.212935	3.002483

5 rows × 1001 columns

In [10]:

Copied!

pred_quantiles.head()
pred_quantiles.head()

Out[10]:

	target	quant_0.05	quant_0.95
0	y1	-1.219715	3.868802
1	y1	-2.528142	3.684324
2	y1	-1.874202	3.649772
3	y1	-2.342928	3.078873
4	y1	-2.203250	4.122032

In [11]:

Copied!

pred_params.head()
pred_params.head()

Out[11]:

	df	location_1	location_2	location_3	scale_1	scale_2	scale_3	rho_12	rho_13	rho_23
0	10.860957	1.175390	0.252095	1.018696	1.530034	0.720119	1.975240	0.745432	0.503505	0.386464
1	8.676908	0.467394	-0.177366	2.009582	2.037138	0.859228	2.141575	0.513212	0.614946	0.649461
2	2.703903	0.807936	0.540155	1.177945	2.183053	2.246431	3.664789	0.382675	0.936917	0.628111
3	2.703903	0.463964	0.214608	2.656021	2.255344	2.420194	3.348612	0.473255	0.930121	0.666063
4	3.757413	0.885413	0.148868	1.099954	2.080441	1.377735	2.460320	0.332695	0.834257	0.576086

SHAP Interpretability¶

Since XGboostlss estimates the covariance matrix $\mathbf{\Sigma}(x) = \mathbf{L}(x) \mathbf{L}^{\prime}(x)$ via the Cholesky factors, interpretability is only sensible for the location parameters. The following parameters have been estimated

In [12]:

Copied!

list(xgblss.dist.param_dict.keys())
list(xgblss.dist.param_dict.keys())

Out[12]:

['df',
 'location_1',
 'location_2',
 'location_3',
 'scale_tril_diag_1',
 'scale_tril_offdiag_12',
 'scale_tril_diag_2',
 'scale_tril_offdiag_13',
 'scale_tril_offdiag_23',
 'scale_tril_diag_3']

In [13]:

Copied!





# Partial Dependence Plot of how x acts on location_2
xgblss.plot(x_test,
            parameter="location_2",
            feature="x",
            plot_type="Partial_Dependence")
# Partial Dependence Plot of how x acts on location_2
xgblss.plot(x_test,
            parameter="location_2",
            feature="x",
            plot_type="Partial_Dependence")

In [14]:

Copied!





# Feature Importance of location_2 parameter
xgblss.plot(x_test,
            parameter="location_2",
            plot_type="Feature_Importance")
# Feature Importance of location_2 parameter
xgblss.plot(x_test,
            parameter="location_2",
            plot_type="Feature_Importance")

True vs. Predicted Distributional Parameters¶

In the following figure, we compare the true parameters of the multivariate Gaussian with the ones predicted by XGBoostLSS. The below figure shows that the estimated parameters closely match the true ones.

In [15]:

Copied!





pred_params["x"] = x_test["x"].values
pred_params["x2"] = x_test["x2"].values

# Data with actual values
plot_df_actual1 = pd.melt(test.drop(columns=["y1", "y2", "y3", "x2", "df", "location_3", "rho_13"], axis=0),
                          id_vars=["x"],
                          value_vars=["location_1", "location_2", "scale_1", "rho_12", "scale_2", "rho_23", "scale_3"])

plot_df_actual2 = pd.melt(test.drop(columns=["y1", "y2", "y3", "location_1", "location_2", "scale_1", "rho_12", "scale_2", "rho_23", "scale_3", "x"], axis=0),
                          id_vars=["x2"],
                          value_vars=["df", "location_3", "rho_13"])

plot_df_actual2.columns = list(plot_df_actual1.columns)
plot_df_actual = pd.concat([plot_df_actual1, plot_df_actual2])
plot_df_actual["type"] = "ACTUAL"



# Data with predicted values
plot_df_pred1 = pd.melt(pred_params.drop(columns=["x2", "df", "location_3", "rho_13"], axis=0),
                        id_vars=["x"],
                        value_vars=["location_1", "location_2", "scale_1", "rho_12", "scale_2", "rho_23", "scale_3"])

plot_df_pred2 = pd.melt(pred_params.drop(columns=["location_1", "location_2", "scale_1", "rho_12", "scale_2", "rho_23", "scale_3", "x"], axis=0),
                        id_vars=["x2"],
                        value_vars=["df", "location_3", "rho_13"])

plot_df_pred2.columns = list(plot_df_pred1.columns)
plot_df_predt= pd.concat([plot_df_pred1, plot_df_pred2])
plot_df_predt["type"] = "PREDICT"    


# Concatenated DF
plot_df = pd.concat([plot_df_actual, plot_df_predt])   
plot_df["variable"] = plot_df.variable.str.upper()


plot_params = (ggplot(plot_df,
                      aes(x="x",
                          y="value",
                          color="type")) +
               geom_line(size=1.1) + 
               facet_wrap("variable",
                          scales="free",
                          ncol=4) + 
               labs(title="Parameters of Trivariate-StudentT estimated with XGBoostLSS using Cholesky-Decomposition of Covariance-Matrix\n",
                    x="",
                    y="") + 
               theme_bw(base_size=15) + 
               theme(legend_position="bottom",
                     legend_title = element_blank())
              )


print(plot_params)
pred_params["x"] = x_test["x"].values
pred_params["x2"] = x_test["x2"].values

# Data with actual values
plot_df_actual1 = pd.melt(test.drop(columns=["y1", "y2", "y3", "x2", "df", "location_3", "rho_13"], axis=0),
                          id_vars=["x"],
                          value_vars=["location_1", "location_2", "scale_1", "rho_12", "scale_2", "rho_23", "scale_3"])

plot_df_actual2 = pd.melt(test.drop(columns=["y1", "y2", "y3", "location_1", "location_2", "scale_1", "rho_12", "scale_2", "rho_23", "scale_3", "x"], axis=0),
                          id_vars=["x2"],
                          value_vars=["df", "location_3", "rho_13"])

plot_df_actual2.columns = list(plot_df_actual1.columns)
plot_df_actual = pd.concat([plot_df_actual1, plot_df_actual2])
plot_df_actual["type"] = "ACTUAL"



# Data with predicted values
plot_df_pred1 = pd.melt(pred_params.drop(columns=["x2", "df", "location_3", "rho_13"], axis=0),
                        id_vars=["x"],
                        value_vars=["location_1", "location_2", "scale_1", "rho_12", "scale_2", "rho_23", "scale_3"])

plot_df_pred2 = pd.melt(pred_params.drop(columns=["location_1", "location_2", "scale_1", "rho_12", "scale_2", "rho_23", "scale_3", "x"], axis=0),
                        id_vars=["x2"],
                        value_vars=["df", "location_3", "rho_13"])

plot_df_pred2.columns = list(plot_df_pred1.columns)
plot_df_predt= pd.concat([plot_df_pred1, plot_df_pred2])
plot_df_predt["type"] = "PREDICT"    


# Concatenated DF
plot_df = pd.concat([plot_df_actual, plot_df_predt])   
plot_df["variable"] = plot_df.variable.str.upper()


plot_params = (ggplot(plot_df,
                      aes(x="x",
                          y="value",
                          color="type")) +
               geom_line(size=1.1) + 
               facet_wrap("variable",
                          scales="free",
                          ncol=4) + 
               labs(title="Parameters of Trivariate-StudentT estimated with XGBoostLSS using Cholesky-Decomposition of Covariance-Matrix\n",
                    x="",
                    y="") + 
               theme_bw(base_size=15) + 
               theme(legend_position="bottom",
                     legend_title = element_blank())
              )


print(plot_params)

Actual vs. Predicted¶

Since we predict the entire conditional distribution, we can overlay the point predictions with predicted densities, from which we can also derive quantiles of interest. We use the first response $y_{1}$ as an example.

In [16]:

Copied!





y_pred = []

n_examples = 16
q_sel = [0.05, 0.95]
y_sel=0
samples_arr = pred_samples.drop(columns="target").values.reshape(n_targets,-1,n_samples)

for i in range(n_examples):    
    y_samples = pd.DataFrame(samples_arr[y_sel,i,:].reshape(-1,1), columns=["PREDICT_DENS"])
    y_samples["PREDICT_POINT"] = y_samples["PREDICT_DENS"].mean()
    y_samples["PREDICT_Q05"] = y_samples["PREDICT_DENS"].quantile(q=q_sel[0])
    y_samples["PREDICT_Q95"] = y_samples["PREDICT_DENS"].quantile(q=q_sel[1])
    y_samples["ACTUAL"] = y_test[i,y_sel]
    y_samples["obs"]= f"Obervation {i+1}"
    y_pred.append(y_samples)
    
pred_df = pd.melt(pd.concat(y_pred, axis=0), id_vars="obs")
pred_df["obs"] = pd.Categorical(pred_df["obs"], categories=[f"Obervation {i+1}" for i in range(n_examples)])
df_actual, df_pred_dens, df_pred_point, df_q05, df_q95 = [x for _, x in pred_df.groupby("variable")]

plot_pred = (
    ggplot(pred_df,
           aes(color="variable")) + 
    stat_density(df_pred_dens,
                 aes(x="value"),
                 size=1.1) + 
    geom_point(df_pred_point,
               aes(x="value",
                   y=0),
               size=1.4) + 
    geom_point(df_actual,
               aes(x="value",
                   y=0),
               size=1.4) + 
    geom_vline(df_q05, 
               aes(xintercept="value",
                   fill="variable",
                   color="variable"),
               linetype="dashed",
               size=1.1) + 
    geom_vline(df_q95, 
               aes(xintercept="value",
                   fill="variable",
                   color="variable"),
               linetype="dashed",
               size=1.1) + 
    facet_wrap("obs",
               scales="free",
               ncol=4) + 
    labs(title="Predicted vs. Actual \n",
         x = "") + 
    theme_bw(base_size=15) +
    scale_fill_brewer(type="qual", palette="Dark2") + 
    theme(legend_position="bottom",
          legend_title = element_blank()
         )
)

print(plot_pred)
y_pred = []

n_examples = 16
q_sel = [0.05, 0.95]
y_sel=0
samples_arr = pred_samples.drop(columns="target").values.reshape(n_targets,-1,n_samples)

for i in range(n_examples):    
    y_samples = pd.DataFrame(samples_arr[y_sel,i,:].reshape(-1,1), columns=["PREDICT_DENS"])
    y_samples["PREDICT_POINT"] = y_samples["PREDICT_DENS"].mean()
    y_samples["PREDICT_Q05"] = y_samples["PREDICT_DENS"].quantile(q=q_sel[0])
    y_samples["PREDICT_Q95"] = y_samples["PREDICT_DENS"].quantile(q=q_sel[1])
    y_samples["ACTUAL"] = y_test[i,y_sel]
    y_samples["obs"]= f"Obervation {i+1}"
    y_pred.append(y_samples)
    
pred_df = pd.melt(pd.concat(y_pred, axis=0), id_vars="obs")
pred_df["obs"] = pd.Categorical(pred_df["obs"], categories=[f"Obervation {i+1}" for i in range(n_examples)])
df_actual, df_pred_dens, df_pred_point, df_q05, df_q95 = [x for _, x in pred_df.groupby("variable")]

plot_pred = (
    ggplot(pred_df,
           aes(color="variable")) + 
    stat_density(df_pred_dens,
                 aes(x="value"),
                 size=1.1) + 
    geom_point(df_pred_point,
               aes(x="value",
                   y=0),
               size=1.4) + 
    geom_point(df_actual,
               aes(x="value",
                   y=0),
               size=1.4) + 
    geom_vline(df_q05, 
               aes(xintercept="value",
                   fill="variable",
                   color="variable"),
               linetype="dashed",
               size=1.1) + 
    geom_vline(df_q95, 
               aes(xintercept="value",
                   fill="variable",
                   color="variable"),
               linetype="dashed",
               size=1.1) + 
    facet_wrap("obs",
               scales="free",
               ncol=4) + 
    labs(title="Predicted vs. Actual \n",
         x = "") + 
    theme_bw(base_size=15) +
    scale_fill_brewer(type="qual", palette="Dark2") + 
    theme(legend_position="bottom",
          legend_title = element_blank()
         )
)

print(plot_pred)