Gausssian-Mixture Regression (California Housing Data)¶
Mixture densities or mixture distributions offer an extension to the notion of traditional univariate distributions by allowing the observed data to be thought of as arising from multiple underlying processes. In its essence, a mixture distribution is a weighted combination of several component distributions, where each component contributes to the overall mixture distribution, with the weights indicating the importance of each component. For instance, if you imagine the observed data distribution having multiple modes, a mixture of Gaussians could be employed to capture each mode with a separate Gaussian distribution.
For each component of the mixture, there would be a set of parameters that depend on covariates, and additional mixing coefficients which are also modeled as a function of covariates. This is particularly useful when a single parametric distribution cannot adequately capture the underlying data generating process. A mixture distribution can be represented as follows:
\begin{equation} f\bigl(y_{i} | \boldsymbol{\theta}_{i}(x_{i})\bigr) = \sum_{m=1}^{M} w_{i,m}(x_{i}) \cdot f_{m}\bigl(y_{i} | \boldsymbol{\theta}_{i,m}(x_{i})\bigr) \end{equation}
where $f(\cdot)$ represents the mixture density for the $i$-th observation, $f_{m}(\cdot)$ is the $m$-th component density, each with its own set of parameters $\boldsymbol{\theta}_{i,m}(\cdot)$, and $w_{i,m}(\cdot)$ represent the weights of the $m$-th component in the mixture, subject to $\sum_{j=1}^{M} w_{i,m} = 1$. The components can either be a combination of different parametric univariate distributions, such as a combination of a Normal and a StudentT, or, as in our implementation, a combination of the same distribution-type with different parameterizations, e.g., Gaussian-Mixture or StudentT-Mixture. The choice of the component distributions depends on the characteristics of the data and the underlying assumptions. Due to their high flexibility, mixture densities can portray a diverse range of shapes, making them adaptable to a plethora of datasets.
Imports¶
from xgboostlss.model import *
from xgboostlss.distributions.Gaussian import *
from xgboostlss.distributions.Mixture import *
from xgboostlss.distributions.mixture_distribution_utils import MixtureDistributionClass
from sklearn import datasets
from sklearn.model_selection import train_test_split
import multiprocessing
import matplotlib.pyplot as plt
import seaborn as sns
figure_size = (10,5)
import plotnine
from plotnine import *
plotnine.options.figure_size = figure_size
Data¶
n_cpu = multiprocessing.cpu_count()
housing_data = datasets.fetch_california_housing()
X, y = housing_data["data"], housing_data["target"]
feature_names = housing_data["feature_names"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
dtrain = xgb.DMatrix(X_train, label=y_train, nthread=n_cpu)
dtest = xgb.DMatrix(X_test, nthread=n_cpu)
Distribution Selection¶
In the following, we specify a list of candidate distributions. The function dist_select
returns the negative log-likelihood of each distribution for the target variable. The distribution with the lowest negative log-likelihood is selected. The function also plots the density of the target variable and the fitted density, using the best suitable distribution among the specified ones. However, note that choosing the best performing mixture-distribution based solely on training data may lead to overfitting, since mixture-densities can approximate any distribution arbitrarily well. It is therefore crucial to carefully select the specifications to strike a balance between model complexity and generalization ability.
mix_dist_class = MixtureDistributionClass()
candidate_distributions = [
Mixture(Gaussian(response_fn="softplus"), M = 2),
Mixture(Gaussian(response_fn="softplus"), M = 3),
Mixture(Gaussian(response_fn="softplus"), M = 4),
]
dist_nll = mix_dist_class.dist_select(target=y_train, candidate_distributions=candidate_distributions, max_iter=50, plot=True, figure_size=(8, 5))
dist_nll
Fitting of candidate distributions completed: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00, 2.22s/it]
nll | distribution | |
---|---|---|
rank | ||
1 | 13576.063477 | Mixture(Gaussian, tau=1.0, M=4) |
2 | 23268.498047 | Mixture(Gaussian, tau=1.0, M=3) |
3 | 23737.701172 | Mixture(Gaussian, tau=1.0, M=2) |
# Specifies a mixture of Gaussians. See ?Mixture for an overview.
xgblss = XGBoostLSS(
Mixture(
Gaussian(response_fn="softplus"),
M = 4,
tau=1.0,
hessian_mode="individual",
)
)
Hyper-Parameter Optimization¶
Any XGBoost hyperparameter can be tuned, where the structure of the parameter dictionary needs to be as follows:
- Float/Int sample_type
- {"param_name": ["sample_type", low, high, log]}
- sample_type: str, Type of sampling, e.g., "float" or "int"
- low: int, Lower endpoint of the range of suggested values
- high: int, Upper endpoint of the range of suggested values
- log: bool, Flag to sample the value from the log domain or not
- Example: {"eta": "float", low=1e-5, high=1, log=True]}
- Categorical sample_type
- {"param_name": ["sample_type", ["choice1", "choice2", "choice3", "..."]]}
- sample_type: str, Type of sampling, either "categorical"
- choice1, choice2, choice3, ...: str, Possible choices for the parameter
- Example: {"booster": ["categorical", ["gbtree", "dart"]]}
- For parameters without tunable choice (this is needed if tree_method = "gpu_hist" and gpu_id needs to be specified)
- {"param_name": ["none", [value]]},
- param_name: str, Name of the parameter
- value: int, Value of the parameter
- Example: {"gpu_id": ["none", [0]]}
Depending on which parameters are optimized, it might happen that some of them are not used, e.g., when {"booster": ["categorical", ["gbtree", "gblinear"]]} and {"max_depth": ["int", 1, 10, False]} are specified, max_depth is not used when gblinear is sampled, since it has no such argument.
param_dict = {
"eta": ["float", {"low": 1e-5, "high": 1, "log": True}],
"max_depth": ["int", {"low": 1, "high": 10, "log": False}],
"gamma": ["float", {"low": 1e-8, "high": 40, "log": True}],
"subsample": ["float", {"low": 0.2, "high": 1.0, "log": False}],
"colsample_bytree": ["float", {"low": 0.2, "high": 1.0, "log": False}],
"min_child_weight": ["float", {"low": 1e-8, "high": 500, "log": True}],
"booster": ["categorical", ["gbtree"]],
}
np.random.seed(123)
opt_param = xgblss.hyper_opt(param_dict,
dtrain,
num_boost_round=100, # Number of boosting iterations.
nfold=5, # Number of cv-folds.
early_stopping_rounds=20, # Number of early-stopping rounds
max_minutes=60, # Time budget in minutes, i.e., stop study after the given number of minutes.
n_trials=30, # The number of trials. If this argument is set to None, there is no limitation on the number of trials.
silence=True, # Controls the verbosity of the trail, i.e., user can silence the outputs of the trail.
seed=123, # Seed used to generate cv-folds.
hp_seed=None # Seed for random number generator used in the Bayesian hyperparameter search.
)
0%| | 0/30 [00:00<?, ?it/s]
Hyper-Parameter Optimization successfully finished. Number of finished trials: 30 Best trial: Value: -339.92735600000003 Params: eta: 0.035446384620699434 max_depth: 9 gamma: 1.001074720860003e-08 subsample: 0.5604463986824038 colsample_bytree: 0.9141117593708521 min_child_weight: 2.4884257425092997 booster: gbtree opt_rounds: 63
Model Training¶
np.random.seed(123)
opt_params = opt_param.copy()
n_rounds = opt_params["opt_rounds"]
del opt_params["opt_rounds"]
# Train Model with optimized hyperparameters
xgblss.train(opt_params,
dtrain,
num_boost_round=n_rounds,
)
Prediction¶
# Set seed for reproducibility
torch.manual_seed(123)
# Number of samples to draw from predicted distribution
n_samples = y_test.shape[0]
quant_sel = [0.05, 0.95] # Quantiles to calculate from predicted distribution
# Sample from predicted distribution
pred_samples = xgblss.predict(dtest,
pred_type="samples",
n_samples=n_samples,
seed=123)
# Calculate quantiles from predicted distribution
pred_quantiles = xgblss.predict(dtest,
pred_type="quantiles",
n_samples=n_samples,
quantiles=quant_sel)
# Returns predicted distributional parameters
pred_params = xgblss.predict(dtest,
pred_type="parameters")
pred_samples.head()
y_sample0 | y_sample1 | y_sample2 | y_sample3 | y_sample4 | y_sample5 | y_sample6 | y_sample7 | y_sample8 | y_sample9 | ... | y_sample4118 | y_sample4119 | y_sample4120 | y_sample4121 | y_sample4122 | y_sample4123 | y_sample4124 | y_sample4125 | y_sample4126 | y_sample4127 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.994438 | 2.316962 | 2.040202 | 1.773273 | 1.057907 | 1.858040 | 1.531195 | 1.343295 | 2.564236 | 1.961088 | ... | 1.937575 | 1.881372 | 1.430810 | 2.256658 | 1.582319 | 2.319254 | 1.594143 | 1.991018 | 2.195542 | 1.123431 |
1 | 0.873218 | 1.127620 | 0.772036 | 1.059127 | 1.497398 | 0.915671 | 1.018594 | 0.998078 | 1.105870 | 3.134003 | ... | 0.272696 | 0.638975 | 0.861581 | 1.580221 | 1.155142 | 0.814255 | 0.663595 | 0.510949 | 1.029415 | 1.272711 |
2 | 1.659060 | 1.461652 | 1.537653 | 1.414761 | 1.590473 | 1.401253 | 1.486623 | 1.618370 | 1.575211 | 1.548486 | ... | 1.422084 | 1.428987 | 1.720092 | 1.517076 | 1.247786 | 1.750022 | 1.630960 | 1.306203 | 1.765775 | 1.461866 |
3 | 1.809175 | 1.551336 | 1.842254 | 1.650605 | 1.460142 | 1.731991 | 1.706405 | 2.048648 | 1.752854 | 1.740417 | ... | 1.935548 | 2.208457 | 1.614117 | 1.806199 | 1.811315 | 1.594439 | 1.700334 | 1.788948 | 2.027447 | 1.972643 |
4 | 3.377752 | 4.452237 | 3.987720 | 3.395683 | 5.000009 | 3.293220 | 3.047759 | 1.706203 | 5.000010 | 4.312225 | ... | 5.000010 | 5.000007 | 4.937522 | 4.134383 | 3.266089 | 4.090020 | 2.309252 | 4.634232 | 4.711234 | 5.020878 |
5 rows × 4128 columns
pred_quantiles.head()
quant_0.05 | quant_0.95 | |
---|---|---|
0 | 1.329030 | 2.899321 |
1 | 0.580137 | 1.822827 |
2 | 1.303417 | 1.738444 |
3 | 1.251841 | 2.305379 |
4 | 2.311502 | 5.000012 |
pred_params.head()
loc_1 | loc_2 | loc_3 | loc_4 | scale_1 | scale_2 | scale_3 | scale_4 | mix_prob_1 | mix_prob_2 | mix_prob_3 | mix_prob_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.00001 | 1.078767 | 1.814293 | 2.410790 | 0.000001 | 0.185484 | 0.314730 | 0.455521 | 0.002581 | 0.005433 | 0.663988 | 0.327997 |
1 | 5.00001 | 0.993237 | 1.124137 | 2.248213 | 0.000001 | 0.093863 | 0.369168 | 1.197223 | 0.003027 | 0.292940 | 0.657855 | 0.046178 |
2 | 5.00001 | 1.052439 | 1.523651 | 1.866059 | 0.000001 | 0.149111 | 0.126420 | 0.666375 | 0.000275 | 0.007836 | 0.984688 | 0.007201 |
3 | 5.00001 | 1.011255 | 1.764503 | 2.136899 | 0.000001 | 0.112121 | 0.286569 | 0.689104 | 0.004213 | 0.007115 | 0.938328 | 0.050344 |
4 | 5.00001 | 1.206031 | 2.224633 | 3.801417 | 0.000001 | 0.165997 | 0.670525 | 0.751687 | 0.198264 | 0.018378 | 0.018591 | 0.764767 |
SHAP Interpretability¶
# Partial Dependence Plot
shap_df = pd.DataFrame(X_train, columns=feature_names)
xgblss.plot(shap_df,
parameter="mix_prob_1",
feature=feature_names[0],
plot_type="Partial_Dependence")
# Feature Importance
xgblss.plot(shap_df,
parameter="mix_prob_1",
plot_type="Feature_Importance")
Density Plots¶
In the following, we plot the actual and a subsample of the predicted denstites.
n_subset = 10
pred_df = pred_samples.iloc[:,0:n_subset]
pred_df.columns=[f"Sample {i+1}" for i in range(n_subset)]
actual_df = pd.DataFrame(y_test.reshape(-1,), columns = ["Actual"])
plot_df = pd.concat([pred_df, actual_df], axis=1)
linestyles = ["--" for _ in range(n_subset)] + ["solid"]
linewidths = [1 for _ in range(n_subset)] + [2.5]
plt.figure(figsize=figure_size)
for idx, col in enumerate(plot_df.columns):
sns.kdeplot(plot_df[col], linestyle=linestyles[idx], lw=linewidths[idx], label=col)
plt.legend()
plt.title("Actual vs. Predicted Densities", fontsize=20)
plt.gca().set_xlabel("y")
plt.gca().set_ylabel("f(y)")
plt.show()
Component Distributions¶
We can also plot each Gaussian component distribution.
# Extract predicted parameters
mix_params = torch.split(torch.tensor(pred_params.values[0,:]).reshape(1,-1), xgblss.dist.M, dim=1)
mix_params[1][0][0] = mix_params[1][0][0] + torch.tensor(0.1) # increase the std of the first density for plotting reasons
# Create Mixture-Distribution
torch.manual_seed(123)
mix_dist = xgblss.dist.create_mixture_distribution(mix_params)
gaus_dist = mix_dist._component_distribution
gaus_samples = pd.DataFrame(
gaus_dist.sample((y_test.shape[0],)).reshape(-1, xgblss.dist.M).numpy(),
columns = [f"Density {i+1}" for i in range(xgblss.dist.M)]
)
# Plot
plt.figure(figsize=figure_size)
sns.kdeplot(gaus_samples, lw=2.5)
plt.title("Gaussian-Mixture Component Distributions", fontsize=20)
plt.show()
Actual vs. Predicted¶
Since we predict the entire conditional distribution, we can overlay the point predictions with predicted densities, from which we can also derive quantiles of interest.
y_pred = []
n_examples = 8
q_sel = [0.05, 0.95]
y_sel=0
samples_arr = pred_samples.values.reshape(-1,n_samples)
for i in range(n_examples):
y_samples = pd.DataFrame(samples_arr[i,:].reshape(-1,1), columns=["PREDICT_DENSITY"])
y_samples["PREDICT_POINT"] = y_samples["PREDICT_DENSITY"].mean()
y_samples["PREDICT_Q05"] = y_samples["PREDICT_DENSITY"].quantile(q=q_sel[0])
y_samples["PREDICT_Q95"] = y_samples["PREDICT_DENSITY"].quantile(q=q_sel[1])
y_samples["ACTUAL"] = y_test[i]
y_samples["obs"]= f"Obervation {i+1}"
y_pred.append(y_samples)
pred_df = pd.melt(pd.concat(y_pred, axis=0), id_vars="obs")
pred_df["obs"] = pd.Categorical(pred_df["obs"], categories=[f"Obervation {i+1}" for i in range(n_examples)])
df_actual, df_pred_dens, df_pred_point, df_q05, df_q95 = [x for _, x in pred_df.groupby("variable")]
plot_pred = (
ggplot(pred_df,
aes(color="variable")) +
stat_density(df_pred_dens,
aes(x="value"),
size=1.1) +
geom_point(df_pred_point,
aes(x="value",
y=0),
size=1.4) +
geom_point(df_actual,
aes(x="value",
y=0),
size=1.4) +
geom_vline(df_q05,
aes(xintercept="value",
fill="variable",
color="variable"),
linetype="dashed",
size=1.1) +
geom_vline(df_q95,
aes(xintercept="value",
fill="variable",
color="variable"),
linetype="dashed",
size=1.1) +
facet_wrap("obs",
scales="free",
ncol=4) +
labs(title="Predicted vs. Actual \n",
x = "") +
theme_bw(base_size=15) +
theme(plot_title = element_text(hjust = 0.5)) +
scale_fill_brewer(type="qual", palette="Dark2") +
theme(legend_position="bottom",
legend_title = element_blank()
)
)
print(plot_pred)