Selection

Model selection and combination functions.

This module provides functions for stepwise regression, model combination, and forecasting accuracy measures.

greybox.selection.CALM(data: dict | DataFrame, ic: Literal['AICc', 'AIC', 'BIC', 'BICc'] = 'AICc', bruteforce: bool = False, silent: bool = True, distribution: str = 'dnorm', **kwargs) → LmCombineResult[source]

Combine ALM models based on information criteria.

Function combines parameters of linear regressions of the first variable on all the other provided data. The algorithm uses ALM to fit different models and then combines the models based on the selected IC.

Parameters not present in some models are assumed to be zero, creating a shrinkage effect in the combination.

Parameters:

data (dict or DataFrame) – Data frame containing dependent variable in the first column and the others in the rest.
ic ({"AICc", "AIC", "BIC", "BICc"}, default="AICc") – Information criterion to use.
bruteforce (bool, default=False) – If True, all possible models are generated and combined. Otherwise the best model is found via stepwise, then if <14 parameters recurses with bruteforce on selected vars, else does stress-testing (best +/- single vars).
silent (bool, default=True) – If False, then progress is printed.
distribution (str, default="dnorm") – Distribution to use for the ALM model.
**kwargs – Additional arguments passed to ALM().

Returns:

Dictionary containing: - coefficients: combined parameters - vcov: combined variance-covariance matrix - fitted: fitted values (on original scale) - residuals: residuals (distribution-specific) - mu: location parameter - scale: scale parameter - distribution: distribution used - log_lik: combined log-likelihood - IC: combined information criterion value - IC_type: type of IC used - df_residual: residual degrees of freedom - df: model degrees of freedom - importance: importance of each parameter - combination: matrix of model combinations with weights/ICs - coefficient_names: names of coefficients - time_elapsed: computation time in seconds

Return type:

dict

Examples

>>> data = {'y': [1, 2, 3, 4, 5], 'x1': [1, 2, 3, 4, 5],
...         'x2': [2, 4, 6, 8, 10]}
>>> result = CALM(data)

class greybox.selection.LmCombineResult(**kwargs: Any)[source]

Bases: object

Result of CALM, with print and summary support.

Supports dict-like access for backwards compatibility and ALM-compatible interface (predict, score, confint, properties).

property actuals: ndarray

property aic: float

property aicc: float

property bic: float

property bicc: float

property coef: ndarray: Slope coefficients (excluding intercept).

property coefficients: ndarray: Combined coefficients.

confint(parm: int | list[int] | None = None, level: float = 0.95) → ndarray[source]

Confidence intervals for parameters.

Parameters:

parm (int or list of int, optional) – Which parameters to include. If None, all.
level (float, default=0.95) – Confidence level.

Returns:

Shape (n_params, 2) with lower and upper bounds.

Return type:

np.ndarray

property data: ndarray

property df_residual_: float

property distribution_: str: Distribution name (ADAM convention with trailing _).

property fitted: ndarray: Fitted values.

property formula: str

property intercept_: float

keys() → list[str][source]

property log_lik: float: Combined log-likelihood.

property loglik: float: Log-likelihood (ADAM-compatible name).

property loss_: str: Loss function name (ADAM convention with trailing _).

property loss_value: float: Loss function value.

property n_param: dict: Parameter count information.

property nobs: int

property nparam: float

predict(X: ndarray | DataFrame | dict | None = None, interval: str = 'none', level: float | list[float] = 0.95, side: str = 'both') → PredictionResult[source]

Predict using the combined model.

Parameters:

X (array-like, dict, DataFrame, or None) – Design matrix (with intercept column), dict/DataFrame of new data, or None to return training fitted values.
interval (str, default="none") – “none”, “confidence”, or “prediction”.
level (float or list of float, default=0.95) – Confidence level(s).
side (str, default="both") – “both”, “upper”, or “lower”.

Return type:

PredictionResult

property residuals: ndarray: Model residuals.

score(X: ndarray, y: ndarray, metric: str = 'likelihood') → float[source]

Calculate model score.

Parameters:

X (array-like) – Design matrix.
y (array-like) – True values.
metric (str, default="likelihood") – “likelihood”, “MSE”, “MAE”, or “R2”.

Return type:

float

property sigma: float

summary(level: float = 0.95) → LmCombineSummary[source]

Summary matching R’s summary.greyboxC.

Parameters:: level (float, default=0.95) – Confidence level for intervals.
Return type:: LmCombineSummary

property time_elapsed: float: Time elapsed during computation (seconds).

vcov() → ndarray[source]: Return variance-covariance matrix.

class greybox.selection.LmCombineSummary(coefficients: ndarray, se: ndarray, importance: ndarray, lower_ci: ndarray, upper_ci: ndarray, coefficient_names: list[str], sigma: float, n_obs: int, nparam: float, df_residual: float, distribution: str, y_variable: str, ic_type: str, aic: float, aicc: float, bic: float, bicc: float)[source]

Bases: object

Summary of CALM result, matching R’s summary.greyboxC.

class greybox.selection.ModelInfo[source]

Bases: TypedDict

clear(): Remove all items from the dict.

coef: ndarray

copy(): Return a shallow copy of the dict.

classmethod fromkeys(iterable, value=None, /): Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /): Return the value for key if key is in the dictionary, else default.

ic: float

items(): Return a set-like object providing a view on the dict’s items.

keys(): Return a set-like object providing a view on the dict’s keys.

model: ALM

pop(k[, d]) → v, remove specified key and return the corresponding value.: If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F) → None. Update D from mapping/iterable E and F.: If E is present and has a .keys() method, then does: for k in E.keys(): D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values(): Return an object providing a view on the dict’s values.

vars: list[str]

greybox.selection.stepwise(data: dict | DataFrame, ic: Literal['AICc', 'AIC', 'BIC', 'BICc'] = 'AICc', silent: bool = True, df: int | None = None, formula: str | None = None, subset: Any | None = None, method: Literal['pearson', 'kendall', 'spearman'] = 'pearson', distribution: str = 'dnorm', occurrence: Literal['none', 'plogis', 'pnorm'] = 'none', **kwargs) → ALM[source]

Stepwise selection of regressors.

Function selects variables that give linear regression with the lowest information criterion. The selection is done stepwise (forward) based on partial correlations. This should be a simpler and faster implementation than step() function from stats package.

The algorithm uses ALM to fit different models and correlation to select the next regressor in the sequence.

Parameters:

data (dict or DataFrame) – Data frame containing dependent variable in the first column and the others in the rest.
ic ({"AICc", "AIC", "BIC", "BICc"}, default="AICc") – Information criterion to use.
silent (bool, default=True) – If False, then progress is printed.
df (int, optional) – Number of degrees of freedom to add (should be used if stepwise is used on residuals).
formula (str, optional) – If provided, then the selection will be done from the listed variables in the formula after all the necessary transformations.
subset (array-like, optional) – An optional vector specifying a subset of observations to be used in the fitting process.
method ({"pearson", "kendall", "spearman"}, default="pearson") – Method of correlations calculation. The default is Pearson’s correlation, which should be applicable to a wide range of data in different scales.
distribution (str, default="dnorm") – Distribution to use for the ALM model. See ALM for details.
occurrence ({"none", "plogis", "pnorm"}, default="none") – What distribution to use for occurrence part. See ALM for details.

Returns:

The final fitted model with additional attributes:

ic_values: dict mapping step names to IC values (e.g. {“Intercept”: 150.3, “x1”: 140.2, “x2”: 138.1}). Keys are in insertion order (Python 3.7+).
time_elapsed: float, seconds taken for calculation

Return type:

ALM

Examples

>>> data = {'y': [1, 2, 3, 4, 5], 'x1': [1, 2, 3, 4, 5],
...         'x2': [2, 4, 6, 8, 10]}
>>> model = stepwise(data)