Association

Measures of association.

This module provides functions for calculating various measures of association including partial correlations, multiple correlations, and correlation analysis.

greybox.association.association(x: ndarray, y: ndarray | None = None, method: str = 'auto') → dict[source]

Calculate measures of association.

Function returns the matrix of measures of association for different types of variables.

Parameters:

x (np.ndarray) – DataFrame or matrix.
y (np.ndarray, optional) – The numerical variable.
method (str, default="auto") – Method to use: “auto”, “pearson”, “spearman”, “kendall”, “cramer”. “auto” selects based on variable types.

Returns:

Dictionary containing: - value: matrix of association coefficients - p.value: p-values - type: matrix of types of measures used

Return type:

dict

Examples

>>> x = np.array([[1, 2, 3], [2, 4, 6], [3, 6, 9]])
>>> result = association(x)

greybox.association.determination(xreg: ndarray, bruteforce: bool = True) → ndarray[source]

Coefficients of determination.

Function produces coefficients of determination for the provided data.

The function calculates coefficients of determination (R^2) between all the provided variables. The higher the coefficient for a variable is, the higher the potential multicollinearity effect in the model with the variable will be. Coefficients of determination are connected directly to Variance Inflation Factor (VIF): VIF = 1 / (1 - determination). Arguably it is easier to interpret, because it is restricted with (0, 1) bounds. The multicollinearity can be considered as serious when determination > 0.9 (which corresponds to VIF > 10).

Parameters:

xreg (np.ndarray) – Data frame or matrix containing the exogenous variables.
bruteforce (bool, default=True) – If True, then all the variables will be used for the regression construction (sink regression). If the number of observations is smaller than the number of series, the function will use stepwise function and select only meaningful variables. So the reported values will be based on stepwise regressions for each variable.

Returns:

Vector of determination coefficients, one for each variable.

Return type:

np.ndarray

Examples

>>> np.random.seed(42)
>>> x1 = np.random.normal(10, 3, 100)
>>> x2 = np.random.normal(50, 5, 100)
>>> x3 = 100 + 0.5*x1 - 0.75*x2 + np.random.normal(0, 3, 100)
>>> xreg = np.column_stack([x3, x1, x2])
>>> np.round(determination(xreg), 4)
array([1.    , 0.2872, 0.5649])

greybox.association.mcor(x: ndarray, y: ndarray | None = None) → float[source]

Calculate multiple correlation coefficient.

Function returns the multiple correlation coefficient between a set of variables and a dependent variable.

Parameters:

x (np.ndarray) – Matrix of independent variables.
y (np.ndarray) – The dependent variable.

Returns:

Multiple correlation coefficient.

Return type:

float

Examples

>>> x = np.array([[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]])
>>> y = np.array([3, 6, 9, 12, 15])
>>> round(float(mcor(x, y)), 4)
1.0

greybox.association.pcor(x: ndarray, y: ndarray | None = None, method: Literal['pearson', 'spearman', 'kendall'] = 'pearson') → dict[source]

Calculate partial correlations.

Function calculates partial correlations between the provided variables. The calculation is done based on multiple linear regressions.

Parameters:

x (np.ndarray) – DataFrame or matrix with numeric values.
y (np.ndarray, optional) – The numerical variable. If provided, calculates partial correlation between each column of x and y.
method ({"pearson", "spearman", "kendall"}, default="pearson") – Which method to use for calculation.

Returns:

Dictionary containing: - value: matrix of partial correlation coefficients - p.value: p-values for the coefficients - method: method used

Return type:

dict

Examples

>>> x = np.array([[1, 2, 3], [2, 4, 6], [3, 6, 9], [4, 8, 12], [5, 10, 15]]
... )
>>> result = pcor(x)