Formula
- greybox.formula.formula(formula_str: str, data, return_type: str = 'both', as_dataframe: bool = True) Any[source]
Parse formula string and return design matrix and/or response.
- Parameters:
formula_str (str) – Formula string in R-style, e.g., “y ~ x1 + x2” or “~ x1 + x2” (no y). Supports transformations: log(x), sqrt(x), x^2, etc. Use I() to protect expressions: I(x^2)
data (dict or DataFrame) – Data containing variables. Can be dict, DataFrame, or dict of arrays.
return_type (str, optional) – What to return: “both” (default), “X”, “y”, or “terms”.
as_dataframe (bool, optional) – If True, return pandas DataFrames instead of numpy arrays. This preserves variable names in column names. Default is True.
- Returns:
“both”: tuple (y, X) where y is 1D array (or DataFrame) and X is 2D array (or DataFrame) with intercept
”X”: design matrix only (or DataFrame if as_dataframe=True)
”y”: response only (or DataFrame if as_dataframe=True)
”terms”: list of term names
- Return type:
Depends on return_type and as_dataframe
Examples
>>> data = {'y': [1, 2, 3], 'x1': [4, 5, 6], 'x2': [7, 8, 9]} >>> y, X = formula("y ~ x1 + x2", data) >>> X_no_intercept = formula("~ x1 + x2", data, return_type="X")
>>> # With transformations >>> data = {'y': [1, 2, 3], 'x': [1, 2, 3]} >>> # y is log-transformed, X has x and x^2 >>> y, X = formula("log(y) ~ x + x^2", data)
>>> # Return as DataFrames with column names >>> y, X = formula("y ~ x1 + x2", data, as_dataframe=True) >>> print(X.columns) # ['(Intercept)', 'x1', 'x2']
>>> # With custom functions (defined or imported in your global scope) >>> def my_transform(x): ... return x * 2 >>> data = {'y': [1, 2, 3], 'x': [1, 2, 3]} >>> y, X = formula("y ~ my_transform(x)", data)
>>> # With imported functions (e.g., from scipy) >>> from scipy.special import erfc >>> data = {'y': [1, 2, 3], 'x': [0.5, 1.0, 1.5]} >>> y, X = formula("y ~ erfc(x)", data)
>>> # Custom function on LHS (response variable) >>> y, X = formula("my_transform(y) ~ x", data)
- greybox.formula.expand_formula(formula_str)[source]
Expand formula with interaction terms.
- Parameters:
formula_str (str) – Formula string, e.g., “y ~ x1 * x2”
- Returns:
Expanded formula with explicit interaction terms.
- Return type:
str
Backshift Operator
- greybox.xreg.B(x: ndarray, k: int, gaps: Literal['auto', 'NAs', 'zero', 'naive', 'extrapolate'] = 'auto') ndarray[source]
Backshift operator: lag (k>0) or lead (k<0) of x.
Positive k creates lag-k (past values); negative k creates lead-abs(k) (future values); k=0 returns x unchanged. Gaps at boundaries are filled per the gaps strategy.
- Parameters:
x (array-like of shape (n,))
k (int) – Lag order. Positive = lag (past values), negative = lead (future).
gaps (str, default "auto") – Boundary fill strategy passed to xreg_expander.
- Return type:
np.ndarray of shape (n,)
Formula Syntax Reference
Basic Operators
~: Separates response from predictors+: Adds a term (include variable)-: Removes a term0or-1: Removes intercept1: Adds intercept (default)*: Main effects and interactions (a*b = a + b + a:b):: Interaction onlyI(): Protect expression from interpretation
Transformations
Supported transformations in formula terms:
log(x)- Natural logarithmlog10(x)- Base 10 logarithmlog2(x)- Base 2 logarithmsqrt(x)- Square rootexp(x)- Exponentialabs(x)- Absolute valuesin(x),cos(x),tan(x)- Trigonometric
Polynomial Terms
I(x^2)- Squared term (protected)I(x^3)- Cubed termpoly(x, 2)- Polynomial (if supported)
Special Variables
trend- Linear time trend (1, 2, 3, …)B(x, k)- Lag (k>0) or lead (k<0) of variable x; used in ARDL models
Examples
Basic linear regression:
y, X = formula("y ~ x1 + x2", data)
Without intercept:
y, X = formula("y ~ 0 + x1 + x2", data)
With log transformation:
y, X = formula("log(y) ~ log(x1) + sqrt(x2)", data)
Polynomial regression:
y, X = formula("y ~ x + I(x^2) + I(x^3)", data)
Interactions:
y, X = formula("y ~ x1 * x2", data) # equivalent to x1 + x2 + x1:x2
Distributed-lag ARDL model:
y, X = formula("y ~ x + B(x, 1) + B(x, 2)", data)