Formula

greybox.formula.formula(formula_str: str, data, return_type: str = 'both', as_dataframe: bool = True) → Any[source]

Parse formula string and return design matrix and/or response.

Parameters:

formula_str (str) – Formula string in R-style, e.g., “y ~ x1 + x2” or “~ x1 + x2” (no y). Supports transformations: log(x), sqrt(x), x^2, etc. Use I() to protect expressions: I(x^2)
data (dict or DataFrame) – Data containing variables. Can be dict, DataFrame, or dict of arrays.
return_type (str, optional) – What to return: “both” (default), “X”, “y”, or “terms”.
as_dataframe (bool, optional) – If True, return pandas DataFrames instead of numpy arrays. This preserves variable names in column names. Default is True.

Returns:

“both”: tuple (y, X) where y is 1D array (or DataFrame) and X is 2D array (or DataFrame) with intercept
”X”: design matrix only (or DataFrame if as_dataframe=True)
”y”: response only (or DataFrame if as_dataframe=True)
”terms”: list of term names

Return type:

Depends on return_type and as_dataframe

Examples

>>> data = {'y': [1, 2, 3], 'x1': [4, 5, 6], 'x2': [7, 8, 9]}
>>> y, X = formula("y ~ x1 + x2", data)
>>> X_no_intercept = formula("~ x1 + x2", data, return_type="X")

>>> # With transformations: y is log-transformed, X has x and x^2
>>> data_t = {'y': [1, 2, 3], 'x': [1, 2, 3]}
>>> y, X = formula("log(y) ~ x + x^2", data_t)

>>> # Return as DataFrames with column names
>>> y, X = formula("y ~ x1 + x2", data, as_dataframe=True)
>>> list(X.columns)
['(Intercept)', 'x1', 'x2']

>>> # With custom functions (defined or imported in your global scope)
>>> def my_transform(x):
...     return x * 2
>>> y, X = formula("y ~ my_transform(x)", data_t)

>>> # With imported functions (e.g., from scipy)
>>> from scipy.special import erfc
>>> data_e = {'y': [1, 2, 3], 'x': [0.5, 1.0, 1.5]}
>>> y, X = formula("y ~ erfc(x)", data_e)

>>> # Custom function on LHS (response variable)
>>> y, X = formula("my_transform(y) ~ x", data_t)

greybox.formula.expand_formula(formula_str)[source]

Expand formula with interaction terms.

Parameters:: formula_str (str) – Formula string, e.g., “y ~ x1 * x2”
Returns:: Expanded formula with explicit interaction terms.
Return type:: str

Backshift Operator

greybox.xreg.B(x: ndarray, k: int, gaps: Literal['auto', 'NAs', 'zero', 'naive', 'extrapolate'] = 'auto') → ndarray[source]

Backshift operator: lag (k>0) or lead (k<0) of x.

Positive k creates lag-k (past values); negative k creates lead-abs(k) (future values); k=0 returns x unchanged. Gaps at boundaries are filled per the gaps strategy.

Parameters:

x (array-like of shape (n,))
k (int) – Lag order. Positive = lag (past values), negative = lead (future).
gaps (str, default "auto") – Boundary fill strategy passed to xreg_expander.

Return type:

np.ndarray of shape (n,)

Formula Syntax Reference

Basic Operators

~ : Separates response from predictors
+ : Adds a term (include variable)
- : Removes a term
0 or -1 : Removes intercept
1 : Adds intercept (default)
* : Main effects and interactions (a*b = a + b + a:b)
: : Interaction only
I() : Protect expression from interpretation

Transformations

Supported transformations in formula terms:

log(x) - Natural logarithm
log10(x) - Base 10 logarithm
log2(x) - Base 2 logarithm
sqrt(x) - Square root
exp(x) - Exponential
abs(x) - Absolute value
sin(x), cos(x), tan(x) - Trigonometric

Polynomial Terms

I(x^2) - Squared term (protected)
I(x^3) - Cubed term
poly(x, 2) - Polynomial (if supported)

Special Variables

trend - Linear time trend (1, 2, 3, …)
B(x, k) - Lag (k>0) or lead (k<0) of variable x; used in ARDL models

Examples

Basic linear regression:

y, X = formula("y ~ x1 + x2", data)

Without intercept:

y, X = formula("y ~ 0 + x1 + x2", data)

With log transformation:

y, X = formula("log(y) ~ log(x1) + sqrt(x2)", data)

Polynomial regression:

y, X = formula("y ~ x + I(x^2) + I(x^3)", data)

Interactions:

y, X = formula("y ~ x1 * x2", data)  # equivalent to x1 + x2 + x1:x2

Distributed-lag ARDL model:

y, X = formula("y ~ x + B(x, 1) + B(x, 2)", data)