Formula
- greybox.formula.formula(formula_str, data, return_type='both', as_dataframe=True)[source]
Parse formula string and return design matrix and/or response.
- Parameters:
formula_str (str) – Formula string in R-style, e.g., “y ~ x1 + x2” or “~ x1 + x2” (no y). Supports transformations: log(x), sqrt(x), x^2, etc. Use I() to protect expressions: I(x^2)
data (dict or DataFrame) – Data containing variables. Can be dict, DataFrame, or dict of arrays.
return_type (str, optional) – What to return: “both” (default), “X”, “y”, or “terms”.
as_dataframe (bool, optional) – If True, return pandas DataFrames instead of numpy arrays. This preserves variable names in column names. Default is True.
- Returns:
“both”: tuple (y, X) where y is 1D array (or DataFrame) and X is 2D array (or DataFrame) with intercept
”X”: design matrix only (or DataFrame if as_dataframe=True)
”y”: response only (or DataFrame if as_dataframe=True)
”terms”: list of term names
- Return type:
Depends on return_type and as_dataframe
Examples
>>> data = {'y': [1, 2, 3], 'x1': [4, 5, 6], 'x2': [7, 8, 9]} >>> y, X = formula("y ~ x1 + x2", data) >>> X_no_intercept = formula("~ x1 + x2", data, return_type="X")
>>> # With transformations >>> data = {'y': [1, 2, 3], 'x': [1, 2, 3]} >>> # y is log-transformed, X has x and x^2 >>> y, X = formula("log(y) ~ x + x^2", data)
>>> # Return as DataFrames with column names >>> y, X = formula("y ~ x1 + x2", data, as_dataframe=True) >>> print(X.columns) # ['(Intercept)', 'x1', 'x2']
>>> # With custom functions (defined or imported in your global scope) >>> def my_transform(x): ... return x * 2 >>> data = {'y': [1, 2, 3], 'x': [1, 2, 3]} >>> y, X = formula("y ~ my_transform(x)", data)
>>> # With imported functions (e.g., from scipy) >>> from scipy.special import erfc >>> data = {'y': [1, 2, 3], 'x': [0.5, 1.0, 1.5]} >>> y, X = formula("y ~ erfc(x)", data)
>>> # Custom function on LHS (response variable) >>> y, X = formula("my_transform(y) ~ x", data)
- greybox.formula.expand_formula(formula_str)[source]
Expand formula with interaction terms.
- Parameters:
formula_str (str) – Formula string, e.g., “y ~ x1 * x2”
- Returns:
Expanded formula with explicit interaction terms.
- Return type:
str
Formula Syntax Reference
Basic Operators
~: Separates response from predictors+: Adds a term (include variable)-: Removes a term0or-1: Removes intercept1: Adds intercept (default)*: Main effects and interactions (a*b = a + b + a:b):: Interaction onlyI(): Protect expression from interpretation
Transformations
Supported transformations in formula terms:
log(x)- Natural logarithmlog10(x)- Base 10 logarithmlog2(x)- Base 2 logarithmsqrt(x)- Square rootexp(x)- Exponentialabs(x)- Absolute valuesin(x),cos(x),tan(x)- Trigonometric
Polynomial Terms
I(x^2)- Squared term (protected)I(x^3)- Cubed termpoly(x, 2)- Polynomial (if supported)
Special Variables
trend- Linear time trend (1, 2, 3, …)
Examples
Basic linear regression:
y, X = formula("y ~ x1 + x2", data)
Without intercept:
y, X = formula("y ~ 0 + x1 + x2", data)
With log transformation:
y, X = formula("log(y) ~ log(x1) + sqrt(x2)", data)
Polynomial regression:
y, X = formula("y ~ x + I(x^2) + I(x^3)", data)
Interactions:
y, X = formula("y ~ x1 * x2", data) # equivalent to x1 + x2 + x1:x2