Formula

greybox.formula.formula(formula_str, data, return_type='both', as_dataframe=True)[source]

Parse formula string and return design matrix and/or response.

Parameters:
  • formula_str (str) – Formula string in R-style, e.g., “y ~ x1 + x2” or “~ x1 + x2” (no y). Supports transformations: log(x), sqrt(x), x^2, etc. Use I() to protect expressions: I(x^2)

  • data (dict or DataFrame) – Data containing variables. Can be dict, DataFrame, or dict of arrays.

  • return_type (str, optional) – What to return: “both” (default), “X”, “y”, or “terms”.

  • as_dataframe (bool, optional) – If True, return pandas DataFrames instead of numpy arrays. This preserves variable names in column names. Default is True.

Returns:

  • “both”: tuple (y, X) where y is 1D array (or DataFrame) and X is 2D array (or DataFrame) with intercept

  • ”X”: design matrix only (or DataFrame if as_dataframe=True)

  • ”y”: response only (or DataFrame if as_dataframe=True)

  • ”terms”: list of term names

Return type:

Depends on return_type and as_dataframe

Examples

>>> data = {'y': [1, 2, 3], 'x1': [4, 5, 6], 'x2': [7, 8, 9]}
>>> y, X = formula("y ~ x1 + x2", data)
>>> X_no_intercept = formula("~ x1 + x2", data, return_type="X")
>>> # With transformations
>>> data = {'y': [1, 2, 3], 'x': [1, 2, 3]}
>>> # y is log-transformed, X has x and x^2
>>> y, X = formula("log(y) ~ x + x^2", data)
>>> # Return as DataFrames with column names
>>> y, X = formula("y ~ x1 + x2", data, as_dataframe=True)
>>> print(X.columns)  # ['(Intercept)', 'x1', 'x2']
>>> # With custom functions (defined or imported in your global scope)
>>> def my_transform(x):
...     return x * 2
>>> data = {'y': [1, 2, 3], 'x': [1, 2, 3]}
>>> y, X = formula("y ~ my_transform(x)", data)
>>> # With imported functions (e.g., from scipy)
>>> from scipy.special import erfc
>>> data = {'y': [1, 2, 3], 'x': [0.5, 1.0, 1.5]}
>>> y, X = formula("y ~ erfc(x)", data)
>>> # Custom function on LHS (response variable)
>>> y, X = formula("my_transform(y) ~ x", data)
greybox.formula.expand_formula(formula_str)[source]

Expand formula with interaction terms.

Parameters:

formula_str (str) – Formula string, e.g., “y ~ x1 * x2”

Returns:

Expanded formula with explicit interaction terms.

Return type:

str

Formula Syntax Reference

Basic Operators

  • ~ : Separates response from predictors

  • + : Adds a term (include variable)

  • - : Removes a term

  • 0 or -1 : Removes intercept

  • 1 : Adds intercept (default)

  • * : Main effects and interactions (a*b = a + b + a:b)

  • : : Interaction only

  • I() : Protect expression from interpretation

Transformations

Supported transformations in formula terms:

  • log(x) - Natural logarithm

  • log10(x) - Base 10 logarithm

  • log2(x) - Base 2 logarithm

  • sqrt(x) - Square root

  • exp(x) - Exponential

  • abs(x) - Absolute value

  • sin(x), cos(x), tan(x) - Trigonometric

Polynomial Terms

  • I(x^2) - Squared term (protected)

  • I(x^3) - Cubed term

  • poly(x, 2) - Polynomial (if supported)

Special Variables

  • trend - Linear time trend (1, 2, 3, …)

Examples

Basic linear regression:

y, X = formula("y ~ x1 + x2", data)

Without intercept:

y, X = formula("y ~ 0 + x1 + x2", data)

With log transformation:

y, X = formula("log(y) ~ log(x1) + sqrt(x2)", data)

Polynomial regression:

y, X = formula("y ~ x + I(x^2) + I(x^3)", data)

Interactions:

y, X = formula("y ~ x1 * x2", data)  # equivalent to x1 + x2 + x1:x2