patsy
API reference¶
This is a complete reference for everything you get when you import patsy.
Basic API¶
-
patsy.
dmatrix
(formula_like, data={}, eval_env=0, NA_action='drop', return_type='matrix')¶ Construct a single design matrix given a formula_like and data.
- Parameters
formula_like – An object that can be used to construct a design matrix. See below.
data – A dict-like object that can be used to look up variables referenced in formula_like.
eval_env – Either a
EvalEnvironment
which will be used to look up any variables referenced in formula_like that cannot be found in data, or else a depth represented as an integer which will be passed toEvalEnvironment.capture()
.eval_env=0
means to use the context of the function callingdmatrix()
for lookups. If calling this function from a library, you probably wanteval_env=1
, which means that variables should be resolved in your caller’s namespace.NA_action – What to do with rows that contain missing values. You can
"drop"
them,"raise"
an error, or for customization, pass anNAAction
object. SeeNAAction
for details on what values count as ‘missing’ (and how to alter this).return_type – Either
"matrix"
or"dataframe"
. See below.
The formula_like can take a variety of forms. You can use any of the following:
(The most common option) A formula string like
"x1 + x2"
(fordmatrix()
) or"y ~ x1 + x2"
(fordmatrices()
). For details see How formulas work.A
ModelDesc
, which is a Python object representation of a formula. See How formulas work and Model specification for experts and computers for details.A
DesignInfo
.An object that has a method called
__patsy_get_model_desc__()
. For details see Model specification for experts and computers.A numpy array_like (for
dmatrix()
) or a tuple (array_like, array_like) (fordmatrices()
). These will have metadata added, representation normalized, and then be returned directly. In this case data and eval_env are ignored. There is special handling for two cases:DesignMatrix
objects will have theirDesignInfo
preserved. This allows you to set up custom column names and term information even if you aren’t using the rest of the patsy machinery.pandas.DataFrame
orpandas.Series
objects will have their (row) indexes checked. If two are passed in, their indexes must be aligned. Ifreturn_type="dataframe"
, then their indexes will be preserved on the output.
Regardless of the input, the return type is always either:
A
DesignMatrix
, ifreturn_type="matrix"
(the default)A
pandas.DataFrame
, ifreturn_type="dataframe"
.
The actual contents of the design matrix is identical in both cases, and in both cases a
DesignInfo
object will be available in a.design_info
attribute on the return value. However, forreturn_type="dataframe"
, any pandas indexes on the input (either in data or directly passed through formula_like) will be preserved, which may be useful for e.g. time-series models.New in version 0.2.0: The
NA_action
argument.
-
patsy.
dmatrices
(formula_like, data={}, eval_env=0, NA_action='drop', return_type='matrix')¶ Construct two design matrices given a formula_like and data.
This function is identical to
dmatrix()
, except that it requires (and returns) two matrices instead of one. By convention, the first matrix is the “outcome” or “y” data, and the second is the “predictor” or “x” data.See
dmatrix()
for details.
-
patsy.
incr_dbuilders
(formula_like, data_iter_maker, eval_env=0, NA_action='drop')¶ Construct two design matrix builders incrementally from a large data set.
incr_dbuilders()
is toincr_dbuilder()
asdmatrices()
is todmatrix()
. Seeincr_dbuilder()
for details.
-
patsy.
incr_dbuilder
(formula_like, data_iter_maker, eval_env=0, NA_action='drop')¶ Construct a design matrix builder incrementally from a large data set.
- Parameters
formula_like – Similar to
dmatrix()
, except that explicit matrices are not allowed. Must be a formula string, aModelDesc
, aDesignInfo
, or an object with a__patsy_get_model_desc__
method.data_iter_maker – A zero-argument callable which returns an iterator over dict-like data objects. This must be a callable rather than a simple iterator because sufficiently complex formulas may require multiple passes over the data (e.g. if there are nested stateful transforms).
eval_env – Either a
EvalEnvironment
which will be used to look up any variables referenced in formula_like that cannot be found in data, or else a depth represented as an integer which will be passed toEvalEnvironment.capture()
.eval_env=0
means to use the context of the function callingincr_dbuilder()
for lookups. If calling this function from a library, you probably wanteval_env=1
, which means that variables should be resolved in your caller’s namespace.NA_action – An
NAAction
object or string, used to determine what values count as ‘missing’ for purposes of determining the levels of categorical factors.
- Returns
Tip: for data_iter_maker, write a generator like:
def iter_maker(): for data_chunk in my_data_store: yield data_chunk
and pass iter_maker (not iter_maker()).
New in version 0.2.0: The
NA_action
argument.
-
exception
patsy.
PatsyError
(message, origin=None)¶ This is the main error type raised by Patsy functions.
In addition to the usual Python exception features, you can pass a second argument to this function specifying the origin of the error; this is included in any error message, and used to help the user locate errors arising from malformed formulas. This second argument should be an
Origin
object, or else an arbitrary object with a.origin
attribute. (If it is neither of these things, then it will simply be ignored.)For ordinary display to the user with default formatting, use
str(exc)
. If you want to do something cleverer, you can use the.message
and.origin
attributes directly. (The latter may be None.)
Convenience utilities¶
-
patsy.
balanced
(factor_name=num_levels[, factor_name=num_levels, ..., repeat=1])¶ Create simple balanced factorial designs for testing.
Given some factor names and the number of desired levels for each, generates a balanced factorial design in the form of a data dictionary. For example:
In [1]: balanced(a=2, b=3) Out[1]: {'a': ['a1', 'a1', 'a1', 'a2', 'a2', 'a2'], 'b': ['b1', 'b2', 'b3', 'b1', 'b2', 'b3']}
By default it produces exactly one instance of each combination of levels, but if you want multiple replicates this can be accomplished via the repeat argument:
In [2]: balanced(a=2, b=2, repeat=2) Out[2]: {'a': ['a1', 'a1', 'a2', 'a2', 'a1', 'a1', 'a2', 'a2'], 'b': ['b1', 'b2', 'b1', 'b2', 'b1', 'b2', 'b1', 'b2']}
-
patsy.
demo_data
(*names, nlevels=2, min_rows=5)¶ Create simple categorical/numerical demo data.
Pass in a set of variable names, and this function will return a simple data set using those variable names.
Names whose first letter falls in the range “a” through “m” will be made categorical (with nlevels levels). Those that start with a “p” through “z” are numerical.
We attempt to produce a balanced design on the categorical variables, repeating as necessary to generate at least min_rows data points. Categorical variables are returned as a list of strings.
Numerical data is generated by sampling from a normal distribution. A fixed random seed is used, so that identical calls to demo_data() will produce identical results. Numerical data is returned in a numpy array.
Example:
Design metadata¶
-
class
patsy.
DesignInfo
(column_names, factor_infos=None, term_codings=None)¶ A DesignInfo object holds metadata about a design matrix.
This is the main object that Patsy uses to pass metadata about a design matrix to statistical libraries, in order to allow further downstream processing like intelligent tests, prediction on new data, etc. Usually encountered as the .design_info attribute on design matrices.
Here’s an example of the most common way to get a
DesignInfo
:In [3]: mat = dmatrix("a + x", demo_data("a", "x", nlevels=3)) In [4]: di = mat.design_info
-
column_names
¶ The names of each column, represented as a list of strings in the proper order. Guaranteed to exist.
In [5]: di.column_names Out[5]: ['Intercept', 'a[T.a2]', 'a[T.a3]', 'x']
-
column_name_indexes
¶ An
OrderedDict
mapping column names (as strings) to column indexes (as integers). Guaranteed to exist and to be sorted from low to high.In [6]: di.column_name_indexes Out[6]: OrderedDict([('Intercept', 0), ('a[T.a2]', 1), ('a[T.a3]', 2), ('x', 3)])
-
term_names
¶ The names of each term, represented as a list of strings in the proper order. Guaranteed to exist. There is a one-to-many relationship between columns and terms – each term generates one or more columns.
In [7]: di.term_names Out[7]: ['Intercept', 'a', 'x']
-
term_name_slices
¶ An
OrderedDict
mapping term names (as strings) to Pythonslice()
objects indicating which columns correspond to each term. Guaranteed to exist. The slices are guaranteed to be sorted from left to right and to cover the whole range of columns with no overlaps or gaps.In [8]: di.term_name_slices Out[8]: OrderedDict([('Intercept', slice(0, 1, None)), ('a', slice(1, 3, None)), ('x', slice(3, 4, None))])
-
terms
¶ A list of
Term
objects representing each term. May be None, for example if a user passed in a plain preassembled design matrix rather than using the Patsy machinery.In [9]: di.terms Out[9]: [Term([]), Term([EvalFactor('a')]), Term([EvalFactor('x')])] In [10]: [term.name() for term in di.terms]
-