Coding categorical data¶
Patsy allows great flexibility in how categorical data is coded,
via the function C()
. C()
marks some data as being
categorical (including data which would not automatically be treated
as categorical, such as a column of integers), while also optionally
setting the preferred coding scheme and level ordering.
Let’s get some categorical data to work with:
In [1]: from patsy import dmatrix, demo_data, ContrastMatrix, Poly
In [2]: data = demo_data("a", nlevels=3)
In [3]: data
Out[3]: {'a': ['a1', 'a2', 'a3', 'a1', 'a2', 'a3']}
As you know, simply giving Patsy a categorical variable causes it
to be coded using the default Treatment
coding
scheme. (Strings and booleans are treated as categorical by default.)
In [4]: dmatrix("a", data)
Out[4]:
DesignMatrix with shape (6, 3)
Intercept a[T.a2] a[T.a3]
1 0 0
1 1 0
1 0 1
1 0 0
1 1 0
1 0 1
Terms:
'Intercept' (column 0)
'a' (columns 1:3)
We can also alter the level ordering, which is useful for, e.g.,
Diff
coding:
In [5]: l = ["a3", "a2", "a1"]
In [6]: dmatrix("C(a, levels=l)", data)
Out[6]:
DesignMatrix with shape (6, 3)
Intercept C(a, levels=l)[T.a2] C(a, levels=l)[T.a1]
1 0 1
1 1 0
1 0 0
1 0 1
1 1 0
1 0 0
Terms:
'Intercept' (column 0)
'C(a, levels=l)' (columns 1:3)
But the default coding is just that – a default. The easiest alternative is to use one of the other built-in coding schemes, like orthogonal polynomial coding:
In [7]: dmatrix("C(a, Poly)", data)
Out[7]:
DesignMatrix with shape (6, 3)
Intercept C(a, Poly).Linear C(a, Poly).Quadratic
1 -0.70711 0.40825
1 -0.00000 -0.81650
1 0.70711 0.40825
1 -0.70711 0.40825
1 -0.00000 -0.81650
1 0.70711 0.40825
Terms:
'Intercept' (column 0)
'C(a, Poly)' (columns 1:3)
There are a number of built-in coding schemes; for details you can check the API reference. But we aren’t restricted to those. We can also provide a custom contrast matrix, which allows us to produce all kinds of strange designs:
In [8]: contrast = [[1, 2], [3, 4], [5, 6]]
In [9]: dmatrix("C(a, contrast)", data)
Out[9]:
DesignMatrix with shape (6, 3)
Intercept C(a, contrast)[custom0] C(a, contrast)[custom1]
1 1 2
1 3 4
1 5 6
1 1 2
1 3 4
1 5 6
Terms:
'Intercept' (column 0)
'C(a, contrast)' (columns 1:3)
In [10]: dmatrix("C(a, [[1], [2], [-4]])", data)