Predictor Coding Schemes for Regression
I always learn which coding schemes mean what when I need to for a project and then forget again by the next project. Better get it written down!
It's also good to have this lying around:
Interpretation of the coefficients in logistic regression
log(P(y = 1|x)/P(y = 0|x))
Rules for Coding
My notes from Brian Dillon's class tell me that:
levels to be grouped together get same sign
levels to be contrasted get opposite
excluded levels get 0
contrasts have to sum to zero if you want use interactions
Types of Coding
dummy coding or treatment coding:
picks a reference level of the factor and compares other levels to that one
intercept is the mean of the reference level
don't use interactions
contrast matrix for a factor with 3 levels (cell shows the contrast between row and column level)
2 3 1Â 0 0 2 1 0 3 0 1
simple coding
compares each non-reference level to the reference level
intercept is the grand mean
can use interactions
contrast matrix:
2 3 1Â -.33 -.33 2 .66 -.33 3 -.33 .66
sum coding or deviation coding
compares each level to the grand mean
intercept is the grand mean
can use interactions
contrast matrix:
2 3 1Â -1 -1 2 1 0 3 0 1
The link below has even more, but that's all I need for now!
Sources:
Brian Dillon's Quantitative Methods class in the UMass Linguistics Department
http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm














