1. Help Center
  2. Machine Learning

when creating dummy variables for a categorical variable, why do we need to discard one of them?

If you have 3 groups for race, then you can use only 2 dummy variables to represent membership in race group.

In general, for k groups, you use only (k-1) dummy variables.

It’s helpful to think of each dummy variable as a yes/ no question about group membership.

Suppose your race groups are:

1= white

2 = black

3 = other

Dummy variable 1 answers the question: Do you identify yourself as white? 0 = no, 1 = yes.

Dummy variable 2 answers the question: Do you identify yourself as black? 0 = no, 1 = yes.

Provided that your groups are mutually exclusive and exhaustive, then if a person answers no to the first two questions, that person must be a member of group 3, other race.

In fact, if you try to include a third dummy variable in this situation, regression analysis will fail because the scores on the third dummy variable are perfectly predictable from the answers on the first two dummy variable questions.