Regression Modeling in Practice, Week 4: Test a Logistic Regression Model
SAS syntax to test logit models
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
data new; set mydata.gapminder;
if urbanrate ne . and internetuserate ne . and polityscore ne .; /* setting aside missing data */
if polityscore > -6 and polityscore < 6 then regtype=0; /* anocracy */
if polityscore >= 6 then regtype=1; /* democracy; also reference group */
/* CENTRING THE MEAN */
proc means; var urbanrate internetuserate; /* means procedure to find the mean for centring var's */
data new2; set new; /* creating a new dataset new2 */
urbanrate_ctrd=urbanrate-55.3735484; /* quan exp var 1; creating new variable by centring the mean */
intuserate_ctrd=internetuserate-32.5309464; /* quan exp var 2; creating new variable be centring the mean */
proc means; var urbanrate_ctrd intuserate_ctrd; /* means procedure to validate centring */
/* LOGISTIC REGRESSION MODELS */
proc logistic descending; model regtype=urbanrate_ctrd; /* testing assc'n with urbanization */
proc logistic descending; model regtype=intuserate_ctrd; /* testing assc'n with internet use rate */
proc logistic descending; model regtype=urbanrate_ctrd intuserate_ctrd; /* testing for confounding variable */
PROC MEANS to centre explanatory variables and checking new variables:
Logit model to test any association with quantitative explanatory variable #1 or urbanization:
Logit model to test association with quantitative explanatory variable #2 or internet use rate:
Logit model testing for confounding effects on associations and including both explanatory variables (e.g., urbanization and internet use rate):
This week, I am testing a logistic regression model based on a binary categorical response variable ‘polity score’ from the Polity IV dataset, and quantitative explanatory variables ‘internet use rate’ (2010) and ‘urbanization’ (2008) from the World Bank data available in Gapminder.
In my data management steps, I have created a new variable “regtype” by categorizing countries with polity scores between -5 and 5 (or, anocracy regime type) as zero “0″, and countries with polity scores between 6 and 10 (or, democracy regime type) as one “1″. The PROC LOGISTIC descending procedure tested for the association between the categorical response variable (i.e., regtype) and quantitative explanatory variables (i.e., urbanrate_ctrd & intuserate_ctrd) where there is a presence of democracy. Note that the “descending” code in PROC LOGISTIC step tells SAS to denote regtype=1 as the presence of democracy and regtype=0 (or anocracy regime category) as its absence.
The key research question is: What happens to the odds of the presence of democracy when we account for urbanization and internet use rates in a sample of countries?
I am setting up three separate models to answer my question, that is I am checking for the association of each explanatory variables with the categorical response variable (i.e., regtype-urbanrate_ctrd, and regtype-intuserate_ctrd) and one model that tests for confounding effects by including both explanatory variables (i.e., regtype-urbanrate_ctrd & intuserate_ctrd).
My hypothesis is that both explanatory variables are significantly associated with the categorical response variable (even after testing for confounding effects), and value(s) of the explanatory variable(s) increase(s) since the odds of the presence of democracy is more likely; in other words, I expect the odds ratio in each test to be greater than 1.
Instead of looking for true expected values (as in a multiple linear regression model where response variable is quantitative), I am looking for “the probability of an event occuring” (i.e., odds ratio) as the response variable is categorical and can only take on the values 0 and 1. Since my explanatory variables are both quantitative (as opposed to being categorical), the odds ratios will show what happens to the odds of the presence of democracy in a sample with every one unit increase in explanatory variable(s).
(i) For association with quantitative explanatory variable #1, i.e., urbanrate_ctrd:
The results show that the odds ratio estimate is 1.031 and the p-value is significant at 0.0005. The results may be interpreted to say that there is a significant association in this relationship, and with every one unit increase in urbanization rate, the odds of the presence of democracy increases in likelihood by 1.031% in the sample. The confidence ratio shows that we can be 95% confident that if we selected another sample from the population and ran the same tests, the odds of the presence of democracy will increase somewhere between 1.014% and 1.050% with every one unit increase in urbanization rate within the sample.
(ii) For association with quantitative explanatory variable #2, i.e, intuserate_ctrd:
The results show that the odds ratio estimate is 1.058 and the p-value is significant at less than .0001. So, there is a positive association between internet use rate and the presence of democracy. With every one unit increase in internet use rate, the likelihood of the odds of the presence of democracy in the sample increases by 1.058%. In a new sample of the population this increase in the odds of the presence of democracy will be somewhere between 1.034% and 1.082% about ninety-five times out of a hundred trials if the same tests were also run.
(iii) For association with both urbanrate_ctrd and intuserate_ctrd explanatory variables:
When we account for internet use rate in the model, urbanization is no longer significantly associated with the presence of democracy (p-value is 0.3995, much higher than the alpha of 0.05). So, there is evidence of confounding when we control for internet use rate in the model. The p-value for internet use rate is still significant and less than .0001. We may interpret the results to say that the odds of the presence of democracy increases by 1.066% with every unit increase in internet use rate, when we also account for urbanization rate in the model. The 95% Wald Confidence Limits tell us that the odds of the presence of democracy is somewhere between 1.035% and 1.098% about ninety-five times out of a hundred trials if another sample was drawn from the population and the same test was run.
(iv) Summary conclusion and hypothesis testing:
The odds ratios are sample statistics and show the probability of the odds in the presence of democracy (y, or response variable) when we account for one unit increase in x, or explanatory variables. There is a positive association between the presence of democracy and the explanatory variables when accounted for separately in two different tests. However, the confounding effect shows that that the hypothesis was partially incorrect, and internet use rate, not urbanization, is significantly associated with predicting the odds of the presence of democracy in the sample, when both explanatory variables are included in the same logit model.