Regression Modelling in Practice (Week 3 : Testing a Multiple Regression Model)
The aim this week is to fit and test a multiple regression model for a given response variable. Moving on from basic linear fitting done on my response variable suicideper100th last week, this week I first tested the fit of a simple linear model based on internetuserate and subsequently on a quadratic model based on internetuserate. I also tested the fit using an additional predictor alcconsumption to explore the association between suicide rate and alcohol consumption, while controlling for other factors.Â
For testing the model, we set up
Null hypothesis (H0): there is no association between the response variable and the predictor, againstÂ
Alternate Hypothesis (H1): there is significant association between the response and predictor.
The SAS Code used is as under:
libname mydata "/courses/d1406ae5ba27fe300/" access=readonly;
data new; set mydata.gapminder;
label internetuserate_c = "Centered Internet Use Rate" Â Â Â Â alccconsumption_c = "Centered Alcohol Consumption";
run; Â ********************************************************************** POLYNOMIAL REGRESSION **********************************************************************;
* scatterplot with linear regression line suicideper100th response variable; proc sgplot; Â reg x=internetuserate y=suicideper100th / lineattrs=(color=blue thickness=2) clm; Â yaxis label="Suicide Rate (per 100,000 persons)"; Â xaxis label="Internet Use Rate"; run;
* scatterplot with linear and quadratic regression line; proc sgplot; Â reg x=internetuserate y=suicideper100th / lineattrs=(color=blue thickness=2) degree=1 clm; Â reg x=internetuserate y=suicideper100th / lineattrs=(color=green thickness=2) degree=2 clm; Â yaxis label="Suicide Rate (per 100,000 persons)"; Â xaxis label="Internet Use Rate"; run;
* centering quantitative explanatory variables; data new2; set new; if suicideper100th ne . and internetuserate ne . and alcconsumption ne .; internetuserate_c=internetuserate-33.7858281; alcconsumption_c=alcconsumption-6.7778409; run;
proc means; var internetuserate alcconsumption; run;
* check coding; proc means; var internetuserate_c alcconsumption_c; run;
* linear regression model; PROC glm;Â model suicideper100th=internetuserate_c/solution clparm; run;
PROC glm;Â model suicideper100th=alcconsumption_c/solution clparm; run;
* polynomial regression model; PROC glm;Â model suicideper100th=internetuserate_c internetuserate_c*internetuserate_c/solution clparm; run;
********************************************************************** EVALUATING MODEL FIT **********************************************************************;
* multiple regression adding employment rate; PROC glm;Â model suicideper100th=internetuserate_c internetuserate_c*internetuserate_c alcconsumption_c/solution clparm; run;
* request regression diagnostic plots; PROC glm PLOTS(unpack)=all; model suicideper100th=internetuserate_c internetuserate_c*internetuserate_c alcconsumption_c/solution clparm; output residual=res student=stdres out=results; run;
* plot of standardized residuals for each observation; proc gplot; label stdres="Standardized Residual"; plot stdres*country/vref=0;Â run;
* using proc reg to get a partial regression plot; * calculate quadratic terms; data partial; set new2; internetuserate2=internetuserate_c*internetuserate_c; run;
*partial regression plot;
PROC reg plots=partial;
model suicideper100th=internetuserate internetuserate2 alcconsumption/partial;
run;
ANALYSIS
1) Centering the predictors
The means of the predictors internetuserate and alcconsumption before centering were 33.7641864 and 6.7778409 respectively. The means after centering were reduced to almost 0 for both the predictors.
2) Testing the regressors
A) For testing suicide rate and internet use rate, the estimated parameter coefficient was 0.0162 (positive) and the corresponding p-value (0.3433) > 0.05, so at 5% level of significance we do not have sufficient evidence to conclude that there is a significant association between internet use and suicide. Figure 1 shows that most of the observations were within the 95% prediction limits, barring those above 21 (approx).
Figure 1. Fit Plot for suicide rate based on internet use rate (centered)
B) For testing suicide rate and alcohol consumption, the estimated parameter coefficient was 0.497 (positive) and the corresponding p-value (<0.0001) < 0.05, so at 5% level of significance we can conclude that there is a significant association between alcohol consumption and suicide. Figure 2 shows that most of the observations were within the 95% prediction limits and were crowded around values of alcohol consumption (centered) lower than 10.
Figure 2. Fit Plot for suicide rate based on alcohol consumption (centered)
C) For testing the fit of a quadratic model based on internetuserate, from Figure 3 and Figure 4, it was observed that for both the predictors (internetuserate_c and internetuserate_c*internetuserate_c), most of the observations were outside the 95% confidence limits. Also, the observations were highly scattered around the line of regression, indicating the fit may not be good. This was further substantiated using the results from the GLM procedure according to which the model has a very high p-value (0.2586) indicating that the model coefficients are not significant, which is indeed true as can be inferred from the high p-values of 0.9618 and 0.1792 for internetuserate_c and internetuserate_c * internetuserate_c, respectively. Thus, quadratic fit based on internetuserate is also not very effective in predicting suicide rates.
Figure 3. Fit Plot for simple linear model based on internetuserate
Figure 4. Fit Plot for quadratic model based on internetuserate
D) For testing the fit of a model based on internetuserate_c, internetuserate_c * internetuserate_c and alcconsumption_c, the p-value was <0.0001, which indicated that the fit was significant or that the predictors were significantly associated with the response. The final predicted model wasÂ
Y = 8.3536 - 0.0728*X1 + 0.0016*X1*X1 + 0.6649*X2Â
where X1 is internetuserate and X2 is alcohol consumption (in litres)
It was thus observed that on addition of a variable alcohol consumption in the model, internet use rate showed a negative association with suicide rate which is contrary to what was observed for a simple linear regression model. Also, earlier the corresponding regression coefficients for internetuserate_c and internetuserate_c*internetuserate_c were not significant and after addition of alcconsumption variable, their association became significant (p-values are 0.0009 and 0.0129, respectively). This indicates that alcohol consumption has a confounding effect on the relationship between internet use rate and suicide rate. Effect of alcohol consumption however, remains significant even on addition of variables to the model indicating a strong association with suicide rates.Â
3) Validation of Assumptions
The residuals are symmetrically distributed around the middle of the plot (Figure 5 & 6), i.e., around residual=0 and also no pattern can be observed in the plot indicating that the residuals and predicted values are unrelated or the actual and predicted values of the response variable are somewhat close to each other. The plots of residuals by regressors (Figure 7) however, indicates that although internetuserate_c and alcconsumption_c seem to be uncorrelated with the residuals, for internetuserate_c*internetuserate_c a pattern in the plot was observed indicating differences in the predicted and actual response values.
Figure 5. Plot of Residuals vs. Predicted response
Figure 6. Plot of Studentized Residuals vs. Predicted response
Figure 7. Plots of residuals by regressors
In linear regression, an outlier is an observation with large residual. An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how much an observation has effected its fitted value from the regression model. From the plot of outlier and leverage points (Figure 8), it is seen that points with studentized residuals higher than 2 are considered outliers and points with leverage value higher than 0.045 (approx.) are considered points with high leverage. The data needs to be treated for these before making any conclusive comment on the underlying relationship of the predictors and the response.
Figure 8. Plot of Outliers and Leverage Points
If the residuals can be taken as coming from normal distribution then the assumption of normality of errors is valid. From figure 9, it can be seen that the Q-Q plot of residuals is not linear and a non-linear pattern (curve) exists which indicates deviation from Normality. This can also be ascertained from figure 10 which shows a more peaked curve of the Kernel than that of Normal distribution.
Figure 9. Q-Q plot of residuals
Figure 10. Plot for distribution of Residuals
In the plot of standardized residuals vs. Country, although some of the points lie far from the line of 0 residual, i.e., fitted=predicted, most of the points lie within a distance of  2 units from the line of 0 deviation.
Figure 11. Plot of Standardized Residuals
In order to check whether the variable under consideration affects the response linearly in the presence of other variables, we use partial regression plots (also called added variable plots). If the points are scattered more or less along a straight line in the partial plot then that feature can be taken as linear. From figure 12 it can be seen that internetuserate and alcconsumption have a linear relationship while internetuserate2 (internetuserate^2) displays a non-linear relationship.
Figure 12. Partial Regression Plots by regressors











