Alcoholism and Major Lifetime Depression : W2 Data Analysis Tools
For the second week’s assignment of Data Analysis Tool on Coursera, we would continue to be working with NESARC’s dataset which contains information on alcohol and drug use and disorders, related risk factors, and associated physical and mental disabilities.
We would be studying the effect of Major Depression in the life of an individual on their alcohol consuming status. We'd be performing an Chi-Square test of Independence test between a categorical explanatory variable (alcohol drinking status ), and a categorical response variable (presence of major lifetime depression). We'll also be restricting the test to include only adults of age between 18-40.
The explanatory variable has 3 groups
The response variable has 2 groups.
0. No Lifetime Depression
1. Has Lifetime Depression
The null hypothesis is that there is no association between the drinking status of an individual and the presence of Major Lifetime Depression
Running a Chi-Square Test of Independence between the data for two variables, we get :
In the first table, the table of counts of the response variable by the explanatory variable, we see the number of individual under each consumer group (1,2, or 3), who do and do not have major lifetime depression. That is, among current drinkers, 10472 individuals do not have a Lifetime depression, while 2768 individuals do suffer from depression.
The next table presents the same data in percentages of individuals with or without lifetime depression under each alcohol consumer group. So 79% of current drinkers do not have major lifetime depression, while 21% do.
The graph below also conveys the same, just for the proportion of individuals under each alcohol consumer group who have Major Lifetime Depression. So, 21% of current drinkers and 20% of Ex-Drinkers have Major Lifetime Depression, while only 11 % of Lifetime abstainers have suffer from depression.
The Chi-Square Value from the test is large, about 168, while the p-value is very small (<< 0.0001), which tells us that the presence of Major Lifetime Depression and the Alcohol-Consuming Status of an individual are significantly associated.
The explanatory variable has 3 categories, and by observing the plot we can infer say that the Life-Time Abstainers had a significantly lower rate of life-time depression diagnosis compared to the current-drinkers and ex-drinkers. To quantitatively verify the same, and to avoid a type 1 error, we'll use the Bonferroni Adjustment Posthoc test.
Since we need to make only three pairs of comparisons, we would evaluate significance at the adjusted p-value of 0.017 (0.05/3).
Now, running a chi-square test between just the group 1 and 2 of Alcohol-Consumer Status we get a low Chi-Square value of 0.211 and a large p-value 0.64 >> 0.017. We hence will accept the null-hypothesis that there is no significant difference in the rates of Major Lifetime Depression among current-drinkers and ex-drinkers.
Running a chi-square test between just the group 1 and 3 of Alcohol-Consumer Status we get a high Chi-Square value of 165 and a low p-value << 0.017. We hence will reject the null-hypothesis that there is no significant difference in the rates of Major Lifetime Depression among current-drinkers and life-time abstainers.
Finally, using a chi-square test between just the group 2 and 3 of Alcohol-Consumer Status we get a high Chi-Square value of 89 and a low p-value << 0.017. We hence will once again reject the null-hypothesis that there is no significant difference in the rates of Major Lifetime Depression among Ex-Drinkers and life-time abstainers.
Thus, using the Bonferroni Adjustment, we can conclude that there is a significant difference in the occurrence of major life-time depression between Lifetime alcohol Abstainers as compared to current-drinkers or ex-drinkers. However, the rate of depression is not significantly different between current-drinkers and ex-drinkers.
@author: DKalaikadal159607
"""
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt
data = pandas.read_csv('nesarc.csv', low_memory=False)
#new code setting variables you will be working with to numeric
data['MAJORDEPLIFE'] = pandas.to_numeric(data['MAJORDEPLIFE'], errors='coerce')
data['CONSUMER'] = pandas.to_numeric(data['CONSUMER'], errors='coerce')
data['AGE'] = pandas.to_numeric(data['AGE'], errors='coerce')
#subset data to young adults age 18 to 40
sub1=data[(data['AGE']>=18) & (data['AGE']<=40)]
#make a copy of my new subsetted data
#contingency table of observed counts
ct1=pandas.crosstab(sub2['MAJORDEPLIFE'], sub2['CONSUMER'])
print (ct1)
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)
seaborn.catplot(x="CONSUMER", y="MAJORDEPLIFE", data=sub2, kind="bar", ci=None)
plt.xlabel('Alcohol Consumer Status')
plt.ylabel('Proportion with Major Depression')
recode2 = {1: 1, 2: 2}
sub2['COMP1v2']= sub2['CONSUMER'].map(recode2)
#contingency table of observed counts
ct2=pandas.crosstab(sub2['MAJORDEPLIFE'], sub2['COMP1v2'])
print (ct2)
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
recode3 = {1: 1, 3:3 }
sub2['COMP1v3']= sub2['CONSUMER'].map(recode3)
#contingency table of observed counts
ct3=pandas.crosstab(sub2['MAJORDEPLIFE'], sub2['COMP1v3'])
print (ct3)
colsum=ct3.sum(axis=0)
colpct=ct3/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs3= scipy.stats.chi2_contingency(ct3)
print (cs3)
recode4 = {2: 2, 3: 3}
sub2['COMP2v3']= sub2['CONSUMER'].map(recode4)
#contingency table of observed counts
ct4=pandas.crosstab(sub2['MAJORDEPLIFE'], sub2['COMP2v3'])
print (ct4)
colsum=ct4.sum(axis=0)
colpct=ct4/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs4= scipy.stats.chi2_contingency(ct4)
print (cs4)