# -*- coding: utf-8 -*-
"""
Created on Wed May 25 08:52:38 2016
@author: DH KIM
"""
# ANOVA
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv('nesarc_pds.csv', low_memory=False)
#setting variables you will be working with to numeric
data['S1Q10A'] = pd.to_numeric(data['S1Q10A'], errors='coerce')
data['S3AQ3B1'] = pd.to_numeric(data['S3AQ3B1'], errors='coerce')
data['S3AQ3C1'] = pd.to_numeric(data['S3AQ3C1'], errors='coerce')
data['CHECK321'] = pd.to_numeric(data['CHECK321'], errors='coerce')
data['SEX'] = pd.to_numeric(data['SEX'], errors='coerce')
#subset data to those people who have income and have smoked in the past 12 months
sub=data[ (data['S1Q10B']!=0) & (data['CHECK321']==1)]
sub2 = sub[['SEX', 'S3AQ3B1', 'S3AQ3C1', 'S1Q10A']].copy()
sub2['S3AQ3B1']=sub2['S3AQ3B1'].replace(9, np.nan)
sub2['S3AQ3C1']=sub2['S3AQ3C1'].replace(99, np.nan)
recode1 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1}
sub2['USFREQMO']= sub2['S3AQ3B1'].map(recode1)
# mean value of personal income in last 12 month
avgIncome = sub2['S1Q10A'].mean()
LvIncome = [0, sub2['S1Q10A'].quantile(0.3), sub2['S1Q10A'].quantile(0.7), sub2['S1Q10A'].max()]
# Split into 3 groups: 'low', 'medium', 'high'
splitIntoCat = pd.cut(sub2.S1Q10A, LvIncome, labels=['low', 'medium', 'high'])
sub2['INCOMECAT'] = splitIntoCat
# num of ciga per month
sub2['NUMCIGMO_EST']=sub2['USFREQMO'] * sub2['S3AQ3C1']
sub1 = sub2[['NUMCIGMO_EST', 'INCOMECAT', 'SEX']].dropna()
sub1.SEX = sub1.SEX - 1 # 0: Male, 1: Female
# using ols function for calculating the F-statistic and associated p value
model = smf.ols(formula='NUMCIGMO_EST ~ C(INCOMECAT)', data=sub1).fit()
print (model.summary())
print ('means for smoking amount by personal income')
print(sub1.groupby('INCOMECAT').mean())
print ('standard deviations for smoking amount by personal income')
print(sub1.groupby('INCOMECAT').std())
sns.factorplot(x="INCOMECAT", y="NUMCIGMO_EST", data=sub1, kind="bar", ci=None)
plt.xlabel('Income Level')
plt.ylabel('Smoking Amount')
# Moderator: SEX
print ('means for numcigmo_est by personal income according to SEX')
print(sub1.groupby('SEX').mean())
sub_m=sub1[ sub1.SEX == 0]
sub_f=sub1[ sub1.SEX == 1]
print ('Association between income level and smoking amount for Male')
model_m = smf.ols(formula='NUMCIGMO_EST ~ C(INCOMECAT)', data=sub_m).fit()
print (model_m.summary())
print ('Association between income level and smoking amount for Female')
model_f = smf.ols(formula='NUMCIGMO_EST ~ C(INCOMECAT)', data=sub_f).fit()
print (model_f.summary())
print ('Means for smoking amount by personal income for Male')
print (sub_m.groupby('INCOMECAT').mean())
print ('Means for smoking amount by personal income for Female')
print (sub_f.groupby('INCOMECAT').mean())
sns.factorplot(x="INCOMECAT", y="NUMCIGMO_EST", data=sub_m, kind="bar", ci=None)
plt.title('Smoking Amount according to income level for Male')
plt.xlabel('Income Level')
plt.ylabel('Smoking Amount')
sns.factorplot(x="INCOMECAT", y="NUMCIGMO_EST", data=sub_f, kind="bar", ci=None)
plt.title('Smoking Amount according to income level for Female')
plt.xlabel('Income Level')
plt.ylabel('Smoking Amount')
5/25/2016
5/17/2016
Correlation Coefficient - Pearson coefficient
It’s a sequel to the 1st assignment - study on the relationship b.t.w personal income and smoking amount. It’s already been verified through ANOVA test that there’re no significant relationship.
Plus, I was interested in association b.t.w personal & family income. To get better meaningful data, I should refine the income data. I got the mean & std value of the personal and family income. And then, discarded the data over 1 std from the mean value as np.nan.(refer to the below)
sub[sub.S1Q10A >= (sub.S1Q10A.mean()+sub.S1Q10A.std())] = np.nan sub[sub.S1Q11A >= (sub.S1Q11A.mean()+sub.S1Q11A.std())] = np.nan
The results are,

Association between personal income and smoking amount
(0.034107681918807302, 0.0025274573221202926)
(0.034107681918807302, 0.0025274573221202926)
Pearson coefficient is almost close to zero, however, the P-value is under significant level(alpha=0.05). The results made me confused because the plot looks there’re no relationship btw them while the P-value told me there’re certain related to one another. Probably, there’re missing factor beyond my knowledge, however, I can’t find the why for now. I tend to think that there is no ‘Linear’ relationship. But, obviously It seemed they have no relationship from that scatter plot.(and also I know the ANOVA results)

Association between personal income and family income
(0.66086621192976558, 0.0)
(0.66086621192976558, 0.0)
The 2nd results are crystal clear to say that there’re in very strong positive relationship and also significantly related to one another. The interesting part is P-value is ZERO (Is it possible?) which means there’s any chance to be in error when rejecting the null hypothesis.
And one more thing, everyone can earn more than at least family income(Is it typical in USA? or Must be wrong something;;) - graph is like a upper diagonal matrix.
Python code: Chi-square & Post hoc Test
“”“
S1Q10A: Total personal income in last 12M (Quantitative)
SINTOCAT: Total personal imcome in last 12M (Categorical, Low: 0~30%, Medium: 31~70%, High: 71%~max)
DMAJORDEPSNI12: Major depression in last 12M(Illness-induced ruled out) (0:N, 1:Y)
S1Q10A: Total personal income in last 12M (Quantitative)
SINTOCAT: Total personal imcome in last 12M (Categorical, Low: 0~30%, Medium: 31~70%, High: 71%~max)
DMAJORDEPSNI12: Major depression in last 12M(Illness-induced ruled out) (0:N, 1:Y)
@author: Daehwan, Kim
”“”
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
”“”
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
data = pd.read_csv(“nesarc_pds.csv”, low_memory=False)
LvIncome = [0, data.S1Q10A.quantile(0.3), data.S1Q10A.quantile(0.7), data.S1Q10A.max()]
splitIntoCat = pd.cut(data.S1Q10A, LvIncome, labels=[‘low’, 'medium’, 'high’])
splitIntoCat = pd.cut(data.S1Q10A, LvIncome, labels=[‘low’, 'medium’, 'high’])
sub = data.copy()
sub['SINTOCAT’] = splitIntoCat
sub['SINTOCAT’] = splitIntoCat
ct = pd.crosstab(sub.DMAJORDEPSNI12, sub.SINTOCAT, colnames=['Income Level’], rownames=['Depression’])
print(ct)
pct=ct/(ct.sum(axis=0))
print(pct)
print(pct)
sns.factorplot(x=“SINTOCAT”, y=“DMAJORDEPSNI12”, data=sub, ci=None, kind=“bar”)
plt.xlabel('Personal income in last 12M’)
plt.ylabel('Pct of people have depression’)
plt.xlabel('Personal income in last 12M’)
plt.ylabel('Pct of people have depression’)
chi2, p_val, df, exct =scipy.stats.chi2_contingency(ct)
print(’*** chi-squre value, p-value, dof’)
print('chi2: %0.2f\np-value: %s\ndof: %d\n\n’ % (chi2, p_val, df) )
print(’*** chi-squre value, p-value, dof’)
print('chi2: %0.2f\np-value: %s\ndof: %d\n\n’ % (chi2, p_val, df) )
comp_lvm = sub[['SINTOCAT’, 'DMAJORDEPSNI12’]].query('SINTOCAT == “low” or SINTOCAT == “medium”’)
ct_lvm = pd.crosstab(comp_lvm.DMAJORDEPSNI12, comp_lvm.SINTOCAT).drop('high’, 1)
print(ct_lvm)
print(ct_lvm)
pct_lvm=ct_lvm/(ct_lvm.sum(axis=0))
print(pct_lvm)
print(pct_lvm)
chi2, p_val, df, exct =scipy.stats.chi2_contingency(ct_lvm)
print(’*** chi-squre value, p-value, dof’)
print('chi2: %0.2f\np-value: %s\ndof: %d\n’ % (chi2, p_val, df) )
print(’*** chi-squre value, p-value, dof’)
print('chi2: %0.2f\np-value: %s\ndof: %d\n’ % (chi2, p_val, df) )
comp_lvm = sub[['SINTOCAT’, 'DMAJORDEPSNI12’]].query('SINTOCAT == “low” or SINTOCAT == “high”’)
ct_lvm = pd.crosstab(comp_lvm.DMAJORDEPSNI12, comp_lvm.SINTOCAT).drop('medium’, 1)
print(ct_lvm)
ct_lvm = pd.crosstab(comp_lvm.DMAJORDEPSNI12, comp_lvm.SINTOCAT).drop('medium’, 1)
print(ct_lvm)
pct_lvm=ct_lvm/(ct_lvm.sum(axis=0))
print(pct_lvm)
print(pct_lvm)
chi2, p_val, df, exct =scipy.stats.chi2_contingency(ct_lvm)
print(’*** chi-squre value, p-value, dof’)
print('chi2: %0.2f\np-value: %s\ndof: %d\n’ % (chi2, p_val, df) )
print(’*** chi-squre value, p-value, dof’)
print('chi2: %0.2f\np-value: %s\ndof: %d\n’ % (chi2, p_val, df) )
comp_lvm = sub[['SINTOCAT’, 'DMAJORDEPSNI12’]].query('SINTOCAT == “medium” or SINTOCAT == “high”’)
ct_lvm = pd.crosstab(comp_lvm.DMAJORDEPSNI12, comp_lvm.SINTOCAT).drop('low’, 1)
print(ct_lvm)
ct_lvm = pd.crosstab(comp_lvm.DMAJORDEPSNI12, comp_lvm.SINTOCAT).drop('low’, 1)
print(ct_lvm)
pct_lvm=ct_lvm/(ct_lvm.sum(axis=0))
print(pct_lvm)
print(pct_lvm)
chi2, p_val, df, exct =scipy.stats.chi2_contingency(ct_lvm)
print(’*** chi-squre value, p-value, dof’)
print('chi2: %0.2f\np-value: %s\ndof: %d\n’ % (chi2, p_val, df) )
print(’*** chi-squre value, p-value, dof’)
print('chi2: %0.2f\np-value: %s\ndof: %d\n’ % (chi2, p_val, df) )
Model Interpretation for Chi-Square and Post Hoc Tests:
I’d like to study about the relationship btw total personal income(explanatory) and depression(response) by using NESARC.
‘S1Q10A’: Total personal income in last 12M (Quantitative) ‘DMAJORDEPSNI12′: Major depression in last 12M (Illness-induced ruled out)
I refined Quantitative data(‘S1Q10A’ ) into 3-levels as ‘low’, ‘medium’, ‘high’ according to the amount of personal income. The ‘low’ level was set up to 30%, 'medium’ level was up to 70% and more than 70%(30% top of personal income) are categorized as 'high’.
Chi Square test of independence revealed that they were significantly associated, chi2: 195.14, p-value: 4.27e-43, dof: 2
Post hoc comparisons also told that any pairs among income categories are not associated.
Here’s the results.
========== chi-square test =================================
Income Level low medium high
Depression
0 9419 16032 12233
1 1051 1211 685
Income Level low medium high
Depression
0 0.899618 0.929769 0.946973
1 0.100382 0.070231 0.053027
*** chi-squre value, p-value, dof
chi2: 195.14
p-value: 4.23610993039e-43
dof: 2
Income Level low medium high
Depression
0 9419 16032 12233
1 1051 1211 685
Income Level low medium high
Depression
0 0.899618 0.929769 0.946973
1 0.100382 0.070231 0.053027
*** chi-squre value, p-value, dof
chi2: 195.14
p-value: 4.23610993039e-43
dof: 2
========== Post hoc test ===================================
low vs medium——————-
*** chi-squre value, p-value, dof
chi2: 78.60
p-value: 7.60330301415e-19
dof: 1
*** chi-squre value, p-value, dof
chi2: 78.60
p-value: 7.60330301415e-19
dof: 1
low vs high——————–
*** chi-squre value, p-value, dof
chi2: 188.03
p-value: 8.54030265154e-43
dof: 1
*** chi-squre value, p-value, dof
chi2: 188.03
p-value: 8.54030265154e-43
dof: 1
medium vs high——————–
*** chi-squre value, p-value, dof
chi2: 36.82
p-value: 1.2984866455e-09
dof: 1
*** chi-squre value, p-value, dof
chi2: 36.82
p-value: 1.2984866455e-09
dof: 1
5/12/2016
ANOVA Test in python code and interpretation - 2
ANOVA Test & Post hoc Test - Tukey HSD
# =============================================================
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
data = pd.read_csv(‘nesarc_pds.csv’, low_memory=False)
#setting variables you will be working with to numeric
data['S1Q10A’] = pd.to_numeric(data['S1Q10A’], errors='coerce’)
data['S3AQ3B1’] = pd.to_numeric(data['S3AQ3B1’], errors='coerce’)
data['S3AQ3C1’] = pd.to_numeric(data['S3AQ3C1’], errors='coerce’)
data['CHECK321’] = pd.to_numeric(data['CHECK321’], errors='coerce’)
data['S1Q10A’] = pd.to_numeric(data['S1Q10A’], errors='coerce’)
data['S3AQ3B1’] = pd.to_numeric(data['S3AQ3B1’], errors='coerce’)
data['S3AQ3C1’] = pd.to_numeric(data['S3AQ3C1’], errors='coerce’)
data['CHECK321’] = pd.to_numeric(data['CHECK321’], errors='coerce’)
#subset data to those people who have income and have smoked in the past 12 months
sub=data[ (data['S1Q10B’]!=0) & (data['CHECK321’]==1)]
sub=data[ (data['S1Q10B’]!=0) & (data['CHECK321’]==1)]
sub.loc[:,'S3AQ3B1’]=sub['S3AQ3B1’].replace(9, np.nan)
sub.loc[:,'S3AQ3C1’]=sub['S3AQ3C1’].replace(99, np.nan)
sub.loc[:,'S3AQ3C1’]=sub['S3AQ3C1’].replace(99, np.nan)
#recoding number of days smoked in the past month
recode1 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1}
sub['USFREQMO’]= sub.loc[:,'S3AQ3B1’].map(recode1)
recode1 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1}
sub['USFREQMO’]= sub.loc[:,'S3AQ3B1’].map(recode1)
# mean value of personal income in last 12 month
avgIncome = sub.loc[:,'S1Q10A’].mean(0)
LvIncome = [0, sub.loc[:,'S1Q10A’].quantile(0.3), sub.loc[:,'S1Q10A’].quantile(0.7), sub.loc[:,'S1Q10A’].max()]
avgIncome = sub.loc[:,'S1Q10A’].mean(0)
LvIncome = [0, sub.loc[:,'S1Q10A’].quantile(0.3), sub.loc[:,'S1Q10A’].quantile(0.7), sub.loc[:,'S1Q10A’].max()]
# Split into 3 groups: 'low’, 'medium’, 'high’
splitIntoCat = pd.cut(sub.S1Q10A, LvIncome, labels=['low’, 'medium’, 'high’])
sub['INCOMECAT’] = splitIntoCat
splitIntoCat = pd.cut(sub.S1Q10A, LvIncome, labels=['low’, 'medium’, 'high’])
sub['INCOMECAT’] = splitIntoCat
# num of ciga per month
sub['NUMCIGMO_EST’]=sub['USFREQMO’] * sub['S3AQ3C1’]
sub1 = sub[['NUMCIGMO_EST’, 'INCOMECAT’]].dropna()
sub['NUMCIGMO_EST’]=sub['USFREQMO’] * sub['S3AQ3C1’]
sub1 = sub[['NUMCIGMO_EST’, 'INCOMECAT’]].dropna()
# using ols function for calculating the F-statistic and associated p value
model = smf.ols(formula='NUMCIGMO_EST ~ C(INCOMECAT)’, data=sub1)
results = model.fit()
print (results.summary())
model = smf.ols(formula='NUMCIGMO_EST ~ C(INCOMECAT)’, data=sub1)
results = model.fit()
print (results.summary())
print ('means for numcigmo_est by personal income’)
print(sub1.groupby('INCOMECAT’).mean())
print(sub1.groupby('INCOMECAT’).mean())
print ('standard deviations for numcigmo_est by personal income’)
print(sub1.groupby('INCOMECAT’).std())
print(sub1.groupby('INCOMECAT’).std())
# ============ Post hoc test, Tukey HSD ===========================
mc1 = multi.MultiComparison(sub1['NUMCIGMO_EST’], sub1['INCOMECAT’])
res1 = mc1.tukeyhsd()
print(res1.summary())
res1 = mc1.tukeyhsd()
print(res1.summary())
ANOVA Test in python code and interpretation - 1
I’d like to know the association btw personal income level(explanatory) and smoking amount(quantitative response). I thought there’s a certain relationship. From my experience, I’ve quit smoking since new year 2016 for raising cigarette prices surprisingly. However, I wouldn’t quit the smokes if I’ve got more paid.
From NESARC(’S1Q10A’), I split into 3 levels as ‘low’, 'medium’, 'high’ according to the level of personal income to make it categorical. The 'low’ level was set up to 30%, 'medium’ level was up to 70% and more than 70%(30% top of personal income) are categorized as 'high’.
ANOVA results revealed that there’re no significant relationship btw personal income and number of cigarettes smoked, F(2, 9345)=1.130 and P-value=0.323. And the belows are means and std for no of cigarettes smoked by personal income level.
means for numcigmo_est by personal income
INCOMECAT NUMCIGMO_EST
low 424.175214
medium 413.702265
high 423.927979standard deviations for numcigmo_est by personal income
INCOMECAT NUMCIGMO_EST
low 342.501714
medium 311.897224
high 335.876155
There’re no significant difference each others and here are the results of post hoc test. Especially, ‘meandiff’ value btw 'high’ and 'low’ is very small. Of course, there’re all 'False’ results in reject columns when the significant level(alpha) is set to 0.05.
Multiple Comparison of Means - Tukey HSD,FWER=0.05
==============================================
group1 group2 meandiff lower upper reject
———————————————-
high low 0.2472 -20.5833 21.0778 False
high medium -10.2257 -29.6021 9.1507 False
low medium -10.4729 -29.5268 8.5809 False
———————————————-
If someone ask me whether this result are credible or not, I would say 'No’. There might be lots of uncontrolled fators, for one simple example, the different ciga. cost by State.
피드 구독하기:
글 (Atom)