Deep_vibes @dipayan-banerjee - Tumblr Blog

Classification Problem.

Logistic regression is another technique borrowed by machine learning from the field of statistics.

It is the go-to method for binary classification problems (problems with two class values). In this post you will discover the logistic regression algorithm for machine learning.

After reading this post you will know:

The many names and terms used when describing logistic regression (like log odds and logit).

The representation used for a logistic regression model.

Techniques used to learn the coefficients of a logistic regression model from data.

How to actually make predictions using a learned logistic regression model.

Where to go for more information if you want to dig a little deeper.

The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

1 / (1 + e^-value)

Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your spreadsheet) and value is the actual numerical value that you want to transform. Below is a plot of the numbers between -5 and 5 transformed into the range 0 and 1 using the logistic function.

CODES:

import numpy

import pandas

import statsmodels.api as sm import seaborn import statsmodels.formula.api as smf

# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%.2f'%x)

data = pandas.read_csv(r"C:\Users\dipayan_b\OneDrive - Dell Technologies\Desktop\Presentations\nesarc.csv", low_memory=False)

h=data.head(5)

############################################################################## # DATA MANAGEMENT ##############################################################################

#setting variables you will be working with to numeric data['IDNUM'] =pandas.to_numeric(data['IDNUM'], errors='coerce') data['TAB12MDX'] = pandas.to_numeric(data['TAB12MDX'], errors='coerce') data['MAJORDEPLIFE'] = pandas.to_numeric(data['MAJORDEPLIFE'], errors='coerce') #data['NDSymptoms'] = pandas.to_numeric(data['NDSymptoms'], errors='coerce') data['SOCPDLIFE'] = pandas.to_numeric(data['SOCPDLIFE'], errors='coerce') data['S3AQ3C1'] = pandas.to_numeric(data['S3AQ3C1'], errors='coerce') data['AGE'] =pandas.to_numeric(data['AGE'], errors='coerce') data['SEX'] = pandas.to_numeric(data['SEX'], errors='coerce')

data['S3AQ3B1'] = pandas.to_numeric(data['S3AQ3B1'], errors='coerce') data['CHECK321'] =pandas.to_numeric( data['CHECK321'], errors='coerce') data['S3AQ8B11'] = pandas.to_numeric(data['S3AQ8B11'], errors='coerce') data['S3AQ8B12'] = pandas.to_numeric(data['S3AQ8B12'], errors='coerce') data['S3AQ8B13'] = pandas.to_numeric(data['S3AQ8B13'], errors='coerce') data['S3AQ8B7A'] = pandas.to_numeric(data['S3AQ8B7A'], errors='coerce') data['S3AQ8B7B'] = pandas.to_numeric(data['S3AQ8B7B'], errors='coerce') data['S3AQ8B7C'] = pandas.to_numeric(data['S3AQ8B7C'], errors='coerce') data['S3AQ8B7D'] = pandas.to_numeric(data['S3AQ8B7D'], errors='coerce') data['S3AQ8B7E'] = pandas.to_numeric(data['S3AQ8B7E'], errors='coerce') data['S3AQ8B7F'] = pandas.to_numeric(data['S3AQ8B7F'], errors='coerce') data['S3AQ8B7G'] = pandas.to_numeric(data['S3AQ8B7G'], errors='coerce') data['S3AQ8B7H'] = pandas.to_numeric(data['S3AQ8B7H'], errors='coerce') data['S3AQ8B7J'] = pandas.to_numeric(data['S3AQ8B7J'], errors='coerce')

data['S6Q1'] = pandas.to_numeric(data['S6Q1'], errors='coerce') data['S6Q2'] = pandas.to_numeric(data['S6Q2'], errors='coerce') data['S6Q3'] = pandas.to_numeric(data['S6Q3'], errors='coerce') data['S6Q7'] = pandas.to_numeric(data['S6Q7'], errors='coerce') data['S6Q61'] = pandas.to_numeric(data['S6Q61'], errors='coerce') data['S6Q62'] = pandas.to_numeric(data['S6Q62'], errors='coerce') data['S6Q63'] = pandas.to_numeric(data['S6Q63'], errors='coerce') data['S6Q64'] = pandas.to_numeric(data['S6Q64'], errors='coerce') data['S6Q65'] = pandas.to_numeric(data['S6Q65'], errors='coerce') data['S6Q66'] = pandas.to_numeric(data['S6Q66'], errors='coerce') data['S6Q67'] = pandas.to_numeric(data['S6Q67'], errors='coerce') data['S6Q68'] = pandas.to_numeric(data['S6Q68'], errors='coerce') data['S6Q69'] = pandas.to_numeric(data['S6Q69'], errors='coerce') data['S6Q610'] = pandas.to_numeric(data['S6Q610'], errors='coerce') data['S6Q611'] = pandas.to_numeric(data['S6Q611'], errors='coerce') data['S6Q612'] = pandas.to_numeric(data['S6Q612'], errors='coerce') data['S6Q613'] = pandas.to_numeric(data['S6Q613'], errors='coerce')

data['S3AQ3C1']=data['S3AQ3C1'].replace(99, numpy.nan)

# subset for NDsymptoms regression (age 18-25, smoked in past month) # pandas gives observations missing on all symptoms (N=3) a value of zero, but should be nan # have to delete them sub1=data[(data['AGE']<=25) & (data['CHECK321']==1) & (data['S3AQ3B1']==1) & (data['IDNUM']!=20346) & (data['IDNUM']!=36471) & (data['IDNUM']!=28724)]

# subset data for logistic regression analyses (18-25, smoked in past month) sub1=data[(data['AGE']<=25) & (data['CHECK321']==1) & (data['S3AQ3B1']==1)]

# Current Tolerance criteria #1 DSM-IV def crit1 (row): if row['S3AQ8B11']==1 or row['S3AQ8B12'] == 1 : return 1 elif row['S3AQ8B11']==2 and row['S3AQ8B12']==2 : return 0 sub1['crit1'] = sub1.apply (lambda row: crit1 (row),axis=1) chk2 = sub1['crit1'].value_counts(sort=False, dropna=False) print (chk2) chk3 = sub1['S3AQ8B11'].value_counts(sort=False, dropna=False) print (chk3) chk4 = sub1['S3AQ8B12'].value_counts(sort=False, dropna=False) print (chk4) print (pandas.crosstab(sub1['S3AQ8B11'], sub1['S3AQ8B12'])) c1 = sub1['S3AQ8B7J'].value_counts(sort=False, dropna=False) print (c1)

#Current 8 WITHDRAWAL SUB-SYMPTOMS IN DSM-IV (recode 1,2 to 0,1 for summing) # after recoding 9s to missing recode1 = {1: 1, 2: 0} sub1['S3AQ8B7A']=sub1['S3AQ8B7A'].replace(9, numpy.nan) sub1['S3AQ8B7A']= sub1['S3AQ8B7A'].map(recode1) sub1['S3AQ8B7B']=sub1['S3AQ8B7B'].replace(9, numpy.nan) sub1['S3AQ8B7B']= sub1['S3AQ8B7B'].map(recode1) sub1['S3AQ8B7C']=sub1['S3AQ8B7C'].replace(9, numpy.nan) sub1['S3AQ8B7C']= sub1['S3AQ8B7C'].map(recode1) sub1['S3AQ8B7D']=sub1['S3AQ8B7D'].replace(9, numpy.nan) sub1['S3AQ8B7D']= sub1['S3AQ8B7D'].map(recode1) sub1['S3AQ8B7E']=sub1['S3AQ8B7E'].replace(9, numpy.nan) sub1['S3AQ8B7E']= sub1['S3AQ8B7E'].map(recode1) sub1['S3AQ8B7F']=sub1['S3AQ8B7F'].replace(9, numpy.nan) sub1['S3AQ8B7F']= sub1['S3AQ8B7F'].map(recode1) sub1['S3AQ8B7G']=sub1['S3AQ8B7G'].replace(9, numpy.nan) sub1['S3AQ8B7G']= sub1['S3AQ8B7G'].map(recode1) sub1['S3AQ8B7H']=sub1['S3AQ8B7H'].replace(9, numpy.nan) sub1['S3AQ8B7H']= sub1['S3AQ8B7H'].map(recode1)

# check recode chk1c = sub1['S3AQ8B7J'].value_counts(sort=False, dropna=False) print (chk1c)

# sum symptoms sub1['CWITHDR_COUNT'] = numpy.nansum([sub1['S3AQ8B7A'], sub1['S3AQ8B7B'], sub1['S3AQ8B7C'], sub1['S3AQ8B7D'], sub1['S3AQ8B7E'], sub1['S3AQ8B7F'], sub1['S3AQ8B7G'], sub1['S3AQ8B7H']], axis=0)

# check to make sure sum code worked chksum=sub1[['IDNUM','S3AQ8B7A', 'S3AQ8B7B', 'S3AQ8B7C', 'S3AQ8B7D', 'S3AQ8B7E', 'S3AQ8B7F', 'S3AQ8B7G', 'S3AQ8B7H', 'CWITHDR_COUNT']] chksum.head(n=50)

chk1d = sub1['CWITHDR_COUNT'].value_counts(sort=False, dropna=False) print (chk1d)

# withdrawal (yes/no) def crit2 (row): if row['CWITHDR_COUNT']>=4 or row['S3AQ8B7J']==1: return 1 elif row['CWITHDR_COUNT']<4 and row['S3AQ8B7J']!=1: return 0 sub1['crit2'] = sub1.apply (lambda row: crit2 (row),axis=1) print (pandas.crosstab(sub1['CWITHDR_COUNT'], sub1['crit2']))

#Current Larger amount or longer period criteria #3 DSM-IV sub1['S3AQ8B13']=sub1['S3AQ8B13'].replace(9, numpy.nan) sub1['S3AQ8B13']= sub1['S3AQ8B13'].map(recode1)

chk1d = sub1['S3AQ8B13'].value_counts(sort=False, dropna=False) print (chk1d)

#Current Cut down criteria #4 DSM-IV sub1['S3AQ8B6'] = pandas.to_numeric(sub1['S3AQ8B6'], errors='coerce') sub1['S3AQ8B1'] = pandas.to_numeric(sub1['S3AQ8B1'], errors='coerce') def crit4 (row): if row['S3AQ8B6']==1 or row['S3AQ8B1'] == 1 : return 1 elif row['S3AQ8B6']==2 and row['S3AQ8B1']==2 : return 0 sub1['crit4'] = sub1.apply (lambda row: crit4 (row),axis=1) chk1e = sub1['crit4'].value_counts(sort=False, dropna=False) print (chk1e)

#Current Substance activities criteria #5 DSM-IV sub1['S3AQ8B5'] = pandas.to_numeric(sub1['S3AQ8B5'], errors='coerce') sub1['S3AQ8B5']=sub1['S3AQ8B5'].replace(9, numpy.nan) sub1['S3AQ8B5']= sub1['S3AQ8B5'].map(recode1)

chk1f = sub1['S3AQ8B5'].value_counts(sort=False, dropna=False) print (chk1f)

#Current Reduce activities criteria #6 DSM-IV sub1['S3AQ8B2'] = pandas.to_numeric(sub1['S3AQ8B2'], errors='coerce') sub1['S3AQ8B3'] = pandas.to_numeric(sub1['S3AQ8B3'], errors='coerce') def crit6 (row): if row['S3AQ8B2']==1 or row['S3AQ8B3'] == 1 : return 1 elif row['S3AQ8B2']==2 and row['S3AQ8B3']==2 : return 0 sub1['crit6'] = sub1.apply (lambda row: crit6 (row),axis=1) chk1g = sub1['crit6'].value_counts(sort=False, dropna=False) print (chk1g)

#Current use continued despite knowledge of physical or psychological problem criteria #7 DSM-IV sub1['S3AQ8B4'] = pandas.to_numeric(sub1['S3AQ8B4'], errors='coerce') sub1['S3AQ8B14'] = pandas.to_numeric(sub1['S3AQ8B14'], errors='coerce') def crit7 (row): if row['S3AQ8B4']==1 or row['S3AQ8B14'] == 1 : return 1 elif row['S3AQ8B4']==2 and row['S3AQ8B14']==2 : return 0 sub1['crit7'] = sub1.apply (lambda row: crit7 (row),axis=1) chk1h = sub1['crit7'].value_counts(sort=False, dropna=False) print (chk1h)

# sum all symptoms (np.nansum allows rows with some missing values to count all valid values) sub1['NDSymptoms'] = numpy.nansum([sub1['crit1'], sub1['crit2'], sub1['S3AQ8B13'], sub1['crit4'], sub1['S3AQ8B5'], sub1['crit6'], sub1['crit7']], axis=0 ) chk2 = sub1['NDSymptoms'].value_counts(sort=False, dropna=False) print (chk2)

c1 = sub1["MAJORDEPLIFE"].value_counts(sort=False, dropna=False) print(c1) c2 = sub1["AGE"].value_counts(sort=False, dropna=False) print(c2) # binary nicotine dependence def NICOTINEDEP (x): if x['TAB12MDX']==1: return 1 else: return 0 sub1['NICOTINEDEP'] = sub1.apply (lambda x: NICOTINEDEP (x), axis=1) print (pandas.crosstab(sub1['TAB12MDX'], sub1['NICOTINEDEP']))

# rename variables sub1.rename(columns={'S3AQ3C1': 'numbercigsmoked'}, inplace=True)

c6 = sub1["numbercigsmoked"].value_counts(sort=False, dropna=False) print(c6)

def PANIC (x1): if ((x1['S6Q1']==1 and x1['S6Q2']==1) or (x1['S6Q2']==1 and x1['S6Q3']==1) or (x1['S6Q3']==1 and x1['S6Q61']==1) or (x1['S6Q61']==1 and x1['S6Q62']==1) or (x1['S6Q62']==1 and x1['S6Q63']==1) or (x1['S6Q63']==1 and x1['S6Q64']==1) or (x1['S6Q64']==1 and x1['S6Q65']==1) or (x1['S6Q65']==1 and x1['S6Q66']==1) or (x1['S6Q66']==1 and x1['S6Q67']==1) or (x1['S6Q67']==1 and x1['S6Q68']==1) or (x1['S6Q68']==1 and x1['S6Q69']==1) or (x1['S6Q69']==1 and x1['S6Q610']==1) or (x1['S6Q610']==1 and x1['S6Q611']==1) or (x1['S6Q611']==1 and x1['S6Q612']==1) or (x1['S6Q612']==1 and x1['S6Q613']==1) or (x1['S6Q613']==1 and x1['S6Q7']==1) or x1['S6Q7']==1): return 1 else: return 0 sub1['PANIC'] = sub1.apply (lambda x1: PANIC (x1), axis=1) c7 = sub1["PANIC"].value_counts(sort=False, dropna=False) print(c7)

# 4 category ethnicity variable sub1['ETHRACE2A'] = pandas.to_numeric(sub1['ETHRACE2A'], errors='coerce') recode2 = {1: 1, 2: 2, 3: 3, 4: 3, 5: 0} sub1['ETHRACE2A'] = sub1['ETHRACE2A'].replace(9, numpy.nan) sub1['ETHRACE'] = sub1['ETHRACE2A'].map(recode2)

c8 = sub1["ETHRACE2A"].value_counts(sort=False, dropna=False) print(c8)

c9 = sub1["ETHRACE"].value_counts(sort=False, dropna=False) print(c9)

############################################################################## # END DATA MANAGEMENT ##############################################################################

############################################################################## # CATEGORICAL VARIABLES WITH 3+ CATEGORIES ##############################################################################

# center quantitative IVs for regression analysis sub1['numbercigsmoked_c'] = (sub1['numbercigsmoked'] - sub1['numbercigsmoked'].mean()) print (sub1['numbercigsmoked_c'].mean()) sub1['age_c']=(sub1['AGE'] - sub1['AGE'].mean()) print (sub1['age_c'].mean())

# adding 4 category ethnicity/race. Reference group coding is called "Treatment" coding in python # and the default reference catergory is the group with a value = 0 (Hispanic) reg6 = smf.ols('NDSymptoms ~ DYSLIFE + MAJORDEPLIFE + numbercigsmoked_c + age_c + SEX + C(ETHRACE)', data=sub1).fit() print (reg6.summary())

# can override the default ad specify a different reference group # non-Hispanic White as reference group reg7 = smf.ols('NDSymptoms ~ DYSLIFE + MAJORDEPLIFE + numbercigsmoked_c + age_c + SEX + C(ETHRACE, Treatment(reference=1))', data=sub1).fit() print (reg7.summary())

############################################################################## # LOGISTIC REGRESSION ##############################################################################

# logistic regression with social phobia lreg1 = smf.logit(formula = 'NICOTINEDEP ~ SOCPDLIFE', data = sub1).fit() print (lreg1.summary()) # odds ratios print ("Odds Ratios") print (numpy.exp(lreg1.params))

# odd ratios with 95% confidence intervals params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

# logistic regression with social phobia and depression lreg2 = smf.logit(formula = 'NICOTINEDEP ~ SOCPDLIFE + MAJORDEPLIFE', data = sub1).fit() print (lreg2.summary())

# odd ratios with 95% confidence intervals params = lreg2.params conf = lreg2.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

# logistic regression with panic lreg3 = smf.logit(formula = 'NICOTINEDEP ~ PANIC', data = sub1).fit() print (lreg3.summary())

# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg3.params conf = lreg3.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

# logistic regression with panic and depression lreg4 = smf.logit(formula = 'NICOTINEDEP ~ PANIC + MAJORDEPLIFE', data = sub1).fit() print (lreg4.summary())

# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg4.params conf = lreg4.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

Explanations:

Daily smokers with major depression are 3.7 times more likely to have nicotine dependence than daily smokers without depression, after controlling for the presence of social phobia. Because the confidence intervals on our odds ratios overlap, we cannot say that major depression is more strongly associated with nicotine dependence than the social phobia. For the population of young adult daily smokers, we can say that those with social phobia are anywhere between 1.2 to 4.6 times more likely to have nicotine dependence than those without social phobia. And those with major depression are between 2.7 and 5.0 times more likely to have nicotine dependence than those without major depression.

Output: Logit Regression Results

==============================================================================

Dep. Variable: NICOTINEDEP No. Observations: 1320

Model: Logit Df Residuals: 1318

Method: MLE Df Model: 1

Date: Tue, 26 May 2020 Pseudo R-squ.: 0.009574

Time: 15:54:31 Log-Likelihood: -876.98

converged: True LL-Null: -885.46

Covariance Type: nonrobust LLR p-value: 3.829e-05

==============================================================================

coef std err z P>|z| [0.025 0.975]

------------------------------------------------------------------------------

Intercept 0.3776 0.057 6.569 0.000 0.265 0.490

SOCPDLIFE 1.2318 0.335 3.674 0.000 0.575 1.889

==============================================================================

Odds Ratios

Intercept 1.46

SOCPDLIFE 3.43

dtype: float64

params = lreg1.params

conf = lreg1.conf_int()

conf['OR'] = params

conf.columns = ['Lower CI', 'Upper CI', 'OR']

print (numpy.exp(conf))

Lower CI Upper CI OR

Intercept 1.30 1.63 1.46

SOCPDLIFE 1.78 6.61 3.43

lreg2 = smf.logit(formula = 'NICOTINEDEP ~ SOCPDLIFE + MAJORDEPLIFE', data = sub1).fit()

print (lreg2.summary())

Optimization terminated successfully.

Current function value: 0.632175

Iterations 6

Logit Regression Results

==============================================================================

Dep. Variable: NICOTINEDEP No. Observations: 1320

Model: Logit Df Residuals: 1317

Method: MLE Df Model: 2

Date: Tue, 26 May 2020 Pseudo R-squ.: 0.05758

Time: 15:54:45 Log-Likelihood: -834.47

converged: True LL-Null: -885.46

Covariance Type: nonrobust LLR p-value: 7.177e-23

================================================================================

coef std err z P>|z| [0.025 0.975]

--------------------------------------------------------------------------------

Intercept 0.0939 0.065 1.444 0.149 -0.034 0.221

SOCPDLIFE 0.8393 0.347 2.416 0.016 0.158 1.520

MAJORDEPLIFE 1.3072 0.152 8.588 0.000 1.009 1.606

================================================================================

params = lreg2.params

conf = lreg2.conf_int()

conf['OR'] = params

conf.columns = ['Lower CI', 'Upper CI', 'OR']

print (numpy.exp(conf))

Lower CI Upper CI OR

Intercept 0.97 1.25 1.10

SOCPDLIFE 1.17 4.57 2.31

MAJORDEPLIFE 2.74 4.98 3.70

lreg3 = smf.logit(formula = 'NICOTINEDEP ~ PANIC', data = sub1).fit()

print (lreg3.summary())

# odd ratios with 95% confidence intervals

print ("Odds Ratios")

params = lreg3.params

conf = lreg3.conf_int()

conf['OR'] = params

conf.columns = ['Lower CI', 'Upper CI', 'OR']

print (numpy.exp(conf))

# logistic regression with panic and depression

lreg4 = smf.logit(formula = 'NICOTINEDEP ~ PANIC + MAJORDEPLIFE', data = sub1).fit()

print (lreg4.summary())

# odd ratios with 95% confidence intervals

print ("Odds Ratios")

params = lreg4.params

conf = lreg4.conf_int()

conf['OR'] = params

conf.columns = ['Lower CI', 'Upper CI', 'OR']

print (numpy.exp(conf))

Optimization terminated successfully.

Current function value: 0.662762

Iterations 5

Logit Regression Results

==============================================================================

Dep. Variable: NICOTINEDEP No. Observations: 1320

Model: Logit Df Residuals: 1318

Method: MLE Df Model: 1

Date: Tue, 26 May 2020 Pseudo R-squ.: 0.01199

Time: 15:54:59 Log-Likelihood: -874.85

converged: True LL-Null: -885.46

Covariance Type: nonrobust LLR p-value: 4.079e-06

==============================================================================

coef std err z P>|z| [0.025 0.975]

------------------------------------------------------------------------------

Intercept 0.3202 0.061 5.278 0.000 0.201 0.439

PANIC 0.7590 0.172 4.423 0.000 0.423 1.095

==============================================================================

Odds Ratios

Lower CI Upper CI OR

Intercept 1.22 1.55 1.38

PANIC 1.53 2.99 2.14

Optimization terminated successfully.

Current function value: 0.633241

Iterations 5

Logit Regression Results

==============================================================================

Dep. Variable: NICOTINEDEP No. Observations: 1320

Model: Logit Df Residuals: 1317

Method: MLE Df Model: 2

Date: Tue, 26 May 2020 Pseudo R-squ.: 0.05600

Time: 15:54:59 Log-Likelihood: -835.88

converged: True LL-Null: -885.46

Covariance Type: nonrobust LLR p-value: 2.930e-22

================================================================================

coef std err z P>|z| [0.025 0.975]

--------------------------------------------------------------------------------

Intercept 0.0826 0.066 1.243 0.214 -0.048 0.213

PANIC 0.3554 0.183 1.941 0.052 -0.003 0.714

MAJORDEPLIFE 1.2848 0.155 8.266 0.000 0.980 1.589

================================================================================

Odds Ratios

Lower CI Upper CI OR

Intercept 0.95 1.24 1.09

PANIC 1.00 2.04 1.43

MAJORDEPLIFE 2.66 4.90 3.61

#Week4 assignment

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Cluster Analysis - KMeans

It is basically a type of unsupervised learning method . An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labeled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features, and groupings inherent in a set of examples. Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.

The k-means clustering algorithm is as follows: 1. Initialize cluster centroids μ1, μ2, . . . , μk 2 Rn randomly. 2. Repeat until convergence: { For every i, set c(i) := argmin j ||x(i) − μj ||2. For each j, set μj := Pm i=1 1{c(i) = j}x(i) Pm i=1 1{c(i) = j} . } In the algorithm above, k (a parameter of the algorithm) is the number of clusters we want to find; and the cluster centroids μj represent our current guesses for the positions of the centers of the clusters. To initialize the cluster centroids (in step 1 of the algorithm above), we could choose k training examples randomly, and set the cluster centroids to be equal to the values of these k examples. (Other initialization methods are also possible.) The inner-loop of the algorithm repeatedly carries out two steps: (i) “Assigning” each training example x(i) to the closest cluster centroid μj , and (ii) Moving each cluster centroid μj to the mean of the points assigned to it.

CODE:

from pandas import Series, DataFrame import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans

""" Data Management """ data = pd.read_csv(r"C:\Users\11042\Desktop\Data\tree_addhealth.csv")

#upper-case all DataFrame column names data.columns = map(str.upper, data.columns)

# Data Management

data_clean = data.dropna()

# subset clustering variables cluster=data_clean[['ALCEVR1','MAREVER1','ALCPROBS1','DEVIANT1','VIOL1', 'DEP1','ESTEEM1','SCHCONN1','PARACTV', 'PARPRES','FAMCONCT']] cluster.describe()

# standardize clustering variables to have mean=0 and sd=1 clustervar=cluster.copy() clustervar['ALCEVR1']=preprocessing.scale(clustervar['ALCEVR1'].astype('float64')) clustervar['ALCPROBS1']=preprocessing.scale(clustervar['ALCPROBS1'].astype('float64')) clustervar['MAREVER1']=preprocessing.scale(clustervar['MAREVER1'].astype('float64')) clustervar['DEP1']=preprocessing.scale(clustervar['DEP1'].astype('float64')) clustervar['ESTEEM1']=preprocessing.scale(clustervar['ESTEEM1'].astype('float64')) clustervar['VIOL1']=preprocessing.scale(clustervar['VIOL1'].astype('float64')) clustervar['DEVIANT1']=preprocessing.scale(clustervar['DEVIANT1'].astype('float64')) clustervar['FAMCONCT']=preprocessing.scale(clustervar['FAMCONCT'].astype('float64')) clustervar['SCHCONN1']=preprocessing.scale(clustervar['SCHCONN1'].astype('float64')) clustervar['PARACTV']=preprocessing.scale(clustervar['PARACTV'].astype('float64')) clustervar['PARPRES']=preprocessing.scale(clustervar['PARPRES'].astype('float64'))

# split data into train and test sets clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)

# k-means cluster analysis for 1-9 clusters from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[]

for k in clusters: model=KMeans(n_clusters=k) model.fit(clus_train) clusassign=model.predict(clus_train) meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0])

""" Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose """

plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method')

# Interpret 3 cluster solution model3=KMeans(n_clusters=3) model3.fit(clus_train) clusassign=model3.predict(clus_train) # plot clusters

from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show()

""" BEGIN multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster """ # create a unique identifier variable from the index for the # cluster training data to merge with the cluster assignment variable clus_train.reset_index(level=0, inplace=True) # create a list that has the new index variable cluslist=list(clus_train['index']) # create a list of cluster assignments labels=list(model3.labels_) # combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) newlist # convert newlist dictionary to a dataframe newclus=DataFrame.from_dict(newlist, orient='index') newclus # rename the cluster assignment column newclus.columns = ['cluster']

# now do the same for the cluster assignment variable # create a unique identifier variable from the index for the # cluster assignment dataframe # to merge with cluster training data newclus.reset_index(level=0, inplace=True) # merge the cluster assignment dataframe with the cluster training variable dataframe # by the index variable merged_train=pd.merge(clus_train, newclus, on='index') merged_train.head(n=100) # cluster frequencies merged_train.cluster.value_counts()

""" END multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster """

# FINALLY calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") print(clustergrp)

# validate clusters in training data by examining cluster differences in GPA using ANOVA # first have to merge GPA with clustering variables and cluster assignment data gpa_data=data_clean['GPA1'] # split GPA data into train and test sets gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123) gpa_train1=pd.DataFrame(gpa_train) gpa_train1.reset_index(level=0, inplace=True) merged_train_all=pd.merge(gpa_train1, merged_train, on='index') sub1 = merged_train_all[['GPA1', 'cluster']].dropna()

import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi

gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit() print (gpamod.summary())

print ('means for GPA by cluster') m1= sub1.groupby('cluster').mean() print (m1)

print ('standard deviations for GPA by cluster') m2= sub1.groupby('cluster').std() print (m2)

mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())

OUTPUT:

After preprocessing the data & fiting the KMeans algorithm, we plot a point to find the K using Elbow Method

Plotting the canonical variables upon applying PCA

Canonical discriminant analyses was used to reduce the 11 clustering variable down a few variables that accounted for most of the variance in the clustering variables.

A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in clusters 1 (green) was densely packed with relatively low within cluster variance, and did overlap very much with cluster 2 (blue). Cluster 2 was a bit wider and overlap mostly with cluster 1. Observations in cluster 3 (yellow) were spread out more than the other clusters, showing high within cluster variance. The results of this plot suggest that the best cluster solution may have fewer than 3 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 3 clusters.

Cluster Counts:

Out[1213]: 2 1419 0 1101 1 682 Name: cluster, dtype: int64

Clustering variable means by cluster

The Coefficients:

OLS Regression Results ============================================================================== Dep. Variable: GPA1 R-squared: 0.078 Model: OLS Adj. R-squared: 0.078 Method: Least Squares F-statistic: 136.0 Date: Tue, 09 Jul 2019 Prob (F-statistic): 2.10e-57 Time: 11:34:01 Log-Likelihood: -3596.8 No. Observations: 3202 AIC: 7200. Df Residuals: 3199 BIC: 7218. Df Model: 2 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 2.8337 0.022 126.312 0.000 2.790 2.878 C(cluster)[T.1] -0.4098 0.036 -11.298 0.000 -0.481 -0.339 C(cluster)[T.2] 0.1614 0.030 5.397 0.000 0.103 0.220 ============================================================================== Omnibus: 152.383 Durbin-Watson: 2.017 Prob(Omnibus): 0.000 Jarque-Bera (JB): 92.763 Skew: -0.280 Prob(JB): 7.19e-21 Kurtosis: 2.382 Cond. No. 3.83 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for GPA by cluster GPA1 cluster 0 2.833712 1 2.423876 2 2.995067 standard deviations for GPA by cluster GPA1 cluster 0 0.728128 1 0.782335 2 0.738169 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================= group1 group2 meandiff lower upper reject --------------------------------------------- 0 1 -0.4098 -0.4949 -0.3248 True 0 2 0.1614 0.0913 0.2315 True 1 2 0.5712 0.4899 0.6525 True --------------------------------------------- index ALCEVR1 MAREVER1 ... PARACTV PARPRES FAMCONCT cluster ... 0 3319.652134 0.946562 -0.058121 ... 0.158664 0.109642 0.236515 1 3330.536657 0.661676 1.085233 ... -0.416451 -0.484017 -0.960210 2 3238.273432 -1.056455 -0.474480 ... 0.091971 0.156849 0.299811

[3 rows x 12 columns]

So we determine the number of cluster to be taken into account by use of elbow method.

So on using number of cluster as 3, we try to fit the model in the training data. It gives us a assigned cluster number per each data point.

In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on grade point average (GPA). A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on GPA (F(3, 3197)=82.28, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on GPA, with the exception that clusters 1 and 2 were not significantly different from each other. Adolescents in cluster 4 had the highest GPA (mean=2.99, sd=0.71), and cluster 3 had the lowest GPA (mean=2.36, sd=0.78).

Lasso Regression

Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of muticollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination.

The acronym “LASSO” stands for Least Absolute Shrinkage and Selection Operator.

Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of the magnitude of coefficients. This type of regularization can result in sparse models with few coefficients; Some coefficients can become zero and eliminated from the model. Larger penalties result in coefficient values closer to zero, which is the ideal for producing simpler models. On the other hand, L2 regularization (e.g. Ridge regression) doesn’t result in elimination of coefficients or sparse models. This makes the Lasso far easier to interpret than the Ridge.

Performing the Regression

Lasso solutions are quadratic programming problems, which are best solved with software (like Matlab). The goal of the algorithm is to minimize:

Which is the same as minimizing the sum of squares with constraint Σ |Bj≤ s. Some of the βs are shrunk to exactly zero, resulting in a regression model that’s easier to interpret.

A tuning parameter, λ controls the strength of the L1 penalty. λ is basically the amount of shrinkage:

When λ = 0, no parameters are eliminated. The estimate is equal to the one found with linear regression.

As λ increases, more and more coefficients are set to zero and eliminated (theoretically, when λ = ∞, all coefficients are eliminated).

As λ increases, bias increases.

As λ decreases, variance increases.

If an intercept is included in the model, it is usually left unchanged.

Code:

import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV

#Load the dataset data = pd.read_csv("tree_addhealth.csv")

#upper-case all DataFrame column names data.columns = map(str.upper, data.columns)

# Data Management data_clean = data.dropna() recode1 = {1:1, 2:0} data_clean['MALE']= data_clean['BIO_SEX'].map(recode1)

#select predictor variables and target variable as separate data sets predvar= data_clean[['MALE','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN', 'AGE','ALCEVR1','ALCPROBS1','MAREVER1','COCEVER1','INHEVER1','CIGAVAIL','DEP1', 'ESTEEM1','VIOL1','PASSIST','DEVIANT1','GPA1','EXPEL1','FAMCONCT','PARACTV', 'PARPRES']]

target = data_clean.SCHCONN1

# standardize predictors to have mean=0 and sd=1 predictors=predvar.copy() from sklearn import preprocessing predictors['MALE']=preprocessing.scale(predictors['MALE'].astype('float64')) predictors['HISPANIC']=preprocessing.scale(predictors['HISPANIC'].astype('float64')) predictors['WHITE']=preprocessing.scale(predictors['WHITE'].astype('float64')) predictors['NAMERICAN']=preprocessing.scale(predictors['NAMERICAN'].astype('float64')) predictors['ASIAN']=preprocessing.scale(predictors['ASIAN'].astype('float64')) predictors['AGE']=preprocessing.scale(predictors['AGE'].astype('float64')) predictors['ALCEVR1']=preprocessing.scale(predictors['ALCEVR1'].astype('float64')) predictors['ALCPROBS1']=preprocessing.scale(predictors['ALCPROBS1'].astype('float64')) predictors['MAREVER1']=preprocessing.scale(predictors['MAREVER1'].astype('float64')) predictors['COCEVER1']=preprocessing.scale(predictors['COCEVER1'].astype('float64')) predictors['INHEVER1']=preprocessing.scale(predictors['INHEVER1'].astype('float64')) predictors['CIGAVAIL']=preprocessing.scale(predictors['CIGAVAIL'].astype('float64')) predictors['DEP1']=preprocessing.scale(predictors['DEP1'].astype('float64')) predictors['ESTEEM1']=preprocessing.scale(predictors['ESTEEM1'].astype('float64')) predictors['VIOL1']=preprocessing.scale(predictors['VIOL1'].astype('float64')) predictors['PASSIST']=preprocessing.scale(predictors['PASSIST'].astype('float64')) predictors['DEVIANT1']=preprocessing.scale(predictors['DEVIANT1'].astype('float64')) predictors['GPA1']=preprocessing.scale(predictors['GPA1'].astype('float64')) predictors['EXPEL1']=preprocessing.scale(predictors['EXPEL1'].astype('float64')) predictors['FAMCONCT']=preprocessing.scale(predictors['FAMCONCT'].astype('float64')) predictors['PARACTV']=preprocessing.scale(predictors['PARACTV'].astype('float64')) predictors['PARPRES']=preprocessing.scale(predictors['PARPRES'].astype('float64'))

# split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)

# specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

# print variable names and regression coefficients dict(zip(predictors.columns, model.coef_))

# plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths')

# plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold')

# MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('test data MSE') print(test_error)

# R-square from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print ('training data R-square') print(rsquared_train) print ('test data R-square') print(rsquared_test)

OUTPUT:

Model Coefficients

dict(zip(predictors.columns, model.coef_)) Out[1191]: {'MALE': -0.21508693783564523, 'HISPANIC': 0.20300474500010318, 'WHITE': 0.0, 'BLACK': -0.6936478686386717, 'NAMERICAN': -0.10784573616426066, 'ASIAN': 0.1886903069462237, 'AGE': 0.21734102065275526, 'ALCEVR1': -0.32499649609663306, 'ALCPROBS1': 0.0, 'MAREVER1': -0.1598037748887288, 'COCEVER1': -0.20000921703104171, 'INHEVER1': 0.0, 'CIGAVAIL': -0.1098387918895045, 'DEP1': -0.8541784447568564, 'ESTEEM1': 1.0974098143740458, 'VIOL1': -0.6392671702279931, 'PASSIST': 0.0, 'DEVIANT1': -0.4180824602779268, 'GPA1': 0.6655764176697664, 'EXPEL1': -0.07382899854861724, 'FAMCONCT': 0.5152729478743291, 'PARACTV': 0.2999119298291273, 'PARPRES': 0.0}

Plotting Coefficient Progression:

Errors:

training data MSE 18.14857266408148 test data MSE 17.29251742716948

training data R-square

0.3336111369269187 test data R-square 0.3100111341600078

Running A Random Forest Classification Model

Random forest classifier creates a set of decision trees from randomly selected subset of training set. It then aggregates the votes from different decision trees to decide the final class of the test object.

In Laymen’s term,

Suppose training set is given as : [X1, X2, X3, X4] with corresponding labels as [L1, L2, L3, L4], random forest may create three decision trees taking input of subset for example,

[X1, X2, X3]

[X1, X2, X4]

[X2, X3, X4]

So finally, it predicts based on the majority of votes from each of the decision trees made.

#CODE

import pandas as pd import numpy as np from pandas import Series, DataFrame import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics

df=pd.read_csv(r"C:\Users\11042\Desktop\Data\tree_addhealth.csv") df=df.dropna() df.dtypes description=df.describe()

df.columns

predictors=df[['BIO_SEX', 'HISPANIC', 'WHITE', 'BLACK', 'NAMERICAN', 'ASIAN', 'age', 'ALCEVR1', 'ALCPROBS1', 'marever1', 'cocever1', 'inhever1', 'cigavail', 'DEP1', 'ESTEEM1', 'VIOL1', 'PASSIST', 'DEVIANT1', 'SCHCONN1', 'GPA1', 'EXPEL1', 'FAMCONCT', 'PARACTV', 'PARPRES']] target=df.TREG1

pred_train,pred_test,tar_train,tar_test=train_test_split(predictors,target,test_size=0.4) pred_train.shape pred_test.shape tar_train.shape tar_test.shape

###############Random Forest#######################

from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import RandomForestClassifier

#Build model on training data from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=25) classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions)

# fit an Extra Trees model to the data model = ExtraTreesClassifier() model.fit(pred_train,tar_train) # display the relative importance of each attribute print(model.feature_importances_)

""" Running a different number of trees and see the effect of that on the accuracy of the prediction """

trees=range(25) accuracy=np.zeros(25)

for idx in range(len(trees)): classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)

plt.cla() plt.plot(trees, accuracy)

#OutPut:

# The output of the confusion matrix-

Out[476]: array([[1424, 85], [ 209, 112]], dtype=int64)

So the first value X- 1424 & Y-112 is the correct value as per the random forest classification .

The value - YX-85 and YX- 112 are the false positives.

#The Output for accuracy score:

0.839344262295082

so the Random Forest algorithm gives 83.9% accuracy.

#Feature importance-

0 BIO_SEX 0.026387 1 HISPANIC 0.016461 2 WHITE 0.027772 3 BLACK 0.018605 4 NAMERICAN 0.008293 5 ASIAN 0.007381 6 age 0.057511 7 ALCEVR1 0.039975 8 ALCPROBS1 0.053567 9 marever1 0.109478 10 cocever1 0.013388 11 inhever1 0.017167 12 cigavail 0.027368 13 DEP1 0.053781 14 ESTEEM1 0.052965 15 VIOL1 0.050157 16 PASSIST 0.017651 17 DEVIANT1 0.072538 18 SCHCONN1 0.067577 19 GPA1 0.083660 20 EXPEL1 0.011814 21 FAMCONCT 0.054771 22 PARACTV 0.064572 23 PARPRES 0.047161

The Graph shows the accuracy of the predictions across the random forest estimators.

Decision Tree

Introduction:

A tree has many analogies in real life, and turns out that it has influenced a wide area of machine learning, covering both classification and regression. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions. Though a commonly used tool in data mining for deriving a strategy to reach a particular goal, its also widely used in machine learning, which will be the main focus of this article.

Code:

Conclusion & Output:

The model grew a total of 322 Leaves before pruning & 27 leaves following the pruning.

Model Event Level lets us confirm that the tree is predicting the value : 1 i.e Yes for our target variable regular smoking.

Number of observation read = 6504

Number of Observations Used = 6500

A vertical reference line is drawn for the tree with the number of leaves that has he lowest cost validated ASE, here its 18.

The Horizontal reference line represents the average standard error plus one standard error for this complexity parameter.

Following the pruning plot, a general model with 10 split levels & 27 leaves.

The model splits on Marijuana use, Race, Deviant, Behavior, Alcohol use & Grade Point Average.

SAS also generates a Confusion matrix (model based) which shows how well the final classification tree performed.. The total model correctly classifies 49% of those who have regularly smoked & 43% of those who have not while defining the null values as most popular seed.

Receiver Operator Curve (ROC) shows the plot of sensitivity to specificity which deviates from its linear pattern.

Variable importance table is used to measure the importance of the primary splitting variables.

#machine learning #week1

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Multivariate Regression Analysis

We perform a multiple linear regression analysis when you have more than one explanatory variable for consideration in our model. We can write the multiple linear regression equation for a model with p explanatory variables as

Y = b0 + b1X1 + b2X2 + ... + bp Xp

where Y is the response, or dependent, variable, the Xs represent the p explanatory variables, and the bs are the regression coefficients.

The goal of the study is to predict Depth of the crater as measured by Diameter & the number of Layer of the crater . Thus, the dependent variable for the analysis is the variable Diameter & Number_Layer .

Scatter plot with Linear Regression Line.

Scatter plot with linear and quadratic regression line.

Linear Regression.

Code:

Result

Evaluating Model Fit.

Multivariate Analysis.

In multivariate Analysis, there was a significant increase in the R square value & 95% confidence Limits with more significant results.

Partial Regression Analysis.

Standardized Residual Using QQPLOT.

Basic Linear Regression Model

While testing for a Basic Linear regression model, the following variables are considered for analysis. NUMBER_LAYERS - Collapsing it into a binary categorical Explanatory variables with 2 levels.

DEPTH_RIMFLOOR_TOPOG - Quantitative Response variable

Code:

Results:

Comments:

means summary response gives the output as per the different NUMBER_Layers. The aggregate mean is 0.83 As per the GLM model we see that :

Number of Observations Read : 384343

Number of Observations Used: 384343

Response variable used here is DEPTH_RIMFLOOR_TOPOG .

F- statistics value is 3011.12

P-Value is .0001 ( i.e negligible. so we can reject the null hypothesis).

R- square value is 0.007774

The mean of depth - 0.075838 i.e almost equal to that of the summary table mean value.

Parameter estimates:

The Beta Sub zero value : 0.75538

The Beta Sub one Value : 1.27646

So the Depth rate (0) = 0.75538 + 1.27646 * Depth (0) = 0.75538

Depth rate (1) = 0.75538 + 1.27646 * Depth (1) = 2.03

#week_2 #basic _linear_regression

Method for Crater Detection From Martian Digital Topography.

Sample:

After looking through the codebook for the Mars Crater, I am particularly interested in Crater diameter, depth, and its distribution dependence.

Impact craters are arguably the primary exogenic planetary process contributing to the surface evolution of solid bodies in the solar system.

Craters appear across the entire surface of Mars, and they are vital to understanding its crustal properties as well as surface ages and modification events. They allow inferences into the ancient climate and hydro-logic history, and they add a key data point for the understanding of impact physics

While Crater diameter, depth, and its distribution dependence is a good starting point, one needs to determine what it is about crater diameter, depth & distribution dependence that’s interesting.

So basically, I am interested in the exploring the depth & diameter of the craters on the Martian surface & its association with the number of layers a crater has.

Data Collection procedure:

Method for Crater Detection From Martian Digital Topography Data Using Gradient Value/Orientation, Morphometry, Vote Analysis, Slip Tuning, and Calibration. So basically the the data involving mars crater is an observational data collected in the process.

The variable which could basically reflect the crater distribution in Martian surface is –

CRATER_ID – crater ID for internal use, based upon the region of the planet (1/16ths), the “pass” under which the crate was identified, and the order in which it was identified

LATITUDE_CIRCLE_IMAGE – latitude from the derived center of a nonlinear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees North)

LONGITUDE_CIRCLE_IMAGE – longitude from the derived center of a nonlinear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees East)

DIAM_CIRCLE_IMAGE – diameter from a nonlinear least squares circle fit to the vertices selected to manually identify the crater rim (units are km)

DEPTH_RIMFLOOR_TOPOG – average elevation of each of the manually determined N points along (or inside) the crater rim (units are km.

Depth Rim -‐ Points are selected as relative topographic highs under the assumption they are the least eroded so most original points along the rim

Depth Floor – Points were chosen as the lowest elevation that did not include visible embedded craters

NUMBER_LAYERS – the maximum number of cohesive layers in any azimuthal direction that could be reliably identified and how is it associated with the NUMBER_LAYERS

#Regression_modelling_week_1

Testing a Potential Moderator

Testing Moderation in context of Anova

Code:

Output:

Comments:

For depth_group= 1, it shows a significantly large F value & a significant P-value.

When we see the means table-

the value associated with Number_layer = 0 shows significantly high values with the diameter of the crater with an average of 43.

For depth_group =2, it shows a significantly large F value & a significant P-value.

the value associated with Number_layer = 0 shows significantly high values with the diameter of the crater with an average of 43. i.e there is no change of the diameter associated with the depth-group=2.

While for depth-group=3, the F-value is significantly less with a large P-Value.

As in the means table-

the value associated with Number_layer = 0 shows significantly high values with the diameter of the crater with an average of 73

Testing Moderation in Context of Chi-Square:

CODE:

OUTPUT:

COMMENTS:

For depth-group=1,

the chi-square value is large & p-Value is quite small . & the col pct showing non-linear relationship. For depth_group=2,

the p-value is large with small p-value, & the & the col pct showing non-linear relationship.

For depth group=3,

the p-value is large with small p-value, & the & the col pct showing a linear relationship with pct of diameter increasing.

Testing Moderation in context of Pearson Correlation

CODE:

OUTPUT:

Comments:

When we examine the correlation coefficient between Diameter of the crater & Depth of the crater, we find the following:

For low number_layer, the correlation between diameter & the depth of the crater shows an average of 0.7 and the p-value is significant, while the number_layer =5 shows gives a correlation coefficient of 0.8 with a non-significant p-value of 0.095

Hence it shows a positive association between the diameter & depth of the crater.

Analysis Using Correlation Coefficient.

Correlation Coefficient by definition measures the linear relationship between two Quantitative Variable . Looking at the Data Set provided, considering two variables for analysis . Depth of the crater with the diameter of the crater.

CODE :

OUTPUT:

The Scatter plot with the linear regression line:

Comments:

Looking at the scatter plot , we can easily find that the relationship between the DEPTH & Diameter of the crater exhibit a positive linear relationship with

Pearson Correlation coefficient (r)= 0.58671

P value= .0001

r is close to 1 , so it shows a strong linear relation.

When we square r, it tells us what proportion of the variability in one variable is described by variation in the second variable (a.k.a. RSquared or Coefficient of Determination).

r^2=0.34

So if we know the Depth, we can predict 34% variability will be seen in the Diameter of the Crater.

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Test of Chi-Square test of Independence

Model Interpretation for Chi-Square Tests:

When examining the association between number of layers (categorical response) and depth of the crater (categorical explanatory), a chi-square test of independence revealed that Chi square test of independence is not accurateas p= <.0001.

The df or degree of freedom we record is the number of levels of the explanatory variable -2. Here the df is 1 depth group has 3 levels (df 3-1=2).

Code:

/*Library referencing the dataset*/ libname mydata '/courses/d1406ae5ba27fe300' access=readonly;

/*using the dataset marscrater_pds*/ data new; set mydata.marscrater_pds; label NUMBER_LAYERS= 'Layers' DIAM_CIRCLE_IMAGE= 'Diameter' DEPTH_RIMFLOOR_TOPOG= 'Depth';

/*for depth of crater*/ if DEPTH_RIMFLOOR_TOPOG le 1 then depth_group=1; else if DEPTH_RIMFLOOR_TOPOG le 2 then depth_group=2; else depth_group=3; run;

/*sorting the data according to the NUMBER_LAYERS*/ proc sort; by NUMBER_LAYERS; run;

/* Chi square test of independence*/ proc freq; tables NUMBER_LAYERS * depth_group/chisq; run; /* Chi square post hoc test*/

data comparison1; set new; if NUMBER_LAYERS=0 or NUMBER_LAYERS=1; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison2; set new; if NUMBER_LAYERS=0 or NUMBER_LAYERS=2; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison3; set new; if NUMBER_LAYERS=0 or NUMBER_LAYERS=3; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison4; set new; if NUMBER_LAYERS=0 or NUMBER_LAYERS=4; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison5; set new; if NUMBER_LAYERS=0 or NUMBER_LAYERS=5; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run;

data comparison6; set new; if NUMBER_LAYERS=1 or NUMBER_LAYERS=0; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison7; set new; if NUMBER_LAYERS=1 or NUMBER_LAYERS=2; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison8; set new; if NUMBER_LAYERS=1 or NUMBER_LAYERS=3; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison9; set new; if NUMBER_LAYERS=1 or NUMBER_LAYERS=4; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison10; set new; if NUMBER_LAYERS=1 or NUMBER_LAYERS=5; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run;

data comparison11; set new; if NUMBER_LAYERS=2 or NUMBER_LAYERS=0; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison12; set new; if NUMBER_LAYERS=2 or NUMBER_LAYERS=1; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison13; set new; if NUMBER_LAYERS=2 or NUMBER_LAYERS=3; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison14; set new; if NUMBER_LAYERS=2 or NUMBER_LAYERS=4; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison15; set new; if NUMBER_LAYERS=2 or NUMBER_LAYERS=5; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run;

data comparison16; set new; if NUMBER_LAYERS=3 or NUMBER_LAYERS=0; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison17; set new; if NUMBER_LAYERS=3 or NUMBER_LAYERS=1; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison18; set new; if NUMBER_LAYERS=3 or NUMBER_LAYERS=2; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison19; set new; if NUMBER_LAYERS=3 or NUMBER_LAYERS=4; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison20; set new; if NUMBER_LAYERS=3 or NUMBER_LAYERS=5; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run;

data comparison21; set new; if NUMBER_LAYERS=4 or NUMBER_LAYERS=0; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison22; set new; if NUMBER_LAYERS=4 or NUMBER_LAYERS=1; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison23; set new; if NUMBER_LAYERS=4 or NUMBER_LAYERS=2; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison24; set new; if NUMBER_LAYERS=4 or NUMBER_LAYERS=3; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison25; set new; if NUMBER_LAYERS=4 or NUMBER_LAYERS=5; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run;

data comparison26; set new; if NUMBER_LAYERS=5 or NUMBER_LAYERS=0; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison27; set new; if NUMBER_LAYERS=5 or NUMBER_LAYERS=1; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison28; set new; if NUMBER_LAYERS=5 or NUMBER_LAYERS=2; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison29; set new; if NUMBER_LAYERS=5 or NUMBER_LAYERS=3; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run; data comparison30; set new; if NUMBER_LAYERS=5 or NUMBER_LAYERS=4; proc sort ; by NUMBER_LAYERS; proc freq; tables NUMBER_LAYERS*depth_group/chisq; run;

Output:

& this goes on for all other categories.

Data Interpretation Using ANOVA

SAS CODE:

To answer the question as How is the depth of the crater related to the Number of Layers, we perform ANOVA Post-Hoc Test by using Duncan Test as the explanatory variable had more than 2 levels or groups.

Output:

Model Interpretation for post hoc ANOVA results:

ANOVA revealed that, Mars Crater (my sample), Number of Layers of the crater (collapsed into 6 ordered categories, which is the categorical explanatory variable) and Depth Of the Crater (quantitative response variable) were significantly associated, F = 18850.4, p= 0.05. Post hoc comparisons of mean number of Layers categories revealed that those Craters where Number of Layers is more than 2 (i.e. 3,4,5) reported significantly more Depth compared to those below 3 (i.e. 2,1,0). All other comparisons were statistically similar.

#data analysis tools week1 anova

Data Visualization & its Distribution

Code for Data Visualization using SAS:

Output & Distribution:

Diameter of the crater:

Depth of the Crater:

Number of Layers :

Distribution:

Uni-variate graph of rate of number of layers for Mars Crater:

This graph is unimodal, with its highest peak at NUMBER_LAYERS=0. It seems to be skewed to the right as there are higher frequencies in lower categories than the higher categories.

The entire weight of the graph seems to be on 0 which is more than 90% of the weight is carried by craters which doesn’t have any layers.

Uni-variate graph for Depth of Mars Crater:

This graph is unimodal, with its highest peak at NUMBER_LAYER= 5. It seems to be skewed to the left as there are higher frequencies in the higher Layer ranges.

The graph shows a positive linear distribution i.e. with increase in the Number of Layers, the Depth of the crater also increases.

Uni-variate graph for Diameter of Mars Crater:

This graph is unimodal, with its highest peak at NUMBER_LAYER= 5. It seems to be skewed to the left as there are higher frequencies in the higher Layer ranges.

The graph shows a positive linear distribution i.e. with increase in the Number of Layers, the Diameter of the crater also increases.

Bi-variate Graph :

The graph above plots the Diameter of the Crater corresponding to the Depth of the Crater. We can see that the scatter graph does not show a clear relationship/trend between the two variables.

Though a slight positive relationship between the two variables is seen i.e. with the increase in Depth of the crater, its Diameter also increases slightly from a depth of 1km. Presence of outliers is also seen at Depth almost equal to 0.

#week_4_assignment

Making Data Management Decisions with SAS.

Using the Mars Crater_pds data, data management by introducing the secondary variable, further narrows down our research.

Code:

OUTPUT:

I collapsed the responses for NUMBER_LAYERS , DIAM_CIRCLE_IMAGE , DEPTH_RIMFLOOR_TOPOG to create three new variables- layer_group, depth_group, diameter_group.

For layer_group, the most commonly endorsed response was 1 (99.78%), meaning that most Craters have layers ranging from 0-2. The endorsed response is completely zero for layers above 4.

For diameter_group, the most commonly endorsed response was 1 (99.46%), meaning that most Craters have Diameter ranging from 0-50 Km.

For depth_group, the most commonly endorsed response was 1 (99.92%), meaning that most Craters have Depth ranging from 0-2 Km. The endorsed response is completely zero for layers above 4km.

#WEEK_3_Assignment

Frequency Distribution for Marscrater_pds & Insights

SAS Code:

/*Library referencing the dataset*/

libname mydata '/courses/d1406ae5ba27fe300' access=readonly;

/*using the dataset marscrater_pds*/

data new;

set mydata.marscrater_pds;

/* Labelling the variables of the dataset*/

label NUMBER_LAYERS= 'Layers'

DIAM_CIRCLE_IMAGE= 'Diameter'

DEPTH_RIMFLOOR_TOPOG= 'Depth';

run;

/*sorting the data according to the NUMBER_LAYERS*/

proc sort;

by NUMBER_LAYERS;

run;

/*frequency table distribution of the 3 variable*/

proc freq;

tables NUMBER_LAYERS DIAM_CIRCLE_IMAGE DEPTH_RIMFLOOR_TOPOG;

run;

Variables Considered:

NUMBER_LAYERS

DIAM_CIRCLE_IMAGE

DEPTH_RIMFLOOR_TOPOG

Frequency Distribution Insights & Output

NUMBER_LAYERS:

The frequency is highest for craters whose NUMBER_LAYERS is 0.

The frequency drops abruptly as the NUMBER_LAYERS increases from 0-4.

#Click here for the output.

DEPTH_RIMFLOOR_TOPOG:

The DEPTH_RIMFLOOR_TOPOG is less the 0 for a few observations. It is deceptive to believe that the depth of crater can be negative.

DEPTH_RIMFLOOR_TOPOG distribution frequency abruptly decreases as the depth increases from 0-4. It is maximum for the depth ranging from 0-1.

#Click here for the output

DIAM_CIRCLE_IMAGE:

The Diameter of the crater decreases gradually as the diameter of the crater increases.

#click here for the output

#Frequency_distribution output - click the hyperlink.

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Crater Diameter & Depth relative to its distribution pattern

After looking through the codebook for the Mars Crater, I am particularly interested in Crater diameter, depth, and its distribution dependence.

Impact craters are arguably the primary exogenic planetary process contributing to the surface evolution of solid bodies in the solar system.

Craters appear across the entire surface of Mars, and they are vital to understanding its crustal properties as well as surface ages and modification events. They allow inferences into the ancient climate and hydrologic history, and they add a key data point for the understanding of impact physics.

The work presents a new global database for Mars that contains 378,540 craters statistically complete for diameters D ≥ 1 km.

This detailed database includes

Location and size

Ejecta morphology and morphometry

Interior morphology and degradation state

And whether the crater is a secondary impact

This database allowed exploration of

Global crater type

Distributions

Depth

And morphologies in unprecedented detail

These were used to re-examine basic crater scaling laws for the planet. The inclusion of hundreds of thousands of small, approximately kilometer-sized impacts facilitated a detailed study of

The properties of nearby fields of secondary craters in relation to their primary crater.

The discovery of vast distant clusters of secondary craters over 5000 km from their primary crater.

Finally, significantly smaller craters were used to - Age-date volcanic calderas on the planet to re-construct

the timeline of the last primary eruption events from 20 of the major Martian volcanoes.

While Crater diameter, depth, and its distribution dependence is a good starting point, one needs to determine what it is about crater diameter, depth & distribution dependence that’s interesting. It strikes me that the diameter & the depth varies greatly with the latitude & longitude. So basically, I am interested in the exploring the depth & diameter of the craters on the Martian surface & its association with the latitude & longitude of the volcanic calderas.