Create your own opera inspired song! A machine learning experiment by David Li in collaboration with Google Arts & Culture
todays bird

pixel skylines
let's talk about Bridgerton tea, my ask is open
trying on a metaphor
noise dept.

祝日 / Permanent Vacation

Discoholic 🪩
Keni
we're not kids anymore.

Kaledo Art
he wasn't even looking at me and he found me
One Nice Bug Per Day
Cosmic Funnies
"I'm Dorothy Gale from Kansas"
tumblr dot com


JBB: An Artblog!


blake kathryn
seen from Sweden
seen from Canada

seen from United States
seen from Türkiye
seen from Malaysia
seen from United States

seen from United States

seen from Japan

seen from United Kingdom
seen from Armenia
seen from Germany

seen from United States

seen from Russia
seen from United States

seen from Ireland
seen from Netherlands

seen from United States

seen from United States

seen from United States
seen from Germany
@sizzlenut
Create your own opera inspired song! A machine learning experiment by David Li in collaboration with Google Arts & Culture

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch • No registration required • HD streaming
Quarkus - a Kubernetes based framework
Quarkus is a Java framework tailored for deployment on Kubernetes. Key technology components surrounding it are OpenJDK HotSpot and GraalVM. The goal of Quarkus is to make Java a leading platform in Kubernetes and serverless environments while offering developers a unified reactive and imperative programming model to optimally address a wider range of distributed application architectures. Quarkus also offers near-instant scale-up and high-density utilisation in container orchestration platforms such as Kubernetes. Many more application instances can be run given the same hardware resources. After its initial debut, Quarkus underwent several enhancements over the next few months, culminating in a 1.0 release within the open source community in October 2019. As a new framework, Quarkus does not need to attempt to retrofit new patterns and principles into an existing codebase. Instead, it can focus on innovation.
Java applications are called WORA (Write Once Run Anywhere). This means a programmer can develop Java code on one system and can expect it to run on any other Java-enabled system without any adjustment. This is all possible because of JVM. The Java VM or Java Virtual Machine resides on the RAM. During execution, using the class loader the class files are brought on the RAM. The BYTE code is verified for any security breaches. Next, the execution engine will convert the Bytecode into Native machine code.
Traditional Java stacks were engineered for monolithic applications with long start-up times and large memory requirements in a world where the cloud, containers, and Kubernetes did not exist. Java frameworks needed to evolve to meet the needs of this new world.
Quarkus was created to enable Java developers to create applications for a modern, cloud-native world. Quarkus is a Kubernetes-native Java framework tailored for GraalVM and HotSpot, crafted from best-of-breed Java libraries and standards. The goal is to make Java the leading platform in Kubernetes and serverless environments while offering developers a framework to address a wider range of distributed application architectures. Quarkus was built from the ground up for Kubernetes making it easy to deploy applications without having to understand all of the complexities of the platform. Quarkus allows developers to automatically generate Kubernetes resources including building and deploying container images without having to manually create YAML files. Quarkus provides a cohesive, fun to use, full-stack framework by leveraging a growing list of hundreds of best-of-breed libraries that you love and use. All wired on a standard backbone.
One of the major productivity problems that face most Java developers is traditional Java development workflow. For most web developers this will generally be:
Write Code → Compile → Deploy → Refresh Browser → Repeat
This can be a major drain on productivity, as the compile + redeploy cycle can often take up to a minute or more. Quarkus aims to solve this problem with its Live Coding feature. When running in development mode the workflow is simply:
Write Code → Refresh Browser → Repeat
RESULTS
The above figure shows us the docker stats of the two containers, one running without (app-access) Quarkus the other with (app-access-jars) respectively. We can see the the docker container running with Quarkus takes up less CPU and memory utilisation.
We can see that the throughput with Quarkus is almost double than that without Quarkus. The more the throughput the better; throughput signifies the number of requests that can be sent per second.
CONCLUSION
As claimed by the Quarkus developer, Red Hat, we were able to see some difference in the response time and memory imprint taken by the API though not as gigantic difference as claimed by Red Hat.
k-means Cluster
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. Note that each code simplet is followed by its output.
from pandas import Series, DataFrame import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans
data=pd.read_csv("/content/gapminder.csv") data['incomeperperson']=data['incomeperperson'].replace(' ',np.nan) data['lifeexpectancy']=data['lifeexpectancy'].replace(' ',np.nan) data['alcconsumption']=data['alcconsumption'].replace(' ',np.nan) data['armedforcesrate']=data['armedforcesrate'].replace(' ',np.nan) data['co2emissions']=data['co2emissions'].replace(' ',np.nan) data['internetuserate']=data['internetuserate'].replace(' ',np.nan) data['suicideper100th']=data['suicideper100th'].replace(' ',np.nan) data['employrate']=data['employrate'].replace(' ',np.nan) data['urbanrate']=data['urbanrate'].replace(' ',np.nan) data_clean = data.dropna()
#variables cluster = data_clean[['incomeperperson','alcconsumption','armedforcesrate','co2emissions','internetuserate','suicideper100th','employrate','urbanrate']]
#standardize clustering variables to have mean=0 and std=1 cluster_v=cluster.copy() cluster_v['incomeperperson']=preprocessing.scale(cluster_v['incomeperperson'].astype('float64')) cluster_v['alcconsumption']=preprocessing.scale(cluster_v['alcconsumption'].astype('float64')) cluster_v['armedforcesrate']=preprocessing.scale(cluster_v['armedforcesrate'].astype('float64')) cluster_v['co2emissions']=preprocessing.scale(cluster_v['co2emissions'].astype('float64')) cluster_v['internetuserate']=preprocessing.scale(cluster_v['internetuserate'].astype('float64')) cluster_v['suicideper100th']=preprocessing.scale(cluster_v['suicideper100th'].astype('float64')) cluster_v['employrate']=preprocessing.scale(cluster_v['employrate'].astype('float64')) cluster_v['urbanrate']=preprocessing.scale(cluster_v['urbanrate'].astype('float64'))
# split data into train and test sets clus_train, clus_test = train_test_split(cluster_v, test_size=.3, random_state=123)
# k-means cluster analysis for 1-9 clusters from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[]
for k in clusters: model=KMeans(n_clusters=k) model.fit(clus_train) clusassign=model.predict(clus_train) meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0])
""" Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose """
plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method')
# Interpret 3 cluster solution model3=KMeans(n_clusters=3) model3.fit(clus_train) clusassign=model3.predict(clus_train)
# plot clusters from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show()
clus_train.reset_index(level=0, inplace=True) cluslist=list(clus_train['index']) labels=list(model3.labels_) newlist=dict(zip(cluslist, labels)) newlist newclus=DataFrame.from_dict(newlist, orient='index') newclus newclus.columns = ['cluster']
newclus.reset_index(level=0, inplace=True) merged_train=pd.merge(clus_train, newclus, on='index') merged_train.head(n=100) merged_train.cluster.value_counts()
clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") print(clustergrp)
data_clean['lifeexpectancy']=data['lifeexpectancy'].replace(' ',np.nan) data_clean['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='ignore') lifeexpectancy_data=data_clean['lifeexpectancy']
lifeexpectancy_train, lifeexpectancy_test = train_test_split(lifeexpectancy_data, test_size=.3, random_state=123) lifeexpectancy_train1=pd.DataFrame(lifeexpectancy_train) lifeexpectancy_train1.reset_index(level=0, inplace=True) merged_train_all=pd.merge(lifeexpectancy_train1, merged_train, on='index') sub1 = merged_train_all[['lifeexpectancy', 'cluster']].dropna()
import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
mod = smf.ols(formula='lifeexpectancy ~ C(cluster)', data=sub1).fit() print (mod.summary())
print ('means for lifeexpectancy by cluster') m1= sub1.groupby('cluster').mean() print (m1)
print ('standard deviations for lifeexpectancy by cluster') m2= sub1.groupby('cluster').std() print (m2)
mc1 = multi.MultiComparison(sub1['lifeexpectancy'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())
SUMMARY
From the bends in the line we can say that there might be 3 clustering points
Two clusters are co-related while the 3rd is quite far away
The 3rd cluster is not related to the response variable. This can be concluded from ols regression model as well as standard deviations
Advantages :
If variables are huge, then K-Means most of the times computationally faster than hierarchical clustering, if we keep k smalls.
K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
Disadvantages :
Difficult to predict K-Value.
With global cluster, it didn't work well.
Different initial partitions can result in different final clusters.
It does not work well with clusters (in the original data) of Different size and Different density
Lasso Regression Analysis
Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). Note that the output of each code segment is below the input.
import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LassoLarsCV from sklearn import preprocessing
data=pd.read_csv("/content/_7548339a20b4e1d06571333baf47b8df_gapminder.csv") data['incomeperperson']=data['incomeperperson'].replace(' ',np.nan) data['lifeexpectancy']=data['lifeexpectancy'].replace(' ',np.nan) data['alcconsumption']=data['alcconsumption'].replace(' ',np.nan) data['armedforcesrate']=data['armedforcesrate'].replace(' ',np.nan) data['co2emissions']=data['co2emissions'].replace(' ',np.nan) data['internetuserate']=data['internetuserate'].replace(' ',np.nan) data['suicideper100th']=data['suicideper100th'].replace(' ',np.nan) data['employrate']=data['employrate'].replace(' ',np.nan) data['urbanrate']=data['urbanrate'].replace(' ',np.nan) data_clean = data.dropna()
#variables predv = data_clean[['incomeperperson','alcconsumption','armedforcesrate','co2emissions','internetuserate','suicideper100th','employrate','urbanrate']] targets = data_clean.lifeexpectancy
#standardize predictors to have mean=0 and std=1 predictors=predv.copy() predictors['incomeperperson']=preprocessing.scale(predictors['incomeperperson'].astype('float64')) predictors['alcconsumption']=preprocessing.scale(predictors['alcconsumption'].astype('float64')) predictors['armedforcesrate']=preprocessing.scale(predictors['armedforcesrate'].astype('float64')) predictors['co2emissions']=preprocessing.scale(predictors['co2emissions'].astype('float64')) predictors['internetuserate']=preprocessing.scale(predictors['internetuserate'].astype('float64')) predictors['suicideper100th']=preprocessing.scale(predictors['suicideper100th'].astype('float64')) predictors['employrate']=preprocessing.scale(predictors['employrate'].astype('float64')) predictors['urbanrate']=preprocessing.scale(predictors['urbanrate'].astype('float64'))
#split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.3, random_state=123) #lasso reg model model = LassoLarsCV(cv=10, precompute=False).fit(pred_train, tar_train)
#variable names and regrwssion co-eff dict(zip(predictors.columns, model.coef_))
output:
{'alcconsumption': 0.0, 'armedforcesrate': 0.0, 'co2emissions': 0.0, 'employrate': -0.6517839519540586, 'incomeperperson': 0.13281754478550406, 'internetuserate': 4.76324440193632, 'suicideper100th': -0.2672969316283555, 'urbanrate': 1.9559846378524477}
# plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths')
output:
# plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.mse_path_, ':') plt.plot(m_log_alphascv, model.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold')
output:
# MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('test data MSE') print(test_error)
output:
training data MSE 32.92544239881372 test data MSE 37.28575586329495
# R-square from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print ('training data R-square') print(rsquared_train) print ('test data R-square') print(rsquared_test)
output:
training data R-square 0.6025231569413958 test data R-square 0.6777826448442148
SUMMARY
alcconsumption, armedforcesrate and co2emissions reduce to zero.
internetuserate has the highest model co-efficient followed by urbanrate.
employrate and suicideper100th follow next but are inversely related.
incomeperperson shows the least model co-efficient.
The Regression Coefficients Progression for Lasso Paths shows the same results.
Mean square error reduces and later stabilizes expect for one parameter whose mean square error shoots up after introduction of other parameters.
We can observe that the test data set has more accuracy that the training data set.
Pros:
As any regularization method, it can avoid overfitting. It can be applied even when number of features is larger than number of data.
It can do feature selection.
It is fast in terms of inference and fitting.
Cons:
The model selected by lasso is not stable.
When there are highly correlated features, lasso may randomly select one of them of part of them. The result depends on the implementation. To improve, people introduced elastic net.
Prediction performance is usually worse than ridge regression in terms of mse.
Random Forest
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees. Following is the code and subsequent output of each part:
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier
data = pd.read_csv("/content/_7548339a20b4e1d06571333baf47b8df_gapminder.csv") data_clean = data.dropna()
data_clean.dtypes data_clean.describe()
data['incomeperperson']=data['incomeperperson'].replace(' ',np.nan) data['lifeexpectancy']=data['lifeexpectancy'].replace(' ',np.nan) data['alcconsumption']=data['alcconsumption'].replace(' ',np.nan) data['armedforcesrate']=data['armedforcesrate'].replace(' ',np.nan) data['co2emissions']=data['co2emissions'].replace(' ',np.nan) data['internetuserate']=data['internetuserate'].replace(' ',np.nan) data['suicideper100th']=data['suicideper100th'].replace(' ',np.nan) data['employrate']=data['employrate'].replace(' ',np.nan) data['urbanrate']=data['urbanrate'].replace(' ',np.nan) data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='ignore') data['lifeexpectancy']=pd.qcut(data.lifeexpectancy, q=2, labels=['0','1'])
data=data.dropna()
#variables predictors = data[['incomeperperson','alcconsumption','armedforcesrate','co2emissions','internetuserate','suicideper100th','employrate','urbanrate']] targets = data.lifeexpectancy
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
print('pred_train.shape') print(pred_train.shape) print('\npred_test.shape') print(pred_test.shape) print('\ntar_train.shape') print(tar_train.shape) print('\ntar_test.shape') print(tar_test.shape)
OUTPUT:
pred_train.shape (92, 8)
pred_test.shape (62, 8)
tar_train.shape (92,)
tar_test.shape (62,)
#Build model on training data from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=25) classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions)
OUTPUT:
array([[23, 8], [ 4, 27]])
sklearn.metrics.accuracy_score(tar_test, predictions)
OUTPUT:
0.8064516129032258
# fit an Extra Trees model to the data model = ExtraTreesClassifier() model.fit(pred_train,tar_train) # display the relative importance of each attribute print(model.feature_importances_)
OUTPUT:
[0.14696707 0.06439017 0.07504602 0.05746416 0.29699549 0.10973903 0.0562014 0.19319666]
""" Running a different number of trees and see the effect of that on the accuracy of the prediction """
trees=range(25) accuracy=np.zeros(25)
for idx in range(len(trees)): classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.cla() plt.plot(trees, accuracy)
OUTPUT:
[<matplotlib.lines.Line2D at 0x7f3e608362e8>]
SUMMARY:
Training and test data have a division of 60-40. As shown in output there are 8 explanatory variables, 92 set of data are for training the rest 62 are for testing.
True negatives=23
True positives=27
False negatives=4
False positives=8
Overall accuracy of forest is 0.80645 i.e. approximately 81%
Variable with highest importance score is internetuserate
Variable with lowest importance score is employrate
Only 4 tress give almost the same accuracy as that of the highest.
advantage
It can come out with very high dimensional (features) data, and no need to reduce dimension, no need to make feature selection
It can judge the importance of the feature
Can judge the interaction between different features
Not easy to overfit
Training speed is faster, easy to make parallel method
It is relatively simple to implement
For unbalanced data sets, it balances the error.
If a large part of the features are lost, accuracy can still be maintained.
Disadvantage
Random forests have been shown to fit over certain noisy classification or regression problems.
For data with different values, attributes with more values will have a greater impact on random forests, so the attribute weights generated by random forests on such data are not credible.

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch • No registration required • HD streaming
Decision Tress
A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes).
Testing multiple explanatory variables
input:
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics
gapminder=pd.read_csv("/_7548339a20b4e1d06571333baf47b8df_gapminder.csv")
data=gapminder.dropna() data.dtypes data.describe()
data['incomeperperson']=data['incomeperperson'].replace(' ',np.nan) data['lifeexpectancy']=data['lifeexpectancy'].replace(' ',np.nan) data['alcconsumption']=data['alcconsumption'].replace(' ',np.nan) data['armedforcesrate']=data['armedforcesrate'].replace(' ',np.nan) data['co2emissions']=data['co2emissions'].replace(' ',np.nan) data['internetuserate']=data['internetuserate'].replace(' ',np.nan) data['suicideper100th']=data['suicideper100th'].replace(' ',np.nan) data['employrate']=data['employrate'].replace(' ',np.nan) data['urbanrate']=data['urbanrate'].replace(' ',np.nan) data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='ignore') data['lifeexpectancy']=pd.qcut(data.lifeexpectancy, q=2, labels=['0','1'])
data=data.dropna()
#variables predictors = data[['incomeperperson','alcconsumption','armedforcesrate','co2emissions','internetuserate','suicideper100th','employrate','urbanrate']] targets = data.lifeexpectancy
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
print('pred_train.shape') print(pred_train.shape) print('\npred_test.shape') print(pred_test.shape) print('\ntar_train.shape') print(tar_train.shape) print('\ntar_test.shape') print(tar_test.shape)
#Build model on training data classifier=DecisionTreeClassifier() classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions)
from sklearn import tree from io import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(classifier, out_file=out) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png())
output:
pred_train.shape (92, 8)
pred_test.shape (62, 8)
tar_train.shape (92,)
tar_test.shape (62,)
Testing two explanatory variables
input:
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics
#os.chdir("/_7548339a20b4e1d06571333baf47b8df_gapminder.csv")
gapminder=pd.read_csv("/_7548339a20b4e1d06571333baf47b8df_gapminder.csv")
data=gapminder.dropna() data.dtypes data.describe()
data['lifeexpectancy']=data['lifeexpectancy'].replace(' ',np.nan) data['urbanrate']=data['urbanrate'].replace(' ',np.nan) data['co2emissions']=data['co2emissions'].replace(' ',np.nan) data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='ignore') data['lifeexpectancy']=pd.qcut(data.lifeexpectancy, q=2, labels=['0','1'])
data=data.dropna()
#variables predictors = data[['co2emissions', 'urbanrate']] targets = data.lifeexpectancy
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
print('pred_train.shape') print(pred_train.shape) print('\npred_test.shape') print(pred_test.shape) print('\ntar_train.shape') print(tar_train.shape) print('\ntar_test.shape') print(tar_test.shape)
#Build model on training data classifier=DecisionTreeClassifier() classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions)
from sklearn import tree from io import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(classifier, out_file=out) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png())
output:
red_train.shape (108, 2)
pred_test.shape (73, 2)
tar_train.shape (108,)
tar_test.shape (73,)
Summary:
Advantages:
Compared to other algorithms decision trees requires less effort for data preparation during pre-processing.
A decision tree does not require normalization of data.
A decision tree does not require scaling of data as well.
Missing values in the data also does NOT affect the process of building decision tree to any considerable extent.
A Decision trees model is very intuitive and easy to explain to technical teams as well as stakeholders.
Disadvantage:
A small change in the data can cause a large change in the structure of the decision tree causing instability.
For a Decision tree sometimes calculation can go far more complex compared to other algorithms.
Decision tree often involves higher time to train the model.
Decision tree training is relatively expensive as complexity and time taken is more.
Decision Tree algorithm is inadequate for applying regression and predicting continuous values.
Test a Logistic Regression Model
(i) Summarize in a few sentences what you found, making sure you discuss the results for the associations between all of your explanatory variables and your response variable.
From the above graph it is pretty clear that the other explanatory variables do not affect the relation between my primary explanatory variable and response variable. Although i found out that the other explanatory variables did affect each other’s relation with response variable. Like when employment rate was brought into picture, alcohol consumption no longer affects the response variable.
(ii) Report whether or not your results supported your hypothesis for the association between your primary explanatory variable and your response variable.
According to the various analysis I ran, i can positively conclude that there is a relationship between my primary explanatory and response variables. This relationship is not affected by any other explanatory variable.
(iii) Discuss whether or not there was evidence of confounding for the association between your primary explanatory variable and the response variable.
There was no evidence of confounding in my primary explanatory variable and the response variable. However other explanatory variables showed confounding. When employment rate was brought into picture, alcohol consumption no longer affects the response variable.
(iv) Include your logistic regression output in your blog entry.
import pandas import numpy import statsmodels.formula.api as smf
data=pandas.read_csv('_7548339a20b4e1d06571333baf47b8df_gapminder.csv',low_memory=False)
data['incomeperperson']=data['incomeperperson'].replace(' ',numpy.nan) data['lifeexpectancy']=data['lifeexpectancy'].replace(' ',numpy.nan)
data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='ignore') data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='ignore')
#made categorical response variable data['lifeexpactancy_c']=pandas.qcut(data.lifeexpectancy, q=2, labels=['0', '1']) data['lifeexpactancy_C'] = pandas.to_numeric(data['lifeexpactancy_c'], errors='ignore')
#logistic regression lreg1=smf.logit(formula = 'lifeexpactancy_C ~ incomeperperson', data = data).fit() print(lreg1.summary())
#odds ratio print("\n\nodds ratio: ") print(numpy.exp(lreg1.params))
print('\n\n') params=lreg1.params conf=lreg1.conf_int() conf['OR']=params conf.columns=['LOWER CI', 'UPPER CI', 'OR'] print(numpy.exp(conf))
After adjusting for potential confounding factors (alcohol consumption, urban rate and employment rate), the probability of a country high life expectancy is 1.000517 times more than a country having low life expectancy.(p=0.000, OR=1.000517, 95% CI= 1.000327 to 1.000786).
SUMMARY:
there is a relationship between my primary explanatory and response variables
no evidence of confounding in my primary explanatory variable and the response variable.
other explanatory variables showed confounding.
When employment rate was brought into picture, alcohol consumption no longer affects the response variable.
After adjusting for potential confounding factors, the probability of a country high life expectancy is 1.000517 times more than a country having low life expectancy.(p=0.000, OR=1.000517, 95% CI= 1.000327 to 1.000786).
Test a Multiple Regression Model
Multiple Regression
import pandas import numpy import scipy.stats import matplotlib.pyplot as plt import statsmodels.api as sm import statsmodels.formula.api as smf
data=pandas.read_csv('_7548339a20b4e1d06571333baf47b8df_gapminder.csv',low_memory=False)
data['incomeperperson']=data['incomeperperson'].replace(' ',numpy.nan) data['lifeexpectancy']=data['lifeexpectancy'].replace(' ',numpy.nan) data['alcconsumption']=data['alcconsumption'].replace(' ',numpy.nan) data['employrate']=data['employrate'].replace(' ',numpy.nan) data['urbanrate']=data['urbanrate'].replace(' ',numpy.nan)
data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='ignore') data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='ignore') data['alcconsumption']= pandas.to_numeric(data['alcconsumption'], errors='ignore') data['employrate']= pandas.to_numeric(data['employrate'], errors='ignore') data['urbanrate']=pandas.to_numeric(data['urbanrate'], errors='ignore')
sub1 = data[['incomeperperson', 'lifeexpectancy', 'alcconsumption', 'employrate', 'urbanrate']].dropna()
print('\nThe mean of explanatory variable is:') print(data['incomeperperson'].mean())
print('\nThe values of incomeperperson after Centering:') sub1['incomeperperson_m']=data['incomeperperson']-data['incomeperperson'].mean() print(sub1['incomeperperson_m'].mean())
print ('\nassociation between incomeperperson and life expectancy after centering:') print (scipy.stats.pearsonr(sub1['incomeperperson_m'], sub1['lifeexpectancy']))
print('\nOLS regression model for incomeperperson and life expectancy after centering:') model2=smf.ols(formula='lifeexpectancy~incomeperperson_m',data=sub1).fit() print(model2.summary())
#multiple regression print('\nOLS regression model for multiple regression') model3=smf.ols(formula='lifeexpectancy~incomeperperson_m + alcconsumption',data=sub1).fit() print(model3.summary())
#multiple regression print('\nOLS regression model for multiple regression') model3=smf.ols(formula='lifeexpectancy~incomeperperson_m + alcconsumption + employrate',data=sub1).fit() print(model3.summary())
#multiple regression print('\nOLS regression model for multiple regression') model3=smf.ols(formula='lifeexpectancy~incomeperperson_m + alcconsumption + employrate + urbanrate',data=sub1).fit() print(model3.summary())
interpretation: WE CAN SEE THAT EXPLANATORY VARIABLE AND RESPONSE VARIABLE ARE RELATED AS THE P-VALUE IS ALMOST ZERO. THE EXPECTED RANGE OF VALUES ARE BETWEEN 68.906 AND 71.389. WE GET A VALUE OF 70.1473.
interpretation: MULTIPLE REGRESSION ANALYSIS: ALCOHOL CONSUMPTION DOES NOT AFFECT THE RELATION BETWEEN PRIMARY VARIABLES. IT IS ALTHOUGH RELATED TO THE RESPONSE VARIABLE. WE CAN SEE AN IMPROVEMENT IN R-SQUARE VALUE.
interpretation: MULTIPLE REGRESSION ANALYSIS:EMPLOY RATE DOES NOT AFFECT THE RELATION BETWEEN PRIMARY VARIABLES. IT IS ALTHOUGH RELATED TO THE RESPONSE VARIABLE AND IT AFFECTS THE RELATIONSHIP BETWEEN ALCOHOL CONSUMPTION AND RESPONSE VARIABLE. FURTHER MORE INCREASE IN R-SQUARE VALUE.
interpretation: MULTIPLE REGRESSION ANALYSIS:EMPLOY RATE DOES NOT AFFECT THE RELATION BETWEEN PRIMARY VARIABLES. IT IS ALTHOUGH HIGHLY RELATED TO THE RESPONSE VARIABLE AND IT AFFECTS THE RELATIONSHIP BETWEEN EMPLOYMENT RATE AND RESPONSE VARIABLE. FURTHER MORE INCREASE IN R-SQUARE VALUE.
Regression diagnostic
#multiple regression print('\nOLS regression model for multiple regression') model3=smf.ols(formula='lifeexpectancy~incomeperperson_m + alcconsumption',data=sub1).fit() print(model3.summary())
#multiple regression print('\nOLS regression model for multiple regression') model4=smf.ols(formula='lifeexpectancy~incomeperperson_m + alcconsumption + employrate',data=sub1).fit() print(model4.summary())
#multiple regression print('\nOLS regression model for multiple regression') model5=smf.ols(formula='lifeexpectancy~incomeperperson_m + alcconsumption + employrate + urbanrate',data=sub1).fit() print(model5.summary())
#QQ plot for normality fig1=sm.qqplot(model5.resid, line='r')
#SIMPLE PLOT OF RESIDUALS stdres=pandas.DataFrame(model5.resid_pearson) fig2=plt.plot(stdres, 'o', ls='None') l = plt.axhline(y=0, color='r') plt.lable('Standardized Residual') plt.xlable('Obervation Table') print(fig2)
#additional regression diagnostic plots fig3=plt.figure(figsize=(12,8)) fig3=sm.graphics.plot_regress_exog(model5, 'incomeperperson_m',fig=fig3) print(fig3)
interpretation: WE CAN SEE THAT LOWER AND HIGHER VALUES DON’T FOLLOW A LINEAR REGRESSION LINE. SO THE CURVE SHOULD PERPHAS BE RECTILINEAR.
interpretation: BY OBSERVING THE VERTICAL LIKE WE CAN SAY THAT MOST THE VALUES LIE BETWEEN 1 AND -1. HENCE THE AMOUNT OF ERROR IS ACCEPTABLE. FOR THE ADDITIONAL HORIZONTAL LINE THE AMOUNT OF ERROR IS UNACCEPTABLE.
interpretation: WE CAN SAY FROM THE RESIDUAL PLOT THAT IT IS A SOMEWHAT FUNNEL SHAPED. THE CCPR ALSO SUGGESTS THE SAME. PARTIAL REGRESSION PLOT SUGGESTS A LINEAR RELATIONSHIP HOWEVER THERE ARE SOME DISCRETE VALUES
NOTE : SUMMARY IS BELOW EVERY GRAPH.
Centering of quantitative explanatory variable
Program
print('\nThe mean of explanatory variable is:') print(data['incomeperperson'].mean())
print('\nThe values of incomeperperson after Centering:') data['incomeperperson']=data['incomeperperson']-data['incomeperperson'].mean() print(data['incomeperperson'])
print ('\nassociation between incomeperperson and life expectancy after centering:') print (scipy.stats.pearsonr(sub1['incomeperperson'], sub1['lifeexpectancy']))
print('\nOLS regression model for incomeperperson and life expectancy after centering:') model2=smf.ols(formula='incomeperperson ~ lifeexpectancy',data=data).fit() print(model2.summary())
Output
Summary of Linear Regression
I chose gapminder data set for my study. My variables are:
explanatory variable: income per person
response variable: life expectancy
From the linear regression results we can observe that:
p value is very small.
F-statistics value is big.
graph shows an linear relation.
from the graph we can also say that for a starting few values the graph showed an shooting rise in values of y and then the graph showed a linear relation.
there is some homoscedasticity
from the regression model results we can see that, y-intercept is: 653.3113
P>|t| values are zero (0.000)
Hence we can say that the variables are linearly related.

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch • No registration required • HD streaming
Linear Regression program and output
Variables
explanatory variable: income per person
response variable: life expectancy
Program:
import pandas import numpy import scipy.stats import matplotlib.pyplot as plt import statsmodels.api as sm import statsmodels.formula.api as smf
data=pandas.read_csv('_7548339a20b4e1d06571333baf47b8df_gapminder.csv',low_memory=False)
data['incomeperperson']=data['incomeperperson'].replace(' ',numpy.nan) data['lifeexpectancy']=data['lifeexpectancy'].replace(' ',numpy.nan)
data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='ignore') data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='ignore')
sub1 = data[['incomeperperson', 'lifeexpectancy']].dropna()
plt.title('income per person vs lifeexpectancy') plt.xlabel('income per person') plt.ylabel('lifeexpectancy') plt.scatter(sub1['incomeperperson'], sub1['lifeexpectancy']) plt.show()
print ('association between incomeperperson and life expectancy') print (scipy.stats.pearsonr(sub1['incomeperperson'], sub1['lifeexpectancy']))
print('\nOLS regression model for incomeperperson and life expectancy') model2=smf.ols(formula='incomeperperson ~ lifeexpectancy',data=data).fit() print(model2.summary())
Output:
Writing About Your Data
Sample
I will be using the NESARC code book as my data set for my research question. The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) is a nationally representative survey of adult Americans that collected data on alcohol use disorders and their associated disabilities in addition to collecting saliva samples for the purpose of understanding the prevalence, risk factors, health disparities, economic costs and gene-environment interactions related to alcohol use disorders and their associated disabilities.
Data Collection Procedure
NESARC collected information on alcohol use and disorders and related physical and mental disabilities in addition to DNA to be obtained through saliva samples. T A local focus group service was used to recruit the respondents for the field test. Westat statistical staff set quotas for respondent recruitment based upon the following demographic criteria: age, gender, race/ethnicity, and education. These quotas were representative of the sample that would be selected in the main study. The 35 respondents represented a mix of these demographic criteria.
A total of 35 AUDADIS-5 interviews and completed saliva samples were collected as part of the field test. Additionally, five Reliability Study and five Validity Study interviews were conducted. All interviews were conducted in English and took place in the Washington, DC metropolitan area over a 2-week period. This location allowed Westat to draw upon field interviewers and sample persons (SPs) in Northern Virginia, Maryland, and Washington, DC. This location also allowed Westat and NIAAA staff to easily accompany field interviewers and observe field test interviews. Interviews took place in the respondents’ homes.
Variables
S3CD6Q15C: NUMBER OF EPISODES OF COCAINE DEPENDENCE
S3CD10Q15B: AGE AT ONSET OF OTHER DRUG DEPENDENCE
PDwA: PANIC DISORDER WITH AGORAPHOBIA
PDw/oA: PANIC DISORDER WITHOUT AGORAPHOBIA
Aw/oPD: AGORAPHOBIA WITHOUT PANIC DISORDER
Measures
A wide demographic study was done and all the factors were taken into account.
Testing a Potential Moderator
a) Defining moderation, a.k.a. statistical interaction
import pandas import numpy import statsmodels.formula.api as smf
data=pandas.read_csv('_7548339a20b4e1d06571333baf47b8df_gapminder.csv',low_memory=False) copy=data.copy()
copy['lifeexpectancy']=copy['lifeexpectancy'].replace(' ',numpy.nan) copy['lifeexpectancy'] = pandas.to_numeric(copy['lifeexpectancy'], errors='ignore') #catogorical variable copy['Glifeexpactancy']=pandas.cut(copy.lifeexpectancy, bins=[0,50, 75, 100], labels=['50', 'NaN', '75'], right=True, include_lowest=True) copy['Glifeexpactancy'] = pandas.to_numeric(copy['Glifeexpactancy'], errors='ignore')
#explanatory variable copy['employrate']=copy['employrate'].replace(' ',numpy.nan) copy['employrate'] = pandas.to_numeric(copy['employrate'], errors='ignore')
model1=smf.ols(formula='employrate ~ C(Glifeexpactancy)',data=copy).fit(); print(model1.summary())
b) Testing moderation in the context of ANOVA
#moderator is sucide per 100th copy['suicideper100th']=copy['suicideper100th'].replace(' ',numpy.nan) copy['suicideper100th'] = pandas.to_numeric(copy['suicideper100th'], errors='ignore') copy['qsuicideper100th']=pandas.qcut(copy.suicideper100th, q=2, labels=['low', 'high'])
sub2=copy[(copy['qsuicideper100th']=='low')] sub3=copy[(copy['qsuicideper100th']=='high')]
print('\nassociation of emplyment rate and life expectancy with low sucide rate') model2=smf.ols(formula='employrate ~ C(Glifeexpactancy)',data=sub2).fit() print(model2.summary())
print('\nassociation of emplyment rate and life expectancy with high sucide rate') model3=smf.ols(formula='employrate ~ C(Glifeexpactancy)',data=sub3).fit() print(model3.summary())
print("\n mean for sub2") m1= sub2.groupby('Glifeexpactancy').mean() print(m1)
print("\n mean for sub3") m1= sub3.groupby('Glifeexpactancy').mean() print(m1)
(c) Testing moderation in the context of Chi-Square
import pandas import numpy import scipy.stats
data=pandas.read_csv('_7548339a20b4e1d06571333baf47b8df_gapminder.csv',low_memory=False) copy=data.copy() #response variable data['lifeexpectancy']=data['lifeexpectancy'].replace(' ',numpy.nan) data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='ignore') copy['Glifeexpactancy']=pandas.cut(data.lifeexpectancy, bins=[0,50, 75, 100], labels=['<50', '50-75', '75<'], right=True, include_lowest=True) #explainatory variable copy['suicideper100th']=copy['suicideper100th'].replace(' ',numpy.nan) copy['suicideper100th'] = pandas.to_numeric(copy['suicideper100th'], errors='ignore') copy['qsuicideper100th']=pandas.qcut(copy.suicideper100th, q=2, labels=['low', 'high']) #moderator copy['incomeperperson']=copy['incomeperperson'].replace(' ',numpy.nan) copy['incomeperperson'] = pandas.to_numeric(copy['incomeperperson'], errors='ignore') copy['qincomeperperson']=pandas.qcut(copy.incomeperperson, q=2, labels=['low', 'high'])
sub2=copy[(copy['qincomeperperson']=='low')] sub3=copy[(copy['qincomeperperson']=='high')] print('\n____for sub2____') # contingency table of observed counts-sub2 print('\ncontingency table of observed counts') ct1=pandas.crosstab(sub2['Glifeexpactancy'], sub2['qsuicideper100th']) print (ct1) # chi-square-sub2 print('\nchi-square') print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1) print('\n____for sub3____') # contingency table of observed counts-sub3 print('\ncontingency table of observed counts') ct1=pandas.crosstab(sub3['Glifeexpactancy'], sub3['qsuicideper100th']) print (ct1) # chi-square-sub3 print('\nchi-square') print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)
d) Testing moderation in the context of correlation
import pandas import numpy import scipy.stats import matplotlib.pyplot as plt
data=pandas.read_csv('_7548339a20b4e1d06571333baf47b8df_gapminder.csv',low_memory=False) copy=data.copy() #moderator copy['incomeperperson']=copy['incomeperperson'].replace(' ',numpy.nan) copy['incomeperperson'] = pandas.to_numeric(copy['incomeperperson'], errors='ignore') copy['qincomeperperson']=pandas.qcut(copy.incomeperperson, q=2, labels=['low', 'high']) #variables copy['alcconsumption']=copy['alcconsumption'].replace(' ',numpy.nan) copy['alcconsumption'] = pandas.to_numeric(copy['alcconsumption'], errors='ignore') copy['suicideper100th']=copy['suicideper100th'].replace(' ',numpy.nan) copy['suicideper100th'] = pandas.to_numeric(copy['suicideper100th'], errors='ignore')
sub1=copy[(copy['qincomeperperson']=='low')].dropna() sub2=copy[(copy['qincomeperperson']=='high')].dropna()
print ('association for sub1') print (scipy.stats.pearsonr(sub1['suicideper100th'], sub1['alcconsumption'])) plt.xlabel('suicideper100th') plt.ylabel('alcconsumption') plt.scatter(sub1['suicideper100th'], sub1['alcconsumption']) plt.show()
print ('\n\nassociation for sub2') print (scipy.stats.pearsonr(sub2['suicideper100th'], sub2['alcconsumption'])) plt.xlabel('suicideper100th') plt.ylabel('alcconsumption') plt.scatter(sub2['suicideper100th'], sub2['alcconsumption']) plt.show()
SUMMARY/CONCLUSION
In (a) part the value of p is more than 0.05 hence there would be no relation between employment rate and life expectancy but we will still run further analysis to be sure.
In (b) part also the value of p is very large rejecting all the possibilities that a relation may exist between employment rate and life expectancy even considering a moderator (suicide per 100th).
In (c) chi-square value is not very large while p-value is very larger than 0.05. Hence there is no relation between suicide rate and life expectancy normally or considering employment rate as moderator.
In (d) we can see that there exist a linear relationship between alcohol consumption and suicide rate for sub2 that is countries with high income per person where as there is no relation between alcohol consumption and suicide rate for sub1 that is countries with low income per person
Generating a Correlation Coefficient
We are using Pearson correlation for generating co-relation co-eff
INPUT
import pandas import numpy import scipy.stats import matplotlib.pyplot as plt
data=pandas.read_csv('_7548339a20b4e1d06571333baf47b8df_gapminder.csv',low_memory=False)
data['employrate']=data['employrate'].replace(' ',numpy.nan) data['incomeperperson']=data['incomeperperson'].replace(' ',numpy.nan) data['lifeexpectancy']=data['lifeexpectancy'].replace(' ',numpy.nan)
data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='ignore') data['employrate'] = pandas.to_numeric(data['employrate'], errors='ignore') data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='ignore')
sub1 = data[['employrate', 'lifeexpectancy', 'incomeperperson']].dropna()
plt.title('employment rate vs life expectancy') plt.xlabel('employment rate') plt.ylabel('life expectancy') plt.scatter(sub1['employrate'], sub1['lifeexpectancy']) plt.show()
plt.title('income per person vs life expectancy') plt.xlabel('income per person') plt.ylabel('life expectancy') plt.scatter(sub1['incomeperperson'], sub1['lifeexpectancy']) plt.show()
print ('association between employment rate and life expectancy') print (scipy.stats.pearsonr(sub1['employrate'], sub1['lifeexpectancy']))
print ('association between incomeperperson and life expectancy') print (scipy.stats.pearsonr(sub1['incomeperperson'], sub1['lifeexpectancy']))
OUTPUT
association between employment rate and life expectancy (-0.31400353129492137, 3.792490398182931e-05) association between incomeperperson and life expectancy (0.6132276230343815, 1.600979267542254e-18)
CONCLUSION/SUMMARY
1)From the graph of employment rate vs life expectancy we can observe that there is no linear relation between employment rate and life expectancy.
2) From the graph of income per person vs life expectancy we can observe that there exists a positive linear relation between income per person and life expectancy i.e. life expectancy increases as income per person increases.
3) interpretation of association between employment rate and life expectancy: the co-relation co-eff is very small (negative) as well as p is significantly high which tells us no relation exists between employment rate and life expectancy. The square of co-relation co-eff will show us that -9.61% variability shall be seen in the life expectancy which is actually none
4) interpretation of association between incomeperperson and life expectancy : the co-relation co-eff is large as well as p is significantly small which tells us a relation does exists between incomeperperson and life expectancy. The square of co-relation co-eff will show us that 37.21% variability shall be seen in the life expectancy which is actually none
Running a Chi-Square Test of Independence
Null hypothesis: polityscore and life expectancy are independent variables
Alternate hypothesis: polityscore and life expectancy are dependent variables
INPUT for Chi Square tests.
import pandas import numpy import scipy.stats import seaborn import matplotlib.pyplot as plt
data=pandas.read_csv('_7548339a20b4e1d06571333baf47b8df_gapminder.csv',low_memory=False) #narrowing s1=data[(data["lifeexpectancy"]<=str(50))|(data["lifeexpectancy"]>=str(75))] copy=s1.copy() #missing data copy['polityscore']=copy['polityscore'].replace(' ',numpy.nan) copy['lifeexpectancy']=copy['lifeexpectancy'].replace(' ',numpy.nan) #to numeric copy['lifeexpectancy'] = pandas.to_numeric(copy['lifeexpectancy'], errors='ignore') copy['polityscore'] = pandas.to_numeric(copy['polityscore'], errors='ignore') #recoding polityscore recode1={-10:0,-9:1,-8:2,-7:3,-2:4,-1:5,0:6,5:7,6:8,7:9,8:10,9:11,10:12} copy['ups']=copy['polityscore'].map(recode1) #grouping life expectancy copy['Glifeexpactancy']=pandas.cut(copy.lifeexpectancy, bins=[0,50, 75, 100], labels=['50', 'NaN', '75'], right=True, include_lowest=True) copy['Glifeexpactancy'] = pandas.to_numeric(copy['Glifeexpactancy'], errors='ignore')
sub1 = copy[['ups', 'Glifeexpactancy']].dropna()
# contingency table of observed counts print('\ncontingency table of observed counts') ct1=pandas.crosstab(sub1['Glifeexpactancy'], sub1['ups']) print (ct1)
# column percentages print('\ncolumn percentages') colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)
# chi-square print('\nchi-square') print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)
# set variable types sub1['ups'] = sub1["ups"].astype('category') sub1['Glifeexpactancy'] = pandas.to_numeric(sub1['Glifeexpactancy'], errors='coerce') seaborn.factorplot(x="ups", y="Glifeexpactancy", data=sub1, kind="bar", ci=None) plt.xlabel('polityscore') plt.ylabel('lifeexpactancy')
OUTPUT for Chi Square tests.
contingency table of observed counts ups 0.0 1.0 2.0 3.0 4.0 ... 8.0 9.0 10.0 11.0 12.0 Glifeexpactancy ... 50.0 0 1 0 0 1 ... 1 2 1 0 0 75.0 1 0 1 4 1 ... 0 0 5 4 27
[2 rows x 13 columns]
column percentages ups 0.0 1.0 2.0 3.0 4.0 ... 8.0 9.0 10.0 11.0 12.0 Glifeexpactancy ... 50.0 0.0 1.0 0.0 0.0 0.5 ... 1.0 1.0 0.166667 0.0 0.0 75.0 1.0 0.0 1.0 1.0 0.5 ... 0.0 0.0 0.833333 1.0 1.0
[2 rows x 13 columns]
chi-square chi-square value, p value, expected counts (39.99537037037037, 7.203618274022623e-05, 12, array([[ 0.16981132, 0.16981132, 0.16981132, 0.67924528, 0.33962264, 0.16981132, 0.16981132, 0.33962264, 0.16981132, 0.33962264, 1.01886792, 0.67924528, 4.58490566], [ 0.83018868, 0.83018868, 0.83018868, 3.32075472, 1.66037736, 0.83018868, 0.83018868, 1.66037736, 0.83018868, 1.66037736, 4.98113208, 3.32075472, 22.41509434]]))
INTERPRETATION for Chi Square tests.
As the p value isn’t very small we can accept the null hypothesis
The family-wise error for 78 comparisons as calculated from the formula is 0.9817. This means that i have 98.17% chance of wrongly rejecting the null hypothesis. However as the assignment demands I have conducted post hoc tests for Chi Square tests.
The adjusted p value is: p(adj)=0.05/no. of comparisons=0.05/78=0.0006
INPUTS and OUTPUTS for post hoc tests
I had to run 78 comparisons however here i have pasted only a few
1)recode2 = {0: 0, 1: 1} copy['COMP1v2']= copy['ups'].map(recode2).dropna() ct2=pandas.crosstab(copy['Glifeexpactancy'], copy['COMP1v2']) print (ct2) colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct) print ('chi-square value, p value, expected counts') cs5= scipy.stats.chi2_contingency(ct2) print (cs5)
output:COMP1v2 0.0 1.0 Glifeexpactancy 50.0 0 1 75.0 1 0 COMP1v2 0.0 1.0 Glifeexpactancy 50.0 0.0 1.0 75.0 1.0 0.0 chi-square value, p value, expected counts (0.0, 1.0, 1, array([[0.5, 0.5], [0.5, 0.5]]))
2)recode2 = {2: 2, 4: 4} copy['COMP1v2']= copy['ups'].map(recode2).dropna() ct2=pandas.crosstab(copy['Glifeexpactancy'], copy['COMP1v2']) print (ct2) colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct) print ('chi-square value, p value, expected counts') cs5= scipy.stats.chi2_contingency(ct2) print (cs5)
ouput:COMP1v2 2.0 4.0 Glifeexpactancy 50.0 0 1 75.0 1 1 COMP1v2 2.0 4.0 Glifeexpactancy 50.0 0.0 0.5 75.0 1.0 0.5 chi-square value, p value, expected counts (0.1875, 0.6650055421020291, 1, array([[0.33333333, 0.66666667], [0.66666667, 1.33333333]]))
3)recode2 = {3: 3, 6: 6} copy['COMP1v2']= copy['ups'].map(recode2).dropna() ct2=pandas.crosstab(copy['Glifeexpactancy'], copy['COMP1v2']) print (ct2) colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct) print ('chi-square value, p value, expected counts') cs5= scipy.stats.chi2_contingency(ct2) print (cs5)
ouput:COMP1v2 3.0 6.0 Glifeexpactancy 50.0 0 1 75.0 4 0 COMP1v2 3.0 6.0 Glifeexpactancy 50.0 0.0 1.0 75.0 1.0 0.0 chi-square value, p value, expected counts (0.703125, 0.4017356370553977, 1, array([[0.8, 0.2], [3.2, 0.8]]))
4) recode2 = {3: 3, 7:7} copy['COMP1v2']= copy['ups'].map(recode2).dropna() ct2=pandas.crosstab(copy['Glifeexpactancy'], copy['COMP1v2']) print (ct2) colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct) print ('chi-square value, p value, expected counts') cs5= scipy.stats.chi2_contingency(ct2) print (cs5)
output:COMP1v2 3.0 7.0 Glifeexpactancy 50.0 0 1 75.0 4 1 COMP1v2 3.0 7.0 Glifeexpactancy 50.0 0.0 0.5 75.0 1.0 0.5 chi-square value, p value, expected counts (0.15, 0.6985353583033387, 1, array([[0.66666667, 0.33333333], [3.33333333, 1.66666667]]))
5) recode2 = {9:9, 12:12} copy['COMP1v2']= copy['ups'].map(recode2).dropna() ct2=pandas.crosstab(copy['Glifeexpactancy'], copy['COMP1v2']) print (ct2) colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct) print ('chi-square value, p value, expected counts') cs5= scipy.stats.chi2_contingency(ct2) print (cs5)
output: COMP1v2 9.0 12.0 Glifeexpactancy 50.0 2 0 75.0 0 27 COMP1v2 9.0 12.0 Glifeexpactancy 50.0 1.0 0.0 75.0 0.0 1.0 chi-square value, p value, expected counts (15.516889574759947, 8.177136530731187e-05, 1, array([[ 0.13793103, 1.86206897], [ 1.86206897, 25.13793103]]))
6) recode2 = {8:8, 11:11} copy['COMP1v2']= copy['ups'].map(recode2).dropna() ct2=pandas.crosstab(copy['Glifeexpactancy'], copy['COMP1v2']) print (ct2) colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct) print ('chi-square value, p value, expected counts') cs5= scipy.stats.chi2_contingency(ct2) print (cs5)
output: COMP1v2 8.0 11.0 Glifeexpactancy 50.0 1 0 75.0 0 4 COMP1v2 8.0 11.0 Glifeexpactancy 50.0 1.0 0.0 75.0 0.0 1.0 chi-square value, p value, expected counts (0.703125, 0.4017356370553977, 1, array([[0.2, 0.8], [0.8, 3.2]]))
7) recode2 = {8:8, 12:12} copy['COMP1v2']= copy['ups'].map(recode2).dropna() ct2=pandas.crosstab(copy['Glifeexpactancy'], copy['COMP1v2']) print (ct2) colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct) print ('chi-square value, p value, expected counts') cs5= scipy.stats.chi2_contingency(ct2) print (cs5)
output: COMP1v2 8.0 12.0 Glifeexpactancy 50.0 1 0 75.0 0 27 COMP1v2 8.0 12.0 Glifeexpactancy 50.0 1.0 0.0 75.0 0.0 1.0 chi-square value, p value, expected counts (6.491083676268863, 0.010841686760430297, 1, array([[ 0.03571429, 0.96428571], [ 0.96428571, 26.03571429]]))
INTERPRETATION for post hoc tests and Conclusion
After running the tests i found out that between category 9 and 12 the p was smaller than p(0.05) as well as the adjusted p(0.0006). So in this case we have to reject the null hypothesis. In all other cases it can be accepted.

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch • No registration required • HD streaming
Running an analysis of variance
Since it was difficult to analyse a single variable in two halves I have additionally grouped them in two.
From the graphical analysis it wasn’t very clear if employment rate did affect life expectancy or not. Hence i have runned an analysis of varience on it.
INPUT
import pandas import numpy import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi data=pandas.read_csv('_7548339a20b4e1d06571333baf47b8df_gapminder.csv',low_memory=False) #narrowing s1=data[(data["lifeexpectancy"]<=str(50))|(data["lifeexpectancy"]>=str(75))] copy=s1.copy() #missing data copy['polityscore']=copy['polityscore'].replace(' ',numpy.nan) copy['employrate']=copy['employrate'].replace(' ',numpy.nan) copy['incomeperperson']=copy['incomeperperson'].replace(' ',numpy.nan) copy['lifeexpectancy']=copy['lifeexpectancy'].replace(' ',numpy.nan) #to numeric copy['lifeexpectancy'] = pandas.to_numeric(copy['lifeexpectancy'], errors='ignore') copy['polityscore'] = pandas.to_numeric(copy['polityscore'], errors='ignore') copy['employrate'] = pandas.to_numeric(copy['employrate'], errors='ignore') copy['incomeperperson'] = pandas.to_numeric(copy['incomeperperson'], errors='ignore') #recoding polityscore recode1={-10:0,-9:1,-8:2,-7:3,-2:4,-1:5,0:6,5:7,6:8,7:9,8:10,9:11,10:12} copy['ups']=copy['polityscore'].map(recode1) #grouping life expectancy copy['Glifeexpactancy']=pandas.cut(copy.lifeexpectancy, bins=[0,50, 75, 100], labels=['50', 'NaN', '75'], right=True, include_lowest=True) copy['Glifeexpactancy'] = pandas.to_numeric(copy['Glifeexpactancy'], errors='ignore')#ols regression sub1 = copy[['employrate', 'Glifeexpactancy']].dropna() mode1 = smf.ols(formula='employrate ~ C(Glifeexpactancy)', data=sub1).fit() print(mode1.summary())print ('\nmeans for employment rate by lifeexpactancy') m2= sub1.groupby('Glifeexpactancy').mean() print (m2)print ('standard deviations for employment rate by lifeexpactancy') sd2 = sub1.groupby('Glifeexpactancy').std() print (sd2)print('\n') mc1 = multi.MultiComparison(sub1['employrate'], sub1['Glifeexpactancy']) res1 = mc1.tukeyhsd() print(res1.summary())
OUTPUT
Interpretation of output:
While the mean and standard deviation show a small difference, the value of p is 0.0688 i.e. greater than 0.5. It shows that there isn’t enough evidence to reject the null hypothesis, hence we can say that there is no relation between employment rate and life expectancy.
To avoid a Type 1 error, i have also conducted a post hoc test for ANOVA (turkey’s honest significant difference test). The rest also shows that the null hypothesis can’t be rejected.
CONCLUSION
Life expectancy does not depend on employment rate.
Creating Graphs For Your Data
In the last assignment I had already ruled out the possibility that female employment rate has an effect on life expectancy but to be sure i have done the graphical study too. To do that i first changed the variables to categorical form so the python could recognize them.
I have already narrowed down my research to countries with a life expectancy less than 50 or more than 75 as per the course so far. Hence further analysis is also on that basis.
distribution of feer is skewed-left distribution while feer vs life distribution is a uniform distribution
From the graphs we can see that the female employment rate doesn’t affect the life expectancy. So now for further analysis I have excluded female employment rate parameter
The next parameter is polityscore:
Above is the uni variate and bi variate graph of polityscore. Here we can see that the countries with a lesser life expectancy have less polityscore while the countiers with a higher polityscore have an higher polityscore.
Now lets have a look at the employment rate:
From the uni-variate we can conclude that most countries have an average employment of around 55-62. From the bi-variate graph we can see that the mode of employment rate is more in the countries having life expectancy more than 75 while the mean of employment rate in both categories is equal.
The last parameter is income per person:
The uni-variate graph is a right skewed graph. We can clearly see that the income per person of a country with less life expectancy is far less than the one with a higher life expectancy
Summary:
AS EARLIER ASSIGNMENTS THE STUDY IS NARROWED DOWN TO LIFE EXPECTANCY RANGE BELOW 50 AND GREATER THAN 75. SUMMARY IS ACCORDINGLY
VARIABLE 1: FEER The uni-variate graph is left skewed. From the bi-variate graph we can see that feer does not affect life expectancy. Hence we can say that female employment does not affect the life expectancy of a country
VARIABLE 2: POLITYSCORE From the uni-variate graph we can see that most counties have a very high polityscore now i’ll have to know which are these countries with high political score. From the bi-variate graph we can see that the countries with less life expectancy have a lesser polityscore. So it can be said that countries with less polityscore have less life expectancy
VARIABLE 2: EMPLOYRATE From the uni-variate graph we can conclude that most countries have an average employment rate ( i.e. between 55-62 ). However the bi-variate graph is a bit unclear.It is a roughly be said that greater the employment rate greater is the life expectancy of the nation.
VARIABLE 2: INCOMEPERPERSON The uni-variate graph shows that most countries have high income per person and very less countiers have low income per person. However the bi-variate graph shows that all counties with less life expectancy have very less income per person. Hence it is safe to say that income per person plays a very import role in life expectancy Counties with less life expectancy have very less income per person