Predicting Polity score using Lasso regression
The github repo with all code is here:
https://github.com/brennap3/Gapminder
Transparency international data can be downloaded from here:
http://www.transparency.org/cpi2015/#downloads
it is available in my github repo as well. The main file of interested to run the analysis is in:
https://github.com/brennap3/Gapminder/blob/master/lasoo_regression_to_explain_democracy.py
In our model we are trying to use data combined from transparency international (data to do with how corrupt a country is) and Gapminder to predict polity score (how democratic a country is):
1. ‘incomeperperson’ :average income per person.
2. 'armedforcesrate’ : armed forces rate as a % of population.
3. 'femaleemployrate’ :female employee rate.
4. 'internetuserate’: internet use rate.
5. 'European’ : Is an European country.
6. 'African’ : Is an African country.
7. 'Asian’ : Is an Asian country.
8. 'Mid_East’ : Is a Mid-East country.
9. 'North_American’ : Is a north American country (includes central America and carribean, basically the CONCACAF countries).
10. 'Carribean_Central_America’ : Is carribean or central American country.
11. 'OPEC’: Is a member of OPEC.
12. 'Arab_League’: Is a member of the Arab league.
13. 'ASEAN_ARF’: Is an ASEAN regional forum.
14. 'South_American’: Is a South American country.
15. 'Is_Nato_Country’: Is a member of NATO.
16. 'Eu_Member’: Is a member of the EU.
17. 'CPI2015' from the transparency international dataset, the transparency international score for 2015.
18. 'PRS International Country Risk Guide' score. Built by ICRG.
19. 'World Economic Forum EOS (Executive Opinion Survey)' .
20. Number of years a member of NATO (if a country is not in NATO, this is 0)..
The value you are trying to predict is polity score (not political category a derived categorical value based on polity score). All Code is listed in the code section.
Besides joining the two datasets (Gapminder and Transparency international) together merging on country, the only other operations were to:
1. Give the transparency international countries consistent country names between the two datasets.
2. Standardize the predictor variables so as to make a determination of which predictor had a greater effect on the model.
3. The standardization is done using the scale unction having a mean of 0 and a standard deviation of 1.
Lasso regression is a type of contraction and selection method for linear regression, it uses L1 regularization, that is it increases a penalty equivalent to the absolute value of the magnitude of coefficients. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients.
LAR algorithm adds predictor most correlated with response variable and moves towards Least Squared Estimation until there is equally correlated with residual and adds it to model, LAR continues with this is repeated for all variables. The LARS algorithm is much akin to the forward stepwise regression, but instead of the addition variables at each iteration, the estimated parameters are increased in a direction equiangular to each one's correlations with the residual.
The advantages of the LARS algorithm are:
1. It is comparable in terms of computation resources to forward selection.
2. It creates a full piecewise linear solution path, which is extremely useful in conjunction with cross-validation in attempts to optimize the model.
3. It is easily modified to produce solutions for other estimators, like the lasso.
The disadvantage of using this method are:
1. It is sensitive to noise due to iterative refitting of residuals.
Calculating the optimal alpha (lambda value)
The optimal alpha value for the model is assessed using cross-validation in conjunction with LARS (LassoLarsCV), this is a more optimal solution than using LassoCV as it explores more useful values of alpha when compared to LassoCV.
I used 5 random folds 4 (the 4 remaining folds, the first being used as a validation set) folds are used for training and 1 (the first fold) used to test. The model which produces the lowest mean squared error as the best model to validate using the test dataset.
The model is fit using the following code:
model=sklearn.linear_model.LassoLarsCV(cv=5, precompute=False).fit(pred_train,tar_train)
Diagnostic plots of our model
Two diagnostic plots were used to look at the lasso selection path for our model these were:
1. A plot of regression coefficients progression for Lasso Path.
2. A plot of change Mean Square error at each change in penalty parameter.
This plot shows the relative importance of predictors at each step of the selection process under lasso, how the coefficients changes on addition of another predictor as well as what stage a predictor entered the selection process model. CPI 2015 (the transparency score) had the largest positive regression value, we can see entering the selection process first as it is the most important.
From the plot we can see that there is variability across each cross-validation fold but the change in MSE across all folds follows the same pattern, it decreases rapidly and then levels off.
Note the penalty parameter is referred to as alphas in scikit. Model.alphas. dashed line. This is shown as a vertical line.
The model coefficients are shown below. We can see the most significant coefficients are CPI2015 (Positive) and income per person.
dict(zip(predictors.columns, model.coef_)) ##dictionaries and lists
{'ASEAN_ARF': 1.0751280260662057,
'African': -0.80103427955213391,
'Arab_League': -0.75445326325981865,
'Asian': -0.18027408104575077,
'CPI2015': 5.1488730346502063,
'Carribean_Central_America': 1.7511806395204483,
'Eu_Member': 0.3186323959143092,
'European': 0.48294668353681142,
'Is_Nato_Country': 1.1682007573664959,
'Mid_East': -0.39695605502390469,
'OPEC': -0.86977861918196875,
'PRS International Country Risk Guide': -1.0914128237785916,
'South_American': 1.5441052599359877,
'World Economic Forum EOS': -1.6287945034420015,
'Years_In_Nato': -0.4787168743731352,
'alcconsumption': 0.61054750969745375,
'armedforcesrate': 0.32722934011118593,
'employrate': -2.4374967009233841,
'femaleemployrate': 1.2879949088390379,
'incomeperperson': 1.9394683296765849,
'internetuserate': -3.2651731839242308,
'lifeexpectancy': -0.40401776086788732}
Only one of the coefficients have shrunk to 0 (after applying lasso regression penalty), North America. As we have standardized all predictors on the same scale, so we can tell which are the most important (which are the strongest predictors of polity score).
There was significant difference in MSE (Mean Squared Error) error between training and test data indicating I may have over-fitted the model.
The R2 calculated showed significant difference between test and training set indicating our model may suffer from overfitting. The R2 value is how much of the variance in the data we can explain.
Where SST (total sum of squares) is given by:
Where SSE (sum of square errors) is
MSE of training and test data
MSE is mean squared error (MSE) measures the mean of the squares of the errors, the difference between the estimated value and the actual value, it is given by the equation above.
The MSE values were not stable across the training and test set. This would indicate that our model is still over-fitted on our training set. The MSE is higher in the test data.
Again the difference in R2 between our training and test data we can explain
An alternative to using LassoLarsCV was to use an information criterion selection method, in this case, I used (AIC) Akaike information criterion (measure of the relativistic quality criteria of a model) to select an ideal value of alpha regularization parameter. Below we see a diagnostic plot of AIC criterion against log alpha model using the AIC as the information criterion.
Regression coefficients for AIC selection criteria Lasso Model
'ASEAN_ARF': 0.4817277075830127,
'African': -0.88743192067681231,
'Arab_League': -0.77102456171737688,
'CPI2015': 3.4261676420330516,
'Carribean_Central_America': 1.3852468150993107,
'Eu_Member': 0.32737115619068258,
'Is_Nato_Country': 0.64445163494424318,
'Mid_East': -0.64918792797127278,
'OPEC': -1.0037125526070625,
'PRS International Country Risk Guide': 0.0,
'South_American': 1.1666702294227076,
'World Economic Forum EOS': -1.1639115442413683,
'alcconsumption': 0.59855758131369263,
'employrate': -2.2695726938628469,
'femaleemployrate': 1.0671515028671372,
'incomeperperson': 1.191656220279911,
'internetuserate': -2.4535120774767076,
Note when we use AIC criterion a lot more of the coefficients are regualirzed to 0.
Training and test error for AIC selection Lasso Model
Again there was significant difference in MSE (Mean Squared Error) error between training and test data indicating I may have over-fitted the model.
The R2 calculated showed significant difference between test and training set indicating our model may suffer from overfitting. The R2 value is how much of the variance in the data we can explain. The data is shown below.
Again we see the same problems with our model that we may have over-fitted our model to our training set as our model dos not fit well to our test with substantially higher MSE error and lower R2 value.
Created on Fri Jun 03 12:27:51 2016
import matplotlib.pyplot as plt
import sys; print(sys.path)
from pandas import Series, DataFrame
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.ensemble import ExtraTreesClassifier
apath='C:\Users\Peter\Desktop\Gapminder'
os.chdir('C:\Users\Peter\Desktop\Gapminder')
##check the directory has changed
data = pandas.read_csv('gapminder.csv', low_memory=False)
##lets convert the data to numeric
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True)
data['armedforcesrate'] = data['armedforcesrate'].convert_objects(convert_numeric=True)
data['breastcancerper100th'] = data['breastcancerper100th'].convert_objects(convert_numeric=True)
data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True)
data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True)
data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
data['oilperperson'] = data['oilperperson'].convert_objects(convert_numeric=True)
data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
data['relectricperperson'] = data['relectricperperson'].convert_objects(convert_numeric=True)
data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
bins = [0, 1000, 5000, 10000, 20000,50000,200000]
group_names = ['Very Low Income,0-1000', 'Low Income,1000-5000', 'Okay Income,5000-10000', 'Good Income,10000-20000','Great Income,20000-50000','50,000-200,000']
categories = pandas.cut(data['incomeperperson'], bins, labels=group_names)
data['categories'] = pandas.cut(data['incomeperperson'], bins, labels=group_names)
##now encode european countries
if row['country'] == 'Albania' :
elif row['country'] == 'Andorra' :
elif row['country'] == 'Armenia' :
elif row['country'] == 'Azerbaijan' :
elif row['country'] == 'Austria' :
elif row['country'] == 'Belarus' :
elif row['country'] == 'Belgium' :
elif row['country'] == 'Bosnia' :
elif row['country'] == 'Bulgaria' :
elif row['country'] == 'Croatia' :
elif row['country'] == 'Cyprus' :
elif row['country'] == 'Czech Republic' :
elif row['country'] == 'Denmark' :
elif row['country'] == 'Estonia' :
elif row['country'] == 'Finland' :
elif row['country'] == 'France' :
elif row['country'] == 'Georgia' :
elif row['country'] == 'Germany' :
elif row['country'] == 'Greece' :
elif row['country'] == 'Hungary' :
elif row['country'] == 'Iceland' :
elif row['country'] == 'Ireland' :
elif row['country'] == 'Italy' :
elif row['country'] == 'Kazakhstan' :
elif row['country'] == 'Kosovo' :
elif row['country'] == 'Latvia' :
elif row['country'] == 'Liechtenstein' :
elif row['country'] == 'Lithuania' :
elif row['country'] == 'Luxembourg' :
elif row['country'] == 'Macedonia' :
elif row['country'] == 'Malta' :
elif row['country'] == 'Moldova' :
elif row['country'] == 'Monaco' :
elif row['country'] == 'Montenegro' :
elif row['country'] == 'Netherlands' :
elif row['country'] == 'Norway' :
elif row['country'] == 'Poland' :
elif row['country'] == 'Portugal' :
elif row['country'] == 'Romania' :
elif row['country'] == 'Russia' :
elif row['country'] == 'San Marino' :
elif row['country'] == 'Serbia' :
elif row['country'] == 'Slovak Republic' :
elif row['country'] == 'Slovenia' :
elif row['country'] == 'Spain' :
elif row['country'] == 'Sweden' :
elif row['country'] == 'Switzerland' :
elif row['country'] == 'Turkey' :
elif row['country'] == 'Ukraine' :
elif row['country'] == 'United Kingdom' :
data['European'] = data.apply (lambda row: EUROPEAN (row),axis=1)
data['European'].value_counts(sort=False, dropna=False)
if row['country'] == 'Algeria' :
elif row['country'] == 'Angola' :
elif row['country'] == 'Benin' :
elif row['country'] == 'Botswana' :
elif row['country'] == 'Burkina Faso' :
elif row['country'] == 'Burundi' :
elif row['country'] == 'Cameroon' :
elif row['country'] == 'Cape Verde' :
elif row['country'] == 'Central African Republic' :
elif row['country'] == 'Chad' :
elif row['country'] == 'Comoros' :
elif row['country'] == 'Congo, Dem. Rep.' :
elif row['country'] == 'Congo, Rep.' :
elif row['country'] == 'Djibouti' :
elif row['country'] == 'Equatorial Guinea' :
elif row['country'] == 'Eritrea' :
elif row['country'] == 'Ethiopia' :
elif row['country'] == 'Egypt' :
elif row['country'] == 'Gabon' :
elif row['country'] == 'Gambia' :
elif row['country'] == 'Ghana' :
elif row['country'] == "Cote d'Ivoire":
elif row['country'] == "Guinea-Bissau":
elif row['country'] == "Guinea":
elif row['country'] == "Kenya":
elif row['country'] == "Lesotho":
elif row['country'] == "Liberia":
elif row['country'] == "Libya":
elif row['country'] == "Madagascar":
elif row['country'] == "Malawi":
elif row['country'] == "Mali":
elif row['country'] == "Mauritania":
elif row['country'] == "Mauritius":
elif row['country'] == "Morocco":
elif row['country'] == "Mozambique":
elif row['country'] == "Namibia":
elif row['country'] == "Niger":
elif row['country'] == "Nigeria":
elif row['country'] == "Rwanda":
elif row['country'] == 'Sao Tome and Principe':
elif row['country'] == 'Senegal':
elif row['country'] == 'Seychelles':
elif row['country'] == 'Sierra Leone':
elif row['country'] == 'Somalia':
elif row['country'] == 'South Sudan':
elif row['country'] == 'South Africa':
elif row['country'] == 'Sudan':
elif row['country'] == 'Swaziland':
elif row['country'] == 'Tanzania':
elif row['country'] == 'Togo':
elif row['country'] == 'Tunisia':
elif row['country'] == 'Uganda':
elif row['country'] == 'Zambia':
elif row['country'] == 'Zimbabwe':
elif row['country'] == 'Somaliland':
data['African'] = data.apply (lambda row: African (row),axis=1)
data['African'].value_counts(sort=False, dropna=False)
if row['country'] == 'Afganistan' :
elif row['country'] == 'Armenia' :
elif row['country'] == 'Bahrain' :
elif row['country'] == 'Bangladesh' :
elif row['country'] == 'Bhutan' :
elif row['country'] == 'Brunei' :
elif row['country'] == 'Cambodia' :
elif row['country'] == 'China' :
elif row['country'] == 'Georgia' :
elif row['country'] == 'India' :
elif row['country'] == 'Iran' :
elif row['country'] == 'Indonesia' :
elif row['country'] == 'Iraq' :
elif row['country'] == 'Israel' :
elif row['country'] == 'Japan' :
elif row['country'] == 'Jordan' :
elif row['country'] == 'Kazakhstan' :
elif row['country'] == 'Korea, Dem. Rep.' :
elif row['country'] == 'Korea, Rep.' :
elif row['country'] == 'Kuwait' :
elif row['country'] == 'Kyrgyzstan' :
elif row['country'] == 'Laos' :
elif row['country'] == 'Lebanon' :
elif row['country'] == 'Malaysia' :
elif row['country'] == 'Maldives' :
elif row['country'] == 'Mongolia' :
elif row['country'] == 'Myanmar' :
elif row['country'] == 'Nepal' :
elif row['country'] == 'Oman' :
elif row['country'] == 'Pakistan' :
elif row['country'] == 'Philippines' :
elif row['country'] == 'Qatar' :
elif row['country'] == 'Saudi Arabia' :
elif row['country'] == 'Singapore' :
elif row['country'] == 'Sri Lanka' :
elif row['country'] == 'Syria' :
elif row['country'] == 'Tajikistan' :
elif row['country'] == 'Thailand' :
elif row['country'] == 'Timor-Leste' :
elif row['country'] == 'Turkey' :
elif row['country'] == 'Turkmenistan' :
elif row['country'] == 'United Arab Emirates' :
elif row['country'] == 'Uzbekistan' :
elif row['country'] == 'Vietnam' :
elif row['country'] == 'Yemen' :
data['Asian'] = data.apply (lambda row: Asian (row),axis=1)
data['Asian'].value_counts(sort=False, dropna=False)
##data[['country','incomeperperson','polityscore_cat']][(data.European=='Europe') & (data.polityscore_cat!='NA') ]
##data[(data.Asian=='Asian')]
if row['country'] == 'Bahrain' :
elif row['country'] == 'Cyprus' :
elif row['country'] == 'Egypt' :
elif row['country'] == 'Iran' :
elif row['country'] == 'Iraq' :
elif row['country'] == 'Israel' :
elif row['country'] == 'Jordan' :
elif row['country'] == 'Kuwait' :
elif row['country'] == 'Lebanon' :
elif row['country'] == 'Oman' :
elif row['country'] == 'Qatar' :
elif row['country'] == 'Saudi Arabia' :
elif row['country'] == 'Syria' :
elif row['country'] == 'Turkey' :
elif row['country'] == 'United Arab Emirates' :
elif row['country'] == 'Yemen' :
return 'Not_In_Middle_East'
data['Mid_East'] = data.apply (lambda row: Mid_East (row),axis=1)
data['Mid_East'].value_counts(sort=False, dropna=False)
def North_American (row):
if row['country'] == 'Antigua and Barbuda' :
elif row['country'] == 'Bahamas' :
elif row['country'] == 'Barbados' :
elif row['country'] == 'Belize' :
elif row['country'] == 'Canada' :
elif row['country'] == 'Costa Rica' :
elif row['country'] == 'Cuba' :
elif row['country'] == 'Dominica' :
elif row['country'] == 'Dominican Republic' :
elif row['country'] == 'El Salvador' :
elif row['country'] == 'Grenada' :
elif row['country'] == 'Guatemala' :
elif row['country'] == 'Haiti' :
elif row['country'] == 'Honduras' :
elif row['country'] == 'Jamaica' :
elif row['country'] == 'Mexico' :
elif row['country'] == 'Nicaragua' :
elif row['country'] == 'Panama' :
elif row['country'] == 'Panama' :
elif row['country'] == 'Saint Kitts and Nevis' :
elif row['country'] == 'Saint Lucia' :
elif row['country'] == 'Saint Vincent and the Grenadines' :
elif row['country'] == 'Trinidad and Tobago' :
elif row['country'] == 'United States' :
return 'Not_In_North_America'
data['North_American'] = data.apply (lambda row: North_American (row),axis=1)
data['North_American'].value_counts(sort=False, dropna=False)
def Carribean_Central_America (row):
if row['country'] == 'Antigua and Barbuda' :
return 'Carribean_Central_American'
elif row['country'] == 'Bahamas' :
return 'Carribean_Central_American'
elif row['country'] == 'Barbados' :
return 'Carribean_Central_American'
elif row['country'] == 'Belize' :
return 'Carribean_Central_American'
elif row['country'] == 'Costa Rica' :
return 'Carribean_Central_American'
elif row['country'] == 'Cuba' :
return 'Carribean_Central_American'
elif row['country'] == 'Dominica' :
return 'Carribean_Central_American'
elif row['country'] == 'Dominican Republic' :
return 'Carribean_Central_American'
elif row['country'] == 'El Salvador' :
return 'Carribean_Central_American'
elif row['country'] == 'Grenada' :
return 'Carribean_Central_American'
elif row['country'] == 'Guatemala' :
return 'Carribean_Central_American'
elif row['country'] == 'Haiti' :
return 'Carribean_Central_American'
elif row['country'] == 'Honduras' :
return 'Carribean_Central_American'
elif row['country'] == 'Jamaica' :
return 'Carribean_Central_American'
elif row['country'] == 'Nicaragua' :
return 'Carribean_Central_American'
elif row['country'] == 'Panama' :
return 'Carribean_Central_American'
elif row['country'] == 'Saint Kitts and Nevis' :
return 'Carribean_Central_American'
elif row['country'] == 'Saint Lucia' :
return 'Carribean_Central_American'
elif row['country'] == 'Saint Vincent and the Grenadines' :
return 'Carribean_Central_American'
elif row['country'] == 'Trinidad and Tobago' :
return 'Carribean_Central_American'
return 'Not_In_Carribean_Central_American'
data['Carribean_Central_America'] = data.apply (lambda row: Carribean_Central_America (row),axis=1)
data['Carribean_Central_America'].value_counts(sort=False, dropna=False)
##Algeria, Angola, Ecuador, Iran, Iraq, Kuwait, Libya, Nigeria, Qatar, Saudi Arabia, United Arab Emirates and Venezuela
if row['country'] == 'Algeria' :
elif row['country'] == 'Angola' :
elif row['country'] == 'Ecuador' :
elif row['country'] == 'Iran' :
elif row['country'] == 'Iraq' :
elif row['country'] == 'Kuwait' :
elif row['country'] == 'Libya' :
elif row['country'] == 'Nigeria' :
elif row['country'] == 'Qatar' :
elif row['country'] == 'Saudi Arabia' :
elif row['country'] == 'United Arab Emirates' :
elif row['country'] == 'Venezuela' :
data['OPEC'] = data.apply (lambda row: OPEC (row),axis=1)
data['OPEC'].value_counts(sort=False, dropna=False)
if row['country'] == 'Algeria' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Bahrain' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Comoros' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Djibouti' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Egypt' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Iraq' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Jordan' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Kuwait' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Lebanon' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Libya' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Mauritania' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Morocoo' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Oman' :
return 'Arab_League_MEMBER'
elif row['country'] == 'West Bank and Gaza' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Qatar' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Saudi Arabia' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Somalia' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Sudan' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Syria' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Tunisia' :
return 'Arab_League_MEMBER'
elif row['country'] == 'United Arab Emirates' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Yemen' :
return 'Arab_League_MEMBER'
elif row['country'] == 'Eritrea' :
return 'Arab_League_MEMBER'
return 'Not_In_Arab_League'
data['Arab_League'] = data.apply (lambda row: Arab_League (row),axis=1)
data['Arab_League'].value_counts(sort=False, dropna=False)
##ASEAN is a regional grouping with security, economic and social aspects
if row['country'] == 'Australia' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Bangladesh' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Brunei' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Cambodia' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Canada' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'China' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'India' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Indonesia' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Japan' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Korea, Dem. Rep.' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Korea, Rep.' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Laos' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Malaysia' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Myanmar' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Mongolia' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'New Zealand' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Pakistan' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Papua New Guinea' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Phillipines' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Russian Federation' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Singapore' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Sri Lanka' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Thailand' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Timor-Leste' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'United States' :
return 'ASEAN_ARF_MEMBER'
elif row['country'] == 'Vietnam' :
return 'ASEAN_ARF_MEMBER'
return 'Not_In_ASEAN_ARF'
data['ASEAN_ARF'] = data.apply (lambda row: ASEAN_ARF (row),axis=1)
data['ASEAN_ARF'].value_counts(sort=False, dropna=False)
def South_American (row):
if row['country'] == 'Argentina' :
elif row['country'] == 'Bolivia' :
elif row['country'] == 'Brazil' :
elif row['country'] == 'Chile' :
elif row['country'] == 'Colombia' :
elif row['country'] == 'Ecuador' :
elif row['country'] == 'Guyana' :
elif row['country'] == 'Paraguay' :
elif row['country'] == 'Peru' :
elif row['country'] == 'Suriname' :
elif row['country'] == 'Uruguay' :
elif row['country'] == 'Venezuala' :
return 'Not_South_America'
data['South_American'] = data.apply (lambda row: South_American (row),axis=1)
data['South_American'].value_counts(sort=False, dropna=False)
###http://www.nato.int/cps/en/natohq/topics_52044.htm
Nato_Countries = pandas.DataFrame({ 'country' : ('Albania','Belgium','Bulgaria','Canada','Croatia','Czech Republic','Denmark','Estonia','France','Germany','Greece','Hungary','Iceland','Italy','Latvia','Lithuania','Luxembourg','Netherlands','Norway','Poland','Portugal','Romania','Slovak Republic','Slovenia','Spain','Turkey','United Kingdom','United States'),
'Year_Joined' : (2009,1949,2004,1949,2009,1999,1949,2004,1949,1955,1952,1999,1949,1949,2004,2004,1949,1949,1949,1999,1949,2004,2004,2004,1982,1952,1949,1949),
'Is_Nato_Country' : 'Nato_Member'
##Enhanced data join NATO data
data=pandas.merge(data, Nato_Countries,how='left',on='country')
##check that all column values have been added
data['Is_Nato_Country']=data['Is_Nato_Country'].fillna('Not_in_Nato')
data['Is_Nato_Country'].value_counts(sort=False, dropna=False)
##year joined needs to be renamed
data.rename(columns={'Year_Joined': 'Year_Joined_Nato'}, inplace=True)
if row['country'] == 'Austria' :
elif row['country'] == 'Belgium' :
elif row['country'] == 'Bulgaria' :
elif row['country'] == 'Croatia' :
elif row['country'] == 'Cyprus' :
elif row['country'] == 'Czech Republic' :
elif row['country'] == 'Denmark' :
elif row['country'] == 'Estonia' :
elif row['country'] == 'Finland' :
elif row['country'] == 'France' :
elif row['country'] == 'Germany' :
elif row['country'] == 'Greece' :
elif row['country'] == 'Hungary' :
elif row['country'] == 'Ireland' :
elif row['country'] == 'Italy' :
elif row['country'] == 'Latvia' :
elif row['country'] == 'Lithuania' :
elif row['country'] == 'Luxembourg' :
elif row['country'] == 'Malta' :
elif row['country'] == 'Netherlands' :
elif row['country'] == 'Poland' :
elif row['country'] == 'Portugal' :
elif row['country'] == 'Romania' :
elif row['country'] == 'Slovak Republic' :
elif row['country'] == 'Slovenia' :
elif row['country'] == 'Spain' :
elif row['country'] == 'Sweden' :
elif row['country'] == 'United Kingdom' :
data['Eu_Member'] = data.apply (lambda row: EUMEMBER (row),axis=1)
data['Eu_Member'].value_counts(sort=False, dropna=False)
print (time.strftime("%Y"))
##write unction to calculate the age of NATO countries based on the current date
current_year=time.strftime("%Y")
if row['Year_Joined_Nato'] >0 :
return (int(current_year)-int(row['Year_Joined_Nato']))
##calculate the age of NATO countries
data['Years_In_Nato'] = data.apply (lambda row: AGE_YEARS (row),axis=1)
##calculate the age of NAto countries
print("distribution of years in NATO for Nato Countries")
pp2=data['Years_In_Nato'].value_counts(sort=False, dropna=False)
##Mean number of years in NATO by european or Non EU
print("Mean years of countries in NATO for European and Non European Countries")
pp3=data[data['Years_In_Nato']>0]['Years_In_Nato'].groupby(data['Eu_Member']).mean()
##Count of countries in EU who are not in not in NATO
print("Count of countries in NATO for European and Non European Countries")
pp4=data[data['Years_In_Nato']>0]['Years_In_Nato'].groupby(data['Eu_Member']).count()
def EU_NATO (Nato_Membership,EU_Membership):
if Nato_Membership == 'Nato_Member' and EU_Membership== 'EU' :
elif Nato_Membership == 'Nato_Member' and EU_Membership == 'Not-In-EU':
elif Nato_Membership != 'Nato_Member' and EU_Membership== 'EU' :
return 'Not_In_Nato_In_EU'
return 'Not_In_Nato_Not_In_EU'
EU_NATO('Nato_Member','EU')
data['NATO_EU_MEMBERSHIP'] = data.apply (lambda row: EU_NATO(row['Is_Nato_Country'],row['Eu_Member']),axis=1)
def polityscore_cat (row):
if (row['polityscore'] >=6 and row['polityscore'] <= 10 ) :
elif (row['polityscore'] >=-5 and row['polityscore'] <= 5 ) :
elif (row['polityscore'] >=-10 and row['polityscore'] <= -6 ) :
##calculate the age of NATO countries
##data['Years_In_Nato'] = data.apply (lambda row: AGE_YEARS (row),axis=1)
data['polityscore_cat'] = data.apply (lambda row: polityscore_cat (row),axis=1)
##array(['country', 'incomeperperson', 'alcconsumption', 'armedforcesrate',
## 'breastcancerper100th', 'co2emissions', 'femaleemployrate',
## 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson',
## 'polityscore', 'relectricperperson', 'suicideper100th',
## 'employrate', 'urbanrate', 'categories', 'European', 'African',
## 'Asian', 'Mid_East', 'North_American', 'Carribean_Central_America',
## 'OPEC', 'Arab_League', 'ASEAN_ARF', 'South_American',
## 'Is_Nato_Country', 'Year_Joined_Nato', 'Eu_Member', 'Years_In_Nato',
## 'NATO_EU_MEMBERSHIP', 'polityscore_cat'], dtype=object)
##data = data.drop('Is_Nato_Country_y', 1)
##data.rename(columns={'Is_Nato_Country_x': 'Is_Nato_Country'}, inplace=True)
data['European'].value_counts(sort=False, dropna=False)
data['European'].replace("Europe",1,inplace=True)
data['European'].replace("Not_In_Europe",0,inplace=True)
data['African'].value_counts(sort=False, dropna=False)
data['African'].replace("Africa",1,inplace=True)
data['African'].replace("Not_In_Africa",0,inplace=True)
data['African'].value_counts(sort=False, dropna=False)
data['Asian'].value_counts(sort=False, dropna=False)
data['Asian'].replace("Asia",1,inplace=True)
data['Asian'].replace("Not_In_Asia",0,inplace=True)
data['Asian'].value_counts(sort=False, dropna=False)
data['Mid_East'].value_counts(sort=False, dropna=False)
data['Mid_East'].replace("Middle_East",1,inplace=True)
data['Mid_East'].replace("Not_In_Middle_East",0,inplace=True)
data['Mid_East'].value_counts(sort=False, dropna=False)
data['North_American'].value_counts(sort=False, dropna=False)
data['North_American'].replace("North_America",1,inplace=True)
data['North_American'].replace("Not_In_North_America",0,inplace=True)
data['North_American'].value_counts(sort=False, dropna=False)
data['Carribean_Central_America'].value_counts(sort=False, dropna=False)
data['Carribean_Central_America'].replace("Carribean_Central_American",1,inplace=True)
data['Carribean_Central_America'].replace("Not_In_Carribean_Central_American",0,inplace=True)
data['Carribean_Central_America'].value_counts(sort=False, dropna=False)
data['OPEC'].replace("OPEC_MEMBER",1,inplace=True)
data['OPEC'].replace("Not_In_OPEC",0,inplace=True)
data['OPEC'].value_counts(sort=False, dropna=False)
data['Arab_League'].value_counts(sort=False, dropna=False)
data['Arab_League'].replace("Not_In_Arab_League",0,inplace=True)
data['Arab_League'].replace("Arab_League_MEMBER",1,inplace=True)
data['Arab_League'].value_counts(sort=False, dropna=False)
data['ASEAN_ARF'].value_counts(sort=False, dropna=False)
data['ASEAN_ARF'].replace("Not_In_ASEAN_ARF",0,inplace=True)
data['ASEAN_ARF'].replace("ASEAN_ARF_MEMBER",1,inplace=True)
data['ASEAN_ARF'].value_counts(sort=False, dropna=False)
data['South_American'].value_counts(sort=False, dropna=False)
data['South_American'].replace("Not_South_America",0,inplace=True)
data['South_American'].replace("South_America",1,inplace=True)
data['South_American'].value_counts(sort=False, dropna=False)
data['Is_Nato_Country'].value_counts(sort=False, dropna=False)
data['Is_Nato_Country'].replace("Not_in_Nato",0,inplace=True)
data['Is_Nato_Country'].replace("Nato_Member",1,inplace=True)
data['Is_Nato_Country'].value_counts(sort=False, dropna=False)
data['Eu_Member'].value_counts(sort=False, dropna=False)
data['Eu_Member'].replace("Not_In_EU",0,inplace=True)
data['Eu_Member'].replace("EU",1,inplace=True)
data['Eu_Member'].value_counts(sort=False, dropna=False)
data['polityscore_cat'].value_counts(sort=False, dropna=False)
data['polityscore_cat'].replace("Anocracy",0,inplace=True)
data['polityscore_cat'].replace("Autocracy",0,inplace=True)
data['polityscore_cat'].replace("NA",0,inplace=True)
data['polityscore_cat'].replace("Democracy",1,inplace=True)
data['polityscore_cat'].value_counts(sort=False, dropna=False)
#Build model on training data
#from StringIO import StringIO
#from StringIO import StringIO
from IPython.display import Image
import pydot ##maygey an error here
from sklearn.externals.six import StringIO
##Ok lets see what the best features are
##note to one self HIV rates missing froma lot of countries
datatransparency = pandas.read_csv('CPI_2015_DATA.csv', low_memory=False)
##w['female'] = w['female'].map({'female': 1, 'male': 0})
datatransparency.columns.values
## datatransparency['Country']= datatransparency['Country'].map(
## {"The United States Of America":"United States",
## "C“te dïIvoire":"Cote d'Ivoire",
## "Korea (South)":"Korea, Rep.",
## "Korea (North)":"Korea, Dem. Rep.",
## "Czech Republic":"Czech Rep.",
## "Democratic Republic of the Congo":"Congo, Dem. Rep.",
## "The FYR of Macedonia": "Macedonia, FYR",
## "Hong Kong":"Hong Kong, China"
def country_consistent (row):
if row['Country'] == "The United States Of America" :
elif row['Country'] == "C“te dïIvoire" :
elif row['Country'] == "Korea (South)" :
elif row['Country'] == "Korea (North)" :
return "Korea, Dem. Rep."
elif row['Country'] == "Korea (South)" :
elif row['Country'] == "Czech Republic" :
elif row['Country'] == "Democratic Republic of the Congo" :
return "Congo, Dem. Rep."
elif row['Country'] == "The FYR of Macedonia" :
elif row['Country'] == "Hong Kong" :
return "Hong Kong, China"
datatransparency['Country'] = datatransparency.apply (lambda row: country_consistent(row),axis=1)
##calculate the age of NATO countries
##data['Years_In_Nato'] = data.apply (lambda row: AGE_YEARS (row),axis=1)
##ok after eyeballing in excel they all look ok
datafullset=data.merge(datatransparency,left_on='country',right_on='Country',how='left')
datafullset.columns.values
## 'country', 'incomeperperson', 'alcconsumption', 'armedforcesrate',
## 'breastcancerper100th', 'co2emissions', 'femaleemployrate',
## 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson',
## 'polityscore', 'relectricperperson', 'suicideper100th',
## 'employrate', 'urbanrate', 'categories', 'European', 'African',
## 'Asian', 'Mid_East', 'North_American', 'Carribean_Central_America',
## 'OPEC', 'Arab_League', 'ASEAN_ARF', 'South_American',
## 'Is_Nato_Country', 'Year_Joined_Nato', 'Eu_Member', 'Years_In_Nato',
## 'NATO_EU_MEMBERSHIP', 'polityscore_cat', 'Rank', 'CPI2015',
## 'Country', 'Region', 'wbcode', 'World Bank CPIA',
## 'World Economic Forum EOS', 'Bertelsmann Foundation TI',
## 'African Dev Bank', 'IMD World Competitiveness Yearbook',
## 'Bertelsmann Foundation SGI', 'World Justice Project ROL',
## 'PRS International Country Risk Guide',
## 'Economist Intelligence Unit', 'IHS Global Insight',
## 'PERC Asia Risk Guide', 'Freedom House NIT', 'CPI2015(2)', 'Rank2',
## 'Number of Sources', 'Std Deviation of Sources', 'Standard Error',
## 'Minimum', 'Maximum', 'Lower CI', 'Upper CI', 'Country2'
data2=datafullset[['incomeperperson', 'alcconsumption', 'armedforcesrate',
'internetuserate', 'lifeexpectancy',
'employrate', 'European', 'African',
'Asian', 'Mid_East', 'North_American', 'Carribean_Central_America',
'OPEC', 'Arab_League', 'ASEAN_ARF', 'South_American', 'Eu_Member','Is_Nato_Country','Years_In_Nato',
'CPI2015','World Economic Forum EOS','PRS International Country Risk Guide',
data_clean2 = data2.dropna() ## drop all na values cant handle nulls
from sklearn import preprocessing
## standardize the dataset
data_clean2['incomeperperson']=preprocessing.scale(data_clean2['incomeperperson'].astype('float64'))
data_clean2['alcconsumption']=preprocessing.scale(data_clean2['alcconsumption'].astype('float64'))
data_clean2['armedforcesrate']=preprocessing.scale(data_clean2['armedforcesrate'].astype('float64'))
data_clean2['femaleemployrate']=preprocessing.scale(data_clean2['femaleemployrate'].astype('float64'))
data_clean2['internetuserate']=preprocessing.scale(data_clean2['internetuserate'].astype('float64'))
data_clean2['lifeexpectancy']=preprocessing.scale(data_clean2['lifeexpectancy'].astype('float64'))
data_clean2['employrate']=preprocessing.scale(data_clean2['employrate'].astype('float64'))
data_clean2['European']=preprocessing.scale(data_clean2['European'].astype('float64'))
data_clean2['African']=preprocessing.scale(data_clean2['African'].astype('float64'))
data_clean2['Asian']=preprocessing.scale(data_clean2['Asian'].astype('float64'))
data_clean2['Mid_East']=preprocessing.scale(data_clean2['Mid_East'].astype('float64'))
data_clean2['North_American']=preprocessing.scale(data_clean2['North_American'].astype('float64'))
data_clean2['Carribean_Central_America']=preprocessing.scale(data_clean2['Carribean_Central_America'].astype('float64'))
data_clean2['OPEC']=preprocessing.scale(data_clean2['OPEC'].astype('float64'))
data_clean2['Arab_League']=preprocessing.scale(data_clean2['Arab_League'].astype('float64'))
data_clean2['ASEAN_ARF']=preprocessing.scale(data_clean2['ASEAN_ARF'].astype('float64'))
data_clean2['South_American']=preprocessing.scale(data_clean2['South_American'].astype('float64'))
data_clean2['Eu_Member']=preprocessing.scale(data_clean2['Eu_Member'].astype('float64'))
data_clean2['Is_Nato_Country']=preprocessing.scale(data_clean2['Is_Nato_Country'].astype('float64'))
data_clean2['Years_In_Nato']=preprocessing.scale(data_clean2['Years_In_Nato'].astype('float64'))
data_clean2['CPI2015']=preprocessing.scale(data_clean2['CPI2015'].astype('float64'))
data_clean2['World Economic Forum EOS']=preprocessing.scale(data_clean2['World Economic Forum EOS'].astype('float64'))
data_clean2['PRS International Country Risk Guide']=preprocessing.scale(data_clean2['PRS International Country Risk Guide'].astype('float64'))
###check the standardization worked
target = data_clean2.polityscore
# standardize predictors to have mean=0 and sd=1
predictors=data_clean2[['incomeperperson', 'alcconsumption', 'armedforcesrate',
'internetuserate', 'lifeexpectancy',
'employrate', 'European', 'African',
'Asian', 'Mid_East', 'North_American', 'Carribean_Central_America',
'OPEC', 'Arab_League', 'ASEAN_ARF', 'South_American', 'Eu_Member','Is_Nato_Country','Years_In_Nato',
'CPI2015','World Economic Forum EOS','PRS International Country Risk Guide']]
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,
test_size=.3, random_state=123)
import sklearn.linear_model
# specify the lasso regression model
model=sklearn.linear_model.LassoLarsCV(cv=5, precompute=False).fit(pred_train,tar_train)
# print variable names and regression coefficients
dict(zip(predictors.columns, model.coef_))
## {'ASEAN_ARF': 1.0751280260662057,
## 'African': -0.80103427955213391,
## 'Arab_League': -0.75445326325981865,
## 'Asian': -0.18027408104575077,
## 'CPI2015': 5.1488730346502063,
## 'Carribean_Central_America': 1.7511806395204483,
## 'Eu_Member': 0.3186323959143092,
## 'European': 0.48294668353681142,
## 'Is_Nato_Country': 1.1682007573664959,
## 'Mid_East': -0.39695605502390469,
## 'North_American': 0.0,
## 'OPEC': -0.86977861918196875,
## 'PRS International Country Risk Guide': -1.0914128237785916,
## 'South_American': 1.5441052599359877,
## 'World Economic Forum EOS': -1.6287945034420015,
## 'Years_In_Nato': -0.4787168743731352,
## 'alcconsumption': 0.61054750969745375,
## 'armedforcesrate': 0.32722934011118593,
## 'employrate': -2.4374967009233841,
## 'femaleemployrate': 1.2879949088390379,
## 'incomeperperson': 1.9394683296765849,
## 'internetuserate': -3.2651731839242308,
## 'lifeexpectancy': -0.40401776086788732}
# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
# R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print ('test data R-square')
from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC
##next we try and fit using AIC criterion
model_aic = LassoLarsIC(criterion='aic')
model_aic.fit(pred_train,tar_train)
alpha_aic_ = model_aic.alpha_
def plot_ic_criterion(model, name, color):
criterion_ = model.criterion_
plt.plot(-np.log10(alphas_), criterion_, '--', color=color,
linewidth=3, label='%s criterion' % name)
plt.axvline(-np.log10(alpha_), color=color, linewidth=3,
label='alpha: %s estimate' % name)
plt.xlabel('-log(alpha)')
plot_ic_criterion(model_aic, 'AIC', 'b')
dict(zip(predictors.columns, model_aic.coef_))
## {'ASEAN_ARF': 0.4817277075830127,
## 'African': -0.88743192067681231,
## 'Arab_League': -0.77102456171737688,
## 'CPI2015': 3.4261676420330516,
## 'Carribean_Central_America': 1.3852468150993107,
## 'Eu_Member': 0.32737115619068258,
## 'Is_Nato_Country': 0.64445163494424318,
## 'Mid_East': -0.64918792797127278,
## 'North_American': 0.0,
## 'OPEC': -1.0037125526070625,
## 'PRS International Country Risk Guide': 0.0,
## 'South_American': 1.1666702294227076,
## 'World Economic Forum EOS': -1.1639115442413683,
## 'alcconsumption': 0.59855758131369263,
## 'armedforcesrate': 0.0,
## 'employrate': -2.2695726938628469,
## 'femaleemployrate': 1.0671515028671372,
## 'incomeperperson': 1.191656220279911,
## 'internetuserate': -2.4535120774767076,
## 'lifeexpectancy': 0.0}
from sklearn.metrics import mean_squared_error
train_error_aic = mean_squared_error(tar_train, model_aic.predict(pred_train))
test_error_aic = mean_squared_error(tar_test, model_aic.predict(pred_test))
print ('training data MSE')
# R-square from training and test data
rsquared_train_aic=model_aic.score(pred_train,tar_train)
rsquared_test_aic=model_aic.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train_aic)
print ('test data R-square')