Making Data Management Decisions
We need to make to manipulate our data properly for acquiring research results. In order to do so we need to manage our data properly. I have already limited my study to a 50>life expectancy and life expectancy <75. So far my code is :
Further more I have labeled all the missing data. The code:
The output shows us that the missing data has now been labeled as NaN which makes work easier because now I know what and how much set of data is missing. The output is as follows:
Since it is a very lengthy output, I have pasted the output text instead of snapshots.
original counts for political score 8 6 9 4 -8 1 0 1 7 2 42 -1 1 6 1 -7 4 -9 1 5 2 10 28 -2 2 -10 1 Name: polityscore, dtype: int64 modified counts for political score NaN 42 8 6 9 4 -8 1 0 1 7 2 -1 1 6 1 -7 4 -9 1 5 2 10 28 -2 2 -10 1 Name: polityscore, dtype: int64
original counts for female employment rate 37.29999924 1 58.29999924 1 41.70000076 2 45.29999924 1 50.40000153 1 .. 30.10000038 1 63.40000153 1 39.59999847 2 56.70000076 1 53.5 1 Name: femaleemployrate, Length: 64, dtype: int64 modified counts for female employment rate NaN 25 37.29999924 1 58.29999924 1 41.70000076 2 45.29999924 1 .. 30.10000038 1 63.40000153 1 39.59999847 2 56.70000076 1 53.5 1 Name: femaleemployrate, Length: 64, dtype: int64
original counts for employment rate 62.40000153 1 56.90000153 1 44.20000076 1 59.29999924 1 42.5 1 .. 66.90000153 1 47.09999847 1 58.40000153 1 57.20000076 1 64.30000305 1 Name: employrate, Length: 66, dtype: int64 modified counts for employment rate NaN 25 62.40000153 1 56.90000153 1 44.20000076 1 59.29999924 1 .. 66.90000153 1 47.09999847 1 58.40000153 1 57.20000076 1 64.30000305 1 Name: employrate, Length: 66, dtype: int64
original counts for income per person 6334.105194 1 24496.04826 1 27595.09135 1 8614.120219 1 239.5187494 1 .. 18982.26929 1 1810.230533 1 12729.4544 1 21943.3399 1 9106.327234 1 Name: incomeperperson, Length: 80, dtype: int64 modified counts for income per person NaN 17 6334.105194 1 24496.04826 1 27595.09135 1 8614.120219 1 .. 18982.26929 1 1810.230533 1 12729.4544 1 21943.3399 1 9106.327234 1 Name: incomeperperson, Length: 80, dtype: int64 original counts for life expectancy 79.499 1 81.907 1 83.394 1 48.673 1 79.311 1 .. 81.097 1 76.142 1 81.804 1 78.371 1 47.794 1 Name: lifeexpectancy, Length: 75, dtype: int64 modified counts for life expectancy NaN 22 79.499 1 81.907 1 83.394 1 48.673 1 .. 81.097 1 76.142 1 81.804 1 78.371 1 47.794 1 Name: lifeexpectancy, Length: 75, dtype: int64
As it can be seen the polityscore ranges from -10 to 10 out of which many values are not even taken by the variable. Now it is easier to analyse if I deal with a data that ranges from 1-10. So i have recoded the polityscore.
Before recoding I have converted the parameters (under study) of the copy variable to numeric (float type).
Input and output are as follows:
I found female employment rate to employment rate ratio to know if what fraction of women work in a country affects its life expectancy. Input code and output are as follows:
I have printed only first 25 entries. As we can see that it is difficult to draw conclusion by studying individual entries, it is better to group them. Hence I have divided feer in 7 almost equal groups:
here i have also cross-checked if feer was grouped properly
Summary:
It can concludes that feer was grouped properly. Also since all the 7 groups have the same count value, the rate of women working in a country does not affect its life expectancy.


















