Data Analysis @dheedataanalysis - Tumblr Blog

Loan Default Prediction:

Building a Loan Auto-Approval and Review System with Machine Learning

Applying for a loan can be a long, stressful process — not just for customers, but also for loan officers who must carefully review applications one by one. At scale, this manual review process is both time-consuming and prone to human fatigue, which increases the risk of overlooking fraudulent or risky applications.

To solve this, I worked on a project that leverages machine learning to predict loan defaults and automatically decide whether an application can be auto-approved or should be sent for human review. The goal is simple: ease the workload on human reviewers while minimizing risks, creating a faster and more efficient loan approval process.

📝 Problem Statement

Relying solely on humans slows down the process, while relying solely on machines introduces risk. This project provides a hybrid solution:

Low-risk customers are auto-approved.

Borderline or high-risk customers are flagged for human review.

This way, the system balances automation with human oversight.

⚙️ Approach

The goal of this project is to predict whether a loan applicant will default or not. The system makes a binary classification: 0 means no default and 1 means default. These outputs map directly to decisions — a “0” leads to auto-approval, while a “1” sends the application for human review. This way, the process is faster for low-risk customers and safer for higher-risk ones.

1. Data Collection & Feature Engineering

I used loan applicant data from the year 2016–2017, including both numerical features (e.g., loan amount, term days, repayment delays, birthdate, longitude, latitude) and categorical features (e.g., bank account type, employment status). Features were carefully selected and combined to reflect borrower behavior, demography and financial patterns to predict the target (whether they will default on loan payments or not).

I built a KMeans clustering pipeline to group customers into three risk levels (0–2) based on how likely each cluster was to default.

I trained models Logistic Regression, Random Forest, and XGBoost to identify the most important features. From this, I selected the top 20 features to reduce noise and strengthen the base models. Finally, I added the risk level as an additional feature, giving me a total of 21 features for the voting system.

Final features included:

Loan history (loan growth trend, last loan amount, loan number, average past loan amount, standard deviation of past loan amounts, average past term days, average loans intervals, average past payout time)

Financial ratios (credit score (0–5), debt-to-loan ratio, total due, average total due, risk level (0–2) from clustering)

Repayment behavior (percentage of overdue payments, maximum repayment delay)

Customer demographics (age, employment status, state location, bank name, bank account type)

2. Modeling with an Ensemble of Classifiers

Instead of relying on a single algorithm, I built a voting ensemble of:

Logistic Regression:

Random Forest

XGBoost

LightGBM

CatBoost

Each model was tuned individually and given a custom decision threshold to account for imbalances in loan default data (78% — 22% ratio between not default and default data). The ensemble then combines their votes to produce a final prediction.

3. Decision Layer: Auto-Approval vs Human Review

If the model is confident and predicts not default (0), the loan is auto-approved.

If a default (1) is predicted, the application is flagged for human review.

This ensures automation doesn’t replace humans but instead augments them.

4. Deployment with Streamlit

To make the system accessible, I built a Streamlit web app that:

Allows New and returning customers to apply a loan.

Gives feedback for Admin reviewers to view predictions and model confidence.

📊 Results

My objective was to minimize financial risks while releasing as many loans as possible correctly. The model shouldn’t be too strict either, so as to reduce the workload on human reviewers. Since the dataset’s target was imbalanced, I applied SMOTE and class weighting to regulate how the models penalize misclassifications. I benchmarked several machine learning models, focusing on precision, recall, f1-score, accuracy, and ROC-AUC to capture performance under class imbalance.

Logistic Regression

ROC-AUC: 0.71, Accuracy: 69%

Class Performance:

Non-default (0): Precision 0.87, Recall 0.72

Default (1): Precision 0.37, Recall 0.61

Confusion Matrix:

With a threshold of 0.81, the logistic regression model is strong at predicting non-default borrowers, meaning most approved loans are indeed safe. However, it is weaker at spotting risky borrowers (defaulters), so some customers who are likely to default may still get approved.

Random Forest

ROC-AUC: 0.68, Accuracy: 62%

Class Performance:

Non-default (0): Precision 0.85, Recall 0.62

Default (1): Precision 0.31, Recall 0.62

Confusion Matrix:

With a threshold of 0.54, the random forest model shows a balanced recall across both classes, meaning it is relatively better at catching risky borrowers (defaulters) than logistic regression. However, this comes at the cost of lower precision, so while more defaulters are flagged, some safe customers may also get flagged for review.

XGBoost

ROC-AUC: 0.65, Accuracy: 62%

Class Performance:

Non-default (0): Precision 0.85, Recall 0.62

Default (1): Precision 0.30, Recall 0.59

Confusion Matrix:

With a threshold of 0.33, the XGBoost model performs similarly to Random Forest, capturing a fair share of risky borrowers (defaulters) with moderate recall.

LightGBM

ROC-AUC: 0.69, Accuracy: 62%

Class Performance:

Non-default (0): Precision 0.85, Recall 0.70

Default (1): Precision 0.35, Recall 0.57

Confusion Matrix:

With a threshold of 0.56, the LightGBM model offers a more balanced trade-off between precision and recall compared to Random Forest and XGBoost. It is fairly strong at identifying safe borrowers while capturing more than half of risky borrowers.

CatBoost

ROC-AUC: 0.70, Accuracy: 62%

Class Performance:

Non-default (0): Precision 0.86, Recall 0.70

Default (1): Precision 0.35, Recall 0.59

Confusion Matrix:

With a threshold of 0.62, the CatBoost model shows strong performance in identifying non-default borrowers, similar to LightGBM, while offering slightly better recall for risky borrowers. This means it can catch more potential defaulters without significantly sacrificing accuracy, making it a reliable choice for balancing speed and risk.

ROC-AUC Curve

All models perform better than random guessing (ROC-AUC = 0.5), but Logistic Regression, LightGBM, and CatBoost appear more confident and reliable in differentiating borrowers who will default from safe ones.

📈 Voting Ensemble (Final System)

Class Performance:

Non-default (0): Precision 0.85, Recall 0.70

Default (1): Precision 0.34, Recall 0.56

Confusion Matrix:

The Voting Ensemble combines all individual models, producing more stable predictions. Its performance does not significantly drop compared to the base models, making it effective for the loan default prediction.

The idea is to send the 310 predicted defaults (204 + 106) for human review to sift out those who are truly eligible for loan approval. If reviewers are able to approve all 204 of the 310 eligible applicants, then only 84 of the 768 approved loans (11%) actually default. This approach effectively balances loan approval speed, human reviewer workload, and minimize financial risks.

Deep Neural Network

Class Performance:

Non-default (0): Precision 0.86, Recall 0.63

Default (1): Precision 0.32, Recall 0.64

Confusion Matrix:

The DNN model is fairly good at predicting non-default borrowers, so most approved loans are safe. Its ability to detect risky borrowers is moderate, catching some defaulters but still missing a portion. Overall, the DNN did not significantly outperform the voting ensemble, meaning the simpler ensemble approach remains an effective and reliable choice for the auto-approval system.

🚀 Impact

By blending machine learning with human oversight, this system provides:

Faster loan approvals for customers.

Reduced workload for human reviewers.

Lower financial risk for lenders.

Instead of replacing humans, the model works alongside them, ensuring decisions are faster, fairer, and more accurate.

🔧 Future Improvements

Enhanced Data Collection: Gather more granular and correlated financial and behavioral data—such as income, payment frequency, employment history, and marital status—to capture richer borrower patterns.

Expanded Feature Engineering: Incorporate transaction-level features and design more sophisticated features, especially to improve the Deep Neural Network’s performance.

Model Optimization: Explore advanced architectures and hyperparameter tuning for the DNN to better capture nonlinear relationships in borrower behavior.

The code and implementation details are available on my GitHub repo:

Machine learning model for loan default prediction. It auto-approves highly credible applicants (class 0) and flags potential defaulters (cl

#machine learning #data science #feature engineering #classification models

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Capstone Project Preliminary Results

Descriptive Statistical Analyses

A dataset my_data was created from the data dataset from environment_ng.csv containing the needed variables, both explanatory and response variables.

Descriptive statistical analyses were conducted on my_data variables. Since all variables are quantitative, Minimum, Mean, Maximum, Standard Deviation, Count and some other statistical values were determined.

For the response variable CO2 Emissions (kt); Min = 3406.643000, Mean = 59472.692667, Max = 131685.637000, Standard Deviation = 36511.379085, Count = 57.0000.

Scatterplots were plotted for the association the response and each of the quantitative variables. The graphs are shown below.

Pearson Correlation Analyses

Pearson correlation was conducted and only 3 of the 7 explanatory variables, Primary Energy Intensity, Access to Electricity, Green House Gas Emissions are correlated to the response variable CO2 Emissions in Nigeria (p-value <= 0.05).

LARS Lasso Regression Analysis

The Lasso Regression model dropped none of the 7 predictor variables. The strongest predictors are Access to Clean Fuels (coefficient: 11995.762382), Primary Energy Intensity (coefficient: -18114.977427). Together these predictors can account for 85.87% of the variability in CO2 Emissions in Nigeria. The Mean Squared Error for the training set was 126129096.64378142.

#Capstone #Data Analysis #Coursera

Capstone Project Data Management

Sample

The Nigeria- Environment dataset (environment_nga.csv, source: data.humdata.org) contains data from World Bank’s portal from different sources like the International Energy Agency and the Carbon Dioxide Information Analysis Center. It contains 127 Indicators with observations for 59 years (1960 – 2018).

The Nigeria-Environment dataset was cleaned and redrawn using pandas DataFrame to suit the format that can be used for the analyses.

The Sample used in this study contains observations for N=39 Years (1980 – 2018). Although some variables lacked valid data observations, the nan values were replaced with the mean of the variable observations because the dataset is a very small one, and it was intended to be a longitudinal study.

Measures

The Carbon Dioxide emissions the quantitative response variable. The variable is the measure of CO2 annual emissions in Nigeria, measured in kilotons (kt). It portrays the carbon footprint of the country in a year.

Predictors used are all quantitative. They are:

Access to clean fuels and technologies for cooking (% of population)

Energy intensity level of primary energy (MJ/$2011 PPP GDP)

Access to electricity (% of population)

Renewable energy consumption (% of total final energy consumption)

People using at least basic drinking water services (% of population)

Total natural resources rents (% of GDP)

Other greenhouse gas emissions, HFC, PFC and SF6 (thousand metric tons of CO2 equivalent)

Analyses

The distribution for all the variables were evaluated by calculating the mean, standard deviation, minimum, and maximum. This is because they are all quantitative.

A scatterplot for each variable was also graphed. Pearson Correlation analysis and Analysis of Variance (ANOVA) were used to test the bivariate associations between the response variable and individual of the quantitative predictors.

All of the predictors were evaluated by Multiple Regression with the response variable to check for confounders.

LASSO Regression with Least Angle Regression (LAR) algorithm was used to identify the variables that best predicted the response variable CO2 emissions. The dataset was not split into training and test sets because it is very small N=39. All predictor variables were standardized to have mean=0 and standard deviation=1 before the analysis. Cross-validation with k-fold = 10 was performed. The variation in the cross validation Mean Squared Error rate was graphed to identify the best subset of predictor variables. The Model’s prediction accuracy was measured by the Mean Squared Error and R-Square used to predict the Target.

#Capstone #Data Analysis #Coursera

Capstone Project Introduction

Title

The Association between Carbon Dioxide (CO2) emissions and Energy related Factors in Nigeria.

Statement of Research Question

The purpose of this study is to identify the best predictors of Carbon Dioxide (CO2) emissions from Energy related factors such as Energy Consumption, Energy Generation, Mineral Rents, and so on.

Motivation for Research Question

CO2 emissions have been one of the major contributors to the Climate change experienced globally, and technology can account for a large amount of this emissions. As an Electrical and Electronic Engineering student interested in Energy Generation and Consumption, having a better understanding of factors that are associated with C02 emissions will allow me to identify which practices in the industry will increase and decrease the emissions.

Implications of the Research Question

This study will help to identify the Energy practices that increase the CO2 emissions of Nigeria. This will lead to substituting them with alternatives that give less emissions. It is a study that can propose the major contributors to the climate change happening globally.

Dataset Information

The Nigeria – Environment dataset (source: data.humdata.org) contains data from World bank’s portal from the International Energy Agency and the Carbon Dioxide Information Analysis Center. It has observations for 127 Indicators (Energy intensity level of primary energy, Access to electricity, CO2 emissions from gaseous fuel consumption, Mineral rents, etc.) for Years 1960 - 2018.

#Capstone #Data Analysis #Coursera

K-Means Cluster Analysis for the GapMinder Dataset

Cluster Analysis is an Unsupervised Learning method that classifies similar datapoints/observations together. It can be used for Data Reduction by allowing the classification of many variables into a single variable with many categories.

K-Means Cluster analysis creates a multi-dimensional space where the number of dimensions is equal to the number of the input variables. The distance between observations and centroids are calculated with different measures but Euclidean Distance is the commonest.

Summary of Results

A K-Means cluster analysis was conducted to recognize the subgroups of countries based on their similarity to responses on 5 variables that could have an impact on the ‘relectricperperson’ (response variable). The cluster variables are all quantitative including ‘incomeperperson’, ‘urbanrate’, ‘employrate’, ‘internetuserate’, and ‘’co2emissions’. All the clustering variables were standardized to have a mean = 0 and standard deviation = 1.

The data was not split because the number of observations is few (N = 127). A range of k = 1 – 9 clusters were used in the analysis. An Elbow curve was plotted to help in choosing the value of k to interpret. The elbow curve suggested that a 3-cluster solution might be interpreted.

Canonical Discriminant analyses was used to reduce the 5 clustering variables to 3 that accounted for most of the variance in the clustering variables. The canonical scatterplot of the first two canonical variables indicated there are 3 distinctive clusters, with the blue and purple clusters not highly correlated unlike the yellow cluster.

The means on the clustering variables showed the countries in cluster 1 had moderate levels on the clustering variables. Cluster 1 has the highest level of incomeperperson, urbanrate, internetuserate, and co2emissions. Countries in cluster 0 are have the highest employrate, but have pretty low levels in all other variables compared to countries in cluster 2.

To validate the clusters, an ANOVA was conducted to test for significant differences between the clusters on relectricperperson. A Tukey HSD test was used for post hoc comparisons between clusters. Results from the F-stat, Prob (F-statistic) and Tukey test, showed significant differences between clusters on ‘relectricperperson’. The countries in cluster 1 have the highest value of relectricperperson (mean = 3113.3330, sd = 2221.0108), while those in cluster 0 have the lowest relectricperperson (mean = 174.0750, sd = 1030.6666).

#K-Means Clustering #Machine Learning #Data Analysis #Coursera

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Lasso Regression for the GapMinder Dataset

LASSO is a Supervised Machine Learning method. It stands for Least Absolute Selection Shrinkage Operator. LASSO uses the Shrinkage and Selection method, which causes some of the regression coefficients of the explanatory variables to equal zero (0) and picks the most important ones associated to the response variable.

LASSO provides greater prediction accuracy when there is a linear relationship between the explanatory and response variables with small number of observations and high number of predictors. The Tuning parameter is called lambda (or alpha in sklearn) which controls the shrinkage of the model. Higher lambda means more coefficients are set to zero. When lambda is zero, the model is an OLS regression. Least Angle Regression LAR Algorithm is used.

For the number of my explanatory variables, OLS regression is ideal but LASSO regression is used for the purpose of this study.

From the Bias-Variance Tradeoff graph below, it is seen that the more predictors added to the model, the more prediction error decreases for the training set. Consequently, bias is lower.

Summary of Results

Lasso regression analysis was run to identify a subset of variables from 5 quantitative predictor variables that best predicted a quantitative response variable ‘relectricperperson’. Quantitative predictor variables include ‘incomeperperson’, ‘urbanrate’, ‘employrate’, ‘internertuserate’, and ‘co2emissions’. All predictor variables were standardized to have a mean of zero and a standard deviation of one.

Data were randomly split into a training set that included 70% of the observations (N=88) and a test set that included 30% of the observations (N=39). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Of the 5 predictor variables, 4 were retained in the selected model. ‘incomeperperson’ and ‘internetuserate’ were most strongly associated with ‘relectricperperson’. No predictor was negatively associated with ‘relectricperperson’. These 4 variables accounted for 50% of the variance in the ‘relectricperperson’ response variable.

#Lasso Regression #Machine Learning #Data Analysis #Coursera

Random Forest Analysis for the GapMinder Dataset

Decision Trees are easy to visualize and interpret but they are not very reproducible on future data. This makes them less reliable prediction models and more useful for explanatory data analysis.

Random Forests are coined from Decision Trees but proceeds by growing many trees for model reproducibility. A random sample of observation is selected through a process called bagging. Each of the trees are grown on a different randomly selected sample of bagged data and the remaining unbagged are used for testing the trees.

Random Forest Classifier deals with a categorical response variable/target y, while Random Forest Regressor is used for quantitative targets y. Same goes for Decision Trees too. The accuracy of Classifiers is measured by Confusion matrixes and Accuracy scores but for Regressors, Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root-mean Squares Error (RMSE) are used.

Summary of Results

Random Forest analysis was performed to examined the importance of ‘incomperperson’ and ‘urbanrate’ as possible contributors to a random forest evaluating the response variable ‘relectricperperson_cat’.

Accuracy of the model built is 82.69% with 6 ‘false-negatives’ and 3 ‘false-positives’. ‘incomperperson’ has almost double the significance of ‘urbanrate’ in the random trees grown. The test on accuracy suggested that the model is higher when growing more than 2 decision trees.

#Random Forests #Machine Learning #Data Analysis #Coursera

Decision Tree Analysis for the GapMinder Dataset

Machine Learning (ML) encompasses a lot of statistical methods. It is used to describe associations, find patterns, and make predictions from a dataset.

Definition of terms

1. Supervised Learning: ML task that maps input/predictor X to output/target y.

2. Unsupervised Learning: ML task function with no target set y.

3. Training set: Dataset used to construct algorithm so that it can predict from data.

4. Test set: Dataset used to measure accuracy of the algorithm built.

5. Overfitting: modelling error as a result of closely fitting the ML function to a limited set of data.

6. Underfitting: modelling error as a result of over-simplicity of the ML algorithm.

7. Accuracy: measure of extent to which a model correctly classifies observations into categories and it is assessed by the Test Error Rate.

8. Variance: change in parameter estimates across different datasets.

9. Confusion Matrix: a estimation of the prediction accuracy of a model.

10. Leaf nodes: final subgroups of a decision tree split.

Decision Tree Analysis

Decision Tree analysis was carried out to test nonlinear relationships among a range of explanatory variables and ‘relectricperperson_cat’, a binary, categorical response variable. In the analysis, all possible cut-points (quantitative) are tested.

The explanatory variables included in this analysis as possible contributors to a classification tree model evaluating ‘relectricperperson_cat’ (response variable) are ‘incomperperson’ and ‘urbanrate’. Just to note, the analysis can take as many explanatory variables, both categorical and quantitative. The gini criterion is used to grow the tree and an explanatory variable can appear in the tree more than once just as seen below. Cross-validation methods guard against overfitting.

Small changes in the data leads to a different split and also rerunning the analysis on the same dataset, slightly changes the values because samples are randomly selected.

Summary of Results

1. 'incomeperperson’ was the first variable to separate the sample into two subgroups. Countries with ‘incomeperperson’ = 4839.878 (range = 103.776 - 52301.587, mean = 8784.532, S.D = 11420.776) are less likely to have residential electricity per capita ‘relectricperperson_cat’ as compared to countries with ‘incomeperperson’ > 4839.878 (37 vs. 41, N=78).

Other subdivisions are made at both separations, and the leave nodes are highlighted to make going through easier.

2. The model Accuracy = 0.788; 21.2% misclassification.

#Decision Tree #Machine Learning #Data Analysis #Coursera

Logistic Regression for the GapMinder Dataset

What if the response variable is a binary-categorical variable .i.e. a categorical variable with only two categories? Logistic Regression is used instead of Linear Regression.

In linear regression any quantitative value can be gotten but for logistic regression, a value between zero and one should the output. So, we do not want true expected values, we want probabilities.

Odds Ratio OR

OR is the probability of an event occurring in one group compared to the event occurring in another group. OR can range from 0 to ∞ [0 </= OR </= ∞], and is centred around 1.

OR = 1: Model is non-significant

OR > 1: As explanatory variable increases, response variable becomes more likely

OR < 1: As explanatory variable increases, response variable becomes less likely

OR for our sample is 1.020 but can vary according to sample. The Odds Ratio indicates that there is a 95% certainty that the true population’s odd ratio falls between 1.007 and 1.033 for urbanrate.

As it is in Multiple Linear Regression, more explanatory variables can be added to the Logistic Regression model to identify the multiple predictors of our binary categorical response variable.

Summary of Results

1. Number of observations N=189.

2. From the Logistic Regression model:

'incomeperperson', beta=-8.251e-06, p=0.426 (> 0.05), 95% C.I= -2.86e-05 - 1.21e-05. This shows 'incomeperperson' is not related to 'relectricperperson'.

'urbanrate', beta=0.0199, p=0.002 (< 0.05), 95% C.I= 0.007 - 0.033. This shows 'urbanrate' is significantly and positively associated with 'relectricperperson'.

3. From the Logistic Regression, this linear equation is gotten:

Because we want probabilities not true values, we go ahead to find the OR.

4. 'urbanrate' (OR = 1.020, 95% C.I = 1.007 - 1.033).

5. The confidence levels signify the range which OR falls between for the true population with 95% certainty.

#Logistic Regression #Data Analysis #Coursera

Multiple Linear Regression for Multiple-Category Explanatory Variable

If the explanatory variables are quantitative or categorical with two categories, the steps to be taken are enumerated in the Multiple Regression for the GapMinder Dataset* post. When there is a categorical variable with more than two categories, additional steps are required.

As centring is important for quantitative explanatory variables, so is recoding for quantitative variables. This is setting a category to zero and making it reference to for the other variables.

Summary of Results

1. From the results, R-squared = 0.494, meaning 49.4% of the ‘relectricperperson’ variable is accounted for.

2. The‘urbanrate_cat’ categories are not significantly different from each other as all the p-values > 0.05.

3. In all these, ‘incomeperperson_c’ and ‘incomeperperson_c**2’ result values are constant:

‘incomeperperson_c’: beta=0.1457, p=0.000 (< 0.001), 95% C.I=0.106 – 0.185;

incomeperperson_c**2’:beta=-2.977e-06, p=0.000 (< 0.001), 95% C.I=-4.56e-06 – -1.4e-06;

4. ‘incomeperperson_c’ is significantly and positively related to ‘relectricperperson’, ‘incomeperperson_c**2’ is significantly and negatively related, while ‘urbanrate_cat’ is not associated with ‘relectricperperson’.

#Multiple Regression #Data Analysis #Coursera

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Multiple Regression for the GapMinder Dataset

In this post, using more than one explanatory variable for linear regression is examined. The effect of both ‘incomeperperson’ and ‘urbanrate’ on ‘relectricperperson’ is studied. The second-order polynomial (quadratic) regression lines are drawn and used in the analysis.

Due to variability in sample chosen, Confidence Level (CI) of 95% [2 standard deviations .i.e. -2</= s.d </= 2] is used, and it this means we can conclude that the true population parameter falls between lower and upper confidence limits that are estimated based on our sample parameter.

Summary of Results

1. Centering the explanatory variables set the observations around zero and the mean to zero.

2. Linear Regression results

‘urbanrate_c’ - ‘relectricperperson’: beta1=35.2598, p=0.000, 95% C.I=23.413 - 47.107.

‘urbanrate_c’ - ‘urbanrate_c**2’ - ‘relectricperperson’: beta1=37.9121, beta2=0.4121, p = [urbanrate_c: 0.000, urbanrate_c**2: 0.101], 95% C.I=[urbanrate_c: 25.723 - 50.101, urbanrate_c**2: -0.081 - 0.905]. ‘urbanrate_c**2’ is not associated with ‘relectricperperson’.

‘incomeperperson_c’ - ‘relectricperperson’: beta1=0.0904, p=0.000, 95% C.I=0.072 - 0.109.

‘incomeperperson_c’ - ‘incomeperperson_c**2’ - ‘relectricperperson’: beta1=0.1448, beta2=-2.993e-06, p = [incomeperperson_c: 0.000, incomeperperson_c**2: 0.000], 95% C.I = [incomeperperson_c: 0.113 - 0.176, incomeperperson_c**2: -4.44e-06 - -1.55e-06]. ‘incomeperperson_c**2’ is negatively associated with ‘relectricperperson’.

‘incomeperperson_c’ - ‘incomeperperson_c**2’ - ‘urbanrate_c’ - ‘relectricperperson’: beta1=0.1469, beta2=-3.045e-06, beta3=-1.0462, p= [incomeperperson_c: 0.000, incomeperperson_c**2: 0.000, urbanrate_c: 0.874]. ‘urbanrate_c’ is not associated with ‘relectricperperson’.

3. ‘incomeperperson_c’ is positively and significantly related to ‘relectricperperson’, ‘incomeperperson_c**2′ is negatively and significantly related to ‘relectricperperson’, while ‘urbanrate_c’ is not related to ‘relectricperperson’.

4. There is a case of Confounding as ‘urbanrate_c’ that was once significant becomes insignifcant when ‘incomeperperson_c’ is added to the analysis. ‘incomeperperson’ accounts for about 49.2% of the ‘relectricperperson’ values.

5. There are high leverage datapoints and outliers in our observations, but No datapoint is an high leverage and an outlier simultaneously.

6. The equation for regression changed from (i) to (ii) as ‘urbanrate_c’ is not significant to the analysis.

E in the equations represents the error terms.

#Multiple Regression #Data Analysis #Coursera

Basic Linear Regression Model for the GapMinder Dataset

Multivariate Model helps us to determine the portion of association of between the explanatory and response variables that can be accounted for by other variables. Multivariate Models can be classified into two namely, Multiple Regression [Quantitative Response variable] and Logistic Regression [Binary Response Variable].

A basic Linear Regression Model is based on the principle of the Equation of a Slope Y = mX + b [Y = Response variable on y-axis, X = Explanatory Variable on x-axis, m = Slope of graph, b = Y-intercept].

From the Linear Regression equation,

y can be predicted from the equation [‘y-hat’ is the predicted y value]. It is anticipated that the expected value will differ from the predicted value unless the value falls on the regression line. The causal model is imposed not tested so error terms are included in the model. Outliers increase the prediction errors also.

Summary of Results

The Ordinary Least Squares Regression Model (OLS) was carried out for the association between our explanatory and response variables. A positive value for slope m shows a positive association between the variables.

For ‘incomeperperson’ as explanatory variable and ‘relectricperperson’ as response variable: b = 354.1982, m = 0.0984, p-value < 0.0001. The m and p-value indicates a positive and significant association between ‘relectricperperson’ and ‘incomeperperson’.

relectricperperson = 0.0984*incomeperperson + 354.1982

For ‘urbanrate’ as explanatory variable and ‘relectricperperson’ as response variable: b = -1151.6589, m = 37.7894, p-value < 0.0001.The m and p-value shows a positive and significant relationship between ‘urbanrate’ and ‘relectricperperson’.

relectricperperson = 37.7894*urbanrate - 1151.6589

#Linear Regression #Data Analysis #Coursera

Data Description for the GapMinder Dataset

The GapMinder Dataset contains Observational data collected through Data Reporting.

Sample

The dataset is from GapMinder (www.gapminder.org), a non-profit venture promoting sustainable global development and achievement of the United Nations Millennium Development Goals. The unique identifier/variable is Country. The dataset contains data from all 192 countries that are members of the United Nations (aggregating data for Serbia and Montenegro), with data from 24 other countries, summing up to 215 countries. (N = 215). The indicators are variables of Income per Person, Urban Rate, Residential Electricity per Capita, Cumulative CO2 Emissions, Life Expectancy at Birth, and other like data for the countries. The data samples gathered for each variable is for a year between years 2002 to 2011,

The data analytic samples for the current study are ‘incomeperperson’, ‘urbanrate’ and ‘relectricperperson’. These all have 215 observations and are all quantitative observational data.

Procedure

Data were collected from different sources, amongst which are the World Bank, World Health Organization (WHO), United Nations Statistics Division, International Energy Agency, and International Labour Organization. The observational data for each country was collected from these sources and more.

Measures

The measure of Income per Person (‘incomeperperson’) was drawn from World Bank Work Development Indicators. It measures the estimated Gross Domestic Product per capita in constant 2000 US$ with the cost of living of countries taken into account for year 2010. It is binned into four categories based on pandas quartile split where necessary in this current study.

Urban Population (% of total) (‘urbanrate’) was sourced from World Bank and was calculated using the estimates of the World Bank population and urban ratios from the UN Urbanization Prospects of year 2009. The variable is binned into five categories based on pandas cut function where needed in the current analysis.

Residential Electricity Consumption per Person (‘relectricperperson’) was collected from International Energy Agency. It gives the estimated measure of the residential electricity consumption per person in kilowatt-hours (kWh) for these countries in year 2008.

#Data Description #Data Analysis #Coursera

Moderation between Variables for the GapMinder Dataset

Two questions are into consideration in this post.

1. Does ‘urbanrate’ moderate the relationship between ‘incomeperperson’ and ‘relectricperperson’?

2. Does ‘incomeperperson’ moderate the relationship between ‘urbanrate’ and ‘relectricperperson’?

The result from these questions gives insight into the relationship between our explanatory and response variables using a moderation variable.

Summary of Result

From the Pearson Correlation test results, we can see

‘incomeperperson’ can not be used to moderate the relationship between ‘urbanrate’ and ‘relectricperperson’ [p-value > 0.05 for all categories of the moderate variable]

‘urbanrate’ moderates the relationship between ‘incomperperson’ and ‘relectricperperson’ for

‘(10.31, 28.32]’: r is 0.9799, p-value is 2.4841e-08 (<<0.05)

‘(28.32, 46.24]’: r is 0.5497, p-value is 0.0098 (<0.05)

‘(46.24, 64.16]’: r is 0.8713, p-value is 9.8213e-12 (<<0.05)

‘(64.16, 82.08]’: r is 0.5357, p-value is 0.0003

‘urbanrate’ does not moderate the relationship between ‘incomperperson’ and ‘relectricperperson’ for category ‘(82.08, 100.0]’. r: 0.3869 but p-value: 0.0919 (>0.05)

Outline of Code

I outlined the codes for ‘urbanrate’ being moderate variable; the same steps were taken for ‘incomeperperson’ being the moderate variable.

Cell 1: Import the modules and the dataset needed, and create your a dataset with the variables under study.

Cell 2: Drop the NaN values and run the Pearson correlation test between the explanatory and the response variables.

For the pearson correlation test values, check Pearson Correlation for the GapMinder Dataset**

Pearson Correlation test between ‘incomeperperson’ and ‘relectricperperson’

Pearson Correlation test between ‘urbanrate’ and ‘relectricperperson’

From the results, ‘incomeperperson’ and ‘relectricperperson’ have a strong positive relation, while ‘urbanrate’ and ‘relectricperperson’ have a moderate positive relationship.

Cell 3: Categorize the moderate variable

Moderate variable: ‘urbanrate’

For more information on quantitative data categorization, check Data Management for the GapMinder Dataset**

Cell 4: Create subdatasets with only one category of the moderate variable.

Cell 5: Pearson correlation test between the explanatory variable and the response variable with the subdatasets having only one moderate variable category.

Pearson Correlation between ‘incomeperperson’ and ‘relectricperperson’ with ‘urbanrate’ being the moderate variable.

Pearson Correlation between ‘urbanrate’ and ‘relectricperperson’ with ‘incomeperperson’ being the moderate variable.

Cells 6, 7, 8: Graph of the response variable against the explanatory variable using the moderated subdatasets.

From the graphs, we can visualize better the relationship between ‘incomeperperson’ and ‘relectricperperson’ using ‘urbanrate’ as the moderate variable.

Reference:

Pearson Correlation for the GapMinder Dataset, https://dheedataanalysis.tumblr.com/post/62656685305634816/data-management-for-the-gapminder-dataset

Data Management for the GapMinder Dataset, https://dheedataanalysis.tumblr.com/post/62656685305634816/data-management-for-the-gapminder-dataset

#Causation #Data Analysis tools #Data Analysis #Coursera

Pearson Correlation for the GapMinder Dataset

Pearson correlation is used to determine only the linear association between two quantitative variables. Pearson Correlation Coefficient r takes value in range -1 to 1 [-1 = r = 1], with -1 denoting a perfect negative linear relationship, 0 denoting no relationship and 1 representing a perfect positive linear relationship.

The p-value calculated is significant if = 0.05. Squaring the r (r-square) value tells us how much of the variability in the response variable can be accounted for by the explanatory variable. ‘relectricperperson’ is the response variable being considered with ‘incomeperperson’ and ‘urbanrate’ as explanatory variables respectively.

Summary of Results

The Scatterplot between ‘incomeperperson’ and ‘relectricperperson’ shows a weak positive relationship with the observations clustered at ‘incomeperperson’ .*

The Scatterplot between ‘urbanrate’ and ‘relectricperperson’ shows that a stronger positive relationship with the observations the first scatterplot.*

The Pearson Correlation test gave these results

Between ‘incomeperperson’ and ‘relectricperperson’; r: 0.654, p-value: 5.647e-17 (, r-square: 0.428. This means that there is a moderate positive relationship between these variables. We can determine the variability in ‘relectricperperson’ 42.8% of the time from ‘incomeperperson’ observations.

Between ‘urbanrate’ and ‘relectricperperson’; r:0.487, p-value: 5.717e-09 (, r-square: 0.237. It can be concluded that these variables have a moderate positive relationship. 23.7% of the variability in ‘relectricperperson’ can be accounted for by the observations in ‘urbanrate’.

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Chi-square Test of Independence for the GapMinder Dataset

Note: Chi-square Test of Independence is used for categorical explanatory and response variables. It measures how far the data are from the null hypothesis Ho.

The variables I am working with are quantitative variables, Chi-square analysis is not applicable, but for the purpose of the project, I will use Chi-square to examine the relationship between the categorized variables ‘income_cat’(explanatory variable) and ‘urbanrate_cat’(response variable).

Hypothesis

Ho: There is no relationship between the two categorical variables

Ha: There is a relationship between the two categorical variables

Summary

Chi-square value: 129.539, p-value: 7.638e-22 (<<0.05), ‘urbanrate_cat’ and ‘income_cat’ are significantly associated.

For the Bonferroni Adjustment test, if the p-value </=0.008 for the groups, we can conveniently reject the null hypothesis Ho between them.

‘low_income’ and ‘low-mid_income’ groups, Chi-square value: 27.852, p-value: 1.336e-05(<<0.008).

‘low_income’ and ‘high-mid_income’ groups, Chi-square value: 46.834, p-value: 1.651e-09(<<0.008).

‘low_income’ and ‘high_income’ groups, Chi-square value: 69.486, p-value: 2.914e-14(<<0.008).

‘low-mid_income’ and ‘high-mid_income’ groups, Chi-square value: 13.917, p-value: 0.008(=0.008).

‘low-mid_income’ and ‘high_income’ groups, Chi-square value: 39.644, p-value: 5.128e-08(<<0.008).

‘high-mid_income’ and ‘high_income’ groups, Chi-square value: 19.896, p-value: 0.001(<0.008).

Since all the p-values from the Bonferroni Adjustment test are </= 0.008, it is safe to reject the null hypothesis Ho for all the groups. It is concluded that ‘urbanrate_cat’ and ‘income_cat’ are related.

#Chi-square #Data analysis #Coursera

Analysis of Variance ANOVA for the GapMinder Dataset

ANOVA is a statistical tool used for testing hypotheses between a categorical explanatory variable and a quantitative response variable. The variables in study are all quantitative, so the explanatory variables are categorized as published in the ‘Data Management for the GapMinder Dataset’ post. ANOVA F test and Tukey’s Honestly Significance Difference test are applied in this post.

Summary

1. µ denotes the population mean, Ho denotes null hypothesis, and Ha denotes alternative hypothesis. When our p-value is less than or equal to 0.05, we can reject Ho and accept Ha

2. For testing the hypothesis between the ‘income_cat’ variable and the ‘relectricperperson’ variable.

Ho: µ1 = µ2 = µ3 = µ4; Ha: not all the µ are equal

The F-statistic value is 40.45 and the p-value is 2.23e-18. We can reject the null hypothesis Ho, since our p-value is way less than 0.05.

3. For testing the hypothesis between ‘urbanrate_cat’ variable and the ‘relectricperperson’ variable.

Ho: µ1 = µ2 = µ3 = µ4 = µ5; Ha: not all the µ are equal

The F-statistic value of 8.277 and p-value of 6.09e-06. We can reject the null hypothesis since Ho, the p-value is lesser than 0.05.

For more information on the results, check the Outline of Code section.

#ANOVA #Data Analysis #Coursera

Trending Blogs

Last Seen Blogs

Data Analysis