Exploring Socioeconomic Indicators Worldwide: A Data Visualization Approach
Here, the Gapminder dataset is used to explore patterns in three major variables: income per person, life expectancy, and urban population rate. Through univariate and bivariate analyses, fundamental trends and relationships among these indicators are made intelligible both for researchers and general audiences.
Data Preparation and Cleaning
The analysis begins with importing and preparing the data. Three core variablesāincomeperperson (GDP per capita), lifeexpectancy, urbanrateāwere selected and converted to numeric format. To ensure validity, only countries with complete data for these metrics were included.
# Import libraries and read the data
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
# Set visualization style
sns.set_style('whitegrid') plt.rcParams['figure.figsize'] = (10, 6)
data = pd.read_csv('gapminder.csv', low_memory=False)
# Convert selected columns to numeric
for col in ['incomeperperson', 'lifeexpectancy', 'urbanrate']: data[col] = pd.to_numeric(data[col], errors='coerce')
# Subset and drop missing values
subset = data[['country', 'incomeperperson', 'lifeexpectancy', 'urbanrate']].dropna() print(f"Working with {len(subset)} countries after removing missing values") print(subset[['incomeperperson', 'lifeexpectancy', 'urbanrate']].describe())
Working with 176 countries after removing missing values
Basic statistics:
incomeperperson lifeexpectancy urbanrate count 176.000000 176.000000 176.000000 mean 7327.444414 69.654733 55.566364 std 10567.304022 9.729521 23.225708 min 103.775857 47.794000 10.400000 25% 702.366463 63.041500 36.685000 50% 2385.184105 73.126500 56.970000 75% 8497.779228 76.569500 73.465000 max 52301.587179 83.394000 100.000000
To visualize the economic landscape, a histogram was plotted for income per person across countries.
plt.figure(figsize=(12, 6))
plt.hist(subset['incomeperperson'], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
plt.xlabel('Income Per Person (USD)', fontsize=12)
plt.ylabel('Number of Countries', fontsize=12)
plt.title('Distribution of Income Per Person Across Countries', fontsize=14, fontweight='bold')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
This distribution is right-skewed: most countries have relatively low GDP per capita, with a minority accounting for much higher levels. The shape of the histogram underscores the breadth of global inequality, as the majority cluster below $10,000 while a handful exceed $30,000. Histograms are an ideal choice here, as they show both the central tendency and spread for continuous variables.
A boxplot summarizes life expectancy, highlighting the range and potential outliers.
plt.figure(figsize=(12, 6))
plt.boxplot(subset['lifeexpectancy'], vert=False, widths=0.5, patch_artist=True,
boxprops=dict(facecolor='lightcoral', alpha=0.7),
medianprops=dict(color='darkred', linewidth=2),
whiskerprops=dict(linewidth=1.5),
capprops=dict(linewidth=1.5))
plt.xlabel('Life Expectancy (years)', fontsize=12)
plt.title('Distribution of Life Expectancy Across Countries', fontsize=14, fontweight='bold')
plt.yticks([])
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
Life expectancy is relatively normally distributed, with a median near 73 years and most values between 63 and 77. Outliers at the lower end (below 50 years) reflect underlying differences in health, resources, and other country-specific challenges. Boxplots clearly delineate medians, quartiles, and outliers, giving a holistic view of the distribution.
Urbanization was explored with another histogram.
plt.figure(figsize=(12, 6))
plt.hist(subset['urbanrate'], bins=25, color='forestgreen', edgecolor='black', alpha=0.7)
plt.xlabel('Urban Rate (% population urban)', fontsize=12)
plt.ylabel('Number of Countries', fontsize=12)
plt.title('Distribution of Urbanization Rate Across Countries', fontsize=14, fontweight='bold')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
Urban rate displays a fairly uniform spread, with a slight concentration around 50ā60%. Countries range from highly rural to nearly entirely urbanized, indicating diverse development pathways. This visualization captures the variability in global urbanization and the absence of a single dominant pattern.
Income vs. Life Expectancy
A scatter plot elucidates the relationship between GDP per capita and life expectancy.
plt.figure(figsize=(12, 8))
plt.scatter(subset['incomeperperson'], subset['lifeexpectancy'],
alpha=0.6, s=80, c='purple', edgecolors='black', linewidth=0.5)
plt.xlabel('Income Per Person (USD)', fontsize=12)
plt.ylabel('Life Expectancy (years)', fontsize=12)
plt.title('Relationship Between Income Per Person and Life Expectancy', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate correlation
corr = subset[['incomeperperson', 'lifeexpectancy']].corr().iloc[0, 1]
print(f"Correlation coefficient: {corr:.3f}")
The analysis reveals a strong positive correlation (0.60): as income rises, life expectancy increasesāthough the effect tapers at high income levels. This non-linear (logarithmic) pattern suggests that basic economic resources have large impacts on health, but marginal gains diminish at greater wealth. Scatter plots are valuable here, displaying both strength and character of the association.
Life Expectancy Across Income Quartiles
Countries were divided into quartiles by income, then life expectancy compared across groups using boxplots.
subset['income_quartile'] = pd.qcut(subset['incomeperperson'], 4,
labels=['Q1: Lowest', 'Q2: Low-Mid', 'Q3: Mid-High', 'Q4: Highest'])
plt.figure(figsize=(12, 8))
subset.boxplot(column='lifeexpectancy', by='income_quartile', patch_artist=True, grid=False)
plt.suptitle('')
plt.xlabel('Income Quartile', fontsize=12, fontweight='bold')
plt.ylabel('Life Expectancy (years)', fontsize=12)
plt.title('Life Expectancy Distribution Across Income Quartiles', fontsize=14, fontweight='bold', pad=20)
plt.xticks(rotation=15)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
Lower income quartiles show the greatest variability and lowest median life expectancy. As income rises, median and lower bounds increase, while the spread narrows dramatically. The highest quartile features consistently high life expectancy (75ā83 years), with minimal variation. This comparison highlights how economic disparity translates directly into health outcomesāwealthier nations having both higher and more consistent life expectancies.
Visualizing key socioeconomic indicators reveals pronounced patterns in global development:
Income per person is heavily concentrated at low values, with high inequality.
Life expectancy distributions reflect substantial differences rooted in health and wealth.
Urbanization is diverse, without a dominant global trend.
Country-level wealth correlates strongly with life expectancy, though increases plateau in richer countries.
Income quartile stratification demonstrates the extent to which wealth drives population health stability.
The approach combines best practices in data cleaning, visualization, and interpretation for clear, reproducible results. Code and commentary are provided for transparency and potential further exploration.