rm(list = ls()) # this line cleans your Global Environment.
setwd("/Users/lucas/Documents/UNU-CDO/courses/ml4p/ml4p-website-v2") # set your working directory
# do not forget to install neuralnet and scales, which are packages we haven't used before
library(tidyverse) # our favourite data wrangling ackagy
library(neuralnet) # a package specific for neural networks
library(scales) # to control the appearance of axis and legend labels
library(skimr) # dataset summary
# import data
load("data/SHARE_DATA.rda")
## Notice that we're using the load() function, this is because the dataset is in .rda format, the standard R dataset format
## Put it into a structure with an easy name, the remove the original
<- easySHARE_rel8_0_0
z rm(easySHARE_rel8_0_0)
An Introduction to Neural Networks
What are neural networks?
In this session, Prof. Dr. Robin Cowan will give is an intuitive introduction to neural networks. He has also prepared an exercise using data from SHARE, the Survey of Health, Ageing and Retirement in Europe. Please watch his video-lesson to get the intuition behind neural network algorithms and you can then follow policy-relevant application: predicting income-vulnerable older people in Europe.
Why would being able to predict what will make an older person struggle financially be policy-relevant?
This is a discussion point that you can explore. But you might want to investigate what the average old-age pension is in some European countries, and what the average cost of living is. After working for more than half of your life, I’m sure you’d like to live comfortably…
Practical Example
As always, start by opening the libraries that you’ll need to reproduce the script below.
Unfortunately, we are unable to share the dataset ourselves. However, if you wish to replicate this exercise at home (and use one of the many target variables that Robin has proposed to see how our model fares for those), you can request access to the dataset by creating an account with SHARE. You’ll need to specify this is for learning purposes, but you won’t be denied it.
You can explore the dataset now (and refer to the SHARE website if you have any questions about the variables).
names(z)
[1] "mergeid" "hhid" "coupleid"
[4] "wave" "wavepart" "int_version"
[7] "int_year" "int_month" "country"
[10] "country_mod" "language" "female"
[13] "dn002_mod" "dn003_mod" "dn004_mod"
[16] "age" "birth_country" "citizenship"
[19] "iv009_mod" "q34_re" "isced1997_r"
[22] "eduyears_mod" "mar_stat" "hhsize"
[25] "partnerinhh" "int_partner" "age_partner"
[28] "gender_partner" "mother_alive" "father_alive"
[31] "siblings_alive" "ch001_" "ch021_mod"
[34] "ch007_hh" "ch007_km" "sp002_mod"
[37] "sp003_1_mod" "sp003_2_mod" "sp003_3_mod"
[40] "sp008_" "sp009_1_mod" "sp009_2_mod"
[43] "sp009_3_mod" "books_age10" "maths_age10"
[46] "language_age10" "vaccinated" "childhood_health"
[49] "sphus" "chronic_mod" "casp"
[52] "euro1" "euro2" "euro3"
[55] "euro4" "euro5" "euro6"
[58] "euro7" "euro8" "euro9"
[61] "euro10" "euro11" "euro12"
[64] "eurod" "bfi10_extra_mod" "bfi10_agree_mod"
[67] "bfi10_consc_mod" "bfi10_neuro_mod" "bfi10_open_mod"
[70] "hc002_mod" "hc012_" "hc029_"
[73] "maxgrip" "adlwa" "adla"
[76] "iadla" "iadlza" "mobilityind"
[79] "lgmuscle" "grossmotor" "finemotor"
[82] "recall_1" "recall_2" "orienti"
[85] "numeracy_1" "numeracy_2" "bmi"
[88] "bmi2" "smoking" "ever_smoked"
[91] "br010_mod" "br015_" "ep005_"
[94] "ep009_mod" "ep011_mod" "ep013_mod"
[97] "ep026_mod" "ep036_mod" "co007_"
[100] "thinc_m" "income_pct_w1" "income_pct_w2"
[103] "income_pct_w4" "income_pct_w5" "income_pct_w6"
[106] "income_pct_w7" "income_pct_w8"
## and how big it is
dim(z)
[1] 412110 107
# ==== we can also use our trusted skimr package ==== #
# skim(z)
# =================================================== #
# Remember to take out the hashtag to print the command!
1. Data Preparation
Now we are going to clean up some things in the data to make it useful.
- Select a subset of the countries: Spain, France, Italy, Germany, Poland. These are identified in the data with numbers:
Spain 724; France 250; Italy 380; Germany 276; NL 528; Poland 616
<- c(724, 250, 380, 276, 528, 616) countries
In the dataset, negative numbers indicate some kind of missing data, so we will replace them with NA (R-speak for missing values).
We then select years since 2013 (let’s focus on the most recent cohorts)
Restrict our data to observations that have certain qualities: we want people who are retired (ep005 ==1).
<- z %>%
z1 filter(country_mod %in% countries )%>% # this line subsets the z dataset to only the countries we're interested in (expressed in the line above)
mutate(across(everything(), function(x){replace(x, which(x<0), NA)})) %>% # this line replaces all values across the entire dataframe that are less than 0 to NA (missing)
filter(int_year >=2013) %>% # now we're subsetting the dataset to the years 2013 and after
filter(ep005_ == 1) # and finally, keeping only people old enought for retirement
At this point you should have decreased the number of observation by 366431 (new obs. = 45679). z1 now contains a cleaner version of the dataset (feel free to delete z)
PS. The following symbols %>% are called pipe operators. They belong to the dplyr packaged, which is masked within the tidyverse. They allow you to indicate a series of actions to do to the object in a sequence, just as above.
Now let’s create some variables for our model
## Create the variable migrant
## change the nature of married to a dummy variable
## change the nature of our vaccination variable to zero or 1
<- z1 %>%
z1 mutate(migrant = ifelse(birth_country==country,0,1)) %>%
mutate(married=ifelse((mar_stat==1 | mar_stat==2),1,0))%>%
mutate(vaccinated=ifelse(vaccinated==1,1,0))
At this point we should have 109 variables (because we created two new variables and rewrote 1.
Select the variables we want in the analysis
To access the full survey with variable definitions, here’s a link to the PDF in English.
## get rid of crazy income values (the people with high income are not not part of our population of interest (regular folks who need to save for retirement))
## and make our dependent variable (co007, which is whether the household struggles to make ends meet) a factor
<- z1 %>%
z1 ::select(female,age,married,hhsize,vaccinated,books_age10,maths_age10,language_age10,childhood_health,migrant,eduyears_mod,co007_,thinc_m,eurod,country_mod,iv009_mod) %>%
dplyrfilter(thinc_m < 100000)%>% # people earning above 100,000 are excluded
mutate(co007_ = as.factor(co007_))
What is our target variable?
In the English Questionnaire of the SHARE dataset, the variable asks:
Thinking of your household’s total monthly income, would you say that your household is able to make ends meet…
- With great difficulty
- With some difficulty
- Fairly easily
- Easily
Let’s work with this variable to turn this into a classification problem.
## aggregate income struggle variable into 2 categories and add to our data
$co007_mod <- z1$co007_ # here we're just creating a duplicate of the co007_ variable but with a different name
z1
# it's usually a good idea to manipulate a duplicated variable in case you make a mistake and need to call on the original/untransformed data again
$co007_mod[z1$co007_ %in% c(1,2)] <- 1 # if the values in var z1$co007_ are 1 or 2, transform them into 1, store this in our new z1$co007_mod variable
z1
$co007_mod[z1$co007_ %in% c(3,4)] <- 2 # if the values in var z1$co007_ are 3 or 4, transform them into 2, store this in our new z1$co007_mod variable
z1
## change the way to factor is defined to have only 2 levels
$co007_mod <- as.factor(as.integer(z1$co007_mod)) z1
Now we have a variable that indicates whether a household struggles (1) or doesn’t struggle (2) to make ends meet.
A different dependent variable could just be income. To make that sensible we make income bands (or ‘bins’): var thinc_m directly asks annual salary.
# we're creating quartiles (to which income quartile do you belong, given your annual salary? the lowest? the highest?)
$inc_bin = cut(z1$thinc_m,quantile(z1$thinc_m,breaks=c(0,0.25,0.5,075,1),na.rm=T)) z1
We won’t work with the inc_bin (classification) variable, but it’s there if you wish to challenge yourself to create a neural network model for it.
Cleaning missing values (recall ML needs a full dataset)
## get rid of any observation that contains NA
sum(is.na(z1))
[1] 140821
#You can get a glimpse of which variables have the most missing values with the skim() function
skim(z1)
Name | z1 |
Number of rows | 37286 |
Number of columns | 18 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 15 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
co007_ | 623 | 0.98 | FALSE | 4 | 4: 12700, 3: 12382, 2: 8843, 1: 2738 |
co007_mod | 623 | 0.98 | FALSE | 2 | 2: 25082, 1: 11581 |
inc_bin | 318 | 0.99 | FALSE | 4 | (2.: 9322, (1.: 9321, (3.: 9321, (0,: 9004 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
female | 0 | 1.00 | 0.47 | 0.50 | 0.0 | 0.00 | 0.00 | 1.00 | 1.0 | ▇▁▁▁▇ |
age | 2 | 1.00 | 73.32 | 7.86 | 42.2 | 67.20 | 72.30 | 78.80 | 102.3 | ▁▃▇▃▁ |
married | 208 | 0.99 | 0.71 | 0.45 | 0.0 | 0.00 | 1.00 | 1.00 | 1.0 | ▃▁▁▁▇ |
hhsize | 0 | 1.00 | 2.02 | 0.88 | 1.0 | 2.00 | 2.00 | 2.00 | 10.0 | ▇▁▁▁▁ |
vaccinated | 29307 | 0.21 | 0.93 | 0.25 | 0.0 | 1.00 | 1.00 | 1.00 | 1.0 | ▁▁▁▁▇ |
books_age10 | 24733 | 0.34 | 1.85 | 1.11 | 1.0 | 1.00 | 1.00 | 3.00 | 5.0 | ▇▃▂▁▁ |
maths_age10 | 25127 | 0.33 | 2.80 | 0.85 | 1.0 | 2.00 | 3.00 | 3.00 | 5.0 | ▁▃▇▂▁ |
language_age10 | 25161 | 0.33 | 2.81 | 0.81 | 1.0 | 2.00 | 3.00 | 3.00 | 5.0 | ▁▃▇▂▁ |
childhood_health | 29173 | 0.22 | 2.33 | 1.03 | 1.0 | 1.00 | 2.00 | 3.00 | 6.0 | ▇▅▁▁▁ |
migrant | 250 | 0.99 | 1.00 | 0.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.0 | ▁▁▇▁▁ |
eduyears_mod | 2542 | 0.93 | 10.21 | 4.44 | 0.0 | 7.00 | 10.00 | 13.00 | 25.0 | ▃▇▇▂▁ |
thinc_m | 0 | 1.00 | 25379.33 | 16076.56 | 0.0 | 14357.39 | 21731.12 | 32783.51 | 99829.1 | ▇▇▂▁▁ |
eurod | 1485 | 0.96 | 2.57 | 2.31 | 0.0 | 1.00 | 2.00 | 4.00 | 12.0 | ▇▃▂▁▁ |
country_mod | 0 | 1.00 | 434.20 | 184.53 | 250.0 | 276.00 | 380.00 | 616.00 | 724.0 | ▇▃▂▂▃ |
iv009_mod | 1269 | 0.97 | 3.68 | 1.34 | 1.0 | 3.00 | 4.00 | 5.00 | 5.0 | ▂▂▃▆▇ |
# we'll use the drop_na() function, which will delete any row if it has at least one missing value (be careful when doing this in your own data cleaning)
<- drop_na(z1)
z2 dim(z2)
[1] 6881 18
# z2 is a small subset of the original dataset which contains i) no missing values, ii) only relevant variables for our model on retirement, and iii) a more manageable dataset size
Rescaling data
## age, years of education and income (thinc_m) have a big range, so let's rescale it to between plus and minus 1
## scaling allows us to compare data that aren't measured in the same way
<- z2 %>%
z4 mutate(ScaledAge = rescale(age,to=c(-1,1)))%>%
mutate(EduYears=rescale(eduyears_mod,to=c(-1,1)))%>%
mutate(income = rescale(thinc_m,to=c(-1,1)))
# z4 is now the working dataset, with 3 more (scaled) variables
## check what we have
summary(z4$ScaledAge)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.00000 -0.16230 0.01222 0.04492 0.22862 1.00000
## check what variables we now have in the data
names(z4)
[1] "female" "age" "married" "hhsize"
[5] "vaccinated" "books_age10" "maths_age10" "language_age10"
[9] "childhood_health" "migrant" "eduyears_mod" "co007_"
[13] "thinc_m" "eurod" "country_mod" "iv009_mod"
[17] "co007_mod" "inc_bin" "ScaledAge" "EduYears"
[21] "income"
## let's look at the data just to see if there is anything observable at the start
## plot the first 100 observations
## we will use a pairs plot
plot(head(z4,100))
# you can use the zoom function of the image if you're replicating this script locally (that way you can read the variable names).
## look more closely at migrant and vaccination (the others seem to have a good spread)
table(z4$migrant)
1
6881
## there is only one value so no point in including it
## look at vaccination
table(z4$vaccinated)
0 1
367 6514
Select a subset of the data, including only relevant variables
<- z4 %>%
z5 ::select(female,married,hhsize,books_age10, maths_age10, language_age10, EduYears,eurod, country_mod,iv009_mod, inc_bin, co007_,co007_mod) dplyr
Notice that we create new datasets everytime we subset, instead of rewriting the old one. This is probably a good idea in case we need to take a step back.
To be able to work with the neuralnet package, it’s best to have dummy variables instead of one variable with various categories. So, let’s start that process:
## now change country to dummy variables
<- as.factor(z5$country_mod)
country <- model.matrix(~0+country) cmat
The model.matrix() function takes a formula and a data frame (or similar structure) and returns a matrix with rows corresponding to cases and columns to predictor variables.
## add to z5
<- cbind(z5,cmat)
z5 head(z5)
female married hhsize books_age10 maths_age10 language_age10 EduYears
104800 1 1 2 1 3 3 -0.12
104803 0 1 2 1 2 3 -0.12
104808 1 1 2 1 3 3 0.12
104812 0 1 2 1 3 3 -0.04
104816 0 1 2 1 3 3 0.20
104852 0 0 1 3 4 3 -0.12
eurod country_mod iv009_mod inc_bin co007_ co007_mod
104800 5 276 5 (1.44e+04,2.17e+04] 2 1
104803 5 276 5 (1.44e+04,2.17e+04] 2 1
104808 3 276 5 (3.28e+04,9.98e+04] 4 2
104812 0 276 5 (3.28e+04,9.98e+04] 4 2
104816 4 276 4 (3.28e+04,9.98e+04] 4 2
104852 3 276 4 (1.44e+04,2.17e+04] 2 1
country250 country276 country380 country528 country724
104800 0 1 0 0 0
104803 0 1 0 0 0
104808 0 1 0 0 0
104812 0 1 0 0 0
104816 0 1 0 0 0
104852 0 1 0 0 0
class(z5)
[1] "data.frame"
To finalise the data preparation, let’s do some variable cleaning:
## fix level names of inc_bin
levels(z5$inc_bin) <- c("first","second","third","fourth")
names(z5)
[1] "female" "married" "hhsize" "books_age10"
[5] "maths_age10" "language_age10" "EduYears" "eurod"
[9] "country_mod" "iv009_mod" "inc_bin" "co007_"
[13] "co007_mod" "country250" "country276" "country380"
[17] "country528" "country724"
2. Model Preparation
This time around, we’re not using caret functions to split our data, or define our target and predictors. We’ll do this “manually”. The first thing we want to do, is create and object of the form [target ~ x1 + x2 + …+ xk]. This is how R reads target variables (on the left hand side of the squiggle) and predictors (on the right hand side of the squiggle and separated by + signs).
Prepare model
## now make the formula we want to estimate
<- paste(names(z5)[c(1:8,10,14:18)],collapse=" + ")
myform0 # the object myform0 contains all the predictor variables for our neural network model
<- paste( "co007_mod",c(myform0),sep=" ~ ")
myform
all.equal(myform,myform0) # returns one mistmatch.
[1] "1 string mismatch"
# myform includes the income variable as a predictor, so it's all our previous predictors + co007_mod (as target!)
## look at the formula to make sure we got what we wanted
print(myform)
[1] "co007_mod ~ female + married + hhsize + books_age10 + maths_age10 + language_age10 + EduYears + eurod + iv009_mod + country250 + country276 + country380 + country528 + country724"
Data Split: train and test
## set the random seed so we can duplicate things if we want to
set.seed(4)
# we're doing this manually, instead of using our trusted caret() package
<- sample(1:nrow(z5),0.8*nrow(z5)) # 80% of data to train
trainRows <- (1:nrow(z5))[-trainRows]
testRows
## now we have training data: trainz5; and testing data: testz5
<- z5[trainRows,]
trainz5 <- z5[testRows,] testz5
Train our neural network model!
set.seed(4)
<- neuralnet(
model ## use the formula we defined above
myform, data = trainz5, ## tell it what data to use
hidden=c(6), ## define the number and size of hidden layers: here we have one layer with 5 nodes in it
linear.output = F, # F to show this is a classification problem (since our predictor is a factor) T returns a linear regression output. This also means that the (default) activation function is the sigmoid!
stepmax = 1000000, ## how many iterations to use to train it (1 million, but it converges before that mark)
lifesign="full", ## get some output while it works
algorithm = "rprop+", # it is a gradient descent algorithm "Resilient Propagation".
learningrate.limit = NULL,
learningrate.factor =
list(minus = 0.5, plus = 1.2),
threshold = 0.01
)
hidden: 6 thresh: 0.01 rep: 1/1 steps: 1000 min thresh: 1.22795192815468
2000 min thresh: 0.408629618104168
3000 min thresh: 0.255375732991587
4000 min thresh: 0.255375732991587
5000 min thresh: 0.206075288342767
6000 min thresh: 0.181738092119449
7000 min thresh: 0.142443872190076
8000 min thresh: 0.108056969567729
9000 min thresh: 0.0965760103945601
10000 min thresh: 0.0924441234865121
11000 min thresh: 0.0924441234865121
12000 min thresh: 0.0924441234865121
13000 min thresh: 0.0903834987550307
14000 min thresh: 0.0903834987550307
15000 min thresh: 0.0903834987550307
16000 min thresh: 0.0903834987550307
17000 min thresh: 0.0903834987550307
18000 min thresh: 0.0903834987550307
19000 min thresh: 0.0903834987550307
20000 min thresh: 0.0903834987550307
21000 min thresh: 0.0903834987550307
22000 min thresh: 0.0903834987550307
23000 min thresh: 0.0903834987550307
24000 min thresh: 0.0903834987550307
25000 min thresh: 0.0903834987550307
26000 min thresh: 0.0903834987550307
27000 min thresh: 0.0903834987550307
28000 min thresh: 0.0903834987550307
29000 min thresh: 0.0903834987550307
30000 min thresh: 0.0903834987550307
31000 min thresh: 0.0903834987550307
32000 min thresh: 0.0903834987550307
33000 min thresh: 0.0903834987550307
34000 min thresh: 0.0903834987550307
35000 min thresh: 0.0903834987550307
36000 min thresh: 0.0903834987550307
37000 min thresh: 0.0903834987550307
38000 min thresh: 0.0903834987550307
39000 min thresh: 0.080031973683847
40000 min thresh: 0.080031973683847
41000 min thresh: 0.080031973683847
42000 min thresh: 0.0781654073237371
43000 min thresh: 0.077217534035844
44000 min thresh: 0.077217534035844
45000 min thresh: 0.0663306621049722
46000 min thresh: 0.0663306621049722
47000 min thresh: 0.0663306621049722
48000 min thresh: 0.0663306621049722
49000 min thresh: 0.0640628426880909
50000 min thresh: 0.0549832161036877
51000 min thresh: 0.0549832161036877
52000 min thresh: 0.0549832161036877
53000 min thresh: 0.0535065197695844
54000 min thresh: 0.0451843906210169
55000 min thresh: 0.0407706834064874
56000 min thresh: 0.0407706834064874
57000 min thresh: 0.0407706834064874
58000 min thresh: 0.0407706834064874
59000 min thresh: 0.0407706834064874
60000 min thresh: 0.0398750949182902
61000 min thresh: 0.0347740947416519
62000 min thresh: 0.0347740947416519
63000 min thresh: 0.0347740947416519
64000 min thresh: 0.0333526116102184
65000 min thresh: 0.0333526116102184
66000 min thresh: 0.0333526116102184
67000 min thresh: 0.0333526116102184
68000 min thresh: 0.0333526116102184
69000 min thresh: 0.0316000239700686
70000 min thresh: 0.0306565221261171
71000 min thresh: 0.0274365035081705
72000 min thresh: 0.0274365035081705
73000 min thresh: 0.026496818727535
74000 min thresh: 0.026496818727535
75000 min thresh: 0.026496818727535
76000 min thresh: 0.026496818727535
77000 min thresh: 0.026496818727535
78000 min thresh: 0.026496818727535
79000 min thresh: 0.026496818727535
80000 min thresh: 0.026496818727535
81000 min thresh: 0.0259845739797748
82000 min thresh: 0.0259845739797748
83000 min thresh: 0.0259845739797748
84000 min thresh: 0.0240579990735271
85000 min thresh: 0.0226252873882434
86000 min thresh: 0.0226252873882434
87000 min thresh: 0.0221455597930601
88000 min thresh: 0.0221455597930601
89000 min thresh: 0.0221455597930601
90000 min thresh: 0.0221455597930601
91000 min thresh: 0.0203603362622581
92000 min thresh: 0.0200746219013495
93000 min thresh: 0.0200746219013495
94000 min thresh: 0.0200746219013495
95000 min thresh: 0.0200746219013495
96000 min thresh: 0.0200746219013495
97000 min thresh: 0.0163554171793395
98000 min thresh: 0.0163554171793395
99000 min thresh: 0.0163554171793395
1e+05 min thresh: 0.0151167048702273
101000 min thresh: 0.0151167048702273
102000 min thresh: 0.0151167048702273
103000 min thresh: 0.0151167048702273
104000 min thresh: 0.0151167048702273
105000 min thresh: 0.0151167048702273
106000 min thresh: 0.0151167048702273
107000 min thresh: 0.0151167048702273
108000 min thresh: 0.0151167048702273
109000 min thresh: 0.0142885225751462
110000 min thresh: 0.0142885225751462
111000 min thresh: 0.0142885225751462
112000 min thresh: 0.0142885225751462
113000 min thresh: 0.0142885225751462
114000 min thresh: 0.0142885225751462
115000 min thresh: 0.0142455799962317
116000 min thresh: 0.0142455799962317
117000 min thresh: 0.0142455799962317
118000 min thresh: 0.012107497406225
119000 min thresh: 0.012107497406225
120000 min thresh: 0.012107497406225
121000 min thresh: 0.012107497406225
122000 min thresh: 0.012107497406225
123000 min thresh: 0.0108934985445289
124000 min thresh: 0.0108934985445289
125000 min thresh: 0.0108934985445289
126000 min thresh: 0.0108934985445289
127000 min thresh: 0.0108934985445289
128000 min thresh: 0.0108934985445289
129000 min thresh: 0.0108934985445289
130000 min thresh: 0.0108934985445289
131000 min thresh: 0.0108934985445289
132000 min thresh: 0.0108934985445289
133000 min thresh: 0.0108934985445289
134000 min thresh: 0.0108934985445289
135000 min thresh: 0.0108934985445289
136000 min thresh: 0.0108934985445289
137000 min thresh: 0.0101758648461206
138000 min thresh: 0.0101758648461206
139000 min thresh: 0.0101758648461206
140000 min thresh: 0.0101758648461206
141000 min thresh: 0.0101758648461206
142000 min thresh: 0.0101758648461206
143000 min thresh: 0.0101758648461206
143866 error: 944.06645 time: 5.7 mins
# if you get an error, don't worry about it. It's not an issue for our estimation, and it indicates the last iteration (way before 1000000)
Now, let’s plot our neural network:
plot(model)
# notice that this is a vanilla network (no deep learning for us!).
Now it is time to test our model’s predictive abilities.
# use our fitted neural network to try to predict what the income states in our test data
set.seed(4)
<- predict(model,testz5) pred
As before, we will not use the caret package to call the ConfusionMatrix function, we’ll do all of it manually. We’ll have to manipulate the variables a little, to visualise the confusion matrix in a helpful way:
# add the levels from our target variable as labels (stored in an object)
<- levels(testz5$co007_mod)
myLabels <- max.col(pred) # find the index of the maximum value of my predictions
pointPred <- myLabels[pointPred] # add the labels (or column names from our predictions)
pointPred
# Now, we'll store the actual/observed values in an objext as well:
<- testz5$co007_mod
actual <- as.factor(as.integer(actual))
actual # Create the Confusion Matrix using the predicted, observed values and the labels
<- table(actual,pointPred)
t1
# voilà! manual confusion matrix!
print(t1)
pointPred
actual 1 2
1 134 259
2 114 870
Now that we have a confusion matrix, we can analyse the performance of our neural network model.
# How many older people struggle to make ends meet?
prop.table(table(testz5$co007_mod))
1 2
0.2854031 0.7145969
# about 28%...
# How accurate is our model?
sum(diag(t1))/sum(t1)
[1] 0.7291213
# this returns the proportion of correct predictions!
# 72% of correctly predicted income status
# so, our neural network model does relatively well in predicting whether older people / pensioneers struggle to make ends meet in selected European countries
If we recall from previous sessions, it is hard to have accurate predictions of that of which we have less (struggling older people)… look again at the confusion matrix: what do we see? the ratio of correct predictions for non-strugglers (2x2) is higher than for the strugglers (1x1).
Let’s remember what information we can obtain from a confusion matrix:
True Positives (TP): 134 (actual = 1, predicted = 1)
False Negatives (FN): 259 (actual = 1, predicted = 2)
False Positives (FP): 114 (actual = 2, predicted = 1)
True Negatives (TN): 870 (actual = 2, predicted = 2)
Based on these confusion matrix values (and the formulas provided in session 2: logistic classification), we can get our neural network model’s performance metrics:
Accuracy: 72.9%. (we got this above!). It is the proportion of true results regardless of the case.
Recall (Sensitivity): 34.09% (we might want to know this, since we’re trying to identify vulnerable elderly people). It is the proportion of correctly identified vulnerable cases. The formula (if you want to check yourself) is TP/(TP+FN) = 134/(134+259) = 0.340
Alternatively…
# select the needed values from the confusion matrix t1
1,1])/(t1[1,1]+t1[1,2]) (t1[
[1] 0.3409669
#==== Python version: 3.10.12 ====#
# Opening libraries
# scikit-learn
import sklearn as sk # our trusted Machine Learning library
from sklearn.model_selection import train_test_split # split the dataset into train and test
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay # returns performance evaluation metrics
from sklearn.neural_network import MLPClassifier
# non-ML libraries
import numpy as np # a library for numeric analysis
import random # for random state
import csv # a library to read and write csv files
import pandas as pd # a library to help us easily navigate and manipulate dataframes
import seaborn as sns # a data visualisation library
import matplotlib.pyplot as plt # a data visualisation library
from scipy.stats import randint # generate random integer
from graphviz import Digraph
# Uploading data
= pd.read_csv('/Users/lucas/Documents/UNU-CDO/courses/ml4p/ml4p-website-v2/data/SHARE_subset.csv') SHARE
1. Overview of the data
Let’s take a quick look at what the dataset looks like.
# let's start with the general dimensions and variable names
print(SHARE.shape)
(6881, 19)
# and variable names (comma separated)
print(', '.join(SHARE.columns))
Unnamed: 0, female, married, hhsize, books_age10, maths_age10, language_age10, EduYears, eurod, country_mod, iv009_mod, inc_bin, co007_, co007_mod, country250, country276, country380, country528, country724
# finally, missing values (best be sure!)
sum() # this sums all missing values per vector/variable SHARE.isnull().
Unnamed: 0 0
female 0
married 0
hhsize 0
books_age10 0
maths_age10 0
language_age10 0
EduYears 0
eurod 0
country_mod 0
iv009_mod 0
inc_bin 0
co007_ 0
co007_mod 0
country250 0
country276 0
country380 0
country528 0
country724 0
dtype: int64
We have \(19\) variables and \(6,881\) observations and no missing values in the dataset. Let’s tackle the variables now: Unnamed:0 is an ID tag, nothing to worry about. Then, we’ve got some traditional variables in socioeconomic analysis: female (or gender), married (or marriage status) hhsize (household size), and a few indicators about the person’s educational past: books maths and language at age 10. These questions ask about a person’s reading, maths and language performance when they were 10 years old (arguably a good proxy for educational relevance in the household, and thus a predictor of future employment… but take this statement with a grain of very salty salt). EduYears directly asks how many years a person studied, formally. The next variables are quite interesting: eurod is a depression scale, country_mod indicates where in Europe the person lives (Spain 724; France 250; Italy 380; Germany 276; NL 528; Poland 616); similarly, the variables country* are dummies generated from the country_mod variable. And finally, the target variable(s): inc_bin was created by Robin, and it indicates whether a person is in the first, second, third, or fourth income quartile. We will work with co007_ and co007_mod (which was generated from the previous variable). Both variables refer to income struggles:
In the English Questionare of the SHARE dataset, the co007_ variable asks:
Thinking of your household’s total monthly income, would you say that your household is able to make ends meet…
- With great difficulty
- With some difficulty
- Fairly easily
- Easily
Which Robin then recoded into a binary variable that takes on the value 1 if the person struggles to make ends meet at 2 otherwise (co007_mod). By doing so, he’s turned the task into a simple classification model with neural networks.
2. Data Split and Fit
# let's quickly look at the location of the relevant variables
SHARE.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6881 entries, 0 to 6880
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 6881 non-null int64
1 female 6881 non-null int64
2 married 6881 non-null int64
3 hhsize 6881 non-null int64
4 books_age10 6881 non-null int64
5 maths_age10 6881 non-null int64
6 language_age10 6881 non-null int64
7 EduYears 6881 non-null float64
8 eurod 6881 non-null int64
9 country_mod 6881 non-null int64
10 iv009_mod 6881 non-null int64
11 inc_bin 6881 non-null object
12 co007_ 6881 non-null int64
13 co007_mod 6881 non-null int64
14 country250 6881 non-null int64
15 country276 6881 non-null int64
16 country380 6881 non-null int64
17 country528 6881 non-null int64
18 country724 6881 non-null int64
dtypes: float64(1), int64(17), object(1)
memory usage: 1021.5+ KB
# X = female + married + hhsize + books_age10 + maths_age10 + language_age10 + EduYears + eurod + iv009_mod + country250 + country276 + country380 + country528 + country724
# Y = co007_mod
= SHARE.iloc[:, list(range(1, 9)) + [10] + list(range(14, 19))] # since our range is discontinuous, we've used a combination of list() and range() to indicate which elemenets from the larger dataset we want contained in X. Also note that range has a start (1 here, so the second position) and and end (with the same example, 9 here, 10th position). Whilst the start is inclusive, the end is exclusive. Which means that the range will only capture elements 1 through 8. That's why the second range works even though we only have 18 variables ;).
X = SHARE.iloc[:, 13] # y is a vector containing our target variable, which is in position 13 of the dataframe
y
# Split data into train and test
= train_test_split(X, y, test_size=0.2, random_state=12345) # random_state is for reproducibility purposes X_train, X_test, y_train, y_test
Now, let’s fit a Neural Network model:
# Initialise the MLPClassifier (or Multilayer Perceptron Classifier).
# Despite the name indicating this type of neural network is used for deep learning, it can also be used for simpler classification "vanilla network" tasks
= MLPClassifier(hidden_layer_sizes=(6,), max_iter=100000, activation='logistic', solver='sgd', random_state=12345) # the solver we're using here is the stochastic gradient descent. This is not what we used in the R practical, but there's no equivalent with the scikit-learn package, so if you peak over there, you might find some differences :).
nn_mlp
# Train the model
nn_mlp.fit(X_train, y_train)
MLPClassifier(activation='logistic', hidden_layer_sizes=(6,), max_iter=100000, random_state=12345, solver='sgd')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(activation='logistic', hidden_layer_sizes=(6,), max_iter=100000, random_state=12345, solver='sgd')
# Predictions on the test dataset
# Something I haven't mentioned before is that the predictions are deterministic based on the random state from the initialised model
= nn_mlp.predict(X_test)
pred
# Evaluation of overall the model
= accuracy_score(y_test, pred)
Accuracy print(f"Overall Accuracy of our model: {Accuracy}, not bad!")
Overall Accuracy of our model: 0.7080610021786492, not bad!
It seems like we’re relatively good at predicting wether an older European citizen struggles economically. However, let’s explore our model a little more. We’ll start by taking a look at the proportions of strugglers and non-strugglers (our target variable):
# recall that 1 is struggle and 2 is no struggle
'co007_mod'].value_counts(normalize=True) # when you set normalize = True, you get proportions. False gives you the counts SHARE[
co007_mod
2 0.702514
1 0.297486
Name: proportion, dtype: float64
So, about 29% struggle, and 70% do not. If we recall from previous sessions it is hard to have accurate predictions of that of which we have less… let’s plot a confusion matrix and evaluate the recall (sensitivity) of our model to see how good we are at predicting strugglers.
# visualising a confusion matrix
= confusion_matrix(y_test, pred)
cm print("Confusion Matrix:", cm)
Confusion Matrix: [[ 42 381]
[ 21 933]]
=cm).plot() # create confusion matrix plot ConfusionMatrixDisplay(confusion_matrix
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x30fe9faa0>
# display confusion matrix plot created above plt.show()
So, the numbers have switched here: \(2\) is now \(1\) and \(1\) is now \(0\). Let’s have a trip back to memory lane and remember the values that a confusion matrix provides:
True Positives (TP): 42 (actual = 0, predicted = 0)
False Negatives (FN): 381 (actual = 0, predicted = 1)
False Positives (FP): 21 (actual = 1, predicted = 0)
True Negatives (TN): 933 (actual = 1, predicted = 1)
Based on these confusion matrix values (and the formulas provided in session 2: logistic classification), we can get our neural network model’s performance metrics.
Accuracy: 70%. (we got this above!). It is the proportion of true results regardless of the case. Recall (Sensitivity): 35.22% (we might want to know this, since we’re trying to identify vulnerable older people). It is the proportion of correctly identified vulnerable cases. The formula (if you want to check yourself) is TP/(TP+FN) = 42/(42+381) = 0.099.
Alternatively:
print("Recall:", recall_score(y_test, pred))
Recall: 0.09929078014184398
Conclusion
How did we do? Neural networks are all the rage these days, it’s arguably the most famous machine learning algorithm. But don’t be fooled, it is still subject to the same data challenges as the rest of the algorithms we have explored so far.
Readings
Optional Readings
- Chatsiou and Mikhaylov (2020). Deep Learning for Political Science. Arxiv preprint.