rm(list=ls())
suppressMessages(library(tidyverse))
theme_set(theme_minimal())
In the world of artificial intelligence, understanding the fundamentals of neural networks is essential. One of the simplest yet powerful architectures is the Artificial Neural Network (ANN) with a single hidden layer, often referred to as a “plain vanilla” network. This blog post aims to provide a comprehensive guide to building such a network from scratch, focusing on key concepts like activation functions, weights, and the backpropagation algorithm outside the library boxes.
ANN’s (Artificial Neural Networks) is the simplest implementation of deep learning model architectures that mimic the human brain’s neural network. The simplest form of ANN is a single-layer network, also known as a “plain vanilla” network. This network consists of an input layer, a hidden layer, and an output layer. The hidden layer transforms the input data into a new set of features, which are then used to predict the response variable.
In the image above, we have some predictors that are fed into the hidden layer, which then transforms them into a new set of features , which are then used to predict the response variable .
rm(list=ls())
suppressMessages(library(tidyverse))
theme_set(theme_minimal())
Let’s create a synthetic dataset to demonstrate the construction of a plain vanilla network. We will generate a dataset with 60 observations and two predictors, x
and y
, using the following steps:
Predictors
as Uniform
distributed variables ranging between [-2, 2]
:Response
Variable as function of the predictors:# A tibble: 6 × 2
y x
<dbl> <dbl>
1 0.859 -0.769
2 0.591 -0.969
3 0.152 0.209
4 -0.829 -1.77
5 0.733 -0.126
6 0.645 -0.0649
Let’s visualize our synthetic data:
data %>%
ggplot(aes(x, y)) +
geom_line() +
geom_point() +
geom_hline(yintercept=0,
linetype="dashed", color="grey")
Now that we have the data, we attempt to replicate the distribution of this data with a model using artificial neural network technique with a single hidden layer. The model will have the following parameters:
Model Formula:
Where, is the hidden layer, the number of activations, the coefficients , and the weights.
The hidden layer computes a number of activations.
Initialize the number of hidden neurons , which is the number of activations in the hidden layer:
hidden_neurons = 5
The number of hidden neurons is a hyperparameter that needs to be tuned. The more neurons, the more complex the model, but it can also lead to overfitting. You can think of a neuron as a connection between the input and output layers. The more neurons, the more connections, and the more complex the model.
We have set to have 5 hidden neurons in this example. This means that the hidden layer will compute 5 different linear combinations of the input . This linear combination is then squashed through an activation function to transform it.
The function that takes the input and produces an output , the activation.
The activation function is a non-linear transformation of the input layers which transform to while learning during the training of the network. It is a function that decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron.
is a function used in logistic regression
to convert a linear function
into probabilities between zero and one.
To be more explicit, the activation function is a function that takes the input signal and generates an output signal, but takes into account a threshold
, meaning that it will only be activated if the signal is above a certain threshold.
We have specified 5 different activation functions to compare their performance, and we will use the sigmoid
function as the activation function in this example.
There are several types of activation functions, each with its own characteristics, but all have in common that they introduce non-linearity into the first level of output provided.
Some of the most common types of activation functions are:
Sigmoid
function:ReLU
(Rectified Linear Unit) function:SoftPlus
FunctionThis is a smooth approximation to the ReLU function. Firstly introduced in 2001, Softplus
is an alternative to traditional functions because it is differentiable and its derivative is easy to demonstrate (see source: https://sefiks.com/2017/08/11/softplus-as-a-neural-networks-activation-function/).
Let’s compare the activation functions
:
data %>%
mutate(z=sigmoid(x)) %>%
ggplot() +
geom_line(aes(x, z)) +
ylim(0, 1)
data %>%
ggplot() +
geom_line(aes(x, relu(x) * 1/2.4))
data %>%
ggplot() +
geom_line(aes(x, softplus(x)))
And now look at how the Sigmoid
differs from the ReLU
function:
Our model is a model in the model:
As you might have noticed in the formula above, the model is a linear combination of the input and the weights , which are adjusted during the training process. The activation function is applied to the linear combination of the input and weights to transform the output.
Let’s have a look at the weights and how they are initialized.
The weights are the parameters of the model that are adjusted during the training process. They can be considered as the coefficients of the hidden layer model.
They are initialized
randomly, and the model is trained to adjust these weights during the training process. The weights are adjusted using the backpropagation algorithm
, which computes the gradient of the loss function
with respect to the weights. Then, the weights are updated using the gradient descent algorithm
. We will see how this is done in the next section.
The weights are initialized randomly to break the symmetry and prevent the model from getting stuck in a local minimum. In this case we use a normal distribution with a mean of 0 and a standard deviation of 1 to initialize the weights.
Randomly initializing the weights
as i.i.d. :
The constant term will shift the inflection point, and transform a linear function to a non-linear one. The weights are adjusted during the training process to minimize the error between the predicted and actual values.
The model derives five new features by computing five different linear combinations of , and then squashes each through an activation function to transform it.
data %>%
ggplot(aes(x, y)) +
geom_point(shape=21,
stroke=0.5,
fill="grey",
color="grey20") +
geom_line(linewidth=0.2) +
geom_smooth(method = "lm",
color="steelblue",
se=F) +
geom_line(aes(x, sigmoid(y)),
linetype="dashed",
color="steelblue")
The meaning of feedforward
is used to describe the process of moving the input data through the network to obtain the predicted output. The feedforward
process is the first step in the training process of the neural network.
Here is a function that computes the output of the model given the inputs: data
, weights
, and number of activations
. It computes the output by multiplying the input data by the weights and applying the activation function to the result. It is a matrix multiplication (%*%
), which is a common operation in unsupervised learning algorithms.
Now, that we have the feedforward
function, we need to compute the derivative of the activation function. The backpropagation algorithm
multiplies the derivative of the activation function.
Backpropagation algorithm
multiplies the derivative of the activation function.
Here is a recap of the definition of derivative formula, which is applied any time the output released by the activation function is met in the network. And so, a new minimum is found. It will be more clear through the end of the post.
So, it is fundamental to define the derivative of the activation function needed for computing the gradient. For this example, we will use the derivative of the sigmoid
function.
derivativeActivation <- function(x) {
g = (sigmoid(x) * (1 - sigmoid(x)))
return(g)
}
Function for computing model error is the sum of squared errors (SSE) between the predicted and actual values.
So, this is the time for computing the gradients.
What are the gradients?
The gradients are the derivatives of the cost function with respect to the weights. The backpropagation algorithm
computes the gradient of the loss function with respect to the weights.
The gradients are then used to update the weights using the gradient descent algorithm
.
backPropagation <- function(x, y, w1, w2,
activation, derivativeActivation) {
#predicted values
preds <- feedForward(x, w1, w2, activation)
#Derivative of the cost function (first term)
derivCost <- -2 * (y - preds)
#Gradients for the weights
dW1 <- matrix(0, ncol=2, nrow=nrow(w1))
dW2 <- matrix(rep(0, length(x) * (dim(w2)[2])), nrow=length(x))
# Computing the Gradient for W2
for (i in 1:length(x)) {
a1 = w1 %*% matrix(rbind(1, x[i]), ncol=1)
da2dW2 = matrix(rbind(1, activation(a1)), nrow=1)
dW2[i,] = derivCost[i] * da2dW2
}
# Computing the gradient for W1
for (i in 1:length(x)) {
a1 = w1 %*% matrix(rbind(1, x[i]), ncol=1)
da2da1 = derivativeActivation(a1) * matrix(w2[,-1], ncol=1)
da2dW1 = da2da1 %*% matrix(rbind(1, x[i]), nrow=1)
dW1 = dW1 + derivCost[i] * da2dW1
}
# Storing gradients for w1, w2 in a list
gradient <- list(dW1, colSums(dW2))
return (gradient)
}
Defining our Stochastic Gradient Descent algorithm
which will adjust our weight matrices.
SGD <- function(x, y, w1, w2, activation, derivative, learnRate, epochs) {
SSEvec <- rep(NA, epochs) # Empty array to store SSE values after each epoch
SSEvec[1] = modelError(x, y, w1, w2, activation)
for (j in 1:epochs) {
for (i in 1:length(x)) {
gradient <- backPropagation(x[i], y[i], w1, w2, activation, derivative)
# Adjusting model parameters for a given number of epochs
w1 <- w1 - learnRate * gradient[[1]]
w2 <- w2 - learnRate * gradient[[2]]
}
SSEvec[j+1] <- modelError(x, y, w1, w2, activation)
# Storing SSE values after each iteration
}
# Beta vector holding model parameters
B <- list(w1, w2)
result <- list(B, SSEvec)
return(result)
}
Running the SGD function to obtain our optimized model and parameters:
model <- SGD(x, y(x), w1, w2,
activation = sigmoid,
derivative = derivativeActivation,
learnRate = 0.01,
epochs = 200)
Obtaining our adjusted SSE’s for each epoch:
SSE <- model[[2]]
Plotting the SSE from each epoch vs number of epochs
Extracting our new parameters
from our model.
new_w1 <- model[[1]][[1]]
new_w2 <- model[[1]][[2]]
Comparing our old weight
matrices against the new ones
.
Obtaining our new predictions using our optimized parameters.
y_pred <- feedForward(x, new_w1, new_w2, sigmoid)
Plotting training data against our model predictions
data %>%
mutate(y_pred=y_pred) %>%
pivot_longer(cols = c(y, y_pred)) %>%
ggplot(aes(x, value, group=name, color=name)) +
geom_point(shape=21, stroke=0.5) +
geom_line() +
scale_color_discrete(type = c("steelblue", "red")) +
labs(title= "Target Response vs. Predictions",
x="Observations",
y="Responses")
tristanoprofetto
github repository: https://github.com/tristanoprofetto/neural-networks/blob/main/ANN/Regressor/feedforward.R
This is the package that covers all datasets used in the Health Metrics and the Spread of Infectious Diseases: Machine Learning Application and Spatial Modelling Analysis with R book.
Let’s scrap the R-Ladies chapters events from Meetup.com We can use the {meetupr}
package.
urlname <- c("rladies-paris","rladies-rome")
events <- purrr::map(urlname,get_events)
dat <- dplyr::bind_rows(events)
Load necessary libraries
urlname <- "rladies-rome"
events <- get_events(urlname)
dplyr::arrange(events, desc(time))%>%
head()
To do it for all chapters on meetup, we need the list of the chapters from the rladies github archive.
data <- jsonlite::fromJSON('https://raw.githubusercontent.com/rladies/meetup_archive/main/data/events.json')
dat <- dplyr::bind_rows(events)
# saveRDS(dat,"dat.rds")
dat3 <- dat2%>%
tidytext::unnest_tokens(word, title,drop = F)%>%
select(chapter,title,going,word)%>%
anti_join(get_stopwords())%>%
filter(!str_length(word)<=3)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta)) +
geom_col() +
scale_x_reordered() +
facet_wrap(vars(topic), scales = "free_x")
assignments <- augment(chapters_lda, data = chapters_dtm)
assignments%>%
filter(!term=="ladies")
# how words in titles changed overtime
inaug_freq <- dat3 %>%
inner_join(dat2,by=c("chapter","title","going"))%>%#View
count(time, word) %>%
complete(time, word, fill = list(n = 0)) %>%
group_by(time) %>%
mutate(time_total = sum(n),
percent = n / time_total) %>%
ungroup()
inaug_freq
models %>%
slice_max(abs(estimate), n = 6) %>%
inner_join(inaug_freq) %>%
ggplot(aes(time, percent)) +
geom_point() +
geom_smooth() +
facet_wrap(vars(word)) +
scale_y_continuous(labels = scales::percent_format()) +
labs(y = "Frequency of word in speech")
Linear regression is a statistical technique used to represent the linear relationship between a response and a predictor .
Below we examine some mismatch in output when plotting the prediction results of a linear model made with the lm()
and the predict()
functions versus the output produced by the geom_smooth()
layer in a ggplot()
visualization.
The first dataset used is from the HistData package HistData::CholeraDeaths1849
. We select just the deaths due to Cholera within 12 months in 1849, and visualize the trend in time of the number of deaths with the addition of a further layer made with the geom_smooth()
.
?geom_smooth()
?HistData::CholeraDeaths1849
library(tidyverse)
library(HistData)
cholera <- HistData::CholeraDeaths1849 %>%
filter(cause_of_death=="Cholera")%>%
select(date,deaths)
cholera %>% head
# A tibble: 6 × 2
date deaths
<date> <dbl>
1 1849-01-01 13
2 1849-01-02 19
3 1849-01-03 28
4 1849-01-04 24
5 1849-01-05 23
6 1849-01-06 39
summary(cholera)
date deaths
Min. :1849-01-01 Min. : 0
1st Qu.:1849-04-02 1st Qu.: 8
Median :1849-07-02 Median : 23
Mean :1849-07-02 Mean : 146
3rd Qu.:1849-10-01 3rd Qu.: 192
Max. :1849-12-31 Max. :1121
ggplot(cholera,aes(x=date,y=deaths))+
geom_point()+
geom_smooth(method = "lm")+
labs(title="Deaths due to Cholera in London (1849)",
x="Date",y="Cholera death")+
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
The purpose of making a model is that to identify the inside pattern of a series of observations. This means that the model would need to be able to interpolate given observations in order to represent the overall pattern. As in the visualization above the geom_smooth()
with the specification of the method="lm"
helps us visualize what the direction of a linear pattern would be on this data. If it is a growing pattern or not.
Clearly the points are shaping a bell distribution of deaths in time, and this is not the case of a linear relationship between date and cholera deaths, but we would like to dig into the output of the prediction of the application of a linear model on this data and then compare it with the output of the geom_smooth(method="lm")
line.
Let’s apply a linear model to this data and make some rough predictions.
Call:
lm(formula = deaths ~ date, data = cholera)
Residuals:
Min 1Q Median 3Q Max
-265.88 -104.28 -59.87 11.19 930.92
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.954e+04 4.878e+03 6.056 3.49e-09 ***
date 6.678e-01 1.108e-01 6.026 4.13e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 223.1 on 363 degrees of freedom
Multiple R-squared: 0.09094, Adjusted R-squared: 0.08843
F-statistic: 36.31 on 1 and 363 DF, p-value: 4.132e-09
The application of a linear model on this data produced an estimation of the intercept and the slope .
The intercept is the starting point of a linear model line on the y axes, while the slope is the inclination of the line, that can be positive or negative, indicating the growing or decreasing tendency of the relationship between the response and the predictor.
Let’s draw this line.
ggplot(cholera, aes(x=date, y=deaths)) +
geom_point() +
geom_abline(slope=0.6678, intercept=29540,
col="pink")+
theme_minimal()
Now calculate the prediction and the Root Mean Squared Error (RMSE) to evaluate how the model worked.
predictions <- predict(mod, newdata = NULL)
rmse <- sqrt(mean((predictions - cholera$deaths)^2))
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
Root Mean Squared Error (RMSE): 222.4774
ggplot(cholera, aes(x=date)) +
geom_point(aes(y=deaths)) +
geom_smooth(aes(y=deaths),method = "lm",linewidth=2)+
geom_line(y=predictions, col="red")+
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Now let’s use a different dataset. This data set comes from a paper by Brigham et al. (2003) that analyses some tables from Farr’s report of the Registrar-General on mortality due to cholera in England in the years 1848-1849, during which there was a large epidemic throughout the country. In this case we do not have the time variable but the numbers of deaths are considered by 38 districts in London.
?HistData::Cholera
cholera2 <- HistData::Cholera %>%
rownames_to_column(var="id")%>%
select(id,district,cholera_deaths,popn)
cholera2 %>% head
id district cholera_deaths popn
1 1 Newington 907 63074
2 2 Rotherhithe 352 17208
3 3 Bermondsey 836 50900
4 4 St George Southwark 734 45500
5 5 St Olave 349 19278
6 6 St Saviour 539 35227
The predictor in this case is a character, we are considering the relationship between the deaths rate and the districts, so we are looking at to see whether the deaths rate is different among 38 districts. The order in this case is arbitrary and this influences the pattern. The geom_smooth()
is not drowning a line, it doesn’t know how the x-axis has to be ordered, because there is not a specified order to follow.
ggplot(cholera2, aes(x=id,y=cholera_deaths)) +
geom_point()+
geom_smooth(method = "lm")+
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
while if we set an order with as.integer(id)
, the line can be drawn but it hasn’t got much meaning. The trend is not going up or down because we are considering districts
in the x-axis.
ggplot(cholera2, aes(x=as.integer(id),y=cholera_deaths)) +
geom_point()+
geom_smooth(method="lm")+
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
If we consider the population, in the middle of 1849, a numeric vector, these values are by districts, each popn
value corresponds to the level of population in one of the 38 districts. Let’s see what happens if we plot popn
versus cholera_deaths
.
ggplot(cholera2, aes(x=popn,y=cholera_deaths)) +
geom_point()+
geom_smooth(method="lm")+
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Now the values on the x-axis are numeric and have a meaning to be ordered from the lower to the highest but it is not a trend
. Each point is one district population value with some deaths due to cholera. The geom_smooth
line it is telling us that if the level of the population is higher, the level of deaths due to cholera is higher, than in other location with a lower level of population.
But we can evaluate the growing relationship between population level and numbers of deaths due to cholera.
Let’s make a linear model and predict the future, roughly.
mod2 <- lm(cholera_deaths ~ popn , data = cholera2)
mod2
Call:
lm(formula = cholera_deaths ~ popn, data = cholera2)
Coefficients:
(Intercept) popn
1.073e+02 4.357e-03
Let’s draw this line.
ggplot(cholera2, aes(x=popn, y=cholera_deaths)) +
geom_point() +
geom_abline(slope=0.004357, intercept=107.3,
col="pink")+
theme_minimal()
predictions2 <- predict(mod2, newdata = NULL)
rmse <- sqrt(mean((predictions2 - cholera2$cholera_deaths)^2))
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
Root Mean Squared Error (RMSE): 281.7808
plot <- ggplot(cholera2, aes(x=popn)) +
geom_point(aes(y=cholera_deaths)) +
geom_smooth(aes(y=cholera_deaths),method = "lm",linewidth=2)+
geom_line(y=predictions2, col="red")+
labs(title="Cholera Deaths explanined by\nLondon Districts Population (1849)",
x="1849 London Population by 38 Districts",
y="Cholera Deaths",
caption="William Farr's Data on Cholera in London, 1849")+
theme_minimal()
plot
`geom_smooth()` using formula = 'y ~ x'
Think about that!
Imagine that we absolutely want to replicate the geom_smooth(method="lm")
line, we would need to consider some steps that the function takes in order to plot the prediction line that doesn’t much with ours this time. First think to consider is that we haven’t used new data
but just produced the prediction from our dataset. But this is exactly as the same as before.
There is a function ggplot_build()
that let’s us dig into the ggplot data manipulation used to make the geom_smooth line.
?ggplot_build()
# source: https://stackoverflow.com/questions/42673665/geom-smooth-gives-different-fit-than-nls-alone
dat = ggplot_build(plot)$data[[2]]
`geom_smooth()` using formula = 'y ~ x'
This time we use the newdata = dat
insted of NULL
.
predictions3 <- predict(mod2, newdata = dat)
rmse <- sqrt(mean((predictions3 - cholera2$cholera_deaths)^2))
Warning in predictions3 - cholera2$cholera_deaths: longer object length is not
a multiple of shorter object length
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
Root Mean Squared Error (RMSE): 411.1835
ggplot(cholera2, aes(x=popn)) +
geom_point(aes(y=cholera_deaths)) +
geom_smooth(aes(y=cholera_deaths),method = "lm",linewidth=2)+
geom_line(data=dat,y=predictions3, col="red")+
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
We eventually matched the geom_smooth line, but why did our predictions on the original data result in that squiggly line?
My contributions to the #DuboisChallenge2024.
Download the georgia-1880-county-shapefile.zip
file from: https://github.com/ajstarks/dubois-data-portraits/tree/master/challenge/2024/challenge01
georgia_shp <- sf::read_sf("data/georgia-1880-county-shapefile")
# georgia_shp%>%head
dat_sf <- georgia_shp%>%
janitor::clean_names()%>%
separate(data1870,into=c("up70","down70"))%>%
separate(data1880_p,into=c("up80","down80"))%>%
mutate(# pop 1870
up70=ifelse(up70=="",0,up70),
down70=ifelse(is.na(down70),0,down70),
up70=as.numeric(up70),
down70=as.numeric(down70),
# pop 1880
up80=ifelse(up80=="",0,up80),
down80=ifelse(is.na(down80),0,down80),
up80=as.numeric(up80),
down80=as.numeric(down80))%>%
rowwise()%>%
mutate(pop70=mean(up70,down70),
pop80=mean(up80,down80))%>%
arrange(pop70,pop80)
data <- dat_sf%>%select(county=icpsrnam,
pop70,pop80)%>%
mutate(id=case_when(pop70 == 0 ~ 7,
pop70 == 1000 ~ 6,
pop70 == 2500 ~ 5,
pop70 == 5000 ~ 4,
pop70 == 10000 ~ 3,
pop70 == 15000 ~ 2,
pop70 == 20000 ~ 1),
pop70=case_when(pop70 == 0 ~ "UNDER 1,000",
pop70 == 1000 ~ "1000 TO 2,500",
pop70 == 2500 ~ "2,500 TO 5,000",
pop70 == 5000 ~ "5,000 TO 10,000",
pop70 == 10000 ~ "10,000 TO 15,000",
pop70 == 15000 ~ "15,000 TO 20,000",
pop70 == 20000 ~ "20,000 TO 30,000"))%>%
# pop80
mutate(id=case_when(pop80 == 0 ~ 7,
pop80 == 1000 ~ 6,
pop80 == 2500 ~ 5,
pop80 == 5000 ~ 4,
pop80 == 10000 ~ 3,
pop80 == 15000 ~ 2,
pop80 == 20000 ~ 1),
pop80=case_when(pop80 == 0 ~ "UNDER 1,000",
pop80 == 1000 ~ "1000 TO 2,500",
pop80 == 2500 ~ "2,500 TO 5,000",
pop80 == 5000 ~ "5,000 TO 10,000",
pop80 == 10000 ~ "10,000 TO 15,000",
pop80 == 15000 ~ "15,000 TO 20,000",
pop80 == 20000 ~ "20,000 TO 30,000"))
data%>%count(id,pop80)
Fonts:
Colors:
Background:
"#e7d6c5"
Text:
c("#483c32","#bbaa98")
legend_colors <- c("#372c59","#7a5039","#c29e84","#d63352",
"#e79d96","#edb456","#4b5c4f")
Bounding box: xmin: 939223.1 ymin: -701249.8 xmax: 1425004 ymax: -200888.5
pop70_map <- data%>%
ggplot()+
geom_sf(aes(fill=pop70),
show.legend = F,
color="#483c32",alpha=0.9,
linewidth=0.1)+
scale_fill_manual(values=c("UNDER 1,000"="#4b5c4f",
"1000 TO 2,500"="#edb456",
"2,500 TO 5,000"="#e79d96",
"5,000 TO 10,000"="#d63352",
"10,000 TO 15,000"="#c29e84",
"15,000 TO 20,000"="#7a5039",
"20,000 TO 30,000"="#372c59"),na.value = "#e0cebb")+
annotate("text", x = -84.45, y = 35.1,
label = "1870",
size = 3.5,color="#483c32",
fontface = "bold",
family = "Public Sans" ) +
coord_sf(crs=4326,clip = "off")+
ggthemes::theme_map()+
theme(plot.background = element_rect(color="#e7d6c5",fill="#e7d6c5"),
panel.background = element_rect(color="#e7d6c5",fill="#e7d6c5"))
pop70_map
pop80_map <- data%>%
ggplot()+
geom_sf(aes(fill=pop80),
show.legend = F,
color="#483c32",alpha=0.9,
linewidth=0.1)+
scale_fill_manual(values=c("UNDER 1,000"="#4b5c4f",
"1000 TO 2,500"="#edb456",
"2,500 TO 5,000"="#e79d96",
"5,000 TO 10,000"="#d63352",
"10,000 TO 15,000"="#c29e84",
"15,000 TO 20,000"="#7a5039",
"20,000 TO 30,000"="#372c59"),na.value = "#e0cebb")+
annotate("text", x = -84.45, y = 35.1,
label = "1880",
size = 3.5,color="#483c32",
fontface = "bold",
family = "Public Sans" ) +
coord_sf(crs=4326,clip = "off")+
ggthemes::theme_map()+
theme(plot.background = element_rect(color="#e7d6c5",fill="#e7d6c5"),
panel.background = element_rect(color="#e7d6c5",fill="#e7d6c5"))
pop80_map
source: https://ggplot2-book.org/arranging-plots
pop70_map+ ggplot() + ggplot()+ pop80_map + plot_layout(ncol = 2,nrow = 2)
legend1_plot <- legend1%>%
ggplot(aes(x,y))+
geom_point(aes(fill=label),
shape=21,stroke=0.1,
size=8.5,
show.legend = F)+
scale_fill_manual(values=c("5,000 TO 10,000"="#d63352",
"2,500 TO 5,000"="#e79d96",
"1000 TO 2,500"="#edb456",
"UNDER 1,000"="#4b5c4f"))+
geom_text(aes(label=label),
family="Public Sans",
size=3.5,color="#7a5039",
nudge_x = 0,hjust=-0.2)+
coord_cartesian(xlim=c(-0.2,1),ylim =c(-0,5) )+
ggthemes::theme_map()+
theme(plot.background = element_rect(color="#e7d6c5",fill="#e7d6c5"),
panel.background = element_rect(color="#e7d6c5",fill="#e7d6c5"))
legend1_plot
legend2_plot <- legend2%>%
ggplot(aes(x,y))+
geom_point(aes(fill=label),
shape=21,stroke=0.1,
size=8.5,
show.legend = F)+
scale_fill_manual(values=c("10,000 TO 15,000"="#c29e84",
"15,000 TO 20,000"="#7a5039",
"BETWEEN 20,000 AND 30,000"="#372c59"))+
geom_text(aes(label=label),
size=3.5,color="#7a5039",
family="Public Sans",
nudge_x = 0.5,hjust=0)+
coord_cartesian(xlim=c(-0.2,7),ylim =c(-1,4) )+
ggthemes::theme_map()+
theme(plot.background = element_rect(color="#e7d6c5",fill="#e7d6c5"),
panel.background = element_rect(color="#e7d6c5",fill="#e7d6c5"))
legend2_plot
pop70_map+ legend2_plot + legend1_plot+ pop80_map + plot_layout(ncol = 2,nrow = 2)+plot_annotation(
title = "NEGRO POPULATION OF GEORGIA BY COUNTIES.",
caption="#DuboisChallenge24| Week1 | by Federica Gazzelloni",
theme = theme_void(base_family = "Public Sans"))&
theme(text=element_text(color="#483c32",face="bold"),
plot.title = element_text(hjust=0.5),
plot.caption = element_text(size=9),
plot.background = element_rect(color="#e7d6c5",fill="#e7d6c5"),
panel.background = element_rect(color="#e7d6c5",fill="#e7d6c5"))
ggsave("challenge01.png",bg="#e7d6c5",height = 8.8)
Simpson's Paradox
is a statistical phenomenon where a trend appears in different groups of data but disappears or reverses when these groups are combined. This paradox highlights the importance of considering confounding variables and understanding the causal relationship between variables.
Let’s consider a hypothetical work environment where the number of women (W) is greater than the number of men (M). However, when looking at the distribution of managerial positions (P), it seems that more men occupy higher-level positions compared to women.
Now, suppose there’s a characteristic Z, representing gender, and you suspect it might influence the choice of assigning a managerial position (P) because a specific time dedicated to a critical task (T) is primarily marketed toward men (M).
To illustrate this paradox, we’ll create synthetic data in R.
set.seed(123)
summary(data)
gender count manager
Length:2000 Min. :200 Min. :0.0000
Class :character 1st Qu.:352 1st Qu.:0.0000
Mode :character Median :500 Median :1.0000
Mean :500 Mean :0.5615
3rd Qu.:648 3rd Qu.:1.0000
Max. :800 Max. :1.0000
In this example, we have created a dataset with a larger number of women, but the chance of obtaining a managerial position for men is influenced by a confounding variable. Now, let’s examine the paradox.
proportion_table
# A tibble: 2 × 2
gender proportion
<chr> <dbl>
1 Men 0.626
2 Women 0.497
library(ggplot2)
proportion_table%>%
ggplot(aes(gender,proportion,fill=gender))+
geom_col(color="white",show.legend = F)+
scale_fill_viridis_d()+
labs(title = "Proportion of Managers by Gender",
subtitle = "Example of the Simpson's Paradox",
x="",
caption = "Data: Syntetic | Graphics: Federica Gazzelloni") +
coord_equal()+
ggthemes::theme_pander()+
theme(plot.caption = element_text(hjust = 0.5))
In this scenario, when examining the proportion of managerial positions within each gender group, it might appear that men have a higher chance. However, when we consider the entire dataset, we may find the opposite due to the confounding variable.
The key takeaway is that understanding causation is crucial, and Simpson’s Paradox emphasizes the need to consider confounding factors when interpreting data.
HIV infections data is from the aidsinfo.unaids.org
. The data contains Global values from 2010 to 2022 for HIV estimate
, HIV incidence of mortality
, HIV prevalence
, and HIV deaths
.
National data for the 2010 and 2020 are included in this dataset.
The following is estimating magnitude of change along 10 years time-frame from 2010 to 2020 of HIV infections for all countries with available data.
Load necessary libraries.
library(tidyverse)
library(tidymodels)
tidymodels_prefer()
library(spatialsample)
library(sf)
library(tmap)
data("World")
This first dataset contains the average values, obtained by averaging the lower and upper bounds of 2010 and 2020 HIV infections for 173 countries.
The percent change relative to the sum of changes for all countries is given by:
This will give you the percent change for each country relative to the sum of all changes
. As well as the percentage contribution of each country's change to the total change
can be obtained.
prevalence_rt <- read.csv("data/Epidemic transition metrics_Incidence_prevalence ratio.csv")
aids_prevalence_rt<-prevalence_rt%>%
filter(!Country=="Global")%>%
select(!contains("Footnote"))%>%
select(contains(c("Country","2010","2020")))%>%
mutate(avg_2010=(as.numeric(X2010_upper)-as.numeric(X2010_lower))/2,
avg_2020=(as.numeric(X2020_upper)-as.numeric(X2020_lower))/2,
aids_change=(avg_2020-avg_2010),
country_change=round(aids_change/avg_2010,5),
aids_prev_cc=round(aids_change/sum(aids_change,na.rm = T),5))
inc_mort_ratio <- read_csv("data/Epidemic transition metrics_Incidence_mortality ratio.csv")
aids_inc_mort_ratio <- inc_mort_ratio%>%
janitor::clean_names()%>%
filter(!country=="Global")%>%
select(!contains("Footnote"))%>%
select(contains(c("Country","2010","2020")))%>%
mutate(avg_2010=(as.numeric(x2010_upper)-as.numeric(x2010_lower))/2,
avg_2020=(as.numeric(x2020_upper)-as.numeric(x2020_lower))/2,
aids_change=(avg_2020-avg_2010),
country_change=round(aids_change/avg_2010,5),
aids_imr_cc=round(aids_change/sum(aids_change,na.rm = T),5))
deaths <- read_csv("data/Epidemic transition metrics_Trend of AIDS-related deaths.csv")
aids_deaths<- deaths%>%
filter(!Country=="Global")%>%
filter(!Country=="India")%>%
select(!contains("Footnote"))%>%
select(contains(c("Country","2010","2020")))%>%
janitor::clean_names()%>%
mutate(x2010_upper=as.numeric(str_extract(x2010_upper,"([0-9]+)")),
x2010_lower=as.numeric(str_extract(x2010_lower,"([0-9]+)")),
x2020_upper=as.numeric(str_extract(x2020_upper,"([0-9]+)")),
x2020_lower=as.numeric(str_extract(x2010_lower,"([0-9]+)")),
avg_2010=(x2010_upper-x2010_lower)/2,
avg_2020=(x2020_upper-x2020_lower)/2,
aids_change=(avg_2020-avg_2010),
country_change=round(aids_change/avg_2010,5),
aids_d_cc=round(aids_change/sum(aids_change,na.rm = T),5))
dat%>%
pivot_longer(cols = contains("cc"))%>%
mutate(value=scale(value),
country=as.factor(country))%>%
drop_na()%>%
ggplot(aes(x=fct_reorder(country,value),y=value,
group=name,fill=name),color="grey24")+
geom_col(position = "stack")+
scale_y_log10(expand=c(0,0),label=scales::comma_format())+
labs(title="HIV Distributions (2010-2020)",
x="Country",
fill="",
caption = "Graphic: @fgazzelloni")+
theme(text=element_text(size=14),
axis.text.x = element_text(angle = 90,size=4,hjust=1),
panel.grid = element_blank(),
panel.background = element_rect(color = "grey24",fill="grey24"))
labels=c("aids_cc"="HIV Country Contribution",
"aids_d_cc"="HIV Deaths Country Contribution",
"aids_imr_cc"="HIV incidence mortality rate Country Contribution",
"aids_prev_cc"="HIV Prevalence Country Contribution")
ggplot()+
geom_sf(data=World,color="grey25",fill="grey75")+
geom_sf(data=aids_map,
mapping=aes(geometry=geometry,fill=value),
color="red")+
coord_sf(crs="ESRI:54030",clip = "off")+
facet_wrap(~name,labeller = labeller(name=labels))+
scale_fill_viridis_c()+
labs(caption="Map: @fgazzelloni")
set.seed(11132023)
split <- initial_split(data,prop = 0.8)
train<- training(split)
test <- testing(split)
folds <- spatial_clustering_cv(train, v = 5)
autoplot(folds)+
labs(title="HIV Spatial Clustering Cross Validation",
caption="DataSource: aidsinfo.unaids.org | Map: @fgazzelloni")+
ggthemes::theme_map(base_size = 14)+
theme(plot.title = element_text(hjust=0.5),
plot.caption = element_text(hjust = 0.5))
source: https://spatialsample.tidymodels.org/articles/spatialsample.html
# `splits` will be the `rsplit` object
compute_preds <- function(splits) {
# fit the model to the analysis set
mod <- lm(aids_cc ~ aids_prev_cc+aids_imr_cc+aids_d_cc,
data = analysis(split)
)
# identify the assessment set
holdout <- assessment(split)
# return the assessment set, with true and predicted price
tibble::tibble(
geometry = holdout$geometry,
aids_cc = log10(holdout$aids_cc),
.pred = predict(mod, holdout)
)
}
cluster_folds <- spatial_clustering_cv(data, v = 15)
block_folds <- spatial_block_cv(data, v = 15)
cluster_folds$type <- "cluster"
block_folds$type <- "block"
resamples <-
dplyr::bind_rows(
cluster_folds,
block_folds
)
For this #30DayMapChallenge 2023 Day6 - Asia
let’s explore the Population estimation for Regions and major Cities, from different sources.
Also, we will be looking at how to get started with ggmap to find the geocodes
for the major cities in Asia.
Let’s scrap the table of the Major Cities in Asia along with the Population level from Wikipedia.org
.
Load the first set of libraries:
html.population <- read_html('https://en.wikipedia.org/wiki/List_of_Asian_cities_by_population_within_city_limits')
df.asia_cities <- html.population %>%
html_nodes("table") %>%
.[[1]] %>%
html_table(fill = TRUE)
df.asia_cities %>% names()
Select only the vectors of interest and clean data.
df.asia_cities <- df.asia_cities[c(1,2,4)]
asia_cities <- df.asia_cities %>%
mutate(Population = str_replace_all(Population, "\\[.*\\]","") %>% parse_number(),
City_full= str_c(df.asia_cities$City, df.asia_cities$Nation, sep = ', ')) %>%
select(City, Nation, City_full, Population)%>%
filter(!str_detect(Nation,"Russia|Turkey"),
!is.na(Population))
asia_cities %>% head()
To find the Asia City Geocodes
we use the geocode()
function from the {ggmap} package.
Once you are all set try:
?ggmap::geocode
data.geo%>%head
Let’s have a look at the map of Asia with {ggmap}.
For this challenge we will be using another package for the polygons of Asia, the {rworldmap} package.
install.packages("rworldmap")
worldmap <- rworldmap::getMap(resolution = "high")
dim(worldmap)
Have a look at the regions and choose Asia.
As it is a spatial polygon dataframe, and we’d like to use the geom_sf()
function from the ggplot2 package, we transform it to a simple feature object with st_as_sf()
function from the sf package.
To map the continent with population estimation by state we can set the option fill= POP_EST
.
asia_sf %>%
ggplot()+
geom_sf(aes(fill=POP_EST))+
scale_fill_continuous()
Interesting is looking at a different classification of the population classes, and we do this by using the classIntervals()
function from the classInt package for classifying the Population Estimation by quantile.
Let’s have a look at the population quantiles first. What we can see are the min and the max levels, and the values of the three quantiles, 25%, 50% (median), and the 75%. Which estimation of population follow in each quantile class.
The median population estimate for Asia is around 18 million, with some regions having populations of less than 1.5 billion people.
quantile(asia_sf$POP_EST, na.rm=TRUE)
asia_sf%>%
ggplot(aes(POP_EST))+
geom_histogram(aes(fill=SOVEREIGNT),bins = 20)+
geom_vline(aes(xintercept = mean(POP_EST)),color="lightblue")+
geom_vline(aes(xintercept = median(POP_EST)),color="midnightblue")+
geom_text(aes(x=9000000,y=9,label="median"),size=2)+
geom_text(aes(x=50000000,y=9,label="mean"),size=2)+
scale_x_log10(labels=scales::comma_format(scale = 1/1000),n.breaks =8)+
scale_fill_viridis_d()+
labs(x="Population Estimation (Thousands)",
title="Asia Population Distribution",
caption="DataSource: {rworldmap} | Graphic: @fgazzelloni")+
ggthemes::theme_clean()+
theme(legend.text = element_text(size=5),
legend.key.size = unit(5,units = "pt"))
We use the classInt package to find custom intervals of the population. And set up a new object called brks
.
brks <- classIntervals(asia_sf$POP_EST,
n=10,
style="quantile")
brks
Set the color scheme:
brks <- brks$brks
colors <- RColorBrewer::brewer.pal(length(brks), "Spectral")
Finalize the dataset to use for the map with the population estimation interval cuts.
region_pop <- asia_sf%>%
select(POP_EST)%>%
mutate(breaks=case_when(POP_EST > 0 & POP_EST < 625493.5 ~ "[0,625493.5)",
POP_EST >= 625493.5 & POP_EST < 2691158 ~ "[625493.5,2691158)",
POP_EST >= 2691158 & POP_EST < 4728016 ~ "[2691158,4728016)",
POP_EST >= 4728016 & POP_EST < 6834942 ~ "[4728016,6834942)",
POP_EST >= 6834942 & POP_EST < 17788961 ~ "[6834942,17788961)",
POP_EST >= 17788961 & POP_EST < 23822783 ~ "[17788961,23822783)",
POP_EST >= 23822783 & POP_EST < 28625005 ~ "[23822783,28625005)",
POP_EST >= 28625005 & POP_EST < 65905410 ~ "[28625005,65905410)",
POP_EST >= 65905410 & POP_EST < 141564781 ~ "[65905410,141564781)",
POP_EST >= 141564781 & POP_EST <= 1338612968 ~ "[141564781,1338612968]"))
Set some information about Asia Population on a text box with the geom_textbox()
function from the ggtext package.
region_pop %>%
ggplot()+
geom_sf(aes(fill=breaks))+
scale_fill_manual(breaks=c("[0,625493.5)","[625493.5,2691158)",
"[2691158,4728016)","[4728016,6834942)",
"[6834942,17788961)","[17788961,23822783)",
"[23822783,28625005)","[28625005,65905410)",
"[65905410,141564781)","[141564781,1338612968]"),
values=rev(colors))+
geom_point(data=asia_cities_full,
mapping=aes(lon,lat,size=Population),
shape=21,stroke=0.5,
alpha=0.7,
color="grey90",
inherit.aes = F)+
scale_size_continuous(labels=scales::comma_format())+
geom_text(data=asia_cities_full,
mapping=aes(lon,lat,label=City),fontface="bold",
check_overlap = T,
size=2.1,color="white")+
ggtext::geom_textbox(data=text,
mapping=aes(x=60,y=-6,label=text),
size=1.8,width = 0.4,fill="grey90",
family = "Gill Sans",
inherit.aes = F)+
geom_curve(x=50,xend=67,y=0,yend=20,
linewidth=0.2,curvature = -0.5,
arrow = arrow(angle=30,
length = unit(0.1, "inches"),
ends = "last", type = "open"),
color="white")+
geom_curve(x=86,xend=140,y=-5,yend=33,
linewidth=0.2,
arrow = arrow(angle=30,
length = unit(0.1, "inches"),
ends = "last", type = "open"),
color="white")+
labs(fill="Regions Population",
size="Cities Population",
title="Asia - Population Level",
caption="#30DayMapChallenge 2023 Day6 - ASIA\nDataSource: Wikipedia & ggmap | Map @fgazzelloni")+
ggthemes::theme_map()+
theme(text=element_text(color="white", family = "Gill Sans"),
plot.title = element_text(face="bold",size=14),
plot.caption = element_text(hjust = 0),
plot.background = element_rect(fill="#4A4A4A",color="#4A4A4A"),
panel.background = element_rect(fill="#4A4A4A",color="#4A4A4A"),
legend.background = element_blank(),
legend.key = element_rect(color="#4A4A4A",fill="#4A4A4A"),
legend.position = "right",
legend.text = element_text(size=5.5),
legend.key.size = unit(5.5,units = "pt"))
ggsave("day6_asia.png",
width = 7,height = 4,
bg="#4A4A4A")
August 18, 2023 “A case of locally acquired #malaria has been confirmed in Maryland, Washington, D.C., area. Nine cases have been reported this summer in Florida and Texas, the first in the US in 20 years, according to the US Centers for Disease Control and Prevention. #epidemics” ^{1}
The Malaria case is somehow a concerning case, eradicated all over the World except for some areas in the Africa’s continent, the highlight of cases of domestic origins are considered epidemics.
Let’s dig on some data about Malaria.
The first source of data is the WHO Malaria Map, a collection of information about location of cases, the vector species, their invasive status, and other variables of interest such as temporal of the study, the sampling methods, and so on.
Load necessary libraries
malaria_who <- read_excel("data/who_data.xlsx",
sheet = "Data")
malaria_who %>% head()
dim(malaria_who)
malaria_who%>%glimpse
Historical collection of data about mosquito number shows data is collected from 1985 and updated to 2022.
Let’s omit missing values for the year_start
variable and create a new variable midyear
which is the middle year between the year_start
and year_end
of each study.
Along the time, the trend of the number of mosquito varied with ups and downs.
malaria_who_mid_year %>%
group_by(midyear)%>%# count(midyear)
mutate(avg_n_mosquito=mean(mosquito_number))%>%
ungroup()%>%
count(country_name,midyear,avg_n_mosquito,invasive_status)%>%
ggplot(aes(x=midyear,y=log10(avg_n_mosquito),
group=invasive_status,color=invasive_status))+
geom_line(linewidth=2)+
scale_color_viridis_d(labels = c("Invasive", "Native"),
guide = guide_legend(reverse=TRUE,
override.aes = list(size = 10)))+
coord_cartesian(clip = 'off') +
labs(title="Time series Malaria 1985 - 2022",
caption="Graphics: FG",color="Status")+
ggthemes::theme_fivethirtyeight()
Mosquito have been categorized as invasive species
, after 2016.
malaria_who_mid_year%>%
filter(midyear>= 2010)%>%
ggplot(aes(x=factor(midyear),y=log10(mosquito_number),
group=midyear,fill=invasive_status))+
geom_violin()+
scale_fill_viridis_d(labels = c("Invasive", "Native"),
guide = guide_legend(reverse=TRUE,
override.aes = list(size = 10)))+
ggthemes::theme_fivethirtyeight()+
theme(legend.position = "top")+
labs(title="When moqsquito became tag invasive",
caption="Graphics: FG",fill="Status")
The location of cases revealed by consistent sentinel surveillance
procedure, identify the area in the south est Africa/Asia to be the most affected by the danger of malaria virus to spread across the rest of the World.
Here is a map of the invasive vector species
in this area:
world <- map_data("world")%>%filter(!region=="Antarctica")
malaria_who %>%
ggplot(mapping=aes(x=longitude,y=latitude))+
geom_polygon(data=world,
mapping=aes(x=long,y=lat,group=group),
linewidth=0.2,
color="grey80",
fill="white")+
geom_point(aes(fill=invasive_status),
color="grey80",
shape=21,
stroke=0.2,
size=0.7,alpha=0.5)+
coord_sf(xlim = c(-50, 110), expand = TRUE) +
scale_fill_viridis_d()+
labs(title="Invasive vector species from 1985 to 2022",fill="Status")+
ggthemes::theme_map()+
theme(plot.background = element_rect(color="steelblue",fill="steelblue"))
Let’s zoom in to the center of the mass points. Setting the mean range of the latitude and the longitude, to identify the central point, within the mass of points where mosquito were located, and setting a zoom level, a closer focus at the locations is possible, even with a specification of the new range of to be assigned to the map boundaries.^{2}
zoom_to <- c(lon_avg,lat_avg)
zoom_level <- 1.5
lon_span <- 360 / 2^zoom_level
lat_span <- 180 / 2^zoom_level
ggplot()+
geom_polygon(data=world,
mapping=aes(x=long,y=lat,group=group),
linewidth=0.2,
color="grey80",
fill="white")+
geom_point(data= malaria_who,
mapping=aes(x=longitude,y=latitude,
fill=invasive_status),
color="black",
shape=21,
stroke=0.2,
size=3,alpha=0.5)+
geom_sf_text(data = st_sfc(st_point(zoom_to), crs = 4326),
label = '.') +
scale_fill_viridis_d()+
coord_sf(xlim = lon_bounds, ylim = lat_bounds,expand = TRUE) +
labs(title="A closer look at invasive vector species",
subtitle="from 1985 to 2022",
caption="DataSource: WHO Malaria Data | Graphics: FG",
fill="Status")+
ggthemes::theme_map()+
theme(text=element_text(family="Roboto Condensed"),
plot.title = element_text(size = 18),
plot.background = element_rect(color="steelblue",fill="steelblue"),
legend.position = c(0,0.001))
Looking at a different data source, the Worldbank data
provides a reports with the incidence of malaria
from 2000 and 2010. ^{3}
malaria_wb <- read_csv("data/worldbank_data.csv")
malaria_wb_long <- malaria_wb%>%
janitor::clean_names()%>%
select(-series_name,-series_code) %>%
pivot_longer(cols = c(3:14),names_to="year")%>%
mutate(value=trimws(value),
value=as.numeric(value),
value=round(value,3))%>%
filter(!is.na(value))%>%
mutate(year=gsub("_yr[0-9]+$","",year),
year=gsub("^x","",year),
year=as.integer(year))%>%
arrange(year)
malaria_wb_long%>%glimpse
Article: https://www.thelancet.com/journals/lanmic/article/PIIS2666-5247(23)00063-0/fulltext
GitHub repository: https://github.com/hannaehrlich/maldrugres_SSA
url <- "https://raw.githubusercontent.com/hannaehrlich/maldrugres_SSA/main/Survey_MolecMarker_Data.csv"
world <- map_data("world")%>%filter(!region=="Antarctica")
africa <- world%>%filter(long >= -50,long< 60)
ggplot(data= maldrugres) +
geom_polygon(data= africa,
mapping=aes(x=long,y=lat,group=group),
linewidth=0.1,
color="grey70",fill="grey90")+
geom_point(mapping=aes(x=Lon,y=Lat,
color=Drug,
fill=Present),
shape=21,
stroke=0.2,
size=3)+
scale_color_manual(values = c("steelblue","darkred"))+
scale_fill_gradient(low=NA,high = "darkred")+
coord_sf(xlim = c(-20,50),ylim = c(-40,60),clip = "off")+
labs(title="antiMalarial drug resistance",
caption="DataSource: GitHub hannaehrlich/maldrugres_SSA | Graphics: FG")+
ggthemes::theme_map()+
theme(text = element_text(family="Roboto Condensed"),
plot.title = element_text(size=30,hjust = 0.5,family="Roboto Condensed"),
plot.title.position = "plot",
legend.position = c(-0.6,0.1))
maldrugres_new%>%
count(drug)%>%
ggplot(aes(x=drug,y=n,fill=drug))+
geom_col()+
labs(title="Drug class imbalance",
caption="Graphics: FG")+
scale_fill_viridis_d()+
ggthemes::theme_fivethirtyeight()+
theme(legend.position = "none")
maldrugres_new%>%
group_by(country,drug)%>%
reframe(avg_drug=mean(present))%>%
ggplot(aes(x=avg_drug,y=fct_reorder(country,avg_drug)))+
geom_col(aes(fill=drug))+
scale_fill_viridis_d()+
labs(title="AntiMalarial Drug Resistance Present",
caption="Graphics: FG",
x="Average value by Country",y="")+
ggthemes::theme_fivethirtyeight()+
theme(axis.text.x = element_text(angle=0,hjust = 1))
maldrugres_new %>%
ggplot(aes(present))+
geom_density()+
labs(title="Density distribution of antimalarial drug resistance",
caption = "Graphics: FG")+
ggthemes::theme_fivethirtyeight()
maldrugres_new%>%
group_by(country,drug)%>%
reframe(avg_drug=mean(present))%>%
ggplot(aes(x=avg_drug,y=fct_reorder(country,avg_drug)))+
geom_boxplot()+
labs(title="AntiMalaria Drug Resistance Present",
caption="Graphics: FG",
x="Average value by Country",y="")+
theme(axis.text.x = element_text(angle=0,hjust = 1))+
ggthemes::theme_fivethirtyeight()
set.seed(123)
split <- initial_split(maldrugres_new)
training <- training(split)
testing <- testing(split)
cv_folds <- vfold_cv(training,v = 10)
rec_pca <- recipe(present ~., training) %>%
step_dummy(all_nominal_predictors(),keep_original_cols = F)%>%
step_corr(all_numeric_predictors())%>%
step_normalize(all_predictors())%>%
step_pca(all_predictors())
rec_pca_df <- rec_pca %>%
prep()%>%
juice()%>%
cbind(drug=training$drug,country=training$country)
rec_pca_df%>%head
rec_pca_df %>%
ggplot(aes(x=PC1,PC2,group=drug,color=drug))+
geom_point()+
geom_smooth(se=F)+
scale_color_viridis_d()+
labs(title="Principal Components Analysis",
caption = "Graphics: FG")+
ggthemes::theme_fivethirtyeight()+
theme(axis.title = element_text())
rec_pca_df %>%
ggplot(aes(x=PC1,y=fct_reorder(country,PC1),group=country))+
geom_boxplot()+
labs(title="Principal Components Analysis - boxplot",
caption = "Graphics: FG",y="")+
ggthemes::theme_fivethirtyeight()+
theme(axis.title = element_text(),
plot.title = element_text(hjust = 1))
rec_pca_df %>%
ggplot(aes(x=PC1,y=present))+
geom_point()+
scale_y_log10()+
geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cs"))+
labs(title="Principal Components Analysis - scatterplot",
caption = "Graphics: FG",y="Present")+
ggthemes::theme_fivethirtyeight()+
theme(axis.title = element_text(),
plot.title = element_text(hjust = 0))
rec_ica %>%
ggplot(aes(x=IC1,IC2,group=drug,color=drug))+
geom_point()+
geom_smooth(se=F)+
scale_colour_viridis_d()+
labs(title="Independent Components Analysis",
caption = "Graphics: FG")+
ggthemes::theme_fivethirtyeight()+
theme(axis.title = element_text(),
plot.title = element_text(hjust = 0))
worldbank_data source: https://databank.worldbank.org/reports.aspx?source=2&series=SH.MLR.INCD.P3&country=#↩︎
worldbank_data source: https://databank.worldbank.org/reports.aspx?source=2&series=SH.MLR.INCD.P3&country=#↩︎
worldbank_data source: https://databank.worldbank.org/reports.aspx?source=2&series=SH.MLR.INCD.P3&country=#↩︎
In this post I’ll go through some differences between Bayesian statistical packages in R. Bayesian statistics involves probabilities. This means that the probability of an event to occur is considered in the modeling procedure, and is mainly used in for making inferences, and can be used for an analysis of the speculation of the root cause of a phenomenon under the term of causal inference.
In more details, when Bayesian statistics is performed, the response variable is tested against (causal) predictors with the application of suited prior distributions, and the use of the likelihood function, to finally produce a posterior distribution which should be as much as possible close to the real future outcome of the response variable distribution.
The prior distribution is the starting point; it is the probability distribution on which the future outcome is linked to, such as the probability to have a Girl given the probability to have had a Boy.
The probability to have had a Boy is the prior, while the conditional probability to have a Girl is the posterior.
Briefly, here is a comparison between different R packages that use Bayesian inference for the calculation of the model probability distribution of the posterior.
The Stan model engine, for model replication and prediction is used in conjunction with the Montecarlo simulation technique for the best model solution. The Stan model engine is applied in the following packages:
All of these packages adapt and adjust different model options for a modeling procedure which is the result of the best combination of efficiency to increase productivity and effectiveness, to identify and remove unnecessary steps, automate repetitive tasks, and utilize the most suitable software tools.
This is the original source code that I have updated: https://www.jstatsoft.org/article/view/v080i01
A wide range of distributions and link functions are supported, allowing users to fit - among others - linear, robust linear, binomial, Poisson, survival, ordinal, zero-inflated, hurdle, and even non-linear models all in a multilevel context. (The Brms package)
Loading required packages
library(tidyverse)
library("brms")
library("rstanarm")
library("rethinking")
library("MCMCglmm")
Helper function to better compute the effective sample size
eff_size <- function(x) {
if (is(x, "brmsfit")) {
samples <- as.data.frame(x$fit)
} else if (is(x, "stanreg")) {
samples <- as.data.frame(x$stanfit)
} else if (is(x, "ulam")) {
samples <- as.data.frame(x@stanfit)
} else if (is(x, "stanfit")) {
samples <- as.data.frame(x)
} else if (is(x, "MCMCglmm")) {
samples <- cbind(x$Sol, x$VCV)
} else {
stop("invalid input")
}
# call an internal function of rstan
floor(apply(samples, MARGIN = 2, FUN = rstan:::ess_rfun))
}
# only used for Stan packages
iter <- 6000
warmup <- 1000
chains <- 1
adapt_delta <- 0.8
# only used for MCMCglmm
nitt <- 35000
burnin <- 10000
thin <- 5
# leads to 5000 posterior samples
prior_dye_brms <- c(set_prior("normal(0, 2000)", class = "Intercept"),
set_prior("cauchy(0, 50)", class = "sd"),
set_prior("cauchy(0, 50)", class = "sigma"))
dye_brms <- brm(Yield ~ 1 + (1 | Batch),
data = lme4::Dyestuff,
prior = prior_dye_brms,
chains = 0)
time_dye_brms <- system.time(capture.output(
dye_brms <- update(dye_brms,
iter = iter,
warmup = warmup,
chains = chains,
control = list(adapt_delta = adapt_delta))
))
# summary(dye_brms)
eff_dye_brms <- min(eff_size(dye_brms)) / time_dye_brms[[1]]
time_dye_rstanarm <- system.time(capture.output(
dye_rstanarm <- stan_glmer(Yield ~ 1 + (1 | Batch), data = lme4::Dyestuff,
prior_intercept = normal(0, 2000),
iter = iter, warmup = warmup, chains = chains,
adapt_delta = adapt_delta)
))
# summary(dye_rstanarm)
eff_dye_rstanarm <- min(eff_size(dye_rstanarm)) / time_dye_rstanarm[[1]]
d <- lme4::Dyestuff
dat <- list(
Yield = d$Yield,
Batch = d$Batch
)
dye_flist <- alist(
Yield ~ dnorm(eta, sigma),
eta <- a + a_Batch[Batch],
a ~ dnorm(0,2000),
a_Batch[Batch] ~ dnorm(0, sd_Batch),
sigma ~ dcauchy(0, 50),
sd_Batch ~ dcauchy(0, 50))
dye_rethinking <- ulam(dye_flist,
data = dat,
chains=1,
cores = 4,
sample = TRUE)
time_dye_rethinking <- system.time(capture.output(
dye_rethinking <- update(dye_rethinking,
iter = iter,
warmup = warmup,
chains = chains,
control = list(adapt_delta = adapt_delta))
))
# summary(dye_rethinking)
eff_dye_rethinking <- min(eff_size(dye_rethinking)) / time_dye_rethinking[[1]]
time_dye_MCMCglmm <- system.time(capture.output(
dye_MCMCglmm <- MCMCglmm(Yield ~ 1,
random = ~ Batch, data = lme4::Dyestuff,
thin = thin, nitt = nitt, burnin = burnin)
))
# summary(dye_MCMCglmm)
eff_dye_MCMCglmm <- min(eff_size(dye_MCMCglmm)) / time_dye_MCMCglmm[[1]]
brms | rstanarm | rethinking | MCMCglmm |
---|---|---|---|
559.55398 | 202.97177 | 3660.71429 | 34.11514 |
These days measuring performances is very appropriate for many different topics. Thinking about health and the fast changing environments, including climate changes, require a ready tool for identifying possible future outcomes. On health, interesting simple metrics are used to classify the state of health of a population, so to be comparable with other near and far.
Here is a spec of my latest project where I am collecting all that I learned since the start of the Covid19 pandemic on a summary of the techniques used for measuring the health status of a population when in conjunction with an extreme event. Many tools are available and ready to use for the most exceptional purpose someone might had in mind, and I had difficulties choosing one on top of the other. But, why choosing if you can combine them?
There are three metrics that are used for the purpose of classification in the public health, the DALYs, YLLs, and the YLDs. Respectively are the Disability Adjusted Life Years, Years of Life Lost, and Years Lived with Disabilities.
Before going into the calculation detail, the definition of good health and well being is required.
The WHO constitution states:
“Health is a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity.”
An important implication of this definition is that mental health is more than just the absence of mental disorders or disabilities.
Let’s load the {infectious} package, still at its very early stages; a development version can be installed from GitHub:
devtools::install_github("Fgazzelloni/infectious")
It contains some interesting datasets:
The Global life tables: Glifetables. A dataset provided by the World Health Organization (WHO). Global Health Observatory data repository
infectious::Glifetables %>% head
And the Germany lung cancer: Germany_lungc. A dataset provided by the Institute for Health Metrics and Evaluation (IHME). GBD Results
infectious::Germany_lungc %>% head
Germany_lungc %>%
ggplot(aes(age_group, val, fill = sex)) +
geom_col() +
facet_wrap( ~ sex) +
scale_x_discrete(breaks = c("35-39", "65-69", "85+")) +
ggthemes::scale_fill_fivethirtyeight() +
ggthemes::theme_fivethirtyeight() +
labs(title = "Germany lung cancer -2019",
caption = "Vis: fgazzelloni|DataSource: IHME")
The combination of this to piece of information, the life expectancy and the expected value of lung cancer cases, in proportion of the Germany population, are combined by age class and divided by sex to obtain the YLLs, the numbers of years of life lost.
In this case for Germany data is available from the age class 10-14, if we would like to improve this analysis it required some missing value imputation, through data feature engineering.
yll <- Germany_lungc %>%
full_join(
Glifetables %>%
filter(year == 2019,
indicator == "ex - expectation of life at age x") %>%
rename(life_expectancy = value),
by = c("age_group", "sex")
) %>%
select(-upper,-lower,-year,-indicator) %>%
group_by(age_group) %>%
mutate(yll = val * life_expectancy) %>%
filter(!is.na(yll))
yll %>%
head()
yll %>%
ggplot(aes(age_group, yll, fill = sex)) +
geom_col() +
facet_wrap( ~ sex) +
scale_x_discrete(breaks = c("35-39", "65-69", "85+")) +
ggthemes::scale_fill_fivethirtyeight() +
ggthemes::theme_fivethirtyeight() +
labs(title = "YLL - Germany lung cancer -2019",
caption = "Vis: fgazzelloni|DataSource: IHME & WHO")
To build the YLDs, the numbers of years lived with a disability due to a disease or injury, we need more data: the prevalence, and the disability weights.
source:
Then, the sum of the YLL and the YLD provides the overall value of the DALY which is the key metric used to state the health of a population, and it is used to make comparisons among population of different countries, as well as bein used to provide a comprehensive assessment of the impact of disease and injury on a population, and help prioritize public health interventions and evaluate the effectiveness of public health programs.
Let’s now have a look at how infectious diseases can affect the DALYs. The COVID-19 pandemic has had a significant impact on DALYs metrics worldwide.
Still results are not fully available, but several of the risk factors and non-communicable diseases (NCDs) highlighted by the GBD study, including obesity, diabetes, and cardiovascular disease, are associated with increased risk of serious illness and death from COVID-19, and so, as a consequence linked with an increase of the overall level of DALYs. See The Lancet: Latest global disease estimates reveal perfect storm of rising chronic diseases and public health failures fuelling COVID-19 pandemic
COVID-19 is expected to show clearly that it has been the leading cause of global DALYs in 2020.
More information on the level of findings are in this interesting article: https://doi.org/10.1016/S0140-6736(20)30925-9 on the Lancet by the GBD collaborator team.
Looking at the Global impact of some infectious diseases, such as: Ebola.
The impact of Ebola on DALYs in 2019 can be assessed by comparing the number of DALYs due to Ebola in 2019 to the DALYs caused by other diseases or conditions during the same period.
According to the Global Health Data Exchange (GHDx), the estimated global DALY rate for Ebola virus disease in 2019 was 0.0005, which is relatively low compared to other leading causes of DALYs, such as cardiovascular diseases, lower respiratory infections, and neonatal disorders. See IHME Ebola — Level 3 cause
Import data on global burden of disease (GBD) for a given year, here I already downloaded the cvs file and save it as RData.
Let’s have a look at the DALYs rates and consider the average value by 5 years range, then calculate the total DALYs for all ages in 2019.
avg_dalys_2019 <- mean(df_dalys_2019$val)
avg_dalys_2019
total_dalys_2019 <- sum(df_dalys_2019$val)
total_dalys_2019
Import data on infectious diseases, and select Global, Ebola, both sex.
load("data/infectious_diseases.RData")
ebola_global_2019 <- infectious_diseases %>%
filter(location_name == "Global",
sex_name == "Both",
cause_name == "Ebola") %>%
select(!contains("_id"))
# Calculate total COVID-19 DALYs for 2019
total_ebola_global_2019 <-
sum(ebola_global_2019$val)
# Calculate the percentage change in DALYs due to COVID-19
percent_change_dalys <- round((total_ebola_global_2019 / total_dalys_2019) *100,4)
# Print the percentage change in DALYs due to COVID-19
cat("Total impact of Ebola virus Globally on DALYs rates in 2019:", percent_change_dalys, "%")
However, the impact of Ebola on DALYs is more significant in certain African regions, it accounts for the whole population. For example, during the 2014-2016, the Ebola outbreak in West Africa caused an estimated 11,000 deaths and 261,000 DALYs lost.
Overall, while the global impact of Ebola on DALYs in 2019 was relatively low, it is still an important health concern in areas where outbreaks occur, and efforts to prevent and control the disease are crucial to reducing its impact on affected populations.
This post is all about Retail Sales with ggstream, the dataset comes from #TidyTuesday 2022 week 50 Monthly State Retail Sales.
The picture below is the result of the ggstream visualization.
Load libraries
Set the theme
theme_set(theme_minimal(base_family = "Roboto Condensed",
base_size = 12))
theme_update(
plot.title = element_text(
size = 20,
face = "bold",
hjust = .5,
margin = margin(10, 0, 30, 0)
),
plot.caption = element_text(
size = 9,
color = "grey40",
hjust = .5,
margin = margin(20, 0, 5, 0)
),
axis.text.y = element_blank(),
axis.title = element_blank(),
plot.background = element_rect(fill = "grey88", color = NA),
panel.background = element_rect(fill = NA, color = NA),
panel.grid = element_blank(),
panel.spacing.y = unit(0, "lines"),
strip.text.y = element_text(angle = 0),
legend.position = "bottom",
legend.text = element_text(size = 9, color = "grey40"),
legend.box.margin = margin(t = 30),
legend.background = element_rect(
color = "grey40",
linewidth = .3,
fill = "grey95"
),
legend.key.height = unit(.25, "lines"),
legend.key.width = unit(2.5, "lines"),
plot.margin = margin(rep(20, 4))
)
And the color palette
pal <- c("#FFB400",
"#C20008",
"#13AFEF",
"#8E038E")
Load the data
tuesdata <- tidytuesdayR::tt_load(2022, week = 50)
coverage_codes <- tuesdata$coverage_codes
state_retail <- tuesdata$state_retail
Add the states’ names
Join all sets
Data wrangling
my_df1 <- my_df %>%
select(-naics) %>%
mutate(
coverage = case_when(
coverage == "non-imputed coverage is greater than or equal to 10% and less than 25% of the state/NAICS total" ~
"greater than or equal 10% and less than 25% of the state/NAICS total",
coverage == "non-imputed coverage is greater than or equal to 25% and less than 50% of the state/NAICS total" ~
"greater than or equal to 25% and less than 50% of the state/NAICS total",
coverage == "non-imputed coverage is greater than or equal to 50% of the state/NAICS total." ~
"greater than or equal to 50% of the state/NAICS total",
coverage == "non-imputed coverage is less than 10% of the state/NAICS total." ~
"less than 10% of the state/NAICS total",
TRUE ~ coverage
),
month = as.character(month),
year = zoo::as.yearmon(paste0(year, "-", month)),
change_yoy = ifelse(change_yoy == "S", 0, change_yoy),
change_yoy_se = ifelse(change_yoy_se == "S", 0, change_yoy_se),
change_yoy = as.numeric(change_yoy),
change_yoy_se = as.numeric(change_yoy_se),
coverage = as.factor(coverage),
coverage = paste(coverage_code, "-", coverage)
) %>%
filter(state_abbr %in% c("USA", "PA", "MD", "MT")) %>%
filter(!coverage_code == "S") %>%
group_by(state_name, coverage, year) %>%
summarise_if(is.numeric, sum, na.rm = TRUE) %>%
mutate(change_yoy = scale(change_yoy, center = FALSE)) %>%
ungroup() %>%
mutate(year = as.POSIXct(year),
year = as.Date(year))
Make the plot
my_df1 %>%
ggplot(aes(
x = year,
y = change_yoy,
color = coverage,
fill = coverage
)) +
geom_stream(
geom = "contour",
color = "white",
linewidth = 1.25,
bw = .45 # Controls smoothness
) +
geom_stream(geom = "polygon",
bw = .45,
linewidth = 0.2) +
facet_grid(state_name ~ .,
scales = "free_y",
space = "free") +
scale_y_continuous(trans = scales::modulus_trans(0.1, 1)) +
scale_x_date(date_breaks = "6 months",
date_labels = "%b-%Y",
expand = c(0, 0)) +
scale_color_manual(expand = c(0, 0),
values = pal,
guide = "none") +
scale_fill_manual(values = pal,
name = NULL) +
labs(title = "Total Year-Over-Year percent change\nin monthly retail sales value",
subtitle = "North American Industry Classification System (NAICS) top YoY states",
caption = "DataSource: #TidyTuesday 2022 Week50 | Monthly State Retail Sales | DataViz: Fgazzelloni") +
theme(legend.direction = "vertical")
ggsave("w50_retail_sales.png")
I made a package! Looking for data to use for one of my data visualization I stomped into this data about frogs in Oregon (US) and realized that it was very interesting for making both classification and regression models. So, I wrapped the data into a package, and even if it is still a work in progress I started using it for practicing machine learning algorithms.
More info about it will follow. For now this project is all about how to use mlr3 package for analysing and predicting data. mlr3 is a machine learning ecosystem, it provides a unified interface to use different machine learning models in R.
Let’s load the libraries, and start using mlr3 with oregonfrogs data!
library(tidyverse)
library(mlr3)
# remotes::install_github("mlr-org/mlr3spatiotempcv")
library(mlr3spatiotempcv)
library(mlr3learners) # needed to initiate a new learner (such as classif.ranger)
# remotes::install_github("mlr-org/mlr3extralearners")
library(mlr3extralearners)
library(ranger)
# install.packages("apcluster")
# remotes::install_github("mlr-org/mlr3proba")
library("mlr3viz")
To install oregonfrogs, which is still in its development version you need to install it from github:
# remotes::install_github("fgazzelloni/oregonfrogs")
library(oregonfrogs)
I changed the name of the dataset to oregonfrogs, so now is oregonfrogs::oregonfrogs
also added some functions, made some modifications, and left the raw data (oregonfrogs_raw
) available.
Here, I take some selected variables which I’ll use in the model.
It’s a pr