rm(list=ls())
suppressMessages(library(tidyverse))
theme_set(theme_minimal())
Overview
In the world of artificial intelligence, understanding the fundamentals of neural networks is essential. One of the simplest yet powerful architectures is the Artificial Neural Network (ANN) with a single hidden layer, often referred to as a “plain vanilla” network. This blog post aims to provide a comprehensive guide to building such a network from scratch, focusing on key concepts like activation functions, weights, and the backpropagation algorithm outside the library boxes.
Example of construction of Plain vanilla network architecture
ANN’s (Artificial Neural Networks) is the simplest implementation of deep learning model architectures that mimic the human brain’s neural network. The simplest form of ANN is a single-layer network, also known as a “plain vanilla” network. This network consists of an input layer, a hidden layer, and an output layer. The hidden layer transforms the input data into a new set of features, which are then used to predict the response variable.

In the image above, we have some predictors
Build synthetic data
Let’s create a synthetic dataset to demonstrate the construction of a plain vanilla network. We will generate a dataset with 60 observations and two predictors, x
and y
, using the following steps:
-
Predictors
asUniform
distributed variables ranging between[-2, 2]
:
-
Response
Variable as function of the predictors:
# A tibble: 6 × 2
y x
<dbl> <dbl>
1 0.859 -0.769
2 0.591 -0.969
3 0.152 0.209
4 -0.829 -1.77
5 0.733 -0.126
6 0.645 -0.0649
EDA - Exploratory Data Analysis
Let’s visualize our synthetic data:
data %>%
ggplot(aes(x, y)) +
geom_line() +
geom_point() +
geom_hline(yintercept=0,
linetype="dashed", color="grey")
Parameters estimation
Now that we have the data, we attempt to replicate the distribution of this data with a model using artificial neural network technique with a single hidden layer. The model will have the following parameters:
Model Formula:
Where,
The hidden layer computes a number of activations.
Number of activations
Initialize the number of hidden neurons
hidden_neurons = 5
The number of hidden neurons is a hyperparameter that needs to be tuned. The more neurons, the more complex the model, but it can also lead to overfitting. You can think of a neuron as a connection between the input and output layers. The more neurons, the more connections, and the more complex the model.
We have set to have 5 hidden neurons in this example. This means that the hidden layer will compute 5 different linear combinations of the input
The function that takes the input
Activation Function
The activation function is a non-linear transformation of the input layers
logistic regression
to convert a linear function
into probabilities between zero and one.
To be more explicit, the activation function is a function that takes the input signal and generates an output signal, but takes into account a threshold
, meaning that it will only be activated if the signal is above a certain threshold.
We have specified 5 different activation functions to compare their performance, and we will use the sigmoid
function as the activation function in this example.
Type of Activation functions
There are several types of activation functions, each with its own characteristics, but all have in common that they introduce non-linearity into the first level of output provided.
Some of the most common types of activation functions are:
-
Sigmoid
function:
-
ReLU
(Rectified Linear Unit) function:
-
SoftPlus
Function
This is a smooth approximation to the ReLU function. Firstly introduced in 2001, Softplus
is an alternative to traditional functions because it is differentiable and its derivative is easy to demonstrate (see source: https://sefiks.com/2017/08/11/softplus-as-a-neural-networks-activation-function/).
- Other types are:
- Polynomials/Splines:
- Hyperbolic tanh:
- Polynomials/Splines:
Let’s compare the activation functions
:
data %>%
mutate(z=sigmoid(x)) %>%
ggplot() +
geom_line(aes(x, z)) +
ylim(0, 1)
data %>%
ggplot() +
geom_line(aes(x, relu(x) * 1/2.4))
data %>%
ggplot() +
geom_line(aes(x, softplus(x)))
And now look at how the Sigmoid
differs from the ReLU
function:
Our model is a model in the model:
As you might have noticed in the formula above, the model is a linear combination of the input
Let’s have a look at the weights and how they are initialized.
Weights Initialization
The weights are the parameters of the model that are adjusted during the training process. They can be considered as the coefficients of the hidden layer model.
They are initialized
randomly, and the model is trained to adjust these weights during the training process. The weights are adjusted using the backpropagation algorithm
, which computes the gradient of the loss function
with respect to the weights. Then, the weights are updated using the gradient descent algorithm
. We will see how this is done in the next section.
The weights are initialized randomly to break the symmetry and prevent the model from getting stuck in a local minimum. In this case we use a normal distribution with a mean of 0 and a standard deviation of 1 to initialize the weights.
Randomly initializing the weights
as i.i.d.
The constant term
The model derives five new features by computing five different linear combinations of
data %>%
ggplot(aes(x, y)) +
geom_point(shape=21,
stroke=0.5,
fill="grey",
color="grey20") +
geom_line(linewidth=0.2) +
geom_smooth(method = "lm",
color="steelblue",
se=F) +
geom_line(aes(x, sigmoid(y)),
linetype="dashed",
color="steelblue")
FeedForward
The meaning of feedforward
is used to describe the process of moving the input data through the network to obtain the predicted output. The feedforward
process is the first step in the training process of the neural network.
Here is a function that computes the output of the model given the inputs: data
, weights
, and number of activations
. It computes the output by multiplying the input data by the weights and applying the activation function to the result. It is a matrix multiplication (%*%
), which is a common operation in unsupervised learning algorithms.
Derivative Activation Function
Now, that we have the feedforward
function, we need to compute the derivative of the activation function. The backpropagation algorithm
multiplies the derivative of the activation function.
Backpropagation algorithm
multiplies the derivative of the activation function.
Here is a recap of the definition of derivative formula, which is applied any time the output released by the activation function is met in the network. And so, a new minimum is found. It will be more clear through the end of the post.

So, it is fundamental to define the derivative of the activation function needed for computing the gradient. For this example, we will use the derivative of the sigmoid
function.
derivativeActivation <- function(x) {
g = (sigmoid(x) * (1 - sigmoid(x)))
return(g)
}
Model Error
Function for computing model error is the sum of squared errors (SSE) between the predicted and actual values.
Back-Propagation
So, this is the time for computing the gradients.
What are the gradients?
The gradients are the derivatives of the cost function with respect to the weights. The backpropagation algorithm
computes the gradient of the loss function with respect to the weights.
The gradients are then used to update the weights using the gradient descent algorithm
.
backPropagation <- function(x, y, w1, w2,
activation, derivativeActivation) {
#predicted values
preds <- feedForward(x, w1, w2, activation)
#Derivative of the cost function (first term)
derivCost <- -2 * (y - preds)
#Gradients for the weights
dW1 <- matrix(0, ncol=2, nrow=nrow(w1))
dW2 <- matrix(rep(0, length(x) * (dim(w2)[2])), nrow=length(x))
# Computing the Gradient for W2
for (i in 1:length(x)) {
a1 = w1 %*% matrix(rbind(1, x[i]), ncol=1)
da2dW2 = matrix(rbind(1, activation(a1)), nrow=1)
dW2[i,] = derivCost[i] * da2dW2
}
# Computing the gradient for W1
for (i in 1:length(x)) {
a1 = w1 %*% matrix(rbind(1, x[i]), ncol=1)
da2da1 = derivativeActivation(a1) * matrix(w2[,-1], ncol=1)
da2dW1 = da2da1 %*% matrix(rbind(1, x[i]), nrow=1)
dW1 = dW1 + derivCost[i] * da2dW1
}
# Storing gradients for w1, w2 in a list
gradient <- list(dW1, colSums(dW2))
return (gradient)
}
Stochastic Gradient Descent
Defining our Stochastic Gradient Descent algorithm
which will adjust our weight matrices.
SGD <- function(x, y, w1, w2, activation, derivative, learnRate, epochs) {
SSEvec <- rep(NA, epochs) # Empty array to store SSE values after each epoch
SSEvec[1] = modelError(x, y, w1, w2, activation)
for (j in 1:epochs) {
for (i in 1:length(x)) {
gradient <- backPropagation(x[i], y[i], w1, w2, activation, derivative)
# Adjusting model parameters for a given number of epochs
w1 <- w1 - learnRate * gradient[[1]]
w2 <- w2 - learnRate * gradient[[2]]
}
SSEvec[j+1] <- modelError(x, y, w1, w2, activation)
# Storing SSE values after each iteration
}
# Beta vector holding model parameters
B <- list(w1, w2)
result <- list(B, SSEvec)
return(result)
}
Modeling
Running the SGD function to obtain our optimized model and parameters:
model <- SGD(x, y(x), w1, w2,
activation = sigmoid,
derivative = derivativeActivation,
learnRate = 0.01,
epochs = 200)
Obtaining our adjusted SSE’s for each epoch:
SSE <- model[[2]]
Model Visualization
Plotting the SSE from each epoch vs number of epochs
Parameters optimization
Extracting our new parameters
from our model.
new_w1 <- model[[1]][[1]]
new_w2 <- model[[1]][[2]]
Comparing our old weight
matrices against the new ones
.
New Predictions
Obtaining our new predictions using our optimized parameters.
y_pred <- feedForward(x, new_w1, new_w2, sigmoid)
Plotting training data against our model predictions
data %>%
mutate(y_pred=y_pred) %>%
pivot_longer(cols = c(y, y_pred)) %>%
ggplot(aes(x, value, group=name, color=name)) +
geom_point(shape=21, stroke=0.5) +
geom_line() +
scale_color_discrete(type = c("steelblue", "red")) +
labs(title= "Target Response vs. Predictions",
x="Observations",
y="Responses")
Resources
- The code used for this example is customized from
tristanoprofetto
github repository: https://github.com/tristanoprofetto/neural-networks/blob/main/ANN/Regressor/feedforward.R - StatQuest: Neural Networks Pt. 1: Inside the Black Box https://www.youtube.com/watch?v=CqOfi41LfDw
- Other-Resources: https://towardsdatascience.com/the-mostly-complete-chart-of-neural-networks-explained-3fb6f2367464
Contacts and Information: fede.gazzelloni@gmail.com
Back to top