Erasmus students exchange

Published

March 15, 2022

Overview

This week 10 of #TidyTuesday 2022 theme is #Erasmus students exchange in the European countries.

The data set is from Erasmus student mobility, Data.Europa.eu and Wimdu.co to discover the most popular Erasmus destinations.

The idea is to make a network of sending and receiving countries, let’s have a look at the data.

if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, ggbump, cowplot, wesanderson)
erasmus <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-08/erasmus.csv')

The set is made of information about students, such as the age, the nationality, the lenght of stay, the gender, academic year, and others. I selected some of them, to extract the information I needed to make the network.

df <- erasmus%>%
  select(sending_country_code,receiving_country_code,
         participant_nationality,participants,
         participant_age,
         academic_year,mobility_duration,
         participant_gender)

kableExtra::kable(head(df)) 

Looking at the participant_age we see that we have some misleading data:

df %>% pull(participant_age) %>% summary()

For this reason the best way is to filter students between 17 and 28 years old. Also, mobility_duration is quite surprising:

df %>% pull(mobility_duration) %>% summary()

The median value of the students’ stay is ONE day, while the mean is just a little above TWO days. Very few students stay more than 10 days, but someone reaches a max of 273 days (39 weeks).

df %>% count(mobility_duration) %>% mutate(perc = round(n/sum(n)*100,2)) %>% head()
df %>% count(mobility_duration) %>% mutate(perc = round(n/sum(n)*100,3)) %>% tail()

Student participants are almost all solo participants as the median shows to be ONE student per observation, TWO students on average, with a max value of 279. So that, to have a picture of the phenomenon select the average value of the student participants as representative.

df %>% pull(participants) %>% summary(participants)

Finally, gender, Females are slightly more than males, just a little above 50%.

tbl <-df %>% pull(participant_gender) %>% table()
cbind(n=tbl,pct=round(prop.table(tbl)*100,2))

This is our new dataset on which we will build our network.

df <- df %>%
  group_by(academic_year) %>%
  filter(between(x = participant_age,17,28),
         mobility_duration>3) %>%
  summarise(m_participants=mean(participants),
            sending_country_code,receiving_country_code,
            participant_gender,.groups="drop") %>%
  ungroup() %>%
  select(-m_participants) %>%
  distinct()
kableExtra::kable(head(df))%>%
  kableExtra::kable_styling(latex_options = "scale_down")

At this point I’d like to have the full country’s name, and use {ISOcodes} package. I do that because I’d like to make a spatial visualization as well. The package contains the values for the countries’ abbreviations coded as “Alpha_2”. I needed to adjust UK and Greece. To verify this you might need to use the count() function and the str_detect() a couple of times before identifying all the values that needs an adjustment.

library(ISOcodes)
isocodes<-ISOcodes::ISO_3166_1
isocodes2 <- isocodes%>%
  mutate(Alpha_2=case_when(Alpha_2=="GB"~"UK",
                           Alpha_2=="GR"~"EL",
                           TRUE ~ Alpha_2))

Some more manipulations for selecting just the countries in the dataset.

sending_country_code <- df %>% count(sending_country_code) %>% select(-n) %>% unlist()
receiving_country_code <- df %>% count(receiving_country_code) %>% select(-n) %>% unlist()

sending <- isocodes2 %>% filter(Alpha_2 %in% sending_country_code)
receiving <- isocodes2 %>% filter(Alpha_2 %in% receiving_country_code)

sending_unlst <- sending %>% count(Name) %>% select(-n) %>% unlist()
receiving_unlst <- receiving %>% count(Name) %>% select(-n) %>% unlist()

{ggplot2} package provides spatial data, as you might notice, students of the Erasmus programs come from all over the World. Still, some countries’ name adjustments are needed.

world <- map_data("world")%>%
  filter(!region=="Antarctica")%>%
  mutate(region=case_when(region=="UK"~ "United Kingdom",
                          region=="Czech Republic"~"Czechia",
                          region=="Moldova"~"Moldova, Republic of",
                          region=="Palestine"~"Palestine, State of",
                          region=="Russia" ~ "Russian Federation",
                          TRUE ~ region))
sending_geo <- world %>% filter(region %in% sending_unlst)

sending_geo_full <- sending %>%
  select(Alpha_2,Name) %>%
  left_join(sending_geo,by = c("Name"="region"))


receiving_geo <- world %>% filter(region %in% receiving_unlst)

receiving_geo_full <- receiving %>%
  select(Alpha_2,Name) %>%
  left_join(receiving_geo,by = c("Name"="region"))

Then finally, make the centroids. We will use this in the visualization further below.

sending_geo_centroids <- sending_geo_full %>%
  group_by(Name) %>%
  mutate(avg_long = mean(range(long)),avg_lat = mean(range(lat))) %>%
  count(Alpha_2,Name,avg_long,avg_lat)

receiving_geo_centroids <- receiving_geo_full %>%
  group_by(Name) %>%
  mutate(avg_long = mean(range(long)),avg_lat = mean(range(lat))) %>%
  count(Alpha_2,Name,avg_long,avg_lat)

df2 is our new data set, made of countries’ names and spatials:

df2 <- df %>%
  left_join(sending_geo_centroids,by=c("sending_country_code"="Alpha_2")) %>%
  left_join(receiving_geo_centroids,by=c("receiving_country_code"="Alpha_2")) %>%
  rename(sending_country_name=Name.x,receiving_country_name=Name.y,
         avg_long_s=avg_long.x,avg_lat_s=avg_lat.x,
         avg_long_r=avg_long.y,avg_lat_r=avg_lat.y)%>%
  select(-n.x,-n.y)
df2 %>% head

order_sending and order_receiving will be very useful for a third visualization in which all the sending countries will be matched with students’ destinations. We will see this further below.

order_sending <- df2%>%
  count(sending_country_name,sort=TRUE)%>%
  mutate(index_sending=seq(1,length(sending_country_name),1),
         index_sending=rev(index_sending))%>%
  drop_na()

order_receiving <- df2%>%
  count(receiving_country_name,sort=TRUE)%>%
  mutate(index_receiving=seq(1,length(receiving_country_name),1),
         index_receiving=rev(index_receiving))%>%
  drop_na()

And here is the data set that we will use to make the rank network:

erasmus_network <- df2 %>%
  select(academic_year,
         sending_country_name,receiving_country_name) %>%
  distinct() %>%
  left_join(order_sending,by="sending_country_name") %>%
  left_join(order_receiving,by="receiving_country_name") %>%
  mutate(group = glue::glue("{sending_country_name}-{receiving_country_name}")) %>%
  distinct() %>%
  drop_na() %>%
  arrange(index_sending) %>%
  mutate(year_id=case_when(academic_year=="2014-2015"~1,
                           academic_year=="2015-2016"~2,
                           academic_year=="2016-2017"~3,
                           academic_year=="2017-2018"~4,
                           academic_year=="2018-2019"~5,
                           academic_year=="2019-2020"~6)) %>%
  relocate(year_id) %>%
  arrange(year_id)

erasmus_network %>% head

Load the packages for setting a nice font.

library(showtext)
library(sysfonts)
library(extrafont)

showtext::showtext_auto()
showtext::showtext_opts(dpi=320)

font_add_google(name="Noto Sans",family="notosans")
erasmus_network%>%
  arrange(index_sending)%>%
  ggplot()+
  
  geom_text(aes(x = -2, y = index_sending+5, label = sending_country_name), 
            vjust=0, 
            hjust="left", color = "darkred", size = 3) +
  
  ggbump::geom_sigmoid(aes(x = -2, xend = 16.1, 
                           y = index_sending+5, yend =index_receiving+18, 
                           group=factor(group),color=receiving_country_name), 
                       alpha = .6, smooth = 10, size = 0.1,show.legend = F) +
  
  geom_text(aes(x = 16, y = index_receiving+17.5, label = receiving_country_name), 
            vjust=-1.5, hjust="right", color = "darkred", size = 3) +
  coord_cartesian()+
  theme_void()

Or a simplified version:

network

Let’s select Top 5 sending countries:

erasmus_network2 <- erasmus_network %>%
  filter(sending_country_name%in%c("Italy",
                                   "Germany",
                                   "United Kingdom",
                                   "Romania","Spain")) %>%
  mutate(sending_country_name=case_when(sending_country_name=="United Kingdom"~"UK",
                                        TRUE~sending_country_name))%>%
  count(year_id,academic_year,sending_country_name) %>%
  group_by(academic_year)%>%
  mutate(rank=rank(x=n))%>%
  ungroup()

erasmus_network2 %>% head()
library(ggthemes)

ggplot(erasmus_network2,
       mapping=aes(academic_year,rank,
                   group=factor(sending_country_name),
                   color=factor(sending_country_name)))+ 
  geom_point(size = 7) +
  geom_text(data = erasmus_network2 %>% filter(year_id == min(year_id)),
            aes(x = year_id - .1, 
                label = sending_country_name), size = 4, hjust = 1) +
  geom_text(data = erasmus_network2 %>% 
              filter(year_id == max(year_id)),
            aes(x = year_id + .1, label = sending_country_name), 
            size = 4, hjust = 0,check_overlap = T) +
  geom_bump(size = 2, smooth = 8) +
  labs(y = "RANK",
       x = "Academic Year",
       title="Erasmus Top 5 student exchange countries",
       subtitle="Ranks of the highest sending frequency",
       caption="DataSource: Erasmus student mobility | Data.Europa.eu | Wimdu.co\nDataViz: Federica Gazzelloni | #TidyTuesday Week 10 Erasmus") +
  scale_y_reverse() +
  scale_color_manual(values = wesanderson::wes_palette(5, name = "Royal2"))+
  cowplot::theme_minimal_grid(font_size = 14, line_size = 0) +
  theme(legend.position = "none",
        panel.grid.major = element_blank(),
        plot.title = element_text(color="#ffc7ba"),
        plot.subtitle = element_text(color="#ffc7ba"),
        plot.caption = element_text(color="#ffc7ba",size=8),
        axis.text = element_text(color="#ffc7ba"),
        axis.title = element_text(color="#ffc7ba"),
        plot.background = element_rect(color="black",fill="black"),
        panel.background = element_rect(color="black",fill="black"))

Here is the final part of this post, I set the spatials for making a map visualizaton of the sending to receiving countries.

sending_geo_full2<-sending_geo_full%>%mutate(direction="sending")
receiving_geo_full2<-receiving_geo_full%>%mutate(direction="receiving")

geo_full <-rbind(sending_geo_full2,receiving_geo_full2)
centr_s<- sending_geo_centroids%>%mutate(direction="Sending")
centr_r<- receiving_geo_centroids%>%mutate(direction="Receiving")
centroids<-rbind(centr_s,centr_r)%>%
  mutate(direction=as.factor(direction))
  
levels(centroids$direction)<-c("Sending","Receiving")
geo_full2<- geo_full%>%
  mutate(direction=as.factor(direction))

levels(geo_full2$direction)<-c("Receiving","Sending")

geo_full2$direction<-relevel(geo_full2$direction,ref="Sending")
ggplot(geo_full2)+
  geom_polygon(data = world,
               aes(x=long,y=lat,group=group),fill="grey78",color="grey5")+
  
  geom_polygon(aes(x=long,y=lat,group=group,fill=direction),alpha=0.3)+
  
  geom_point(data=centroids,
             aes(x=avg_long, y=avg_lat,color=direction,shape=direction))+
 
  coord_map("ortho", orientation = c(33.366449, 24.022840, 0))+
  facet_wrap(vars(direction))+
  scale_x_continuous("Latitude", expand=c(0,0)) +
  scale_y_continuous("Longitude", expand=c(0,0)) +
  theme_void()+
  theme(legend.position = "none")
Back to top