About

This tutorial is an overview of trafficaccidents data package. The package contains information about traffic accidents in Slovenia examined by the police from 2005 to 2016. Original files11 The original data is published on http://www.policija.si/ were converted to R data frames without changes, except for:

Additional accident geo positions in WGS 84
Additional calendar table (with local holidays)
Translation to English

The tutorial was created using R language (R Core Team ⊕2017R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.) and several R extensions. See the list of all used packages at the end of the document.

Note that this is not a thorough analysis of traffic accidents. The only goal is to bring the data and visualization tools to a larger group of users.

Data Structure

Information about accidents is stored in three data frames:

event contains accident events with date, time, position and other accident attributes.
party contains attributes of the parties involved.
calendar contains all dates from 2005 to 2016.

Data structure

Accident Locations

The majority of accidents (93%) are geocoded. Use pos_x and pos_y columns for geographic location.22 Columns pos_x and pos_y are geocoded in D48/GK coordinate reference system. Use columns lon and lat for WGS84 For example, we can plot all accidents locations as points, coloured by road category (road_type):

library(dplyr)
library(ggplot2)
library(viridis)
library(trafficaccidents)

dat_roads <- 
  event %>% 
  filter(!is.na(road_type), !is.na(pos_x)) %>% 
  arrange(desc(road_type))

ggplot(data = dat_roads, aes(pos_y, pos_x, color = road_type) ) +
  geom_point(alpha = 0.03, size = 0.2) +
  coord_fixed() +
  scale_color_viridis(discrete = TRUE, end = 0.7, option = "A") +
  guides(colour = guide_legend(
    override.aes = list(alpha = 1, size = 2), title = "Road category")) +
  theme_void() +
  theme(legend.position = c(0.95,0.3))

Traffic accidents locations by road category

Time

Working days have different time distribution than weekends and holidays.

days_count <-
  calendar %>% 
  group_by(year, working_day) %>% 
  summarise(days = n())

accidents_per_hour <- 
  event %>% 
  left_join(calendar, by = "date") %>% 
  group_by(year, working_day, hour) %>% 
  count %>% 
  left_join(days_count, by = c("year", "working_day")) %>% 
  mutate(p = n / days)

ggplot(accidents_per_hour, aes(x = hour, y = p)) +
    geom_point(alpha = 0.2, size = 0.2, position = position_jitter(0.1)) +
    geom_smooth(method = "loess", span = 1/3, color = "gray") +
    scale_x_continuous(breaks = 1:7*3, name = "Hour", minor_breaks = NULL) +
    scale_y_continuous(limits = c(0, NA), breaks = 0:6 * 2, 
                       name = "Average number of accidents") +
    coord_cartesian(y=c(0, 9)) +
    facet_wrap(~ working_day, ncol = 1, scales = "fixed") + 
    theme_minimal()

Commuting Pattern

With “joy plot” from ggjoy package (Wilke ⊕2017Wilke, Claus O. 2017. Ggjoy: Joyplots in ’Ggplot2’. https://CRAN.R-project.org/package=ggjoy.) we can discover the differences of time distributions between road categories. The peaks at rush hour are more visible on roads than on city streets.

Time distribution of accidents on working day

library(ggjoy)

dat_rt_wd <- 
  event %>% 
  left_join(calendar, by = "date") %>% 
  filter(
    working_day == "Working day", 
    !is.na(road_type)
  )

ggplot(dat_rt_wd) +
  geom_joy(aes(x = hour, y = road_type, fill = road_type), alpha = 0.5 ) +
  scale_x_continuous(breaks = 1:7*3, minor_breaks = NULL,
                     limits = c(0,23), name = "Hour") +
  scale_fill_viridis(discrete = TRUE, end = 0.7, option = "A")+
  scale_y_discrete(limits = rev(levels(dat_rt_wd$road_type)), name = NULL)+
  theme_minimal() +
  theme(legend.position = "none")

Time and Place

With animation it is possible to see accident locations by hour. The following animation was created with gganimate package (Robinson ⊕2016Robinson, David. 2016. Gganimate: Create Easy Animations with Ggplot2. http://github.com/dgrtwo/gganimate.).

event_wd <-
  event %>%
  filter(!is.na(pos_x)) %>%
  mutate(alpha = 0.2) %>%
  left_join(calendar, by = "date") %>%
  filter(working_day == "Working day")

shift_hour <- function(x, n, a) {
  mutate(x, alpha = a, hour = (hour + n) %% 24)
}

dat_wd_shift <-
  bind_rows(
    event_wd, shift_hour(event_wd, 1, 0.1), shift_hour(event_wd, 2, 0.07)
  ) %>%
  arrange(hour, desc(road_type))

clock <- list(x = 378938 + 30000, y = 191590 - 10000, width = 8500)

p1 <-
  ggplot(data = dat_wd_shift, aes(frame = hour)) +
  geom_point(aes(pos_y, pos_x, alpha = alpha, color = road_type), size = 0.4) +
  scale_alpha_identity() +
  scale_color_viridis(discrete = TRUE, end = 0.7, option = "A", guide = "none",
                      limits = levels(dat_wd_shift$road_type)) +
  coord_fixed() +
  geom_text(aes(label = sprintf("%02d", hour)), color = "gray",
            x = clock$x, y = clock$y, size = 9) +
  geom_segment(size = 2, aes(
    x = clock$x + clock$width * sin(hour/6 * pi),
    y = clock$y + clock$width * cos(hour/6 * pi),
    xend = clock$x + clock$width*1.4 * sin(hour/6 * pi),
    yend = clock$y + clock$width*1.4 * cos(hour/6 * pi),
    color = levels(dat_wd_shift$road_type)[as.integer(hour >= 12)*8+1]
  )) +
  theme_void()


library(gganimate)
library(animation)

magickPath <- shortPathName("magick.exe")
ani.options(convert=magickPath)
#ani.options(ani.width = 1024, ani.height = 640+55)
ani.options(ani.width = 1024*1.3, ani.height = (640+55)*1.3)
ani.options(interval = 0.1)
gganimate(p1, "docs/animation/accidents_animation.gif", title_frame = FALSE )

Severity

All accidents are marked with injury severity (column injury). The plot below compares time distributions of accidents for different levels of severities.

Distribution of accidents over time

dat <- 
  event %>% 
  left_join(calendar, by = "date")
  


ggplot(dat) +
  geom_joy(aes(x = hour, y = injury, alpha = injury), fill = "darkred") +
  scale_x_continuous(breaks = 1:7*3, minor_breaks = NULL, 
                     limits = c(0,23), name = "Hour") +
  scale_y_discrete(limits = rev(levels(dat$injury)), name = NULL)+
  scale_alpha_discrete(range = c(0.06, 0.7), guide = "none") +
  theme_minimal() +
  facet_wrap(~working_day, ncol = 1)

Cause

Police classifies each accident by main cause. The plot below represent the number of accidents as a circle size and injury severity as color opacity for different accident causes:

dat_type <- 
  event %>% 
  group_by(cause, injury) %>% 
  count 

ggplot(dat_type, aes(x = 1, y=1, size = sqrt(n) / 7, alpha = injury)) +
  scale_size_identity() +
  geom_point(color = "darkred") +
  scale_y_discrete(limits = rev(levels(dat_type$cause))) +
  coord_fixed(0.7) +
  scale_alpha_discrete(range = c(0.06, 0.7)) +
  guides(alpha = guide_legend(override.aes = list(size = 6))) +
  theme_void(base_size = 12) +
  #theme(strip.text = element_text(size = 10)) +
  labs(x = NULL, y = NULL, alpha = "Severity") +
  facet_wrap(~cause, ncol = 3)

Causes and Accident Types

The accident_type column defines what happened (e.g. side collision or vehicle rollover). The plot below explores its relation to accident cause.

dat_cause_type <- 
  event %>% 
  group_by(cause, event_type, injury) %>% 
  count

ggplot(dat_cause_type, 
       aes(x = event_type, y=cause, size = sqrt(n) / 7, alpha = injury)) +
  geom_point(color = "darkred") +
  scale_size_identity() +
  scale_alpha_discrete(range = c(0.06, 0.7)) +
  guides(alpha = guide_legend(override.aes = list(size = 10))) +
  scale_x_discrete(position = "top") +
  scale_y_discrete(limits = rev(levels(dat_cause_type$cause))) +
  coord_fixed() +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle=90,hjust = 0, vjust = 0.3, size = 12),
    axis.text.y = element_text(size = 12),
    legend.position = "bottom"
  ) +
  labs(x = "Accident type", y = "Cause", alpha = "Severity")

Parties

There can be several parties involved in same accident. The data includes also some cases where there are several parties at fault for the same accident. Here the events_party data frame is created with single event and some attributes from one party at fault.

party_at_fault <- 
  party %>% 
  filter(accident_role == "At fault") 

other_party <- 
  party %>% 
  filter(accident_role != "At fault") 

events_party <- 
  party_at_fault %>% 
  group_by(src_file, event_id) %>%
  summarise(
    age = first(age),
    gender = first(gender),
    alco =  first(alcotest)  > 0, 
    experience = first(experience_y) + first(experience_m)/12,
    traffic_role = first(traffic_role)
  )

events_party %>% 
  group_by(traffic_role) %>% 
  summarise(accidents = n()) %>% 
  arrange(desc(accidents)) %>% 
  head(8) %>% 
  kable

traffic_role	accidents
Car driver	192340
Truck driver	24174
Cyclist	7862
Motorcycle driver	5290
Motorized bicycle driver	3344
Bus driver	1704
Pedestrian	1637
Tractor driver	1492

Car Driver’s Age

Most frequent parties at fault are car drivers. Is it true, that there are more accidents involving young drivers, driving at night?

hour_intervals <- setNames(
  c(0, 6, 22), 
  c("Night (22h - 06h)", "Day (6h-21h)", "Night (22h - 06h)")
)

events_party_car_drivers <-
  events_party %>% 
  filter(traffic_role %in% c("Car driver")) %>% 
  filter(age >= 18) %>% 
  inner_join(event, by = c("src_file", "event_id")) %>% 
  inner_join(calendar, by = "date") %>% 
  mutate(
    hour_interval = findInterval(hour, hour_intervals),
    hour_interval = names(hour_intervals)[hour_interval],
    age = round(age) 
  )

events_party_car_drivers %>% 
  filter(date >= as.Date("2011-01-01")) %>% 
  group_by(age, hour_interval, src_file) %>% 
  count %>% 
  
  ggplot(aes(x = age, y = n)) +
  #geom_point(alpha = 0.7) +
  geom_point(alpha = 0.5, mapping = aes(color = src_file)) +
  geom_smooth(alpha = 0.2, method = "loess") +
  facet_wrap(~hour_interval, scales = "free_y") +
  labs(x = "Age", y = "Number of accidents", color = NULL) +
  scale_x_continuous(breaks = 1:10*10, limits = c(18, 85)) +
  scale_color_brewer(palette = "Blues") +
  theme_bw()

Number of accidents per year by car driver’s age

Alcohol Involved Accidents by Age

Alcohol involved accidents, number of accidents by age at day/night:

events_party_car_drivers %>% 
  filter(date >= as.Date("2011-01-01"), alco) %>% 
  group_by(age, hour_interval, src_file) %>% 
  count %>% 
  
  ggplot(aes(x = age, y = n)) +
  #geom_point(alpha = 0.7) +
  geom_point(alpha = 0.3, mapping = aes(color = src_file)) +
  geom_smooth(method = "loess", color = "orange", span = 0.6) +
  facet_wrap(~hour_interval) +
  labs(x = "Age", y = "Number of accidents", color = NULL) +
  scale_x_continuous(breaks = 1:10*10, limits = c(18, 80)) +
  scale_color_brewer(palette = "Oranges") +
  theme_bw()+
  theme(legend.position = "none")

Number of alcohol induced accidents per year, by car driver age

R Packages

knitr 1.16

Xie Y (2017). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.16, http://yihui.name/knitr/.

rmarkdown 1.5

Allaire J, Cheng J, Xie Y, McPherson J, Chang W, Allen J, Wickham H, Atkins A, Hyndman R and Arslan R (2017). rmarkdown: Dynamic Documents for R. R package version 1.5, https://CRAN.R-project.org/package=rmarkdown.

tidyr 0.6.2

Wickham H (2017). tidyr: Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions. R package version 0.6.2, https://CRAN.R-project.org/package=tidyr.

dplyr 0.5.0

Wickham H and Francois R (2016). dplyr: A Grammar of Data Manipulation. R package version 0.5.0, https://CRAN.R-project.org/package=dplyr.

ggjoy 0.2.0

Wilke C (2017). ggjoy: Joyplots in ‘ggplot2’. R package version 0.2.0, https://CRAN.R-project.org/package=ggjoy.

viridis 0.4.0

Garnier S (2017). viridis: Default Color Maps from ‘matplotlib’. R package version 0.4.0, https://CRAN.R-project.org/package=viridis.

viridisLite 0.2.0

Garnier S (2017). viridisLite: Default Color Maps from ‘matplotlib’ (Lite Version). R package version 0.2.0, https://CRAN.R-project.org/package=viridisLite.

ggplot2 2.2.1

Wickham H (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-0-387-98140-6, http://ggplot2.org.

datamodelr 0.2.1.9001

Bergant D (2017). datamodelr: Define and Plot Data Model Diagrams. R package version 0.2.1.9001, https://github.com/bergant/datamodelr.

trafficaccidents 0.1.0

Bergant D (2018). trafficaccidents: Traffic Accidents in Slovenia. R package version 0.1.0, http://github.com/bergant/trafficaccidents.

tufte 0.2

Xie Y and Allaire J (2016). tufte: Tufte’s Styles for R Markdown Documents. R package version 0.2, https://CRAN.R-project.org/package=tufte.

Traffic Accidents in Slovenia

Darko Bergant

August 2017