This tutorial is an overview of trafficaccidents
data package. The package contains information about traffic accidents in Slovenia examined by the police from 2005 to 2016. Original files1 The original data is published on http://www.policija.si/ were converted to R data frames without changes, except for:
The tutorial was created using R language (R Core Team 2017R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.) and several R extensions. See the list of all used packages at the end of the document.
Note that this is not a thorough analysis of traffic accidents. The only goal is to bring the data and visualization tools to a larger group of users.
Information about accidents is stored in three data frames:
event
contains accident events with date, time, position and other accident attributes.party
contains attributes of the parties involved.calendar
contains all dates from 2005 to 2016.The majority of accidents (93%) are geocoded. Use pos_x
and pos_y
columns for geographic location.2 Columns pos_x
and pos_y
are geocoded in D48/GK coordinate reference system. Use columns lon and lat for WGS84 For example, we can plot all accidents locations as points, coloured by road category (road_type
):
library(dplyr)
library(ggplot2)
library(viridis)
library(trafficaccidents)
dat_roads <-
event %>%
filter(!is.na(road_type), !is.na(pos_x)) %>%
arrange(desc(road_type))
ggplot(data = dat_roads, aes(pos_y, pos_x, color = road_type) ) +
geom_point(alpha = 0.03, size = 0.2) +
coord_fixed() +
scale_color_viridis(discrete = TRUE, end = 0.7, option = "A") +
guides(colour = guide_legend(
override.aes = list(alpha = 1, size = 2), title = "Road category")) +
theme_void() +
theme(legend.position = c(0.95,0.3))
Working days have different time distribution than weekends and holidays.
days_count <-
calendar %>%
group_by(year, working_day) %>%
summarise(days = n())
accidents_per_hour <-
event %>%
left_join(calendar, by = "date") %>%
group_by(year, working_day, hour) %>%
count %>%
left_join(days_count, by = c("year", "working_day")) %>%
mutate(p = n / days)
ggplot(accidents_per_hour, aes(x = hour, y = p)) +
geom_point(alpha = 0.2, size = 0.2, position = position_jitter(0.1)) +
geom_smooth(method = "loess", span = 1/3, color = "gray") +
scale_x_continuous(breaks = 1:7*3, name = "Hour", minor_breaks = NULL) +
scale_y_continuous(limits = c(0, NA), breaks = 0:6 * 2,
name = "Average number of accidents") +
coord_cartesian(y=c(0, 9)) +
facet_wrap(~ working_day, ncol = 1, scales = "fixed") +
theme_minimal()
With “joy plot” from ggjoy
package (Wilke 2017Wilke, Claus O. 2017. Ggjoy: Joyplots in ’Ggplot2’. https://CRAN.R-project.org/package=ggjoy.) we can discover the differences of time distributions between road categories. The peaks at rush hour are more visible on roads than on city streets.
Time distribution of accidents on working day
library(ggjoy)
dat_rt_wd <-
event %>%
left_join(calendar, by = "date") %>%
filter(
working_day == "Working day",
!is.na(road_type)
)
ggplot(dat_rt_wd) +
geom_joy(aes(x = hour, y = road_type, fill = road_type), alpha = 0.5 ) +
scale_x_continuous(breaks = 1:7*3, minor_breaks = NULL,
limits = c(0,23), name = "Hour") +
scale_fill_viridis(discrete = TRUE, end = 0.7, option = "A")+
scale_y_discrete(limits = rev(levels(dat_rt_wd$road_type)), name = NULL)+
theme_minimal() +
theme(legend.position = "none")
With animation it is possible to see accident locations by hour. The following animation was created with gganimate package (Robinson 2016Robinson, David. 2016. Gganimate: Create Easy Animations with Ggplot2. http://github.com/dgrtwo/gganimate.).
event_wd <-
event %>%
filter(!is.na(pos_x)) %>%
mutate(alpha = 0.2) %>%
left_join(calendar, by = "date") %>%
filter(working_day == "Working day")
shift_hour <- function(x, n, a) {
mutate(x, alpha = a, hour = (hour + n) %% 24)
}
dat_wd_shift <-
bind_rows(
event_wd, shift_hour(event_wd, 1, 0.1), shift_hour(event_wd, 2, 0.07)
) %>%
arrange(hour, desc(road_type))
clock <- list(x = 378938 + 30000, y = 191590 - 10000, width = 8500)
p1 <-
ggplot(data = dat_wd_shift, aes(frame = hour)) +
geom_point(aes(pos_y, pos_x, alpha = alpha, color = road_type), size = 0.4) +
scale_alpha_identity() +
scale_color_viridis(discrete = TRUE, end = 0.7, option = "A", guide = "none",
limits = levels(dat_wd_shift$road_type)) +
coord_fixed() +
geom_text(aes(label = sprintf("%02d", hour)), color = "gray",
x = clock$x, y = clock$y, size = 9) +
geom_segment(size = 2, aes(
x = clock$x + clock$width * sin(hour/6 * pi),
y = clock$y + clock$width * cos(hour/6 * pi),
xend = clock$x + clock$width*1.4 * sin(hour/6 * pi),
yend = clock$y + clock$width*1.4 * cos(hour/6 * pi),
color = levels(dat_wd_shift$road_type)[as.integer(hour >= 12)*8+1]
)) +
theme_void()
library(gganimate)
library(animation)
magickPath <- shortPathName("magick.exe")
ani.options(convert=magickPath)
#ani.options(ani.width = 1024, ani.height = 640+55)
ani.options(ani.width = 1024*1.3, ani.height = (640+55)*1.3)
ani.options(interval = 0.1)
gganimate(p1, "docs/animation/accidents_animation.gif", title_frame = FALSE )
All accidents are marked with injury severity (column injury
). The plot below compares time distributions of accidents for different levels of severities.
Distribution of accidents over time
dat <-
event %>%
left_join(calendar, by = "date")
ggplot(dat) +
geom_joy(aes(x = hour, y = injury, alpha = injury), fill = "darkred") +
scale_x_continuous(breaks = 1:7*3, minor_breaks = NULL,
limits = c(0,23), name = "Hour") +
scale_y_discrete(limits = rev(levels(dat$injury)), name = NULL)+
scale_alpha_discrete(range = c(0.06, 0.7), guide = "none") +
theme_minimal() +
facet_wrap(~working_day, ncol = 1)
Police classifies each accident by main cause. The plot below represent the number of accidents as a circle size and injury severity as color opacity for different accident causes:
dat_type <-
event %>%
group_by(cause, injury) %>%
count
ggplot(dat_type, aes(x = 1, y=1, size = sqrt(n) / 7, alpha = injury)) +
scale_size_identity() +
geom_point(color = "darkred") +
scale_y_discrete(limits = rev(levels(dat_type$cause))) +
coord_fixed(0.7) +
scale_alpha_discrete(range = c(0.06, 0.7)) +
guides(alpha = guide_legend(override.aes = list(size = 6))) +
theme_void(base_size = 12) +
#theme(strip.text = element_text(size = 10)) +
labs(x = NULL, y = NULL, alpha = "Severity") +
facet_wrap(~cause, ncol = 3)
The accident_type
column defines what happened (e.g. side collision or vehicle rollover). The plot below explores its relation to accident cause.
dat_cause_type <-
event %>%
group_by(cause, event_type, injury) %>%
count
ggplot(dat_cause_type,
aes(x = event_type, y=cause, size = sqrt(n) / 7, alpha = injury)) +
geom_point(color = "darkred") +
scale_size_identity() +
scale_alpha_discrete(range = c(0.06, 0.7)) +
guides(alpha = guide_legend(override.aes = list(size = 10))) +
scale_x_discrete(position = "top") +
scale_y_discrete(limits = rev(levels(dat_cause_type$cause))) +
coord_fixed() +
theme_minimal() +
theme(
axis.text.x = element_text(angle=90,hjust = 0, vjust = 0.3, size = 12),
axis.text.y = element_text(size = 12),
legend.position = "bottom"
) +
labs(x = "Accident type", y = "Cause", alpha = "Severity")
There can be several parties involved in same accident. The data includes also some cases where there are several parties at fault for the same accident. Here the events_party
data frame is created with single event and some attributes from one party at fault.
party_at_fault <-
party %>%
filter(accident_role == "At fault")
other_party <-
party %>%
filter(accident_role != "At fault")
events_party <-
party_at_fault %>%
group_by(src_file, event_id) %>%
summarise(
age = first(age),
gender = first(gender),
alco = first(alcotest) > 0,
experience = first(experience_y) + first(experience_m)/12,
traffic_role = first(traffic_role)
)
events_party %>%
group_by(traffic_role) %>%
summarise(accidents = n()) %>%
arrange(desc(accidents)) %>%
head(8) %>%
kable
traffic_role | accidents |
---|---|
Car driver | 192340 |
Truck driver | 24174 |
Cyclist | 7862 |
Motorcycle driver | 5290 |
Motorized bicycle driver | 3344 |
Bus driver | 1704 |
Pedestrian | 1637 |
Tractor driver | 1492 |
Most frequent parties at fault are car drivers. Is it true, that there are more accidents involving young drivers, driving at night?
hour_intervals <- setNames(
c(0, 6, 22),
c("Night (22h - 06h)", "Day (6h-21h)", "Night (22h - 06h)")
)
events_party_car_drivers <-
events_party %>%
filter(traffic_role %in% c("Car driver")) %>%
filter(age >= 18) %>%
inner_join(event, by = c("src_file", "event_id")) %>%
inner_join(calendar, by = "date") %>%
mutate(
hour_interval = findInterval(hour, hour_intervals),
hour_interval = names(hour_intervals)[hour_interval],
age = round(age)
)
events_party_car_drivers %>%
filter(date >= as.Date("2011-01-01")) %>%
group_by(age, hour_interval, src_file) %>%
count %>%
ggplot(aes(x = age, y = n)) +
#geom_point(alpha = 0.7) +
geom_point(alpha = 0.5, mapping = aes(color = src_file)) +
geom_smooth(alpha = 0.2, method = "loess") +
facet_wrap(~hour_interval, scales = "free_y") +
labs(x = "Age", y = "Number of accidents", color = NULL) +
scale_x_continuous(breaks = 1:10*10, limits = c(18, 85)) +
scale_color_brewer(palette = "Blues") +
theme_bw()
Alcohol involved accidents, number of accidents by age at day/night:
events_party_car_drivers %>%
filter(date >= as.Date("2011-01-01"), alco) %>%
group_by(age, hour_interval, src_file) %>%
count %>%
ggplot(aes(x = age, y = n)) +
#geom_point(alpha = 0.7) +
geom_point(alpha = 0.3, mapping = aes(color = src_file)) +
geom_smooth(method = "loess", color = "orange", span = 0.6) +
facet_wrap(~hour_interval) +
labs(x = "Age", y = "Number of accidents", color = NULL) +
scale_x_continuous(breaks = 1:10*10, limits = c(18, 80)) +
scale_color_brewer(palette = "Oranges") +
theme_bw()+
theme(legend.position = "none")
knitr 1.16
Xie Y (2017). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.16, http://yihui.name/knitr/.
rmarkdown 1.5
Allaire J, Cheng J, Xie Y, McPherson J, Chang W, Allen J, Wickham H, Atkins A, Hyndman R and Arslan R (2017). rmarkdown: Dynamic Documents for R. R package version 1.5, https://CRAN.R-project.org/package=rmarkdown.
tidyr 0.6.2
Wickham H (2017). tidyr: Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions. R package version 0.6.2, https://CRAN.R-project.org/package=tidyr.
dplyr 0.5.0
Wickham H and Francois R (2016). dplyr: A Grammar of Data Manipulation. R package version 0.5.0, https://CRAN.R-project.org/package=dplyr.
ggjoy 0.2.0
Wilke C (2017). ggjoy: Joyplots in ‘ggplot2’. R package version 0.2.0, https://CRAN.R-project.org/package=ggjoy.
viridis 0.4.0
Garnier S (2017). viridis: Default Color Maps from ‘matplotlib’. R package version 0.4.0, https://CRAN.R-project.org/package=viridis.
viridisLite 0.2.0
Garnier S (2017). viridisLite: Default Color Maps from ‘matplotlib’ (Lite Version). R package version 0.2.0, https://CRAN.R-project.org/package=viridisLite.
ggplot2 2.2.1
Wickham H (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-0-387-98140-6, http://ggplot2.org.
datamodelr 0.2.1.9001
Bergant D (2017). datamodelr: Define and Plot Data Model Diagrams. R package version 0.2.1.9001, https://github.com/bergant/datamodelr.
trafficaccidents 0.1.0
Bergant D (2018). trafficaccidents: Traffic Accidents in Slovenia. R package version 0.1.0, http://github.com/bergant/trafficaccidents.
tufte 0.2
Xie Y and Allaire J (2016). tufte: Tufte’s Styles for R Markdown Documents. R package version 0.2, https://CRAN.R-project.org/package=tufte.