19 min read

Part 1: Working with time series data

Stéphane Guillou

Technology Trainer, The University of Queensland Library

Prerequisites

This R workshop assumes basic knowledge of R including:

  • Installing and loading packages
  • How to read in data with read.csv(), readr::read_csv() and similar
  • Creating objects in R
  • How to transform data frames and tibbles with dplyr

We are happy to have any and all questions though!

What are we going to learn?

In this first part of the workshop, we will learn how to:

  • Read data from multiple sheets into an object
  • Clean the data and extract information
  • Explore the data visually (static and interactive)
  • Perform rolling operations

Create a project

To work cleanly, we need to:

  • Create a new project in RStudio
  • Create a new script to write our R code
  • Create a data directory to store our data in

Load packages

For this workshop, we will use many tools from the Tidyverse: a collection of packages for data import, cleaning, transformation, visualisation and export.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

About the data

Our dataset describes atmospheric samples of a compound which were collected each day during seven consecutive days for different month in the year. Some years and months had less samples due to technical issues.

Download the data

Let’s download our dataset form the web:

download.file("https://github.com/seaCatKim/PACE-LIB-R-timeseries/raw/main/content/post/2020-12-01-part1-working-timeseries-data/data/analytes_data.xlsx",
              destfile = "data/analytes_data.xlsx",
              mode = 'wb')

The mode = 'wb' is binary and necessary for download.file() to work on Windows OS.

Read in the data

We have an XLSX workbook that contains several sheets. The first one is only documenting what the data is about, whereas the two other ones contain the data we are interested in.

The package readxl is useful for importing data stored in XLS and XLSX files. For example, to have a look at a single sheet of data, we can do the following:

# load the package
library(readxl)
# only import the second sheet
analytes <- read_excel("data/analytes_data.xlsx",
                       sheet = 2)

We could also point to the correct sheet by using the sheet name instead of its index. For that, the excel_sheets() function is useful to find the names:

# excel_sheets() shows the sheet names
excel_sheets("data/analytes_data.xlsx")
## [1] "infromation data " "Site_759"          "Site_1335"
analytes <- read_excel("data/analytes_data.xlsx", sheet = "Site_759")

Let’s have a look at the first few rows of data:

head(analytes)
## # A tibble: 6 x 4
##   `Site code` Analyte    `Real date`         `mg/day`
##         <dbl> <chr>      <dttm>                 <dbl>
## 1         759 Compound x 1991-11-29 00:00:00    0.334
## 2         759 Compound x 1991-11-30 00:00:00    0.231
## 3         759 Compound x 1991-12-01 00:00:00    0.216
## 4         759 Compound x 1991-12-02 00:00:00    0.219
## 5         759 Compound x 1991-12-03 00:00:00    0.203
## 6         759 Compound x 1991-12-04 00:00:00    0.206

Bind several workbook sheets

Even though this workbook only has two sheets of data, we might want to automate the reading and binding of all data sheets to avoid repeating code. This comes in very handy if you have a workbook with a dozen sheets of data, or if your data is split between several files.

The Tidyverse’s purrr package allows “mapping” a function (or a more complex command) to several elements. Here, we will map the reading of the sheet to each element in a vector of sheet names.

Using the map_dfr() function makes sure we have a single data frame as an output.

# only keep sheet names that contain actual data
sheets <- excel_sheets("data/analytes_data.xlsx")[2:3]
# map the reading to each sheet
analytes <- map_dfr(sheets,
                    ~ read_excel("data/analytes_data.xlsx", sheet = .x))

We could map a function by simply providing the name of the function. However, because we are doing something slightly more elaborate here (pointing to one single file, and using an extra argument to point to the sheet itself), we need to use the ~ syntax, and point to the element being processed with the .x placeholder.

For more information on the different options the map family offers, see ?map.

Data cleaning

There are a few issues with the dataset. First of all, there are variations in how the compound is named. We can replace the value in the first column with a simpler, consistent one:

# all same compound
analytes$Analyte <- "x"

Our column names are not the most reusable names for R. Better names do not contain spaces or special characters like /. dplyr’s rename() function is very handy for that:

library(dplyr)
analytes <- rename(analytes, Site = 1, Date = 3, mg_per_day = 4)

Finally, the Site column is stored as numeric data. If we plot it as it is, R will consider it to be a continuous variable, when it really should be discrete. Let’s fix that with dplyr’s mutate() function:

analytes <- mutate(analytes, Site = as.character(Site))

We could convert it to a factor instead, but the Tidyverse packages tend to be happy with categorical data stored as the character type.

Export a clean dataset

We now have a clean dataset in a single table, which we could make a copy of, especially to share with others, or if we want to split our code into several scripts that can work independently.

write.csv(analytes, "data/analytes_data_clean.csv",
          row.names = FALSE)

write.csv() will by default include a column of row names in the exported file, which are the row numbers if no row names have been assigned. That’s not usually something we want, so we can turn it off with row.names = FALSE

Visualisation with ggplot2

At this stage, we can start exploring visually. For a lot of R users, the go-to package for data visualisation is ggplot2, which is also part of the Tidyverse.

For a ggplot2 visualisations, remember that we usually need these three essential elements:

  • the dataset
  • the mapping of aesthetic elements to variables in the dataset
  • the geometry used to represent the data

Let’s try a first timeline visualisation with a line plot:

library(ggplot2)
ggplot(analytes,             # data
       aes(x = Date,         # mapping of aesthetics
           y = mg_per_day,
           colour = Site)) + # (separate by site)
  geom_line()                # geometry

A simple line plot is not great here, because of the periodicity: there were bursts of sampling, several days in a row, and then nothing for a while. Which results in a fine, daily resolution for small periods of time, and a straight line joining these periods of time.

We might want to “smoothen” that line, hoping to get a better idea of the trend, keeping the original data as points in the background:

ggplot(analytes, aes(x = Date, y = mg_per_day, colour = Site)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The trend lines only give a very general trend. What if we make it follow the points more closely?

ggplot(analytes, aes(x = Date, y = mg_per_day, colour = Site)) +
  geom_point(size = 0.3) + # smaller points
  geom_smooth(span = 0.05) # follow the data more closely
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

With the method used, we end up with an increased uncertainty (the shaded area around the curves). It also creates artificial “dips” to fit the data, for example close to the beginning of 2000 for the site 1335 (it even reaches negative values).

Summarise the data

In this case, because we have sampling points for what looks like groups of successive days, we can try to summarise them.

Operations on time-date data can be done more comfortably with extra packages. The Tidyverse comes with the lubridate package, which has been around for a while and is very powerful. Another, more recent package called “clock” can do most of what lubridate can, and more, but it is still being heavily developed, so we stick to lubridate here.

Let’s start by extracting all the date components that could be useful:

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
analytes <- analytes %>% 
   mutate(year = year(Date),
          month = month(Date),
          day = day(Date),
          week = week(Date),
          weekday = weekdays(Date))

How many sampling days per month are there?

analytes %>% 
   group_by(Site, year, month) %>% 
   count() %>% 
   head(12)
## # A tibble: 12 x 4
##    Site   year month     n
##    <chr> <dbl> <dbl> <int>
##  1 1335   1991    11     3
##  2 1335   1991    12     3
##  3 1335   1992     1     2
##  4 1335   1992     2     5
##  5 1335   1992     3     5
##  6 1335   1992     4     2
##  7 1335   1992     5     4
##  8 1335   1992     6     3
##  9 1335   1992     7     2
## 10 1335   1992     8     4
## 11 1335   1992    10     7
## 12 1335   1992    12     6

The number of samples per month is irregular, and some months have no data.

Furthermore, the week numbers don’t align with the sampling weeks, and some sampling weeks overlap over two months:

analytes %>%  select(year, month, day, week) %>% head(10)
## # A tibble: 10 x 4
##     year month   day  week
##    <dbl> <dbl> <int> <dbl>
##  1  1991    11    29    48
##  2  1991    11    30    48
##  3  1991    12     1    48
##  4  1991    12     2    48
##  5  1991    12     3    49
##  6  1991    12     4    49
##  7  1991    12     5    49
##  8  1992     1    31     5
##  9  1992     2     1     5
## 10  1992     2     2     5

In any case, the fact that week numbers are reset at the beginning of the year wouldn’t help.

One way to group the sampling days together is to detect which ones are spaced by one day, and which ones by a lot more. The lag() and lead() functions from dplyr are very useful to compare values in a single column:

analytes <- analytes %>%
   arrange(Site, Date) %>% # make sure it is in chronological order
   group_by(Site) %>% # deal with sites separately
   mutate(days_lapsed = as.integer(Date - lag(Date))) %>%  # compare date to previous date
   ungroup() # leftover grouping might have unintended consequences later on

Grouping by site is important, otherwise we get an erroneous value at the row after switching to the second site. Because we grouped, it does not compare to the previous value in the different site, but instead only returns an NA.

How consistent are the sampling periods? Let’s investigate:

analytes %>% 
   count(days_lapsed) %>% 
   head()
## # A tibble: 6 x 2
##   days_lapsed     n
##         <int> <int>
## 1           1   608
## 2           2     4
## 3           3     3
## 4          39     1
## 5          42     1
## 6          43     5

It looks like some sampling days might have been missed, so we can define a sampling period as “a period in which sampling times are not spaced by more than 3 days.”

To create a grouping index, we can first assign a value of TRUE to the first row of each time period, and then use the cumulative sum function on that column (as it converts TRUEs to 1s and FALSEs to 0s):

analytes <- analytes %>% 
   group_by(Site) %>%
   mutate(sampling_period = row_number() == 1 | days_lapsed > 3,
          sampling_period = cumsum(sampling_period)) %>%
   ungroup()

We can now use these new group indices to summarise by time period:

analytes_summary <- analytes %>% 
   group_by(Analyte, Site, sampling_period) %>% # we are keeping Analyte
   summarise(Date = round_date(mean(Date), unit = "day"),
             mg_per_day = mean(mg_per_day)) %>% 
   ungroup()
## `summarise()` has grouped output by 'Analyte', 'Site'. You can override using the `.groups` argument.

We chose to average and round the date for each sampling period, but we could opt for another option depending on what we want to do, for example keeping the actual time interval: interval(start = min(Date), end = max(Date))

Let’s try again our line plot with the summarised data:

ggplot(analytes_summary,
       aes(x = Date,
           y = mg_per_day,
           colour = Site)) +
  geom_line()  

This is a lot cleaner than what we had originally!

Export summarised data

We have previously exported a CSV, which is a great, simple format that can be opened pretty much anywhere. However, if you want to save an R object to reopen it exactly as it was, you can use an R-specific format like RData.

save(analytes_summary, file = "data/summary.RData")

The file can then be imported again with the load() function. You won’t need to convert the columns to the correct data type again.

Interactive visualisation

Exploring timelines might be more comfortable with an interactive visualisation. Plotly is a helpful library available for various programming languages, and the plotly package makes it easy to use it in R.

Once a visualisation is created in R, it is trivial to convert it to a Plotly visualisation with one single function: ggplotly().

# save as an object
p <- ggplot(analytes_summary,
       aes(x = Date,
           y = mg_per_day,
           colour = Site)) +
  geom_line()
# turn it into a plotly visualisation
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
ggplotly(p)

To focus on a section, draw a rectangle (and double-click to reset to the full view).

With several time series plotted, it is useful to change the hover setting to “compare data on hover” with this button:

“Compare data on hover” button in Plotly visualisation

It is however possible to set a similar hover mode as a default:

ggplotly(p) %>% 
   layout(hovermode = "x unified")

Rolling operations

With time series affected by seasonality, or with a lot of variation creating visual noise, it is sometimes useful to represent long-term trends with a rolling average.

The package RcppRoll is useful for this kind of windowed operation:

library(RcppRoll)
analytes_summary %>% 
   group_by(Site) %>% 
   mutate(rolling_mean = roll_mean(mg_per_day,
                                   n = 6,          # the size of the window
                                   fill = NA)) %>% # to have same length
   ggplot(aes(x = Date,
           y = rolling_mean,
           colour = Site)) +
  geom_line()
## Warning: Removed 10 row(s) containing missing values (geom_path).

This method might not be the best for this data, but it proves very useful in other cases, for example for COVID-19 daily infection rates (with a 7-day window).

Next: part 2

In Part 2, you will learn about analysing time series data.

(This section is commented out as the feature seems broken currently.)

Another helpful functionality is the ability to add a slider to restrict the time range:
ggplotly(p) %>% 
   rangeslider()

````