< | Main Materials | >

Introduction

This section is focused on teaching some techniques for analysis of time series of data of interest such as abundances. This section assumes that you have seen the time series lecture, where you saw how basic time series analyses are simply extensions to linear regression.

Packages and tools

For this practical you will need to have the following packages installed for data manipulation:

require(dplyr)
require(tidyr)

You will get warnings because these packages have functions with the same name as the main (“base”) R packages.

Vector Abundances (example analysis)

In the face of global climate change, vector borne diseases have the capacity to shift ranges and drastically alter emergence times. This can have profound public health implications. Therefore, it is essential that we monitor, and make attempts to forecast how changing climate could affect vectors.

We will apply linear models to vector abundance data available in VecDyn.

Now let’s read in some data. There are a couple of example datasets for you to explore - as well as a wealth of data you can download from the VecDyn database. They will all be in the same format, so the cleaning procedures, and the analytic processes will be the same for linear modeling.

# This is a dataset from Walton County, Florida for the species Culex erraticus

Complete_data <- read.csv("../data/Culex_erraticus_walton.csv")

## Or you could use either of the following datasets for Culex nigripalpus

# Complete_data <- read.csv("../data/Culex_nigripalpus_walton.csv")
# Complete_data <- read.csv("../data/Culex_nigripalpus_manatee.csv")

Great, now we have some data. It’s important that we have a look at its layout, format, etc. I usually use 3 methods to start with. First, just the names (tells you the names of the columns):

names(Complete_data)
##  [1] "X"                        "title"                   
##  [3] "dataset_citation"         "publication_citation"    
##  [5] "description"              "url"                     
##  [7] "contact_name"             "contact_affiliation"     
##  [9] "email"                    "orcid"                   
## [11] "dataset_license"          "project_identifier"      
## [13] "publication_status"       "taxon"                   
## [15] "location_description"     "study_collection_area"   
## [17] "geo_datum"                "gps_obfuscation_info"    
## [19] "species_id_method"        "study_design"            
## [21] "sampling_strategy"        "sampling_method"         
## [23] "sampling_protocol"        "measurement_unit"        
## [25] "value_transform"          "sample_start_date"       
## [27] "sample_start_time"        "sample_end_date"         
## [29] "sample_end_time"          "sample_value"            
## [31] "sample_sex"               "sample_stage"            
## [33] "sample_location"          "sample_collection_area"  
## [35] "sample_lat_dd"            "sample_long_dd"          
## [37] "sample_environment"       "additional_location_info"
## [39] "additional_sample_info"   "sample_name"

We can already see that there’s a bunch of the data set that we probably won’t be using. Next, head allows us to look at the first few rows of data

head(Complete_data)
##         X                                                            title
## 1 1622290 Mosquito surveillance in South Walton County, Florida, USA. 2015
## 2 1622343 Mosquito surveillance in South Walton County, Florida, USA. 2015
## 3 1622380 Mosquito surveillance in South Walton County, Florida, USA. 2015
## 4 1622410 Mosquito surveillance in South Walton County, Florida, USA. 2015
## 5 1622440 Mosquito surveillance in South Walton County, Florida, USA. 2015
## 6 1622479 Mosquito surveillance in South Walton County, Florida, USA. 2015
##                                                     dataset_citation
## 1 DOI:10.5281/zenodo.1220173,https://southwaltonmosquitocontrol.org/
## 2 DOI:10.5281/zenodo.1220173,https://southwaltonmosquitocontrol.org/
## 3 DOI:10.5281/zenodo.1220173,https://southwaltonmosquitocontrol.org/
## 4 DOI:10.5281/zenodo.1220173,https://southwaltonmosquitocontrol.org/
## 5 DOI:10.5281/zenodo.1220173,https://southwaltonmosquitocontrol.org/
## 6 DOI:10.5281/zenodo.1220173,https://southwaltonmosquitocontrol.org/
##   publication_citation
## 1                   NA
## 2                   NA
## 3                   NA
## 4                   NA
## 5                   NA
## 6                   NA
##                                                                                                                                description
## 1 Mosquito surveillance from the South Walton County Mosquito Control District Vector Surveillance program to survey mosquito populations.
## 2 Mosquito surveillance from the South Walton County Mosquito Control District Vector Surveillance program to survey mosquito populations.
## 3 Mosquito surveillance from the South Walton County Mosquito Control District Vector Surveillance program to survey mosquito populations.
## 4 Mosquito surveillance from the South Walton County Mosquito Control District Vector Surveillance program to survey mosquito populations.
## 5 Mosquito surveillance from the South Walton County Mosquito Control District Vector Surveillance program to survey mosquito populations.
## 6 Mosquito surveillance from the South Walton County Mosquito Control District Vector Surveillance program to survey mosquito populations.
##   url                        contact_name contact_affiliation email orcid
## 1  NA Peter J Brabant;Peter J Brabant III                  NA    NA    NA
## 2  NA Peter J Brabant;Peter J Brabant III                  NA    NA    NA
## 3  NA Peter J Brabant;Peter J Brabant III                  NA    NA    NA
## 4  NA Peter J Brabant;Peter J Brabant III                  NA    NA    NA
## 5  NA Peter J Brabant;Peter J Brabant III                  NA    NA    NA
## 6  NA Peter J Brabant;Peter J Brabant III                  NA    NA    NA
##   dataset_license project_identifier publication_status           taxon
## 1           CC BY         VBP0000263        unpublished Culex erraticus
## 2           CC BY         VBP0000263        unpublished Culex erraticus
## 3           CC BY         VBP0000263        unpublished Culex erraticus
## 4           CC BY         VBP0000263        unpublished Culex erraticus
## 5           CC BY         VBP0000263        unpublished Culex erraticus
## 6           CC BY         VBP0000263        unpublished Culex erraticus
##   location_description study_collection_area geo_datum gps_obfuscation_info
## 1        Walton County                    NA        NA                   NA
## 2        Walton County                    NA        NA                   NA
## 3        Walton County                    NA        NA                   NA
## 4        Walton County                    NA        NA                   NA
## 5        Walton County                    NA        NA                   NA
## 6        Walton County                    NA        NA                   NA
##   species_id_method study_design sampling_strategy sampling_method
## 1                NA           NA                NA           light
## 2                NA           NA                NA           light
## 3                NA           NA                NA           light
## 4                NA           NA                NA           light
## 5                NA           NA                NA           light
## 6                NA           NA                NA           light
##       sampling_protocol measurement_unit value_transform sample_start_date
## 1 new jersey trap catch               NA              NA                NA
## 2 new jersey trap catch               NA              NA                NA
## 3 new jersey trap catch               NA              NA                NA
## 4 new jersey trap catch               NA              NA                NA
## 5 new jersey trap catch               NA              NA                NA
## 6 new jersey trap catch               NA              NA                NA
##   sample_start_time sample_end_date sample_end_time sample_value sample_sex
## 1                NA      2015-01-05              NA            0     female
## 2                NA      2015-01-05              NA            0     female
## 3                NA      2015-01-05              NA            0     female
## 4                NA      2015-01-05              NA            0     female
## 5                NA      2015-01-05              NA            0     female
## 6                NA      2015-01-22              NA            0     female
##   sample_stage sample_location sample_collection_area sample_lat_dd
## 1        adult   Walton County                     NA       30.3813
## 2        adult   Walton County                     NA       30.3892
## 3        adult   Walton County                     NA       30.3469
## 4        adult   Walton County                     NA       30.3717
## 5        adult   Walton County                     NA       30.3650
## 6        adult   Walton County                     NA       30.3813
##   sample_long_dd sample_environment additional_location_info
## 1       -86.3593                 NA                       NA
## 2       -86.0890                 NA                       NA
## 3       -86.0694                 NA                       NA
## 4       -86.2655                 NA                       NA
## 5       -86.2261                 NA                       NA
## 6       -86.3593                 NA                       NA
##   additional_sample_info sample_name
## 1                     NA          NA
## 2                     NA          NA
## 3                     NA          NA
## 4                     NA          NA
## 5                     NA          NA
## 6                     NA          NA

Finally the summary, which tells us a little bit more about the type of data in each column, as R sees it:

summary(Complete_data)
##        X              title           dataset_citation   publication_citation
##  Min.   :1622290   Length:6096        Length:6096        Mode:logical        
##  1st Qu.:1666636   Class :character   Class :character   NA's:6096           
##  Median :1709015   Mode  :character   Mode  :character                       
##  Mean   :1711077                                                             
##  3rd Qu.:1756655                                                             
##  Max.   :1805144                                                             
##  description          url          contact_name       contact_affiliation
##  Length:6096        Mode:logical   Length:6096        Mode:logical       
##  Class :character   NA's:6096      Class :character   NA's:6096          
##  Mode  :character                  Mode  :character                      
##                                                                          
##                                                                          
##                                                                          
##   email          orcid         dataset_license    project_identifier
##  Mode:logical   Mode:logical   Length:6096        Length:6096       
##  NA's:6096      NA's:6096      Class :character   Class :character  
##                                Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##  publication_status    taxon           location_description
##  Length:6096        Length:6096        Length:6096         
##  Class :character   Class :character   Class :character    
##  Mode  :character   Mode  :character   Mode  :character    
##                                                            
##                                                            
##                                                            
##  study_collection_area geo_datum      gps_obfuscation_info species_id_method
##  Mode:logical          Mode:logical   Mode:logical         Mode:logical     
##  NA's:6096             NA's:6096      NA's:6096            NA's:6096        
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  study_design   sampling_strategy sampling_method    sampling_protocol 
##  Mode:logical   Mode:logical      Length:6096        Length:6096       
##  NA's:6096      NA's:6096         Class :character   Class :character  
##                                   Mode  :character   Mode  :character  
##                                                                        
##                                                                        
##                                                                        
##  measurement_unit value_transform sample_start_date sample_start_time
##  Mode:logical     Mode:logical    Mode:logical      Mode:logical     
##  NA's:6096        NA's:6096       NA's:6096         NA's:6096        
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  sample_end_date    sample_end_time  sample_value      sample_sex       
##  Length:6096        Mode:logical    Min.   : 0.0000   Length:6096       
##  Class :character   NA's:6096       1st Qu.: 0.0000   Class :character  
##  Mode  :character                   Median : 0.0000   Mode  :character  
##                                     Mean   : 0.8735                     
##                                     3rd Qu.: 0.0000                     
##                                     Max.   :47.0000                     
##  sample_stage       sample_location    sample_collection_area sample_lat_dd  
##  Length:6096        Length:6096        Mode:logical           Min.   :30.27  
##  Class :character   Class :character   NA's:6096              1st Qu.:30.35  
##  Mode  :character   Mode  :character                          Median :30.37  
##                                                               Mean   :30.36  
##                                                               3rd Qu.:30.39  
##                                                               Max.   :30.41  
##  sample_long_dd   sample_environment additional_location_info
##  Min.   :-86.40   Mode:logical       Mode:logical            
##  1st Qu.:-86.27   NA's:6096          NA's:6096               
##  Median :-86.21                                              
##  Mean   :-86.19                                              
##  3rd Qu.:-86.10                                              
##  Max.   :-86.00                                              
##  additional_sample_info sample_name   
##  Mode:logical           Mode:logical  
##  NA's:6096              NA's:6096     
##                                       
##                                       
##                                       
## 

Notice here that most of these are of type “character” which indicates strings of words. In certain cases you may end up wanting to use some of these (e.g. such as sex or stage) as factors, in which case you have to change the format before you can use it in your analysis.

There’s a lot of information in this dataset, most of which we won’t be using to analyze the time series data. For simplicity, let’s clean the data and store it in another data frame which we can use to match with climate data from the location where the data were taken.

# select only the rows of interest 
Main_data <- select(Complete_data, c("sample_end_date", "sample_value", "sample_lat_dd", "sample_long_dd"))

# Make sure the sample end date is in date format
Main_data$sample_end_date <- as.Date(Main_data$sample_end_date, format = "%Y-%m-%d")

# Order by date 
Main_data <- Main_data[order(Main_data$sample_end_date, decreasing=FALSE),]

# We can now create columns for Month/Year and Month 
Main_data$Month_Yr <- format(as.Date(Main_data$sample_end_date), "%Y-%m")
Main_data$Month <- format(as.Date(Main_data$sample_end_date), "%m")

Before we go any further, let’s take a look at the data so far

summary(Main_data)
##  sample_end_date       sample_value     sample_lat_dd   sample_long_dd  
##  Min.   :2015-01-05   Min.   : 0.0000   Min.   :30.27   Min.   :-86.40  
##  1st Qu.:2015-10-12   1st Qu.: 0.0000   1st Qu.:30.35   1st Qu.:-86.27  
##  Median :2016-07-05   Median : 0.0000   Median :30.37   Median :-86.21  
##  Mean   :2016-07-09   Mean   : 0.8735   Mean   :30.36   Mean   :-86.19  
##  3rd Qu.:2017-04-18   3rd Qu.: 0.0000   3rd Qu.:30.39   3rd Qu.:-86.10  
##  Max.   :2017-12-18   Max.   :47.0000   Max.   :30.41   Max.   :-86.00  
##    Month_Yr            Month          
##  Length:6096        Length:6096       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

and plot the samples

plot(Main_data$sample_value, type="l")
points(Main_data$sample_value, col=2)

Notice that we have a LOT of zeros, both between seasons, and because data aren’t collected everyday. Further, we have a lot of variance. This dataset, as is, will be hard to model without some sort of transformation. We also can’t use basic regression approaches unless we don’t have any data gaps. So the best bet here will be to aggregate and the possibly transform (but let’s get our climate data, first)

# Pre-allocate the rows for climate data by filling with NAs
Main_data$Max.Temp <- NA
Main_data$Precipitation <- NA

Import and Map Climate Data

Next, we can import the climate data to be matched up to the abundance data we have stored as Main_data. For this, you will need to ensure you have the climate data file vectorbase_locations_dates_climate.csv (look in the data directory on the github repo) saved in your data directory.

This climate data has been downloaded in advance from NOAA databases, on a scale of 2.5x2.5 degrees lat/long, on a daily basis.

# Read in the climate data csv 

Climate_data <- read.csv("../data/vectorbase_locations_dates_climate.csv")

We can now populate these columns by matching up the date for each row, and the closest co-ordinates we have in our climate data.

# For each row in Main_data
for (row in 1:nrow(Main_data)){

  # extract the date associated with the row 
  date <- as.character(Main_data[row, "sample_end_date"])
  
  # subset the climate data to only those with the same date
  data <- subset(Climate_data, Climate_data$Collection.date.range == date)
  
  if (nrow(data)>0){
  
    # define the lat and long desired
    lat <- as.numeric(Main_data[row, "sample_lat_dd"])
    long <- as.numeric(Main_data[row, "sample_long_dd"])
    
    # find the absolute differences between desired lat/longs to the climate datasets
    x <- abs(data$Initial_latitude - lat)
    y <- abs(data$Initial_longitude - long)
    
    # find the index for which there is the minimum overall difference between lat/longs 
    z<-which(x+y==min(x+y))
    
    # draw out the max temp and place into main data frame 
    Main_data[row, "Max.Temp"] <- data[z, "Max.Temp"]
    Main_data[row, "Precipitation"] <- data[z, "Precipitation"]
    
  }
  
  else{
  
    # If there aren't any data to extract for a given date, input NAs
    Main_data[row, "Max.Temp"] <- NA
    Main_data[row, "Precipitation"] <- NA
  } 
}

Now let’s check whether this has worked correctly, and assess whether there are any NA’s.

summary(Main_data$Max.Temp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.656  23.902  27.844  26.895  31.091  35.782
summary(Main_data$Precipitation)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.00000   0.00000   0.02836   5.73650   4.60330 121.88240

When considering time series data, the temporal resolution can have profound implications. Ideally, we want no gaps in our data - therefore working with data on the daily scale for abundance data is impossible without some sort of interpolation to fill the gaps. Instead, lets aggregate the data to the monthly scale by averaging the daily data. That is, this will give us the average number of mosquitoes observed per day in that month.

Aggregated_data <- aggregate(cbind(sample_value, Max.Temp, Precipitation) ~ Month_Yr, data = Main_data, mean)
print(Aggregated_data)
##    Month_Yr sample_value Max.Temp Precipitation
## 1   2015-01  0.000000000 17.74602   3.303991888
## 2   2015-02  0.018181818 17.87269  16.544265802
## 3   2015-03  0.468085106 23.81767   2.405651215
## 4   2015-04  1.619047619 26.03559   8.974406168
## 5   2015-05  0.821428571 30.01602   0.567960943
## 6   2015-06  3.005952381 31.12094   4.841342729
## 7   2015-07  2.380952381 32.81130   3.849010353
## 8   2015-08  1.826347305 32.56245   5.562845324
## 9   2015-09  0.648809524 30.55155  10.409724627
## 10  2015-10  0.988023952 27.22605   0.337750269
## 11  2015-11  0.737804878 24.86768  18.306749680
## 12  2015-12  0.142857143 22.46588   5.621475377
## 13  2016-01  0.000000000 16.02406   3.550622029
## 14  2016-02  0.020202020 19.42057  11.254680803
## 15  2016-03  0.015151515 23.13610   4.785664728
## 16  2016-04  0.026143791 24.98082   4.580424519
## 17  2016-05  0.025252525 28.72884   0.053057634
## 18  2016-06  0.833333333 30.96990   6.155417473
## 19  2016-07  1.261363636 33.30509   4.496368193
## 20  2016-08  1.685279188 32.09633  11.338749182
## 21  2016-09  2.617142857 31.60575   2.868288451
## 22  2016-10  1.212121212 29.14275   0.000000000
## 23  2016-11  1.539772727 24.48482   0.005462681
## 24  2016-12  0.771573604 20.46054  11.615521725
## 25  2017-01  0.045454545 18.35473   0.000000000
## 26  2017-02  0.036363636 23.65584   3.150710053
## 27  2017-03  0.194285714 22.53573   1.430094952
## 28  2017-04  0.436548223 26.15299   0.499381616
## 29  2017-05  1.202020202 28.00173   6.580562663
## 30  2017-06  0.834196891 29.48951  13.333939858
## 31  2017-07  1.765363128 32.25135   7.493927035
## 32  2017-08  0.744791667 31.86476   6.082113434
## 33  2017-09  0.722222222 30.60566   4.631037395
## 34  2017-10  0.142131980 27.73453  11.567112214
## 35  2017-11  0.289772727 23.23140   1.195760473
## 36  2017-12  0.009174312 18.93603   4.018254442

As I mentioned above, we may want/need to transform our data. So first let’s plot the samples over months:

plot(Aggregated_data$sample_value, type = "l", main="Average Abundance", xlab ="Time (months)", ylab = "Average Count")

We still see some of that behavior from earlier where we bottom out with lots of zero counts with the variance being small when counts are small and large variance when counts are large. If we’d summed instead of averaged we would have raw counts and we could use a Poisson or similar to allow this difference in variances. Working with the means, the standards would be to log (natural log, if no zeros are present) or take the square-root if you have zeros. One can also add a small value to all of the counts and then log. In this case I’m going to work with the square-root.

# create a sqrt column - we could also use a log(x+small) transform 
Aggregated_data$sqrt_sample <- sqrt(Aggregated_data$sample_value)

Now that we have a clean dataframe to work with, let’s plot our abundance data.

plot(Aggregated_data$sqrt_sample, type = "l", main="Average Monthly Abundance", xlab ="Time (months)", ylab = "Average SqRt Count")

Fitting the linear models

It’s good practice to always begin by creating a dataframe containing your response and all covariates that you want to consider. We include size and cosine waves with a 12 month period to capture the pattern of seasonality apparent in these data. I also currently use the data from the same month of weather, but because of the time to mature, it may be reasonable to use the previous month, instead.

t <- 2:nrow(Aggregated_data)
TS_df <- data.frame(Specimens=Aggregated_data$sqrt_sample[t],
                        SpecimensPast=Aggregated_data$sqrt_sample[t-1],
                        t=t,
                        Month=Aggregated_data$Month[t],
                        Max.Temp=Aggregated_data$Max.Temp[t],
                        Precipitation=Aggregated_data$Precipitation[t],
                        sin12=sin(2*pi*t/12), cos12=cos(2*pi*t/12))

Here we will conduct and plot a number of linear regression models for our data we will progressively add components to see which best predicts the data given.

Are the data autocorrelated? That is, do the abundances correlate with themselves with a lag time? We can check using the acf() function in R:

acf(TS_df$Specimens)

The acf() function automatically includes a 0 time lag which will always have a value of 1 (every measure is perfectly correlated with itself!), thus acting as a reference point. We can see that a time lag of 1 (month) has the highest correlation, so it should be interesting to incorporate this into a linear model.

TS_lm <- lm(Specimens ~ SpecimensPast, data = TS_df)

Let’s look at the summary:

summary(TS_lm)
## 
## Call:
## lm(formula = Specimens ~ SpecimensPast, data = TS_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6338 -0.2173 -0.0678  0.2463  0.8675 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.2410     0.1130   2.132   0.0405 *  
## SpecimensPast   0.6899     0.1240   5.562 3.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3598 on 33 degrees of freedom
## Multiple R-squared:  0.4839, Adjusted R-squared:  0.4682 
## F-statistic: 30.94 on 1 and 33 DF,  p-value: 3.507e-06

The AR coeff is significant, and \(<1\), which indicates that the series seems to be “mean reverting” (as we learned in the lecture portion). Let’s plot the fit with the data:

plot(t, TS_df$Specimens, type="l", xlab = "Time (months)", ylab = "Average log Count")
lines(t, TS_lm$fitted, col="pink", lwd=2)

This is actually pretty good (although AR fits often are, esp if there’s any mean reversion). What if we don’t use the AR and just try sin/cos?

# first we will look at how the previous time step and the time itself can predict
TS_lm_sin_cos <- lm(Specimens ~ sin12 + cos12, data = TS_df)

# plot the sample 
par(mfrow=c(1,1))
plot(t, TS_df$Specimens, type="l", xlab = "Time (months)", ylab = "Average SqRt Count")

# add a line to the plot for this particular model 
lines(t, TS_lm_sin_cos$fitted, col="green", lwd=2)

It looks like the sine and cosine waves can predict the peaks relatively well!

From here there are lots of possible paths forward. For now we will create models to incorporate climatic factors individually. First we visually examine the times series of temperature and precipitation:

par(mfrow=c(2,1), mar=c(2, 4, 1.1, 1.1))
plot(t, TS_df$Max.Temp, type="l", xlab = "Time (months)", ylab = "Ave Max Temperature")
plot(t, TS_df$Precipitation, type="l", xlab = "Time (months)", ylab = "Ave Precipitation per day")

Clearly temperature looks more promising as a correlate, but let’s fit linear models with each of these variables anyway:

TS_lm_temp <- lm(Specimens ~ Max.Temp, data = TS_df)
summary(TS_lm_temp)
## 
## Call:
## lm(formula = Specimens ~ Max.Temp, data = TS_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78204 -0.17950  0.01095  0.14751  0.61934 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.22123    0.30870  -3.956 0.000381 ***
## Max.Temp     0.07526    0.01147   6.562 1.86e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3299 on 33 degrees of freedom
## Multiple R-squared:  0.5661, Adjusted R-squared:  0.553 
## F-statistic: 43.06 on 1 and 33 DF,  p-value: 1.864e-07
plot(t, TS_df$Specimens, type="l", xlab = "Time (months)", ylab = "Average log Count")
lines(t, TS_lm_temp$fitted, col="red", lwd=2)

And precipitation:

TS_lm_precip <- lm(Specimens ~ Precipitation, data = TS_df)
summary(TS_lm_precip)
## 
## Call:
## lm(formula = Specimens ~ Precipitation, data = TS_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77073 -0.47601  0.08573  0.34152  0.96282 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7701328  0.1313966   5.861 1.45e-06 ***
## Precipitation 0.0001687  0.0177542   0.010    0.992    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5008 on 33 degrees of freedom
## Multiple R-squared:  2.736e-06,  Adjusted R-squared:  -0.0303 
## F-statistic: 9.028e-05 on 1 and 33 DF,  p-value: 0.9925
plot(t, TS_df$Specimens, type="l", xlab = "Time (months)", ylab = "Average log Count")
lines(t, TS_lm_precip$fitted, col="blue", lwd=2)

Unlike temperature, precipitation by itself is clearly not a very good predictor, as we suspected. Finally, let’s multiple factors into one model.

TS_lm_all <- lm(Specimens ~ SpecimensPast + Max.Temp + sin12 + cos12, data = TS_df)
summary(TS_lm_all)
## 
## Call:
## lm(formula = Specimens ~ SpecimensPast + Max.Temp + sin12 + cos12, 
##     data = TS_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50261 -0.20328 -0.00893  0.18471  0.62930 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.85405    1.00975  -0.846  0.40436   
## SpecimensPast  0.46646    0.15741   2.963  0.00591 **
## Max.Temp       0.04783    0.03881   1.232  0.22735   
## sin12          0.05026    0.18361   0.274  0.78618   
## cos12         -0.04708    0.21479  -0.219  0.82798   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2977 on 30 degrees of freedom
## Multiple R-squared:  0.6787, Adjusted R-squared:  0.6359 
## F-statistic: 15.85 on 4 and 30 DF,  p-value: 4.479e-07
plot(t, TS_df$Specimens, type="l", xlab = "Time (months)", ylab = "Average log Count")
lines(t, TS_lm_all$fitted, col="purple", lwd=2)

This looks pretty good, but only the AR term appears to be significant. Let’s try pulling out the sin and cos:

TS_lm_sub <- lm(Specimens ~ SpecimensPast + Max.Temp , data = TS_df)
summary(TS_lm_sub)
## 
## Call:
## lm(formula = Specimens ~ SpecimensPast + Max.Temp, data = TS_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49304 -0.20062  0.03508  0.15616  0.66323 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.92723    0.28736  -3.227 0.002886 ** 
## SpecimensPast  0.39265    0.12197   3.219 0.002945 ** 
## Max.Temp       0.05276    0.01230   4.289 0.000155 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2912 on 32 degrees of freedom
## Multiple R-squared:  0.6723, Adjusted R-squared:  0.6518 
## F-statistic: 32.82 on 2 and 32 DF,  p-value: 1.772e-08
plot(t, TS_df$Specimens, type="l", xlab = "Time (months)", ylab = "Average log Count")
lines(t, TS_lm_sub$fitted, col="purple", lwd=2)

All of the models appear, visually, to predict our data relatively well except precipitation. Ideally we would also examine residual plots (residuals over time, and the acf of the residuals) to see if we’ve managed to pull out all/most of the signal. For example, here’s the acf of the residuals for the “all but precip” model:

acf(TS_lm_all$residuals)

No remaining lingering autocorrelation!

Model Comparisons

So which model is the “best”? Which should we use? We want to find a well fitting model that is also parsimonious. We could compare our models by looking at the \(R^2\) values (remember that the \(R^2\) value shows us the proportion of the variance in our data that can be explained by our model). However \(R^2\) always increases as the number of parameters increases. It can be tempting to use adjusted \(R^2\), but we don’t have good theory as to what constitutes a meaningful different in \(R^2\). We can use a partial-F test between any two (nested) models to see if the difference in \(R^2\) between them is significant. For example, the ANOVA function in R allows us to to this easily. For example the temperature model and the all model.

anova(TS_lm_temp, TS_lm_all)
## Analysis of Variance Table
## 
## Model 1: Specimens ~ Max.Temp
## Model 2: Specimens ~ SpecimensPast + Max.Temp + sin12 + cos12
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     33 3.5912                              
## 2     30 2.6590  3   0.93218 3.5057 0.02723 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This tells us that adding in the extra components gives a significant increase in \(R^2\) compared to temperature alone. However the partial F-test only allows us to compare two models at a time and only those that are nested.

Instead, a popular option is to use some type of information criterion (IC). The two most popular are AIC (Akaike IC) and BIC (Bayesian IC). Both allow you to compare multiple arbitrary models as long as the response data (here sqrt(ave counts)) are the same between all models considered, and all use the same data (i.e., you can’t have added extra rows, but you can add columns). BIC has the nice addition that you can caluclate approximate model probabilities, so we’ll use that here.

BIC is a metric that looks at model fit (based on likelihoods) and discounts the fit based on a metric of how many parameters and data points you have. Generally, the lowest BIC value is the preferred model. A difference of greater than 5 shows evidence against the higher scored model, while a difference of greater than 10 showing strong evidence (note that these rules of thumb can depend on the exact way that you calculate BIC because of constant factors out front). Thus it’s often easier to interpret the approximate model probabilities.

Let’s compare ALL of the models:

# define the length of the series 
n<-length(TS_df$Specimens)-1

# extract the BIC scores 
bics<-c(TSlm=extractAIC(TS_lm, k=log(n))[2],
        TSlmsc=extractAIC(TS_lm_sin_cos, k=log(n))[2],
        TSlmt=extractAIC(TS_lm_temp, k=log(n))[2],
        TSlmp=extractAIC(TS_lm_precip, k=log(n))[2],
        TSlmall=extractAIC(TS_lm_all, k=log(n))[2],
        TSlmsub=extractAIC(TS_lm_sub, k=log(n))[2])

## calculates the first part of the prob calculation
ebics<-exp(-0.5*(bics-min(bics)))

## final part of the probability calculation
probs<-ebics/sum(ebics)


rbind(round(bics-min(bics), 5), round(probs, 5))
##          TSlm   TSlmsc   TSlmt    TSlmp TSlmall TSlmsub
## [1,] 12.36738 10.83723 6.29270 35.51738 6.35325  0.0000
## [2,]  0.00189  0.00406 0.03941  0.00000 0.03824  0.9164

For this example using Culex erraticus from Walton County, we can see that there is positive evidence for the complete model incorporating Maximum Temperature and past abundance counts, with a BIC score that is 6.35 lower than the “all” model. The probability that this is the best of the models is >90%. In other words, our best model indicated that temperature and previous abundance are most predictive of the data in sample.

Airline passenger data (Guided Practice)

Now let’s perform a time series analysis of some data on airline passenger numbers. You will mimic the analysis from above for these data, but you will need to fill in the code/analysis gaps. While not about vectors, this is a classic dataset that many features that you can explore and practice with.

The data you will analyze consist of monthly numbers of international airline passengers from 1949 to 1961. These data can be found in the airline.csv file.

airline <- read.csv("../data/airline.csv")

First plot the data:

plot(airline$Passengers, xlab="year", ylab="monthly passengers", type="l", col=3, lwd=2, xaxt="n")
axis(1, at=(0:12)*12, labels=1949:1961)

Notice that this involves a somewhat different approach toward adding axis labels than usual.

Q: What does at="n" mean? What does the axis function do? (Check the help doc: ?axis)

Next, we use the acf function to plot the auto-correlation function of the passengers data:

## put the cf function here

Q: From the two plots above, what things do you notice about the data? What transforms might you need to take of the data? What kinds of covariates might you need to add in?

Re-plot the from data above using a log transform of the response (passenger):

## plot code here

Now it’s time to build a data frame to hold the data. This is a good habit to get in to when you are building models for data that include transforms and possibly multiple lags, etc.

First we make a time variate:

t <- 2:nrow(airline)

Now, into the data frame, add the following covariates:

  1. logY: log of the number of passengers
  2. logYpast: this is your auto-regressive term, the log of the passengers from the previous month
  3. t: month number
  4. sin12: sine terms with period of 12 months
  5. cos12: cosine term with period of 12 months
YX <- data.frame(logY=log(airline$Passengers[t]), 
                 logYpast=log(airline$Passengers[t-1]), t=t,
                 sin12=sin(2*pi*t/12), cos12=cos(2*pi*t/12))
## your fitted model and the summary go here

Fit a linear model with logY as the response and the other 4 components as predictors. Look at the summary of the fit.

Q: Are all of the predictors significant? What is the \(R^2\) of your regression?

Next, we want to plot the data along with the prediction (aka the fit). I’ve plotted the data on a log scale (drawn with a dotted line). Use the “lines” function to overlay the FITTED values from your regression (e.g., if your regression model was called “reg” you want to plot reg$fitted vs t) as a solid line in another color. This solid line is your prediction. Update the legend to reflect your additional line.

plot(log(airline$Passengers), xlab="year",
     ylab="log monthly passengers", type="l", col=4, lty=2,
     xaxt="n", lwd=2)
axis(1, at=(0:12)*12, labels=1949:1961)

## add in the line here

legend("topleft", legend=c("data", "fitted"), lty=c(2,1), col=c(4,4)) ## update the legend

The difference between the solid and dotted lines at each month are your residuals across time. As always, we want to also look at our residuals explicitly to see if we’re doing a good job of explaining things. For TS we primarily look at residuals across time, and the ACF of our residuals. So make those two plots here.

par(mfrow=c(1,2))

## residuals plotted across time

## acf plot of the residuals

Q: How do these look? What do you notice about the residuals, esp the ACF?

It turns out that there is a month effect that we’re missing. Here is one way to look at it (note we have to index by t so that everything lines up properly):

## this command assumes that your fitted model is called mod1. You'll need to change it to your object

## boxplot(mod1$resid ~ airline$Month[t], xlab="month",
##        ylab="residuals", col=7)

Residuals in months with lots of school holidays (March, summer, December) are consistently high. Let’s create a dummy variable called “holidays” that tells whether a particular passenger record is for a month that has lots of holidays.

YX$holidays <- airline$Month[t] %in% c(3,6,7,8,12)

Fit a new lm that adds this holiday variable on to what you had before, and then re-examine the residuals, including by month.

## new fitted model and summary here

## plot of data + model fit here
par(mfrow=c(1,2))
## residuals plotted across time

## acf plot of the residuals

## boxplot of residuals by month

Model Comparison

Now you have 2 nested models. Because they are nested, we can compare them in two ways. First I do a partial F-test. The idea behind the F-test is that it looks at the difference in the \(R^2\) value between the two models and determines whether or not this difference is large enough to warrent the additional covariates. If we use the “anova” function in R and provide both fittend model objects as arguments, it will automatically perform a partial F-test. Note: you will need to replace mod1 and mod2 with the names of your model objects.

## partial F test: anova(mod1, mod2)

Q: Based on these what model would you choose and why?

We can also compare the models via BIC (the Bayesian Information Criterion) and the approximate relative model probabilities based on the BICs. Note that we can use BIC/model probabilities to compare between models even if the models are not nested, as long as the response being modeled is the same in each case. Note: you will need to replace mod1 and mod2 with the names of your model objects.

n<-length(YX$logY)-1
##bics<-c(mod1=extractAIC(mod1, k=log(n))[2],
##        mod2=extractAIC(mod2, k=log(n))[2])

##ebics<-exp(-0.5*(bics-min(bics)))

##probs<-ebics/sum(ebics)


##rbind(round(bics, 5), round(probs, 5))

Q: Which model is the best via BIC? Does this jive with what the partial F-test told you? What is the \(R^2\) for your best model. Based on this model selection, \(R^2\) and what you saw in the residuals for this model, do you feel satisfied that your model is capturing the patterns in the data? Would you want to go back and fit anything else?

Aedes aegypti abundance (Independent Practice)

Use Aedes aegypti abundances from light traps from various locations in Manatee County, Florida to perform a Time series analysis. This dataset is available on the workshop git repository (file called vecdyn_manatee_county_a.aegypti.csv).

Write it as an independent, self-sufficient R script that produces all the plots in a reproducible workflow when sourced.