< | Main Materials | >
This section is focused on teaching some techniques for analysis of time series of data of interest such as abundances. This section assumes that you have seen the time series lecture, where you saw how basic time series analyses are simply extensions to linear regression.
For this practical you will need to have the following packages installed for data manipulation:
require(dplyr)
require(tidyr)
You will get warnings because these packages have functions with the same name as the main (“base”) R packages.
In the face of global climate change, vector borne diseases have the capacity to shift ranges and drastically alter emergence times. This can have profound public health implications. Therefore, it is essential that we monitor, and make attempts to forecast how changing climate could affect vectors.
We will apply linear models to vector abundance data available in VecDyn.
Now let’s read in some data. There are a couple of example datasets for you to explore - as well as a wealth of data you can download from the VecDyn database. They will all be in the same format, so the cleaning procedures, and the analytic processes will be the same for linear modeling.
# This is a dataset from Walton County, Florida for the species Culex erraticus
Complete_data <- read.csv("../data/Culex_erraticus_walton.csv")
## Or you could use either of the following datasets for Culex nigripalpus
# Complete_data <- read.csv("../data/Culex_nigripalpus_walton.csv")
# Complete_data <- read.csv("../data/Culex_nigripalpus_manatee.csv")
Great, now we have some data. It’s important that we have a look at its layout, format, etc. I usually use 3 methods to start with. First, just the names (tells you the names of the columns):
names(Complete_data)
## [1] "X" "title"
## [3] "dataset_citation" "publication_citation"
## [5] "description" "url"
## [7] "contact_name" "contact_affiliation"
## [9] "email" "orcid"
## [11] "dataset_license" "project_identifier"
## [13] "publication_status" "taxon"
## [15] "location_description" "study_collection_area"
## [17] "geo_datum" "gps_obfuscation_info"
## [19] "species_id_method" "study_design"
## [21] "sampling_strategy" "sampling_method"
## [23] "sampling_protocol" "measurement_unit"
## [25] "value_transform" "sample_start_date"
## [27] "sample_start_time" "sample_end_date"
## [29] "sample_end_time" "sample_value"
## [31] "sample_sex" "sample_stage"
## [33] "sample_location" "sample_collection_area"
## [35] "sample_lat_dd" "sample_long_dd"
## [37] "sample_environment" "additional_location_info"
## [39] "additional_sample_info" "sample_name"
We can already see that there’s a bunch of the data set that we probably won’t be using. Next, head allows us to look at the first few rows of data
head(Complete_data)
## X title
## 1 1622290 Mosquito surveillance in South Walton County, Florida, USA. 2015
## 2 1622343 Mosquito surveillance in South Walton County, Florida, USA. 2015
## 3 1622380 Mosquito surveillance in South Walton County, Florida, USA. 2015
## 4 1622410 Mosquito surveillance in South Walton County, Florida, USA. 2015
## 5 1622440 Mosquito surveillance in South Walton County, Florida, USA. 2015
## 6 1622479 Mosquito surveillance in South Walton County, Florida, USA. 2015
## dataset_citation
## 1 DOI:10.5281/zenodo.1220173,https://southwaltonmosquitocontrol.org/
## 2 DOI:10.5281/zenodo.1220173,https://southwaltonmosquitocontrol.org/
## 3 DOI:10.5281/zenodo.1220173,https://southwaltonmosquitocontrol.org/
## 4 DOI:10.5281/zenodo.1220173,https://southwaltonmosquitocontrol.org/
## 5 DOI:10.5281/zenodo.1220173,https://southwaltonmosquitocontrol.org/
## 6 DOI:10.5281/zenodo.1220173,https://southwaltonmosquitocontrol.org/
## publication_citation
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## description
## 1 Mosquito surveillance from the South Walton County Mosquito Control District Vector Surveillance program to survey mosquito populations.
## 2 Mosquito surveillance from the South Walton County Mosquito Control District Vector Surveillance program to survey mosquito populations.
## 3 Mosquito surveillance from the South Walton County Mosquito Control District Vector Surveillance program to survey mosquito populations.
## 4 Mosquito surveillance from the South Walton County Mosquito Control District Vector Surveillance program to survey mosquito populations.
## 5 Mosquito surveillance from the South Walton County Mosquito Control District Vector Surveillance program to survey mosquito populations.
## 6 Mosquito surveillance from the South Walton County Mosquito Control District Vector Surveillance program to survey mosquito populations.
## url contact_name contact_affiliation email orcid
## 1 NA Peter J Brabant;Peter J Brabant III NA NA NA
## 2 NA Peter J Brabant;Peter J Brabant III NA NA NA
## 3 NA Peter J Brabant;Peter J Brabant III NA NA NA
## 4 NA Peter J Brabant;Peter J Brabant III NA NA NA
## 5 NA Peter J Brabant;Peter J Brabant III NA NA NA
## 6 NA Peter J Brabant;Peter J Brabant III NA NA NA
## dataset_license project_identifier publication_status taxon
## 1 CC BY VBP0000263 unpublished Culex erraticus
## 2 CC BY VBP0000263 unpublished Culex erraticus
## 3 CC BY VBP0000263 unpublished Culex erraticus
## 4 CC BY VBP0000263 unpublished Culex erraticus
## 5 CC BY VBP0000263 unpublished Culex erraticus
## 6 CC BY VBP0000263 unpublished Culex erraticus
## location_description study_collection_area geo_datum gps_obfuscation_info
## 1 Walton County NA NA NA
## 2 Walton County NA NA NA
## 3 Walton County NA NA NA
## 4 Walton County NA NA NA
## 5 Walton County NA NA NA
## 6 Walton County NA NA NA
## species_id_method study_design sampling_strategy sampling_method
## 1 NA NA NA light
## 2 NA NA NA light
## 3 NA NA NA light
## 4 NA NA NA light
## 5 NA NA NA light
## 6 NA NA NA light
## sampling_protocol measurement_unit value_transform sample_start_date
## 1 new jersey trap catch NA NA NA
## 2 new jersey trap catch NA NA NA
## 3 new jersey trap catch NA NA NA
## 4 new jersey trap catch NA NA NA
## 5 new jersey trap catch NA NA NA
## 6 new jersey trap catch NA NA NA
## sample_start_time sample_end_date sample_end_time sample_value sample_sex
## 1 NA 2015-01-05 NA 0 female
## 2 NA 2015-01-05 NA 0 female
## 3 NA 2015-01-05 NA 0 female
## 4 NA 2015-01-05 NA 0 female
## 5 NA 2015-01-05 NA 0 female
## 6 NA 2015-01-22 NA 0 female
## sample_stage sample_location sample_collection_area sample_lat_dd
## 1 adult Walton County NA 30.3813
## 2 adult Walton County NA 30.3892
## 3 adult Walton County NA 30.3469
## 4 adult Walton County NA 30.3717
## 5 adult Walton County NA 30.3650
## 6 adult Walton County NA 30.3813
## sample_long_dd sample_environment additional_location_info
## 1 -86.3593 NA NA
## 2 -86.0890 NA NA
## 3 -86.0694 NA NA
## 4 -86.2655 NA NA
## 5 -86.2261 NA NA
## 6 -86.3593 NA NA
## additional_sample_info sample_name
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
Finally the summary, which tells us a little bit more about the type of data in each column, as R sees it:
summary(Complete_data)
## X title dataset_citation publication_citation
## Min. :1622290 Length:6096 Length:6096 Mode:logical
## 1st Qu.:1666636 Class :character Class :character NA's:6096
## Median :1709015 Mode :character Mode :character
## Mean :1711077
## 3rd Qu.:1756655
## Max. :1805144
## description url contact_name contact_affiliation
## Length:6096 Mode:logical Length:6096 Mode:logical
## Class :character NA's:6096 Class :character NA's:6096
## Mode :character Mode :character
##
##
##
## email orcid dataset_license project_identifier
## Mode:logical Mode:logical Length:6096 Length:6096
## NA's:6096 NA's:6096 Class :character Class :character
## Mode :character Mode :character
##
##
##
## publication_status taxon location_description
## Length:6096 Length:6096 Length:6096
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## study_collection_area geo_datum gps_obfuscation_info species_id_method
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:6096 NA's:6096 NA's:6096 NA's:6096
##
##
##
##
## study_design sampling_strategy sampling_method sampling_protocol
## Mode:logical Mode:logical Length:6096 Length:6096
## NA's:6096 NA's:6096 Class :character Class :character
## Mode :character Mode :character
##
##
##
## measurement_unit value_transform sample_start_date sample_start_time
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:6096 NA's:6096 NA's:6096 NA's:6096
##
##
##
##
## sample_end_date sample_end_time sample_value sample_sex
## Length:6096 Mode:logical Min. : 0.0000 Length:6096
## Class :character NA's:6096 1st Qu.: 0.0000 Class :character
## Mode :character Median : 0.0000 Mode :character
## Mean : 0.8735
## 3rd Qu.: 0.0000
## Max. :47.0000
## sample_stage sample_location sample_collection_area sample_lat_dd
## Length:6096 Length:6096 Mode:logical Min. :30.27
## Class :character Class :character NA's:6096 1st Qu.:30.35
## Mode :character Mode :character Median :30.37
## Mean :30.36
## 3rd Qu.:30.39
## Max. :30.41
## sample_long_dd sample_environment additional_location_info
## Min. :-86.40 Mode:logical Mode:logical
## 1st Qu.:-86.27 NA's:6096 NA's:6096
## Median :-86.21
## Mean :-86.19
## 3rd Qu.:-86.10
## Max. :-86.00
## additional_sample_info sample_name
## Mode:logical Mode:logical
## NA's:6096 NA's:6096
##
##
##
##
Notice here that most of these are of type “character” which indicates strings of words. In certain cases you may end up wanting to use some of these (e.g. such as sex or stage) as factors, in which case you have to change the format before you can use it in your analysis.
There’s a lot of information in this dataset, most of which we won’t be using to analyze the time series data. For simplicity, let’s clean the data and store it in another data frame which we can use to match with climate data from the location where the data were taken.
# select only the rows of interest
Main_data <- select(Complete_data, c("sample_end_date", "sample_value", "sample_lat_dd", "sample_long_dd"))
# Make sure the sample end date is in date format
Main_data$sample_end_date <- as.Date(Main_data$sample_end_date, format = "%Y-%m-%d")
# Order by date
Main_data <- Main_data[order(Main_data$sample_end_date, decreasing=FALSE),]
# We can now create columns for Month/Year and Month
Main_data$Month_Yr <- format(as.Date(Main_data$sample_end_date), "%Y-%m")
Main_data$Month <- format(as.Date(Main_data$sample_end_date), "%m")
Before we go any further, let’s take a look at the data so far
summary(Main_data)
## sample_end_date sample_value sample_lat_dd sample_long_dd
## Min. :2015-01-05 Min. : 0.0000 Min. :30.27 Min. :-86.40
## 1st Qu.:2015-10-12 1st Qu.: 0.0000 1st Qu.:30.35 1st Qu.:-86.27
## Median :2016-07-05 Median : 0.0000 Median :30.37 Median :-86.21
## Mean :2016-07-09 Mean : 0.8735 Mean :30.36 Mean :-86.19
## 3rd Qu.:2017-04-18 3rd Qu.: 0.0000 3rd Qu.:30.39 3rd Qu.:-86.10
## Max. :2017-12-18 Max. :47.0000 Max. :30.41 Max. :-86.00
## Month_Yr Month
## Length:6096 Length:6096
## Class :character Class :character
## Mode :character Mode :character
##
##
##
and plot the samples
plot(Main_data$sample_value, type="l")
points(Main_data$sample_value, col=2)
Notice that we have a LOT of zeros, both between seasons, and because data aren’t collected everyday. Further, we have a lot of variance. This dataset, as is, will be hard to model without some sort of transformation. We also can’t use basic regression approaches unless we don’t have any data gaps. So the best bet here will be to aggregate and the possibly transform (but let’s get our climate data, first)
# Pre-allocate the rows for climate data by filling with NAs
Main_data$Max.Temp <- NA
Main_data$Precipitation <- NA
Next, we can import the climate data to be matched up to the abundance data we have stored as Main_data
. For this, you will need to ensure you have the climate data file vectorbase_locations_dates_climate.csv
(look in the data directory on the github repo) saved in your data directory.
This climate data has been downloaded in advance from NOAA databases, on a scale of 2.5x2.5 degrees lat/long, on a daily basis.
# Read in the climate data csv
Climate_data <- read.csv("../data/vectorbase_locations_dates_climate.csv")
We can now populate these columns by matching up the date for each row, and the closest co-ordinates we have in our climate data.
# For each row in Main_data
for (row in 1:nrow(Main_data)){
# extract the date associated with the row
date <- as.character(Main_data[row, "sample_end_date"])
# subset the climate data to only those with the same date
data <- subset(Climate_data, Climate_data$Collection.date.range == date)
if (nrow(data)>0){
# define the lat and long desired
lat <- as.numeric(Main_data[row, "sample_lat_dd"])
long <- as.numeric(Main_data[row, "sample_long_dd"])
# find the absolute differences between desired lat/longs to the climate datasets
x <- abs(data$Initial_latitude - lat)
y <- abs(data$Initial_longitude - long)
# find the index for which there is the minimum overall difference between lat/longs
z<-which(x+y==min(x+y))
# draw out the max temp and place into main data frame
Main_data[row, "Max.Temp"] <- data[z, "Max.Temp"]
Main_data[row, "Precipitation"] <- data[z, "Precipitation"]
}
else{
# If there aren't any data to extract for a given date, input NAs
Main_data[row, "Max.Temp"] <- NA
Main_data[row, "Precipitation"] <- NA
}
}
Now let’s check whether this has worked correctly, and assess whether there are any NA
’s.
summary(Main_data$Max.Temp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.656 23.902 27.844 26.895 31.091 35.782
summary(Main_data$Precipitation)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.02836 5.73650 4.60330 121.88240
When considering time series data, the temporal resolution can have profound implications. Ideally, we want no gaps in our data - therefore working with data on the daily scale for abundance data is impossible without some sort of interpolation to fill the gaps. Instead, lets aggregate the data to the monthly scale by averaging the daily data. That is, this will give us the average number of mosquitoes observed per day in that month.
Aggregated_data <- aggregate(cbind(sample_value, Max.Temp, Precipitation) ~ Month_Yr, data = Main_data, mean)
print(Aggregated_data)
## Month_Yr sample_value Max.Temp Precipitation
## 1 2015-01 0.000000000 17.74602 3.303991888
## 2 2015-02 0.018181818 17.87269 16.544265802
## 3 2015-03 0.468085106 23.81767 2.405651215
## 4 2015-04 1.619047619 26.03559 8.974406168
## 5 2015-05 0.821428571 30.01602 0.567960943
## 6 2015-06 3.005952381 31.12094 4.841342729
## 7 2015-07 2.380952381 32.81130 3.849010353
## 8 2015-08 1.826347305 32.56245 5.562845324
## 9 2015-09 0.648809524 30.55155 10.409724627
## 10 2015-10 0.988023952 27.22605 0.337750269
## 11 2015-11 0.737804878 24.86768 18.306749680
## 12 2015-12 0.142857143 22.46588 5.621475377
## 13 2016-01 0.000000000 16.02406 3.550622029
## 14 2016-02 0.020202020 19.42057 11.254680803
## 15 2016-03 0.015151515 23.13610 4.785664728
## 16 2016-04 0.026143791 24.98082 4.580424519
## 17 2016-05 0.025252525 28.72884 0.053057634
## 18 2016-06 0.833333333 30.96990 6.155417473
## 19 2016-07 1.261363636 33.30509 4.496368193
## 20 2016-08 1.685279188 32.09633 11.338749182
## 21 2016-09 2.617142857 31.60575 2.868288451
## 22 2016-10 1.212121212 29.14275 0.000000000
## 23 2016-11 1.539772727 24.48482 0.005462681
## 24 2016-12 0.771573604 20.46054 11.615521725
## 25 2017-01 0.045454545 18.35473 0.000000000
## 26 2017-02 0.036363636 23.65584 3.150710053
## 27 2017-03 0.194285714 22.53573 1.430094952
## 28 2017-04 0.436548223 26.15299 0.499381616
## 29 2017-05 1.202020202 28.00173 6.580562663
## 30 2017-06 0.834196891 29.48951 13.333939858
## 31 2017-07 1.765363128 32.25135 7.493927035
## 32 2017-08 0.744791667 31.86476 6.082113434
## 33 2017-09 0.722222222 30.60566 4.631037395
## 34 2017-10 0.142131980 27.73453 11.567112214
## 35 2017-11 0.289772727 23.23140 1.195760473
## 36 2017-12 0.009174312 18.93603 4.018254442
As I mentioned above, we may want/need to transform our data. So first let’s plot the samples over months:
plot(Aggregated_data$sample_value, type = "l", main="Average Abundance", xlab ="Time (months)", ylab = "Average Count")
We still see some of that behavior from earlier where we bottom out with lots of zero counts with the variance being small when counts are small and large variance when counts are large. If we’d summed instead of averaged we would have raw counts and we could use a Poisson or similar to allow this difference in variances. Working with the means, the standards would be to log (natural log, if no zeros are present) or take the square-root if you have zeros. One can also add a small value to all of the counts and then log. In this case I’m going to work with the square-root.
# create a sqrt column - we could also use a log(x+small) transform
Aggregated_data$sqrt_sample <- sqrt(Aggregated_data$sample_value)
Now that we have a clean dataframe to work with, let’s plot our abundance data.
plot(Aggregated_data$sqrt_sample, type = "l", main="Average Monthly Abundance", xlab ="Time (months)", ylab = "Average SqRt Count")
It’s good practice to always begin by creating a dataframe containing your response and all covariates that you want to consider. We include size and cosine waves with a 12 month period to capture the pattern of seasonality apparent in these data. I also currently use the data from the same month of weather, but because of the time to mature, it may be reasonable to use the previous month, instead.
t <- 2:nrow(Aggregated_data)
TS_df <- data.frame(Specimens=Aggregated_data$sqrt_sample[t],
SpecimensPast=Aggregated_data$sqrt_sample[t-1],
t=t,
Month=Aggregated_data$Month[t],
Max.Temp=Aggregated_data$Max.Temp[t],
Precipitation=Aggregated_data$Precipitation[t],
sin12=sin(2*pi*t/12), cos12=cos(2*pi*t/12))
Here we will conduct and plot a number of linear regression models for our data we will progressively add components to see which best predicts the data given.
Are the data autocorrelated? That is, do the abundances correlate with themselves with a lag time? We can check using the acf()
function in R:
acf(TS_df$Specimens)
The acf()
function automatically includes a 0 time lag which will always have a value of 1 (every measure is perfectly correlated with itself!), thus acting as a reference point. We can see that a time lag of 1 (month) has the highest correlation, so it should be interesting to incorporate this into a linear model.
TS_lm <- lm(Specimens ~ SpecimensPast, data = TS_df)
Let’s look at the summary:
summary(TS_lm)
##
## Call:
## lm(formula = Specimens ~ SpecimensPast, data = TS_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6338 -0.2173 -0.0678 0.2463 0.8675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2410 0.1130 2.132 0.0405 *
## SpecimensPast 0.6899 0.1240 5.562 3.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3598 on 33 degrees of freedom
## Multiple R-squared: 0.4839, Adjusted R-squared: 0.4682
## F-statistic: 30.94 on 1 and 33 DF, p-value: 3.507e-06
The AR coeff is significant, and \(<1\), which indicates that the series seems to be “mean reverting” (as we learned in the lecture portion). Let’s plot the fit with the data:
plot(t, TS_df$Specimens, type="l", xlab = "Time (months)", ylab = "Average log Count")
lines(t, TS_lm$fitted, col="pink", lwd=2)
This is actually pretty good (although AR fits often are, esp if there’s any mean reversion). What if we don’t use the AR and just try sin/cos?
# first we will look at how the previous time step and the time itself can predict
TS_lm_sin_cos <- lm(Specimens ~ sin12 + cos12, data = TS_df)
# plot the sample
par(mfrow=c(1,1))
plot(t, TS_df$Specimens, type="l", xlab = "Time (months)", ylab = "Average SqRt Count")
# add a line to the plot for this particular model
lines(t, TS_lm_sin_cos$fitted, col="green", lwd=2)
It looks like the sine and cosine waves can predict the peaks relatively well!
From here there are lots of possible paths forward. For now we will create models to incorporate climatic factors individually. First we visually examine the times series of temperature and precipitation:
par(mfrow=c(2,1), mar=c(2, 4, 1.1, 1.1))
plot(t, TS_df$Max.Temp, type="l", xlab = "Time (months)", ylab = "Ave Max Temperature")
plot(t, TS_df$Precipitation, type="l", xlab = "Time (months)", ylab = "Ave Precipitation per day")
Clearly temperature looks more promising as a correlate, but let’s fit linear models with each of these variables anyway:
TS_lm_temp <- lm(Specimens ~ Max.Temp, data = TS_df)
summary(TS_lm_temp)
##
## Call:
## lm(formula = Specimens ~ Max.Temp, data = TS_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78204 -0.17950 0.01095 0.14751 0.61934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.22123 0.30870 -3.956 0.000381 ***
## Max.Temp 0.07526 0.01147 6.562 1.86e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3299 on 33 degrees of freedom
## Multiple R-squared: 0.5661, Adjusted R-squared: 0.553
## F-statistic: 43.06 on 1 and 33 DF, p-value: 1.864e-07
plot(t, TS_df$Specimens, type="l", xlab = "Time (months)", ylab = "Average log Count")
lines(t, TS_lm_temp$fitted, col="red", lwd=2)
And precipitation:
TS_lm_precip <- lm(Specimens ~ Precipitation, data = TS_df)
summary(TS_lm_precip)
##
## Call:
## lm(formula = Specimens ~ Precipitation, data = TS_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77073 -0.47601 0.08573 0.34152 0.96282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7701328 0.1313966 5.861 1.45e-06 ***
## Precipitation 0.0001687 0.0177542 0.010 0.992
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5008 on 33 degrees of freedom
## Multiple R-squared: 2.736e-06, Adjusted R-squared: -0.0303
## F-statistic: 9.028e-05 on 1 and 33 DF, p-value: 0.9925
plot(t, TS_df$Specimens, type="l", xlab = "Time (months)", ylab = "Average log Count")
lines(t, TS_lm_precip$fitted, col="blue", lwd=2)
Unlike temperature, precipitation by itself is clearly not a very good predictor, as we suspected. Finally, let’s multiple factors into one model.
TS_lm_all <- lm(Specimens ~ SpecimensPast + Max.Temp + sin12 + cos12, data = TS_df)
summary(TS_lm_all)
##
## Call:
## lm(formula = Specimens ~ SpecimensPast + Max.Temp + sin12 + cos12,
## data = TS_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50261 -0.20328 -0.00893 0.18471 0.62930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.85405 1.00975 -0.846 0.40436
## SpecimensPast 0.46646 0.15741 2.963 0.00591 **
## Max.Temp 0.04783 0.03881 1.232 0.22735
## sin12 0.05026 0.18361 0.274 0.78618
## cos12 -0.04708 0.21479 -0.219 0.82798
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2977 on 30 degrees of freedom
## Multiple R-squared: 0.6787, Adjusted R-squared: 0.6359
## F-statistic: 15.85 on 4 and 30 DF, p-value: 4.479e-07
plot(t, TS_df$Specimens, type="l", xlab = "Time (months)", ylab = "Average log Count")
lines(t, TS_lm_all$fitted, col="purple", lwd=2)
This looks pretty good, but only the AR term appears to be significant. Let’s try pulling out the sin and cos:
TS_lm_sub <- lm(Specimens ~ SpecimensPast + Max.Temp , data = TS_df)
summary(TS_lm_sub)
##
## Call:
## lm(formula = Specimens ~ SpecimensPast + Max.Temp, data = TS_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49304 -0.20062 0.03508 0.15616 0.66323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.92723 0.28736 -3.227 0.002886 **
## SpecimensPast 0.39265 0.12197 3.219 0.002945 **
## Max.Temp 0.05276 0.01230 4.289 0.000155 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2912 on 32 degrees of freedom
## Multiple R-squared: 0.6723, Adjusted R-squared: 0.6518
## F-statistic: 32.82 on 2 and 32 DF, p-value: 1.772e-08
plot(t, TS_df$Specimens, type="l", xlab = "Time (months)", ylab = "Average log Count")
lines(t, TS_lm_sub$fitted, col="purple", lwd=2)
All of the models appear, visually, to predict our data relatively well except precipitation. Ideally we would also examine residual plots (residuals over time, and the acf of the residuals) to see if we’ve managed to pull out all/most of the signal. For example, here’s the acf of the residuals for the “all but precip” model:
acf(TS_lm_all$residuals)
No remaining lingering autocorrelation!
So which model is the “best”? Which should we use? We want to find a well fitting model that is also parsimonious. We could compare our models by looking at the \(R^2\) values (remember that the \(R^2\) value shows us the proportion of the variance in our data that can be explained by our model). However \(R^2\) always increases as the number of parameters increases. It can be tempting to use adjusted \(R^2\), but we don’t have good theory as to what constitutes a meaningful different in \(R^2\). We can use a partial-F test between any two (nested) models to see if the difference in \(R^2\) between them is significant. For example, the ANOVA function in R allows us to to this easily. For example the temperature model and the all model.
anova(TS_lm_temp, TS_lm_all)
## Analysis of Variance Table
##
## Model 1: Specimens ~ Max.Temp
## Model 2: Specimens ~ SpecimensPast + Max.Temp + sin12 + cos12
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 33 3.5912
## 2 30 2.6590 3 0.93218 3.5057 0.02723 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This tells us that adding in the extra components gives a significant increase in \(R^2\) compared to temperature alone. However the partial F-test only allows us to compare two models at a time and only those that are nested.
Instead, a popular option is to use some type of information criterion (IC). The two most popular are AIC (Akaike IC) and BIC (Bayesian IC). Both allow you to compare multiple arbitrary models as long as the response data (here sqrt(ave counts)) are the same between all models considered, and all use the same data (i.e., you can’t have added extra rows, but you can add columns). BIC has the nice addition that you can caluclate approximate model probabilities, so we’ll use that here.
BIC is a metric that looks at model fit (based on likelihoods) and discounts the fit based on a metric of how many parameters and data points you have. Generally, the lowest BIC value is the preferred model. A difference of greater than 5 shows evidence against the higher scored model, while a difference of greater than 10 showing strong evidence (note that these rules of thumb can depend on the exact way that you calculate BIC because of constant factors out front). Thus it’s often easier to interpret the approximate model probabilities.
Let’s compare ALL of the models:
# define the length of the series
n<-length(TS_df$Specimens)-1
# extract the BIC scores
bics<-c(TSlm=extractAIC(TS_lm, k=log(n))[2],
TSlmsc=extractAIC(TS_lm_sin_cos, k=log(n))[2],
TSlmt=extractAIC(TS_lm_temp, k=log(n))[2],
TSlmp=extractAIC(TS_lm_precip, k=log(n))[2],
TSlmall=extractAIC(TS_lm_all, k=log(n))[2],
TSlmsub=extractAIC(TS_lm_sub, k=log(n))[2])
## calculates the first part of the prob calculation
ebics<-exp(-0.5*(bics-min(bics)))
## final part of the probability calculation
probs<-ebics/sum(ebics)
rbind(round(bics-min(bics), 5), round(probs, 5))
## TSlm TSlmsc TSlmt TSlmp TSlmall TSlmsub
## [1,] 12.36738 10.83723 6.29270 35.51738 6.35325 0.0000
## [2,] 0.00189 0.00406 0.03941 0.00000 0.03824 0.9164
For this example using Culex erraticus from Walton County, we can see that there is positive evidence for the complete model incorporating Maximum Temperature and past abundance counts, with a BIC score that is 6.35 lower than the “all” model. The probability that this is the best of the models is >90%. In other words, our best model indicated that temperature and previous abundance are most predictive of the data in sample.
Now let’s perform a time series analysis of some data on airline passenger numbers. You will mimic the analysis from above for these data, but you will need to fill in the code/analysis gaps. While not about vectors, this is a classic dataset that many features that you can explore and practice with.
The data you will analyze consist of monthly numbers of international airline passengers from 1949 to 1961. These data can be found in the airline.csv
file.
airline <- read.csv("../data/airline.csv")
First plot the data:
plot(airline$Passengers, xlab="year", ylab="monthly passengers", type="l", col=3, lwd=2, xaxt="n")
axis(1, at=(0:12)*12, labels=1949:1961)
Notice that this involves a somewhat different approach toward adding axis labels than usual.
Q: What does at="n"
mean? What does the axis function do? (Check the help doc: ?axis
)
Next, we use the acf
function to plot the auto-correlation function of the passengers data:
## put the cf function here
Q: From the two plots above, what things do you notice about the data? What transforms might you need to take of the data? What kinds of covariates might you need to add in?
Re-plot the from data above using a log transform of the response (passenger):
## plot code here
Now it’s time to build a data frame to hold the data. This is a good habit to get in to when you are building models for data that include transforms and possibly multiple lags, etc.
First we make a time variate:
t <- 2:nrow(airline)
Now, into the data frame, add the following covariates:
YX <- data.frame(logY=log(airline$Passengers[t]),
logYpast=log(airline$Passengers[t-1]), t=t,
sin12=sin(2*pi*t/12), cos12=cos(2*pi*t/12))
## your fitted model and the summary go here
Fit a linear model with logY as the response and the other 4 components as predictors. Look at the summary of the fit.
Q: Are all of the predictors significant? What is the \(R^2\) of your regression?
Next, we want to plot the data along with the prediction (aka the fit). I’ve plotted the data on a log scale (drawn with a dotted line). Use the “lines” function to overlay the FITTED values from your regression (e.g., if your regression model was called “reg” you want to plot reg$fitted vs t) as a solid line in another color. This solid line is your prediction. Update the legend to reflect your additional line.
plot(log(airline$Passengers), xlab="year",
ylab="log monthly passengers", type="l", col=4, lty=2,
xaxt="n", lwd=2)
axis(1, at=(0:12)*12, labels=1949:1961)
## add in the line here
legend("topleft", legend=c("data", "fitted"), lty=c(2,1), col=c(4,4)) ## update the legend
The difference between the solid and dotted lines at each month are your residuals across time. As always, we want to also look at our residuals explicitly to see if we’re doing a good job of explaining things. For TS we primarily look at residuals across time, and the ACF of our residuals. So make those two plots here.
par(mfrow=c(1,2))
## residuals plotted across time
## acf plot of the residuals
Q: How do these look? What do you notice about the residuals, esp the ACF?
It turns out that there is a month effect that we’re missing. Here is one way to look at it (note we have to index by t so that everything lines up properly):
## this command assumes that your fitted model is called mod1. You'll need to change it to your object
## boxplot(mod1$resid ~ airline$Month[t], xlab="month",
## ylab="residuals", col=7)
Residuals in months with lots of school holidays (March, summer, December) are consistently high. Let’s create a dummy variable called “holidays” that tells whether a particular passenger record is for a month that has lots of holidays.
YX$holidays <- airline$Month[t] %in% c(3,6,7,8,12)
Fit a new lm that adds this holiday variable on to what you had before, and then re-examine the residuals, including by month.
## new fitted model and summary here
## plot of data + model fit here
par(mfrow=c(1,2))
## residuals plotted across time
## acf plot of the residuals
## boxplot of residuals by month
Now you have 2 nested models. Because they are nested, we can compare them in two ways. First I do a partial F-test. The idea behind the F-test is that it looks at the difference in the \(R^2\) value between the two models and determines whether or not this difference is large enough to warrent the additional covariates. If we use the “anova” function in R and provide both fittend model objects as arguments, it will automatically perform a partial F-test. Note: you will need to replace mod1 and mod2 with the names of your model objects.
## partial F test: anova(mod1, mod2)
Q: Based on these what model would you choose and why?
We can also compare the models via BIC (the Bayesian Information Criterion) and the approximate relative model probabilities based on the BICs. Note that we can use BIC/model probabilities to compare between models even if the models are not nested, as long as the response being modeled is the same in each case. Note: you will need to replace mod1 and mod2 with the names of your model objects.
n<-length(YX$logY)-1
##bics<-c(mod1=extractAIC(mod1, k=log(n))[2],
## mod2=extractAIC(mod2, k=log(n))[2])
##ebics<-exp(-0.5*(bics-min(bics)))
##probs<-ebics/sum(ebics)
##rbind(round(bics, 5), round(probs, 5))
Q: Which model is the best via BIC? Does this jive with what the partial F-test told you? What is the \(R^2\) for your best model. Based on this model selection, \(R^2\) and what you saw in the residuals for this model, do you feel satisfied that your model is capturing the patterns in the data? Would you want to go back and fit anything else?
Use Aedes aegypti abundances from light traps from various locations in Manatee County, Florida to perform a Time series analysis. This dataset is available on the workshop git repository (file called vecdyn_manatee_county_a.aegypti.csv
).
Write it as an independent, self-sufficient R script that produces all the plots in a reproducible workflow when sourced.