Covariate data #

Covariates are variables that are expected to change with the response variable – they covary in some way with the observations we seek to model. The definition of covariates varies widely online and in the literature. For our purposes, we use the term covariate to describe any variable (whether continuous or discrete) that might influence the mean of the response variable we are interested in. In many cases their effects are of direct interest in the analysis (weather or terrain, for instance). In other cases, a covariate might be “nuisance variable” – a fact or feature that is of no particular interest in itself, but nonetheless might be necessary to build a proper model and develop robust inference. Examples of nuisances include sudden changes in protocol or observers.

Covariates can be static or dynamic. A variable like the elevation of site varies from site to site, but not in time (at least on the time scales we are interested in). Variables involving weather, such as precipitation, are more interesting. A long-run summary (e.g., 30-year average rainfall), like our static terrain variable, might only vary in space. However, an annualized metric will vary in time and space.

The granularity of information can also vary. Sometimes the information we have is rather coarse, so many sites have the same value. Gridded climate data are an excellent example of relatively coarse (e.g., 1km gridcell) data. Adjacent sites falling into the same gridcell will have the same value.

Format #

As with the response data, covariates are typically stored as flat files.

Example #

The data below contain site (SiteName), stratum (MDCATY), time (Year), and covariate information (Botanist and deficit.pregr) for sites from Little Bighorn Battlefield National Monument (LIBI), in Montana. Here, we see the first and last six rows of the data.

Row	MDCATY	SiteName	Year	Botanist	deficit.pregr
1	Gulley1	Grid_100	2009	DS	418.7766661
2	Gulley1	Grid_100	2010	DS	434.9612421
3	Gulley1	Grid_100	2011	JA	408.9919742
4	Gulley1	Grid_100	2012	JA	450.2639661
5	Gulley1	Grid_100	2013	JA	532.0683071
6	Gulley1	Grid_100	2014	JA	437.0008004
…	…	…	…	…	…
3240	Upland	LIBI_050	2014	JA	381.9736062
3241	Upland	LIBI_050	2015	JA	328.6534898
3242	Upland	LIBI_050	2016	JA	431.6948948
3243	Upland	LIBI_050	2017	JA	402.8102409
3244	Upland	LIBI_050	2018	JA	393.2275301
3245	Upland	LIBI_050	2019	JA	247.7891026

It’s worth pointing out a few things. First, we see two different types of variables — a categorical variable containing the initials of the observing botanist (Botanist), and a continuous variable for pre growing season water deficit (deficit.pregr). Second, the spatial granularity of the covariates is limited to the site, but both variables appear to vary in time. Finally, we see more rows than we might expect based on the example data seen in the response data section. The reason for this is two-fold:

The covariate data at a site include values for all years over the duration of the study, whether that site was sampled or not. In general, sites are visited on a rotating basis, meaning they’re not sampled every year.
There are sites (e.g., Grid_100) that were never sampled. We include this information because it’s needed to make inference at the park scale, and because we’re interested in making predictions of our focal response variables at every site on the landscape, whether it was visited by field crews or not.

From data to model #

We’ll see how to include covariates in models using the XX and YY blocks of the analysis config files in other sections of this guide.