---
title: "Extended moult models"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Extended moult models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: moultmcmc.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

To install `moultmcmc` please follow the instructions in the [README file](https://github.com/pboesu/moultmcmc#installation).

```{r setup, warning=FALSE, message=FALSE}
library(moult)
library(moultmcmc)
library(dplyr)
library(ggplot2)
library(rstan)
```

# Lumped models

The gradual transition from new to old plumage between successive moults can make the assignment of non-moulting birds to pre-moult and post-moult categories ambiguous.
This classification problem can be sidestepped by distinguishing only two categories of
birds: moulting and non-moulting birds, which is implemented in the so called "lumped" moult models in `moultmcmc`.

The following example demonstrates the use of the lumped type 2 moult model.
It is based on a dataset originally supplied with the `moult` package [@erni2013moult] on Southern Masked Weavers (*Ploceus velatus*), all caught in the Western Cape province of South Africa [@oschadleus2005patterns]. 
For convenience the data set is provided with processed moult scores in `moultmcmc` as `weavers_processed`.

```{r lumped-demo, fig.width = 7, fig.caption='A plot of the Masked Weaver dataset clearly shows that individuals scored as pre-moult (pfmg = 0) and post-moult (pfmg = 1) are present in the sample throughout the year, even though the active moult period appears to be reasonably well constrained within the year.'}
data(weavers_processed)
weavers_processed <- filter(weavers_processed, Year %in% 1993:2003)
ggplot(weavers_processed, aes(x=day, y =pfmg)) +
  geom_point() + theme_classic() +
  ylab('PFMG') + xlab('Days since 01 August')
```

We then fit both standard and lumped type 2 moult models to this dataset. The lumped model can be activated by setting the option `lump_non_moult = TRUE`.
```{r lumped-models}
mc2 <- moultmcmc('pfmg',
                 'day',
                 data=weavers_processed,
                 chains = 2, cores = 2)
mc2l <- moultmcmc('pfmg',
                  'day',
                  data=weavers_processed,
                  lump_non_moult = TRUE,
                  chains = 2, cores = 2)
```
We also fit a type 3 model to these data, following the recommendation in @erni2013moult. This model type only uses active moult records, and therefore is unaffected by misclassification in the non-moulting birds.
```{r lumped-models-t3}
mc3 <- moultmcmc('pfmg',
                 'day',
                 data=weavers_processed,
                 type = 3,
                 chains = 2, cores=2)
```

```{r lumped-model-plots, fig.width = 7, fig.height = 8}
cowplot::plot_grid(
   moult_plot(mc2) + theme(legend.position='none') +
     xlim(-150,575) + ggtitle('Type 2 model'),
   moult_plot(mc2l) + theme(legend.position='none') +
     xlim(-150,575) + ggtitle('Lumped type 2 model (T2L)'),
   moult_plot(mc3) + theme(legend.position='bottom') +
     xlim(-150,575) + ggtitle('Type 3 model'),
   nrow=3
)
```

The moult plots make it clear that the standard type 2 model estimates an unrealistically large standard deviation of the start date, implying that the moult period for the population exceeds a calendar year. 
In contrast both the lumped type 2 model and the type 3 model give plausible estimates, in that both sets of estimates are consistent with published moult parameters for this species [@oschadleus2005patterns]. However, the lumped type 2 model offers much higher nominal precision, as apparent from the model comparison plot.

```{r lumped-model-compare, fig.width=7}
compare_plot(mc3,mc2,mc2l) + theme(axis.text.x = element_text(angle = 30, vjust = 1, hjust=1))
```

# Repeated measures moult models 
When the assumption of even sampling among and within moult categories is violated, the standard moult model can converge at biologically non-sensible results. 

This problem can be greatly reduced when information about the pace of moult from within-season recaptures of individuals is explicitly incorporated into to the parameter estimation using the model.

The repeated measures model is currently implemented for data types 2, 3 and 5 in the `moultmcmc` function. The repeated measures model is selected by supplying the optional argument `id_column` which designates *within season* recaptures, typically formed of a combination of ring number and season.

We demonstrate this here using simulated data which is biased towards records of birds in the early stages of moult, based on a citizen science dataset of moult scores from Eurasian Siskins (*Spinus spinus*)  (@insley2022moult).


```{r siskin-hist, fig.width = 7, fig.cap='Frequency distribution of moult records with respect to moult progress for actively moulting Eurasian Siskins [@insley2022moult] sampled in a sub-urban garden in Scotland (left) and seasonal progression of observed moult scores from the same dataset (right). Lines connect repeated observations of the same individuals.', echo= FALSE, message = FALSE, fig.height = 3}
data(siskins)
siskins$hexid = factor(format(as.hexmode(as.numeric(siskins$id)), width = 3))
cowplot::plot_grid(
ggplot(filter(siskins, pfmg > 0), aes(x=pfmg)) +
  geom_histogram(breaks = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1)) +
  theme_classic() + xlab("Moult index") + ylab("Number of records"),
ggplot(siskins, aes(x=yday,y=pfmg,group=id)) +
  geom_point(alpha = 0.3) + geom_path(alpha = 0.5)+
  theme_classic() + xlab('Days since Jan 01') + ylab('PFMG'))
```

The standard models are fitted by not specifying the `id_column` input argument (which implies `id_column = NULL`):

```{r, echo = FALSE}
uz3 <- moultmcmc('pfmg',
                  'yday',
                  start_formula = ~1,
                  duration_formula = ~1,
                  log_lik = FALSE,
                  data=siskins,
                  type = 3,
                  iter = 2000,
                  chain = 2,
                  cores = 2)
uz5 <- moultmcmc('pfmg',
                  'yday',
                  start_formula = ~1,
                  duration_formula = ~1,
                  log_lik = FALSE,
                  data=siskins,
                  type = 5,
                  iter = 2000,
                  chain = 2,
                  cores = 2)

```

and the repeated measures models by passing the information about bird identities to the  `moultmcmc()` function:

```{r sisikin-recap-models, eval = TRUE, cache = FALSE, warning=FALSE}
#fit the equivalent recapture models
uz3_recap <- moultmcmc(moult_column = 'pfmg',
                       date_column = 'yday',
                       id_column = 'hexid',
                       start_formula = ~1,
                       duration_formula = ~1,
                       type = 3,
                       log_lik = FALSE,
                       data=siskins,
                       iter = 1000,
                       chain = 2,
                       cores = 2)
max(rowSums(get_elapsed_time(uz3_recap$stanfit)))
#
uz5_recap <- moultmcmc('pfmg',
                       'yday',
                       'hexid',
                       start_formula = ~1,
                       duration_formula = ~1,
                       type = 5,
                       log_lik = FALSE,
                       data=siskins,
                       iter = 1000,
                       chain = 2,
                       cores = 2)
max(rowSums(get_elapsed_time(uz5_recap$stanfit)))

```
When comparing the results visually it is apparent that the moult period for the population predicted by the standard Type 3 models fails to encapsulate a substantial proportion of the observed data, whereas this is not the case for the repeated measures model.
The bias in Type 5 model estimates is less severe.

```{r siskin-model-plots, fig.width=7, eval = TRUE, echo = FALSE, fig.cap='Standard type 3 and 5 moult models fitted to the Siskin dataset without considering within-season recaptures converge at biased parameter estimates (red lines and polygons). When recapture information is included in the model, the resulting fits (green lines and polygons) better encapsulate the observed data.'}
m_summ <- bind_rows(moult_plot(uz3, plot = F) %>% mutate(model = 'ignore recaptures'),
                    moult_plot(uz3_recap, plot=F) %>% mutate(model = 'recaptures'))
#hacky way to reorder data to plot polygon
filter(m_summ, line_type == '95 % quantile') %>%
  tidyr::pivot_longer(start_date:end_date, values_to = 'x', names_to = 'se') %>%
  group_by(model,se) %>%
  mutate(y = ifelse(se == 'start_date', 0, 1), lci = ifelse(x == min(x), 1,0)) %>%
  arrange(model,x) -> m_summ_poly
m3_comp <-
  ggplot(filter(siskins, pfmg > 0), aes(x = yday, y = pfmg)) + 
  geom_point(alpha = 0.4) +
  geom_polygon(data = m_summ_poly[c(1, 2, 4, 3,
                                    5, 6, 8, 7), ], aes(x = x, y = y, fill = model), alpha = 0.3) +
  geom_segment(
    data = filter(m_summ, line_type == 'Population mean'),
    aes(
      x = start_date,
      y = 0,
      xend = end_date,
      yend = 1,
      col = model
    ),
    lwd = 1
  )  + theme_classic() + ggtitle('Type 3 model') + xlim(170,340)

m_summ <- bind_rows(moult_plot(uz5, plot=F) %>% mutate(model = 'ignore recaptures'),
                    moult_plot(uz5_recap, plot=F) %>% mutate(model = 'repeated measures model'))
#hacky way to reorder data to plot polygon
filter(m_summ, line_type == '95 % quantile') %>% tidyr::pivot_longer(start_date:end_date, values_to = 'x', names_to = 'se') %>% group_by(model,se) %>% mutate(y = ifelse(se == 'start_date', 0, 1), lci = ifelse(x == min(x), 1,0)) %>% arrange(model,x) -> m_summ_poly

m5_comp <- ggplot(filter(siskins), aes(x = yday, y = pfmg)) + 
  geom_polygon(data = m_summ_poly[c(1,2,4,3,
                                    5,6,8,7),],
               aes(x = x, y = y, fill = model), alpha = 0.3 ) + 
  geom_point(alpha = 0.4) +
  geom_segment(data = filter(m_summ, line_type == 'Population mean'), aes(x = start_date, y = 0, xend = end_date, yend = 1, col = model), lwd = 1)  + theme_classic() + ggtitle('Type 5 model')  + theme(legend.position = 'bottom') + xlim(170,340)

m5_legend <- cowplot::get_legend(m5_comp)
cowplot::plot_grid(cowplot::plot_grid(m3_comp+theme(legend.position='none')+ylab('PFMG')+xlab('Days since Jan 01'),
                   m5_comp+theme(legend.position='none')+ylab('PFMG')+xlab('Days since Jan 01')),
                                 m5_legend, nrow = 2, rel_heights =  c(1,.1))

```

```{r fig.width=7, fig.height=5, eval = TRUE}
compare_plot(uz3,uz3_recap,uz5,uz5_recap) + theme(axis.text.x = element_text(angle = 30, vjust = 1, hjust=1), legend.position = 'bottom')

```

As these are simulated data we can also assess how well the recapture model estimates individual start dates, by extracting the random intercept estimates with the `ranef()` method and comparing them to the simulated data.


```{r ind-intercepts,eval = TRUE, fig.width = 7, fig.caption = 'There is good agreement between the simulated individual start dates and their estimates in both the type 3r and type 5R models.'}
ints <- ranef(uz3_recap) %>% 
  as_tibble(rownames = 'hexid') %>%
  left_join(siskins) %>%
  group_by(id) %>%
  summarise(est_start = unique(mean),
            lci_start = unique(`2.5%`),
            uci_start = unique(`97.5%`),
            true_start = unique(start_date),
            true_dur = unique(duration),
            n_total = n(),
            n_active = sum(pfmg != 0))

t3r_intercepts <- ggplot(ints,
                         aes(x = true_start - 196.83,
                             y = est_start,
                             ymin = lci_start,
                             ymax = uci_start,
                             fill = true_dur - 77.8)) +
  geom_pointrange(pch = 21) +
  scale_fill_gradient2() +
  theme_classic() +
  stat_smooth(method = 'lm') +
  ggtitle('T3R') +
  geom_abline(intercept = 0, slope = 1, lwd = 1, lty =2) +
  ylab('estimated start date') +
  xlab('true start date')

confint(lm(est_start ~ I(true_start - 196.83), data = ints))

ints <- ranef(uz5_recap) %>%
  as_tibble(rownames = 'hexid') %>%
  left_join(siskins) %>%
  group_by(id) %>%
  summarise(est_start = unique(mean),
            lci_start = unique(`2.5%`),
            uci_start = unique(`97.5%`),
            true_start = unique(start_date),
            true_dur = unique(duration),
            n_total = n(),
            n_active = sum(pfmg != 0))

t5r_intercepts <- ggplot(ints,
                         aes(x = true_start - 196.83,
                             y = est_start,
                             ymin = lci_start,
                             ymax = uci_start,
                             fill = true_dur - 77.8)) +
  geom_pointrange(pch = 21) +
  scale_fill_gradient2() +
  theme_classic() +
  stat_smooth(method = 'lm') +
  ggtitle('T5R') +
  geom_abline(intercept = 0, slope = 1, lwd = 1, lty =2) +
  ylab('estimated start date') +
  xlab('true start date')

confint(lm(est_start ~ I(true_start - 196.83), data = ints))

cowplot::plot_grid(t3r_intercepts + theme(legend.position = 'none'),
                   t5r_intercepts + theme(legend.position = 'none'))
```

We find that agreement between simulations and estimates is good, even though the model does not explicitly account for heterogeneity in individual moult durations.
For both models a regression of estimated versus true start dates has an intercept indistinguishable form 0 and a slope indistinguishable from 1.


# Mixed moult records

`moultmcmc` also allows to fit a mixture of categorical and continuous moult records, by combining the relevant likelihoods from the type 1 and type 2 models, as originally suggested in @underhill1988model.
This, so-called **Type 12** model can be employed to boost sample sizes and/or allow better balancing of moult and non-moult records [@bonnevie2010balancing] when multiple types of records are available.

We can simulate such a dataset by splitting the `sanderlings` dataset into two, and degrading the moult scores in one subsample to moult categories

```{r mixed-data, fig.width=7}
sanderlings <- moult::sanderlings

sanderlings$MCat[sanderlings$MIndex == 0] <- 1
sanderlings$MCat[sanderlings$MIndex == 1] <- 3
sanderlings$MCat[is.na(sanderlings$MCat)] <- 2
#introduce a couple of NAs
set.seed(1234)
sanderlings$keep_score <- sample(c(0,1), size = nrow(sanderlings), replace = TRUE)
sanderlings_scores <- subset(sanderlings, keep_score==1)
sanderlings_cat <- subset(sanderlings, keep_score==0)
sanderlings_cat$MCat[sanderlings_cat$MIndex == 0] <- 1
sanderlings_cat$MCat[sanderlings_cat$MIndex == 1] <- 3
sanderlings_cat$MCat[sanderlings_cat$MIndex != 1 & sanderlings_cat$MIndex != 0] <- 2
sanderlings_cat$MIndex <- NULL

cowplot::plot_grid(
  ggplot(sanderlings_scores, aes(x = Day, y = MIndex)) + geom_point() + theme_classic(),
  ggplot(sanderlings_cat, aes(x = Day, y = MCat)) + geom_point() + theme_classic()
)
```

The two datasets can then be pieced together again, and the function `consolidate_moult_records` is used to consolidate and recode moult score and moult category data.
```{r mixed-data-assembly}
sanderlings_combined <- bind_rows(sanderlings_scores, sanderlings_cat)
sanderlings_combined$MIndexCat <- consolidate_moult_records(sanderlings_combined$MIndex, sanderlings_combined$MCat)
```

We can then fit Type 1 and Type 2 models to each of the split datasets, as well as the Type 12 model for the recombined dataset, and for reference a Type 2 model to the original data.

```{r mixed-data-models, message=FALSE, warning=FALSE, results='hide'}
fit_scores <- moultmcmc(moult_column = "MIndex",
                 date_column = "Day",
                 data = sanderlings_scores,
                 type = 2,
                 log_lik = FALSE,
                 chains = 1)

fit_cat <- moultmcmc(moult_column = "MCat",
                 date_column = "Day",
                 data = sanderlings_cat,
                 type = 1,
                 log_lik = FALSE,
                 chains = 1)

fit_combined = moultmcmc(moult_column = "MIndexCat",
                 date_column = "Day",
                 data = sanderlings_combined,
                 type = 12,
                 log_lik = FALSE,
                 chains = 1)

fit_original = moultmcmc(moult_column = "MIndex",
                 date_column = "Day",
                 data = sanderlings,
                 type = 2,
                 log_lik = FALSE,
                 chains = 1)
```

```{r mixed-data-model-comparison, fig.width = 7}
compare_plot(fit_scores, fit_cat, fit_combined, fit_original) + theme(axis.text.x = element_text(angle = 30, vjust = 1, hjust=1))
```
The estimates from the combined model (T12) are closer to the estimates from the original dataset, than estimates from either partial dataset.

# References