In this exercise we’ll use TransPhylo to reconstruct transmission trees, first for a simulated outbreak and then for your TB and/or measles data.
1. Introduction: Set up
- Load up the packages we will need for this tutorial and set the seed (so that if you repeat the analysis you’ll get the same output).
2. Introduction: Simulating Outbreaks in TransPhylo
First, we’ll use TransPhylo’s outbreak simulator to simulate a small outbreak. In the simulation we will know exactly who infected whom - providing a ‘ground truth’ to compare our transmission reconstruction against.
- Define values for the 5 input parameters required to simulate outbreaks in Transphylo. You can choose the same values as below or pick your own. The parameters are:
Neg (
neg): the effective within-host population size multiplied by the generation timeoff.r: the basic reproduction number (used as the mean of the negative binomial offspring distribution)w.meanandw.std: the mean and standard deviation of the gamma distributed generation time. This will also be used for the sampling time distribution unless we definews.meanandws.std. Alternately, we can define these distributions through aw.shapeandw.scale.pi: the density of sampling.
neg=100/365 # about a third of a year
off.r=5 # R0=5 secondary cases on average,
w.mean=1 # 1 year
w.std=0.3
pi=0.25 # ~1/4 cases are sampled
# These parameter values are appropriate for a disease like TB in which people are infectious for a long time. For something like influenza or SARS-CoV-2 you would want a much shorter generation time and Neg, on a scale of days. - Simulate and plot an outbreak with your chosen parameters that begins in 2015 and stops being observed in 2018.
simu <- simulateOutbreak(neg=neg,pi=pi,off.r=off.r,w.mean=w.mean,
w.std=w.std,dateStartOutbreak=2015,dateT=2018)
simu## Colored phylogenetic tree
## Number of sampled individuals=8
## Total number of hosts=15
If you’ve chosen your own parameter values or used a different seed, you may want to keep simulating or tweak your parameters until you get an outbreak with say 5-20 sampled cases (enough to work with, but not so many that it will be hard to interpret the results or take long to run the analysis).
The main result of the outbreak simulation is a coloured-in phylogeny as above. This shows the evolution of the pathogen through time, as in a regular timed tree, and paints on colours to show transmission from host-to-host.
Each host is represented by a unique colour - for example host 6 is a shade of green. We can trace back that host 6 was infected by an unsampled host (purple) in late 2016 (infections are shown with a red star). We know that 6’s infector was unsampled because there is no tip associated with the purple colour. Moving back in time, the unsampled case was infected by host 1 (red) in mid 2016. Host 1 was infected in mid 2015 by another unsampled host (pink) who is the root of this outbreak. As well as host 6, four further cases were infected by host 1: host 3 plus 3 unsampled cases.
- Extract and plot the simulated transmission tree, separately from the phylogeny. There are 2 available ways to plot this:
The first plot shows just the times of transmission events between hosts (black = sampled, white = unsampled), without any phylogeny branching events.
The second plot includes more detail. Each host (sampled and unsampled) gets a row. Horizontal lines display when that host was infectious (with lowered opacity showing uncertainty around the start and end of their infectious period). Arrows between hosts represent transmission and red dots represent a host being sampled.
- Extract and plot the simulated phylogeny, separately from the transmission tree. This should look just like the initial figure, but without the colours representing transmission or any indication of unsampled hosts.
- Convert your simulated phylogeny into
ape’sphyloformat.
## [1] "phylo"
The phylogeny should look exactly the same in phylo
format. Shortly, we’re going to feed this simulated phylogeny (without
any of the transmission information) into TransPhylo.
- Save your phylogenetic tree as a Newick tree file for safekeeping.
An important note: TransPhylo measures time relative to the date of the last sample, not in absolute calendar time. When a phylogeny is saved, absolute time information is lost and only relative distances between samples are retained. We therefore need to record the date of the last sample, so that tip and branching dates can be back-calculated when needed
- Allocate a variable to the date of the last sample.
## [1] 2017.993
3. Introduction: Reconstructing the Simulated Outbreak
Now we’re ready to use TransPhylo to infer who infected whom in the simulated outbreak.
We’ll start with our phylogeny p, which you’ll recall we
removed all transmission information from.
Unlike Outbreaker, which inferred transmission directly from genomes and sampling dates, TransPhylo requires an input phylogeny. So usually a first step will be using your favourite tree building software to build a phylogeny from your genomic data. In this tutorial we skip the tree building step because we already have one for our simulated outbreak.
A recent-ish development to TransPhylo is that you can now input a set of likely phylogenies rather than having to pick just one. This lets you build phylogenetic uncertainty into the analysis. We won’t be covering that in this tutorial, but feel free to ask us about it if you are interested.
As mentioned in the previous section, this phylogeny p
is dated relatively, not absolutely. We need to tell TransPhylo when the
final sample was taken, and it will back-calculate all other events.
- Re-attach the date of the last sample to your phylogeny
Although this seems a little circular, because you already had the
object ptree in section 1 above, remember that in a regular
(unsimulated) analysis you would be beginning your analysis right
here.
TransPhylo needs 4 things for transmission inference: a timed phylogeny (check!), an assumed generation time distribution, the date at which observation of the outbreak stopped, and the number of iterations to run the MCMC algorithm for.
In the simulation, we used a generation time with mean 1 year and standard deviation 0.3 years. We’ll pass those true parameters to TransPhylo here, so that it has the best chance of reconstructing the outbreak correctly. In a real-data scenario you won’t know the true generation time of course, so you should use something sensible for your pathogen under study (perhaps, look up some litrature estimates).
Similarly for the time outbreak observation ended: in our simulated
outbreak we used dateT=2018 so we can input that here. For
real data, you should use whatever date you stopped observing your
outbreak of interest. If you’re doing a real-time investigation, that
date might be today. Inputting a dateT allows TransPhylo to
account for the fact that individuals who became infected just before T
have a lower probability of being sampled. If you’re absolutely
confident that your outbreak was over upon the last sampled case in your
data set, you can set this date to Inf (infinity).
- Define the remaining parameters needed by TransPhylo.
Now we are ready to run our transmission inference.
- Run TransPhylo for your simulated outbreak. Try using just 1000 MCMC iterations for now, so you don’t have to wait long.
4. Introduction: Checking MCMC Convergence and Mixing
Just like in the outbreaker analysis, we need to check how well our MCMC performed.
- Generate trace plots of your MCMC output
Again like outbreaker, we need these trace plots to look like stable
fuzzy caterpillars in order for our results to be trustworthy. For our
chosen input parameters, they are not. This implies that we should rerun
inferTTree with a higher number of
mcmcIterations. This is expected, as we only used 1000.
Another approach for assessing MCMC mixing uses the CODA
package to calculate the ‘effective sample size’ (ESS) of our MCMC
outputs. In short, the effective sample size tells us how many
independent samples of each parameter were obtained. This will be less
than the actual number of samples, because successive steps in the MCMC
are actually autocorrelated.
- Calculate the ESS for each parameter in TransPhylo
## pi neg off.r off.p
## 11.944439 2.719254 9.260364 0.000000
A general guideline is that the ESS of each parameter should be at
least 100. Clearly we have not achieved that here!
Note: at default settings, TransPhylo does not estimate
the parameter off.p, which is why it is has ESS 0. So we
can ignore that one for now.
- Re-run TransPhylo with more iterations until you achieve good MCMC mixing (or until you get bored and decide to move on anyway 🙂)
## Result from TransPhylo analysis
## pi=4.51e-01 [1.45e-01;8.55e-01]
## neg=1.34e+00 [7.46e-02;4.08e+00]
## off.r=3.54e+00 [1.23e+00;6.53e+00]
## off.p=5.00e-01 [5.00e-01;5.00e-01]
## pi neg off.r off.p
## 56.14124 35.62324 24.88185 0.00000
Looks like 20,000 is better but still not enough. In a real analysis we would probably want to run at least 100,000, but since this is just an example let’s move on.
5. Introduction: Visualizing the Results
- From your MCMC samples in
res, plot the most representative (medoid) coloured-in phylogeny and corresponding transmission tree. Depending on the number of MCMC iterations you used to generateres, this might take a while to run. If you find it is taking too long, feel free to return to usingmcmcIterations=1000or similar.
# Extract and plot the transmission tree
ttree=extractTTree(med)
plot(ttree,type='detailed',w.shape = (w.mean/w.std)^2, w.scale = (w.std^2)/w.mean)As in outbreaker, it is possible to define a burn-in period in
TransPhylo that removes the unstable period at the beginning of the MCMC
chain. This is controlled with the burnin input to
medTTree. For example, defining burnin=0.4
will discard the first 40% of samples as burn-in.
- Discard the first 20% of samples as a burn-in, and re-plot the medoid tree
Since we simulated this outbreak, we know the true transmission tree.
- Compare the simulated and reconstructed transmission trees by plotting them side-by-side. What are the similarities and differences? Did we do a good job reconstructing the outbreak?
par(mfrow=c(1,2),oma = c(0, 0, 2, 0))
plot(simu)
plot(med)
mtext("Comparison of true tree (left) and medoid tree (right)", outer = TRUE, cex = 1.5)See the ‘Extension tasks’ section at the end of the worksheet for more options to compare these trees.
In a real outbreak we wouldn’t have the true tree to compare against of course, but we could at this point dive into the details of who infected whom, to see if any of our transmission pairs are especially meaningful. E.g., with sufficient data, did any pairs live, work or socialize together? Did the most ‘infectious’ hosts have anything in common? Were they young, old, very social, in high-contact employment? When in their infectious period did most hosts infect others? …
In addition to the transmission tree, we are able to make a variety of plots exploring other aspects of the analysis.
- Plot a matrix of the probability of direct transmission between all pairs of individuals in the outbreak
From this plot we see high certainty in host 1’s status as a major infection source. The highest probability pair is host 1 → host 6, observed in about 70% of trees. Sometimes (about 15% of the time), we incorrectly identify that host 3 infected host 7. We usually correctly identify that cases 2 and 8 were not infected by any other sampled host, rather by an unsampled case.
- Plot a similar matrix showing how many intermediates there were inferred to be in the transmission chain between each pair of individuals
The diagonal of this matrix will always be 0, since hosts cannot infect themselves. We can intepret this plot as e.g. “host 1 and host 8 are predicted to be separated by 3 intermediates”. This is consistent with the true tree: hosts 1 and 8 are indeed separated by 3 intermediates (all unsampled, shown in pink, dark pink, and blue).
- Make plots of (a) incident sampled and unsampled cases over time, (b) the distribution of estimated generation times, and (c) the distribution of estimated sampling times.
Most unsampled cases are seen during early 2016, when the phylogeny was most rapidly branching.
The generation and sampling distributions are fairly smooth and gamma distribution-like. Challenge: can you adapt this plot to check how well your estimates match the distribution you simulated from? What do you find?
- Plot the distribution of inferred infection time and the offspring distribution, for hosts 1 and 3.
The model thinks that host 1 was infected during 2015 and host 3 during 2017 or just prior. Consistent with our other plots, host 1 is identified as a major infector, with 4 or 5 offpsring. Host 3 is identified as mostly infecting no one else.
You could also make these plots for any other hosts if you choose.
- Lastly, save your TransPhylo output to file.
6. Your own analysis: TB and Measles
Set up and run a TransPhylo analysis for either the TB or measles dataset.
We suggest continuing to work with whichever dataset you chose in Lab 1. You’ll build up a set of reconstructed trees across methods that you can compare. Remember to save your completed transmission tree to file for later!
TransPhylo requires a time-calibrated phylogeny as
input i.e. a phylogeny where the ‘x-axis’ is time. We have provided a
timed phylogeny for both datasets, in the respective data file you
downloaded for Lab 1 (roetzer2013_datedtree.nex and
OWAmeasles2024_datedtree.nex).Read nexus (.nex) files into R using the
command mytree <- read.nexus(file = "filenamehere.nex")
from the ape package.
Full details of how the Roetzer phylogeny was built using BEAST software are available in the following paper, introducing TransPhylo: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5850352/. Thank you to Renny Doig at SFU for building the measles phylogeny for this course.
Hints, tips and common issues:
Dates: Recall that TransPhylo requires the date that the last sample was collected,
dateLastSample, to build theptreephylogeny object. You can extract that from the txt file of sampling dates.You will also need the date that observation ceased,
dateT. We advise settingdateTto several weeks after the last sample in your dataset. In reality, it was probably more like days. However, if we set it as such, TransPhylo will often place many unsampled cases close to the tips in our tree. If you are sure the outbreak was over you can setdateT = Inf- but that is not the case here.As in outbreaker, getting the correct time scale can be one of the trickiest parts. Here we recommend years for the TB analysis and days for the measles analysis. TransPhylo requires
dateTanddateLastSampleentered as decimal numbers. E.g. January 2nd 2001 = 2001.0027 (2001 and 1/365th) in years. You can create this from a date object using the decimal_date() function in the lubridate package e.g.decimal_date(as.Date("2001-01-02"))= 2001.003.For the TB data, do exactly as above using
decimal_date(enter-here-the-last-date-in-the-dataset)For the measles data, instead of converting years to days, you could note that all we really care about is the length of time the outbreak occurred over, not the actual dates. Looking at the sample dates, we can see that the earliest sample was collected 43 days before the final one. Therefore, if we set
dateT = 42, we effectively label the first sample date as ‘day 0’.
Distributions: You can use the same generation time/sampling time distributions as Lab 1. These are:
TB: gen. time ~ gamma(mean = 4.3, sd = 3.8) years, and samp. time ~ (mean = 2.75, sd = 2.6) years
Measles: gen. time ~ gamma(mean = 11 days, sd = 1.5) days, and samp. time ~ gamma(mean = 11 days, sd = 1.5) days
General: When we input phylogenies to TransPhylo, it’s good practice to ensure they have no negative branch lengths or multifurcations (a node that splits into 3+ branches, forming a non-binary tree). Before creating your
ptreefrom aphyloobject namedphy, say, you can do this with:
phy <- multi2di(phy) # remove multifurcations
phy$edge.length <- pmax(phy$edge.length,1/365) # make sure all branch lengths are at least 1 day long (if our time scale is years)- MCMC: You might need many more iterations for the real datasets than the 1000 we started with in the tutorial. Inspect the MCMC trace plots and calculate the ESS as in the tutorial to test for convergence. More iterations → better results, but slower to run! Of course, for the purposes of this exercise you don’t want to be waiting hours for code to run, so feel free to continue on to the results analysis even if you don’t get good MCMC mixing.
Extension tasks and additional reading
A. In this tutorial we visually compared our inferred transmission tree to the true simulated tree:
par(mfrow=c(1,2),oma = c(0, 0, 2, 0))
plot(simu)
plot(med)
mtext("Comparison of true tree (left) and medoid tree (right)", outer = TRUE, cex = 1.5)We can also compare trees with various numeric measures. Note that the main differences in our trees come from the location of the red stars, who infected whom, and the number/location of unsampled individuals. Can you come up with some ways to measure the difference between transmission trees?
Here are some suggestions you could explore and code up:
What is the difference in the number of unsampled intermediates between true tree and inferred tree?
How well did we infer the location of the unsampled intermediates? E.g. are they all clustered in one area or spread evenly throughout the tree.
How well did we infer the infection times of the sampled individuals?
For what proportion of cases did we infer the right infector?
For what proportion of cases did their top 3 infectors include the true infector?
To answer these questions, look at the help pages for the TransPhylo
functions getInfectionTimeDist(),
getGenerationTimeDist(),
getSamplingTimeDist(), and getIncidentCases(),
which help to extract information from your MCMC output.
B. Load up the transmission networks you inferred with outbreaker2 for the TB or measles dataset. How different are they to those you inferred here in TransPhylo? Can you think of reasons for any differences?
This is a topic we’ll return to in Lab 4.
C. Try visualising your inferred transmission tree(s) using VisNetwork - an R network visualisation package. Some sample code is available in the transphylo_extras.R script on the module webpage.
D. If you’re interested to learn about simultaneous inference of multiple transmission trees from a set of phylogenetic trees in TransPhylo, you can try out this tutorial.