--- title: 'Discrete trait analysis' author: "Kenneth B. Hoehn" date: '`r Sys.Date()`' output: pdf_document: dev: pdf fig_height: 4 fig_width: 7.5 highlight: pygments toc: yes html_document: fig_height: 4 fig_width: 7.5 highlight: pygments theme: readable toc: yes md_document: fig_height: 4 fig_width: 7.5 preserve_yaml: no toc: yes geometry: margin=1in fontsize: 11pt vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Discrete Trait Analysis} %\VignetteEncoding{UTF-8} %\usepackage[utf8]{inputenc} --- Dowser implements recently developed phylogenetic tests to characterize B cell migration, differentiation, and isotype switching. These are "discrete" traits because their values are not continuous. ## Discrete trait statistics There are two main statistics implemented by Dowser to characterize the distribution of trait values at the tips of trees. These all derive from using a maximum parsimony algorithm to reconstruct trait values at the internal nodes of each tree. Then "switches" between discrete characters are quantified along the tree topologies. 1. PS: Parsimony score, the total number of switches among all trait values. 2. SP: Switch proportion, the proportion of switches from one trait to another. The significance of each of these statistics is estimated using a permutation test, which randomizes trait values at the tree's tips. [Full published details on these methods are available here](https://doi.org/10.1371/journal.pcbi.1009885) ## Caveats and interpreting results Before using these tests, it's important to understand how the test works and what potential biases can affect the results. [These tests are detailed in full here](https://doi.org/10.1371/journal.pcbi.1009885).These tests summarize the distribution of trait values along trees. They don't directly test for cellular migration and differentiation. As such, it's important to understand potential caveats and explore alternative explanations. **What does a significant SP test mean?** A significant SP test from tissue A to tissue B indicates that there is a greater proportion of switches from A to B along the trees than expected by random tip/trait relationships. This can be due to B cells preferentially migrating from tissue A to B, but there are other possibilities. As explored in the original paper, a significant SP test from tissue A to B could be due to: 1. Biased migration from A to B. 2. Biased sampling, especially under-sampling tissue A. 3. Significantly higher mutation rate over time in the cells in B than in A, placing it further down the tree. 4. Extremely low rate of switching events (if no downsampling used, see below). Usually option #1 is most biologically interesting, but it is important to consider the other options as well. ## Setting up IgPhyML All of the switch count statistics on this page require IgPhyML. IgPhyML needs to be compiled and the location of the executable needs to be passed to the `findSwitches` function. To set up IgPhyML, please [visit the docs page](https://igphyml.readthedocs.io). This is true even if IgPhyML is not used to build the trees. If you're not using a Linux computer, your best option is likely to use the [Immcantation Docker container](https://igphyml.readthedocs.io/en/latest/install.html#docker-image-recommended). ## Set up data structures and trees This step proceeds as in tree building, but it is important to specify the column of the discrete trait you want to analyze in the `formatClones` step. In this example we are using simulated data from nose and lung biospies. However, this could be any discrete trait value such as cell types. Filtering out clones that contain only a single trait value type is not strictly necessary but can dramatically improve computing time. ```{r, eval=FALSE, warning=FALSE, message=FALSE} library(dowser) # load example AIRR tsv data data(ExampleAirr) # trait value of interest trait="biopsy" # Process example data using default settings clones = formatClones(ExampleAirr, traits=trait, num_fields="duplicate_count", minseq=3) # Column shows which biopsy the B cell was obtained from print(table(ExampleAirr[[trait]])) #Lung Nose # 145 244 # Calculate number of tissues sampled in tree tissue_types = unlist(lapply(clones$data, function(x) length(unique(x@data[[trait]])))) # Filter to multi-type trees clones = clones[tissue_types > 1,] # Build trees using maximum likelihood (can use alternative builds if desired) trees = getTrees(clones, build="pml") ``` ## Visualize maximum parsimony trait reconstruction Sometimes it can be useful to visualize the predicted state of each internal node in the tree using maximum parsimony. In Dowser, this can be accomplished by modifying the `getTrees` call to specify the `trait` value of interest, and the location of the `igphyml` executable. The internal node states can be plotted using plotTrees with `nodes=TRUE` and `tips` specifying the trait value. Edges in the tree are colored by the predicted state of their descendant (lower) node. ```{r, eval=FALSE, warning=FALSE, message=FALSE} # the location of the igphyml executable # this is location in Docker image, will likely be different if you've set it up yourself # note this is the location of the compiled executable file, not just the source folder igphyml_location = "/usr/local/share/igphyml/src/igphyml" # build trees as before, but use IgPhyML to reconstruct the states of internal # nodes using maximum parsimony trees = getTrees(clones, build="pml", trait=trait, igphyml=igphyml_location) # show internal node (edge) predictions based on maximum parsimony plotTrees(trees, tips=trait, nodes=TRUE, palette="Set1")[[1]] ``` ```{r, eval=TRUE, echo=FALSE, warning=FALSE, message=FALSE} library(dowser) # Load data instead of running phylip data(BiopsyTrees) plotTrees(BiopsyTrees, tips="biopsy", nodes=TRUE, palette="Set1")[[1]] ``` ## Discrete trait analysis with fixed trees Once we've set up the tree objects, we can calculate the switches along these trees using a maximum parsimony algorithm implemented in IgPhyML. We only need to perform this computationally intensive step once. All discrete trait tests use the resulting object. Note the IgPhyML location must be properly configured for your setup. Note also that your results may differ slightly from the ones shown below, due to the stochasticity of this test. ```{r, eval=FALSE, warning=FALSE, message=FALSE} # the location of the igphyml executable # this is location in Docker image, will likely be different if you've set it up yourself # note this is the location of the compiled executable file, not just the source folder igphyml_location = "/usr/local/share/igphyml/src/igphyml" # calculate switches along trees compared to 100 random permutations # this may take a while, and can be parallelized using nproc switches = findSwitches(trees, permutations=100, trait=trait, igphyml=igphyml_location, fixtrees=TRUE) # Perform PS test on switches ps = testPS(switches$switches) print(ps$means) # A tibble: 1 x 6 # RECON PERMUTE PLT DELTA STAT REPS # #1 8 8.6 0.4 -0.6 PS 100 sp = testSP(switches$switches, alternative="greater") print(sp$means) # A tibble: 2 x 8 # Groups: FROM [2] # FROM TO RECON PERMUTE PGT DELTA STAT REPS # #1 Lung Nose 0.131 0.323 1 -0.192 SP 100 #2 Nose Lung 0.869 0.677 0 0.192 SP 100 ``` **PS test** The object `ps` contains the results of the PS test. The `means` dataframe contains the test summary statistics, while the `reps` dataframe contains information from each repetition. `RECON` shows the number of switches across all trees, `PERMUTE` shows the mean number of switches across all permutations, and `DELTA` shows the mean difference between these two values. `PLT` is the P value that there are at least as many switches in the observed trees as in the permuted data. If `PLT < 0.05`, this indicates that there are signifcantly fewer switches among biopsies than expected by random tree/trait association. This indicates that sequences within the sample biopsies are more clustered together within the trees than expected. In this case, `PLT` is 0.4, so we can't draw that conclusion. **SP test** The object `sp` contains the results of the SP test. As with the PS test, the `means` dataframe contains summary statistics. The columns are largely the same as in the PS test, but the SP test is performed in a particular direction. In this case, we have an SP test from Lung to Nose and then from Nose to Lung. We can see that the SP value from Nose to Lung in our trees (`RECON`) is much higher than the mean SP statistic in our permuted trees (`PERMUTE`). `DELTA` is the mean difference between `RECON` and `PERMUTE` values, so a positive `DELTA` indicates greater SP values in observed trees than permuted trees. The p value in the `PGT` column (0) shows that the SP statistic from Nose to Lung is significant. This indicates that there are a greater proportion of switches from the Nose to the Lungs within these trees than expected from random tree/trait association. This is consistent with biased movement from Nose to Lungs in this dataset (but see [Caveats and interpreting results](#caveats-and-interpreting-results) section). ## Accounting for uncertainty in tree topology The previous analysis used fixed tree topologies, which will speed up calculations but does not account for uncertainty in tree topology. To account for this, be sure `fixtrees=FALSE` (the default option). For this option, the trees will be re-built for each permutation in the manner specified (same parameters as getTrees). ```{r, eval=FALSE, warning=FALSE, message=FALSE} # calculate switches along bootstrap distribution of trees # build using the 'pml' maximum likelihood option # in a real analysis it's important to use at least 100 permutations switches = findSwitches(trees, permutations=10, trait=trait, igphyml=igphyml_location, fixtrees=FALSE, build="pml") sp = testSP(switches$switches, alternative="greater") print(sp$means) # A tibble: 2 x 8 # Groups: FROM [2] # FROM TO RECON PERMUTE PGT DELTA STAT REPS # #1 Lung Nose 0.168 0.358 1 -0.190 SP 10 #2 Nose Lung 0.832 0.642 0.1 0.190 SP 10 ``` ## Within and between lineage permutations In some cases it may be preferable to permute trait values among trees rather than within them. This will detect association between traits within a tree as well as directional relationships. In general it is harder to interpret. To perform this test, set `permuteAll=TRUE`. ```{r, eval=FALSE, warning=FALSE, message=FALSE} sp = testSP(switches$switches, alternative="greater", permuteAll=TRUE) print(sp$means) # A tibble: 2 x 8 # Groups: FROM [2] # FROM TO RECON PERMUTE PGT DELTA STAT REPS # #1 Lung Nose 0.168 0.241 0.6 -0.0736 SP 10 #2 Nose Lung 0.832 0.759 0.4 0.0736 SP 10 ``` ## Controlling false positive rate through downsampling The SP test has been shown to have a high false positive rate if switching events are rare along very large trees (see paper). To reduce this effect, by default a downsampling algorithm in `findSwitches` will downsample all trees to have a maximum tip-to-switch ratio of 20. This ratio can be toggled by altering the `tip_switch` parameter. This feature can also be turned off by setting `downsample=FALSE`, but this is not recommended. ```{r, eval=FALSE, warning=FALSE, message=FALSE} # Downsample each tree to a tip-to-switch ratio of 10 instead of 20 # this will reduce the false positive rate but also (likely) power switches = findSwitches(trees, permutations=100, trait=trait, igphyml=igphyml_location, fixtrees=TRUE, tip_switch=10) # didn't have much effect for this dataset sp = testSP(switches$switches, alternative="greater") print(sp$means) # A tibble: 2 x 8 # Groups: FROM [2] # FROM TO RECON PERMUTE PGT DELTA STAT REPS # #1 Lung Nose 0.168 0.358 1 -0.190 SP 10 #2 Nose Lung 0.832 0.642 0.1 0.190 SP 10 ``` ## Incorporating switching constraints Some processes, like Ig isotype switching, can only happen in a particular direction. Using regular parsimony to reconstruct these processes will likely end up predicting impossible switches. Instead, it's possible to constrain the maximum parsimony reconstruction to only happen in a particular direction. There are a couple of steps to do this. 1. Create a vector with the states in the order they occur in. 2. Create a model file with `makeModelFile`. 3. Repeat maximum parsimony analyses with IgPhyML specifying the model file. `makeModelFile` produces a physical file - you can open it with a text editor to inspect it. All runs of `findSwitches` create a model file, it's just normally not seen by the user and by default contains no constraints. ```{r, eval=FALSE, warning=FALSE, message=FALSE} # the location of the igphyml executable # this is location in Docker image, will likely be different if you've set it up yourself # note this is the location of the compiled executable file, not just the source folder igphyml_location = "/usr/local/share/igphyml/src/igphyml" # constant region column name trait = "c_call" # Vector of human isotypes in the proper order. Isotype switching # can only go from left to right (e.g. IGHM to IGHA1). One exception # is IGHD, which can switch to IGHM. isotypes = c("IGHM","IGHD","IGHG3","IGHG1","IGHA1","IGHG2", "IGHG4","IGHE","IGHA2") # Process example data using default settings with "c_call" as a trait value clones = formatClones(ExampleAirr, traits=trait, minseq=3) # Column shows which constant region associated with a BCR print(table(ExampleAirr[[trait]])) # IGHA1 IGHA2 IGHD IGHG1 IGHG2 IGHG3 IGHG4 IGHM # 55 56 11 58 64 60 63 22 # Calculate number of istoypes sampled in each tree isotype_counts = unlist(lapply(clones$data, function(x) length(unique(x@data[[trait]])))) # make model file with irreversibility constraints # Will prohibit switches from right to left in the "states" vector # IGHD to IGHM switching listed as an exception, since this can occur makeModelFile(file="isotype_model.txt", states=isotypes, constraints="irrev", exceptions=c("IGHD,IGHM")) # Build trees and predict states at internal nodes using maximum parsimony trees = getTrees(clones[isotype_counts > 1,], trait=trait, igphyml=igphyml_location, modelfile="isotype_model.txt") # show internal node (edge) predictions based on maximum parsimony plotTrees(trees, tips=trait, nodes=TRUE, palette="Paired", ambig="grey")[[1]] ``` ```{r, eval=TRUE, echo=FALSE, warning=FALSE, message=FALSE} data(IsotypeTrees) plotTrees(IsotypeTrees, tips="c_call", nodes=TRUE, palette="Paired", ambig="grey")[[1]] ``` Performing the SP test is the same as before, just specify the model file created earlier using the `modelfile` option. Note that no switches occur in directions that are forbidden by our model. If you try this without specifying the model file, you'll likely get many biologically impossible switches! ```{r, eval=FALSE, warning=FALSE, message=FALSE} # Downsample each tree to a tip-to-switch ratio of 10 instead of 20 # this will reduce the false positive rate but also (likely) power switches = findSwitches(trees, permutations=100, trait=trait, igphyml=igphyml_location, fixtrees=TRUE, tip_switch=10, modelfile="isotype_model.txt") # didn't have much effect for this dataset sp = testSP(switches$switches, alternative="greater") print(sp$means,n=42) # A tibble: 42 × 8 # Groups: FROM [7] # FROM TO RECON PERMUTE PGT DELTA STAT REPS # # 1 IGHA1 IGHA2 0.0972 0.0994 0.56 -0.00219 SP 100 # 2 IGHA1 IGHD 0 0 1 0 SP 100 # 3 IGHA1 IGHG1 0 0 1 0 SP 100 # 4 IGHA1 IGHG2 0.00815 0.00833 0.58 -0.000184 SP 100 # 5 IGHA1 IGHG3 0 0 1 0 SP 100 # 6 IGHA1 IGHG4 0.00352 0.00409 0.59 -0.000575 SP 100 # 7 IGHA2 IGHA1 0 0 1 0 SP 100 # 8 IGHA2 IGHD 0 0 1 0 SP 100 # 9 IGHA2 IGHG1 0 0 1 0 SP 100 #10 IGHA2 IGHG2 0 0 1 0 SP 100 #11 IGHA2 IGHG3 0 0 1 0 SP 100 #12 IGHA2 IGHG4 0 0 1 0 SP 100 #13 IGHD IGHA1 0 0 1 0 SP 100 #14 IGHD IGHA2 0 0 1 0 SP 100 #15 IGHD IGHG1 0.0144 0.0127 0.42 0.00173 SP 100 #16 IGHD IGHG2 0.00926 0.00981 0.54 -0.000545 SP 100 #17 IGHD IGHG3 0.0157 0.0113 0.3 0.00440 SP 100 #18 IGHD IGHG4 0.00203 0.00271 0.58 -0.000681 SP 100 #19 IGHG1 IGHA1 0.00632 0.00757 0.59 -0.00126 SP 100 #20 IGHG1 IGHA2 0.00326 0.00419 0.51 -0.000932 SP 100 #21 IGHG1 IGHD 0 0 1 0 SP 100 #22 IGHG1 IGHG2 0.0417 0.0545 0.86 -0.0127 SP 100 #23 IGHG1 IGHG3 0 0 1 0 SP 100 #24 IGHG1 IGHG4 0.0451 0.0493 0.66 -0.00427 SP 100 #25 IGHG2 IGHA1 0 0 1 0 SP 100 #26 IGHG2 IGHA2 0.00542 0.00511 0.45 0.000316 SP 100 #27 IGHG2 IGHD 0 0 1 0 SP 100 #28 IGHG2 IGHG1 0 0 1 0 SP 100 #29 IGHG2 IGHG3 0 0 1 0 SP 100 #30 IGHG2 IGHG4 0.0715 0.0581 0.08 0.0133 SP 100 #31 IGHG3 IGHA1 0.0716 0.0600 0.16 0.0116 SP 100 #32 IGHG3 IGHA2 0.0508 0.0492 0.37 0.00154 SP 100 #33 IGHG3 IGHD 0 0 1 0 SP 100 #34 IGHG3 IGHG1 0.159 0.181 0.96 -0.0225 SP 100 #35 IGHG3 IGHG2 0.228 0.203 0.05 0.0250 SP 100 #36 IGHG3 IGHG4 0.161 0.175 0.8 -0.0137 SP 100 #37 IGHG4 IGHA1 0 0 1 0 SP 100 #38 IGHG4 IGHA2 0.00596 0.00437 0.27 0.00159 SP 100 #39 IGHG4 IGHD 0 0 1 0 SP 100 #40 IGHG4 IGHG1 0 0 1 0 SP 100 #41 IGHG4 IGHG2 0 0 1 0 SP 100 #42 IGHG4 IGHG3 0 0 1 0 SP 100 ``` If you're interested only in switches to a particular tissue or isotype, you can specify this using the `to` option in the `testSP` function. Here, we only look at the proportion of switches that go to IGHA2: ```{r, eval=FALSE, warning=FALSE, message=FALSE} sp = testSP(switches$switches, alternative="greater", to="IGHA2") print(sp$means) # A tibble: 8 × 8 # Groups: FROM [8] # FROM TO RECON PERMUTE PGT DELTA STAT REPS # #1 IGHA1 IGHA2 0.620 0.634 0.59 -0.0144 SP 100 #2 IGHD IGHA2 0 0 1 0 SP 100 #3 IGHE IGHA2 0 0 1 0 SP 100 #4 IGHG1 IGHA2 0.0116 0.0292 0.85 -0.0176 SP 100 #5 IGHG2 IGHA2 0.0371 0.0252 0.32 0.0119 SP 100 #6 IGHG3 IGHA2 0.297 0.283 0.45 0.0144 SP 100 #7 IGHG4 IGHA2 0.0346 0.0290 0.36 0.00563 SP 100 #8 IGHM IGHA2 0 0 1 0 SP 100 ```