3. Investigating and Creating Copy-Number Signatures • utanos

Pre-amble

This vignette demonstrates how to use copy-number (CN) abnormalities data, which could represent the imprints of distinct mutational processes, to either build new CN-Signatures or determine which existing signatures are present in a dataset. This workflow is based on one described in the methods section of a paper published in 2018 (Macintyre et al. 2018).

Most of the published literature thus far has focused on CN-Signatures from absolute copy-number data. That is to say data where the sample ploidy is known. However, CN-Signatures can be extracted from both relative and absolute CN data, and so both cases will be covered here.

First, this vignette will cover finding signature exposures in a new dataset using existing signatures (section III). Second, the creation of new signatures will be covered (section IV). In either case, to begin, a set of samples with segmented copy-number profiles are needed. These profiles should be either formatted as a list of named dataframes or exist within a QDNAseq object.

I - Input data

In earlier vignettes it sufficed to use just a few samples to demonstrate the usage of utanos functions. However, in order to generate new cn-signatures, considerably more data is needed. Therefore, to demonstrate both signature exposure calling and the creation of new signatures, the Britroc HGSOC (High-Grade Serous Ovarian Carcinoma) cohort will be used (Macintyre et al. 2018). The segmented copy-number data can be found here in the paper’s bitbucket repository.

So step one, download said data/clone the repo, and read in the absolute copy-numbers:

> library(utanos)
> library(dplyr)
> acn.obj <- readRDS("~/repos/cnsignatures/manuscript_Rmarkdown/data/britroc_absolute_copynumber.rds")
> acn.obj
QDNAseqCopyNumbers (storageMode: lockedEnvironment)
assayData: 103199 features, 280 samples 
  element names: copynumber, segmented 
protocolData: none
phenoData
  sampleNames: IM_100 IM_101 ... JBLAB-4282 (280 total)
  varLabels: name total.reads ... mapTP53cn (10 total)
  varMetadata: labelDescription
featureData
  featureNames: 1:1-30000 1:30001-60000 ... Y:59370001-59373566 (103199 total)
  fvarLabels: chromosome start ... use (9 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:

For this vignette we will use just the higher-quality samples (2 and 3 star in the context of that paper). A list is available in utanosmodellingdata.

britroc_samples_subset <- data.table::fread("~/repos/utanosmodellingdata/sample_sets/brit117_samples.csv")
acn.obj <- acn.obj[,britroc_samples_subset$x]

II - Extract CN-features

Step 2: To model copy-number aberrations in the genome, find trends and commonalities (what we will call signatures henceforth), specific features within the dataset are pulled out and modeled separately. We can extract copy number features using either the ExtractCopyNumberFeatures() or ExtractRelativeCopyNumberFeatures() functions.

II.1 - Relative Copy-Numbers

Section coming soon!

II.2 - Relative Copy-Numbers

Standard features:

Absolute copy-numbers at each genomic bin (copynumbers)
The magnitude of each copy number change (changepoints)
The genomic length (in bp) of each segment (segsizes)
The number of breakpoints per chromosome arm (brchrmarm)
The number of breakpoints per 10 megabases (bp10MB)
The genomic length (in bp) of oscillating copy number regions (osCNs)

Optional features:

Minimum number of chromosomes (a count) needed to account for 50% of CN changes in a sample (nc50)
Distance in base pairs of each breakpoint to the centromere (cdist)

Both of the two CN-Signature sets included in utanosmodellingdata (from p53abn endometrial and HGSOC) use the standard features. For a few hundred samples, this function should run within a few minutes.

cn_features <- ExtractCopyNumberFeatures(acn.obj, 
                                         genome = "hg19", 
                                         extra_features = FALSE)

III - Determine CN-signature exposures

The HGSOC CN-Signatures included in the utanosmodellingdata repository were created using 91 samples (the highest quality ones). To find exposures for the remaining samples in the group, subset the original object by those samples not used for signature creation, and calculate exposures.

The CallSignatureExposures() function in utanos runs several computational steps in the appropriate order for convenience. A user can run each of these separately should they want. For details see the docs for ExtractCopyNumberFeatures()/ExtractRelativeCopyNumberFeatures(), GenerateSampleByComponentMatrix(), and QuantifySignatures().

hgsoc_sigs <- readRDS(file = "~/repos/utanosmodellingdata/signatures/30kb_ovarian/component_by_signature_britroc_aCNs.rds")
acn.obj.subset <- acn.obj[,!colnames(acn.obj) %in% colnames(hgsoc_sigs@consensus)]
sigexposures <- CallSignatureExposures(copy_numbers_input = acn.obj.subset, 
                                       component_models = "~/repos/utanosmodellingdata/component_models/30kb_ovarian/component_models_britroc_aCNs.rds", 
                                       signatures = "~/repos/utanosmodellingdata/signatures/30kb_ovarian/component_by_signature_britroc_aCNs.rds",
                                       refgenome = "hg19", 
                                       relativeCN_data = FALSE)

The exposures can be visualized with the SignatureExposuresPlot() function like so:

ggp1 <- SignatureExposuresPlot(sigexposures)
ggp1$plot

Alongside the plot, the SignatureExposuresPlot() function returns the sample ordering in the plot. Given that these signatures are from the Macintyre et al. 2018 publication, let’s re-order to match the signature order from the paper for easy comparison:

reord_britroc <- as.integer(c(2,6,5,4,7,3,1))
ggp1 <- SignatureExposuresPlot(sigexposures[reord_britroc,])
ggp1$plot

Interpreting this plot, it is clear the sample subset has higher exposures for signatures 1 and 4. Referencing (Macintyre et al. 2018), these signatures correspond to poor overall survival + breakage-fusion bridge events and whole-genome duplication + PI3K/AKT signalling inactivation respectively.

IV - Create New CN-Signatures

Creating copy-number signatures anew involves a few more steps than directly determining exposures from existing signatures. This process can also be iterative where given some metrics, earlier steps are re-run with different parameters/inputs. This vignette will be using the Britroc cohort samples to demonstrate signature creation but will not be seeking to re-create the signatures found in that 2018 paper exactly. For an exact replication, the user to will need to set all parameters exactly as seen in the paper repository and run a few steps without the utanos wrappers.
To reiterate, the modelling and signatures seen below will not match (Macintyre et al. 2018). This is expected.

IV.1 - Modelling

First, the CN features discussed in section II are modeled using mixtures of gaussians or poissons with the help of the flexmix R-package (Grün and Leisch 2008). Utanos provides a wrapper called FitMixtureModels(). Then, given these mixture models for each CN feature, all the samples used for modelling are compared to each component of each model and a sum-of-posterior probabilities matrix is calculated. The GenerateSampleByComponentMatrix() utanos function calculates these posteriors. In more detail: For each copy-number event for a feature in each sample, compute the posterior probability of belonging to a component for said feature. For each sample, sum these posterior event vectors to a sum-of-posterior probabilities vector. Combine the sum-of-posterior vectors into a patient-by-component sum-of-posterior probabilities matrix.

Finally, given the patient-by-component sum-of-posterior probabilities matrix, matrix decomposition is run many times to decide on an optimal rank for the two output matrices (sample-by-signature matrix and a signature-by-component matrix). The optimal rank is the optimal number of signatures. The NMF R-package is leveraged for matrix decomposition using the ChooseNumberSignatures() function (Gaujoux and Seoighe 2010).

Succinctly, the modelling stage involves…

Fitting mixture components.
Calculating a patient-by-component sum-of-posterior probabilities matrix.
Searching for the optimal number of signatures (interval of 3-12 is used by default), running NMF many times (ex. 1000x) with different random seeds for each signature number.

Note on 2. The returned sample_by_component matrix may have the components in a different order than the components in the component_models object. The components are not in increasing order in the component_models object.

Note on 3. Estimating the optimal number of signatures is quite computationally heavy. This function takes the longest to run of any in the utanos package. It might make sense to run this function and then get a snack!

# Model using just the highest quality samples
acn.obj.high <- acn.obj[,colnames(acn.obj) %in% colnames(hgsoc_sigs@consensus)]
cnfeatures_high <- ExtractCopyNumberFeatures(acn.obj.high, 
                                             genome = "hg19", 
                                             extra_features = FALSE)

component_models <- FitMixtureModels(cnfeatures_high)
sample_by_component <- GenerateSampleByComponentMatrix(CN_features, component_models)

# The next function contains a method that MUST have NMF loaded, this is a known issue with the NMF package
library(NMF)
rank_estimate_outputs <- ChooseNumberSignatures(sample_by_component)
NMF::plot(rank_estimate_outputs$estrank, rank_estimate_outputs$randomized_estrank,
          what = c("cophenetic", "dispersion", "sparseness", "silhouette"),
          xname = "Observed", yname = "Randomised", main = "")

This dot and line plot contains several metrics from the NMF runs over possible ranks that aid in the selection of an optimal number of signatures.

Brief heuristics for what to look for in the values of each metric:

Metric	Description	Interpretation
cophenetic	after the trend begins increasing choose the max value before it decreases again.	higher => better
dispersion	measures the reproducibility	higher => better
silhouette	measures consistency between clusters	higher => better
sparseness	choose the maximum sparsity which can be achieved without that for the corresponding randomly permuted matrix exceeding it	higher => better

For detailed discussions about each of these metrics consult the NMF package documentation or other sources online.

The utanos package defaults for the ChooseNumberSignatures() function include running NMF 100 times for each rank. This was to keep the default run-time manageable. Package users however are encouraged to experiment with more iterations and other run parameter values. The (Macintyre et al. 2018) paper for instance used 1000 iterations in estimating the rank (this and a few other parameters are the reason this plot differs from the one in the paper bitbucket repository).

IV.2 - Evaluation

At this point it is a good idea to stop and examine the modelling done thus far.
Some points to consider:

If the metrics provided above for the input matrices do not appear to be outperforming the randomized matrices, you may wish to set more restrictive demands on what samples can be used as input data. The ability to detect copy number signatures depends heavily on whether each pattern is sufficiently represented in the data, BUT, this signal can be obfuscated with lower quality samples that introduce noise. Finding a balance of quantity vs. quality is important. The largest number of high-quality samples possible used for signature creation (but no more) is ideal.
By default utanos suggests using 6 CN features for modelling, however it may be desirable to use more/others. There are two more that can easily be used by setting the extra_features parameter of ExtractCopyNumberFeatures() to TRUE.
Examine the sample-by-component matrix as a heatmap to see if there are major contributions from many different samples to each component. If the sample-by-component matrix does not indicate higher values for the posteriors for at least several components per sample it could be the case that the modelling is not appropriately capturing the heterogeneity in the data.
Examine the consensus matrices (the average connectivity matrix across NMF runs) in heatmap form alongside the silhouette track (or others) to see how samples cluster together. Do clear clusters emerge?

Visualize the sample-by-component matrix:

NMF::aheatmap(sample_by_component, fontsize = 7, 
              Rowv=FALSE, Colv=FALSE, cexRow = 0, legend = FALSE,
              breaks=c(seq(0,199,2),500), 
              main="Component x Sample matrix")

And the consensus matrices across 4 ranks (5,6,7, and 8):

NMF::consensusmap(rank_estimate$fit[c(3:6)], main='Consensus matrix - Cluster stability', treeheight = 0, labCol=NA, labRow=NA, tracks = c("silhouette:"))

Note that the map corresponding to ranks 6 and 7 look pretty good, the same samples tend to cluster together across runs. Samples tend to be either dark red or dark blue rather than a lighter shade somewhere in between. For more details on connectivity/consensus matrices see the docs for NMF::connectivity().

If content with the modelling and a clear factorization rank (number of signatures) emerges, move on to the next section and create a signature set.

IV.3 - Create CN-Signatures

The GenerateSignatures() function is again running NMF many times (but not over multiple ranks) and therefore takes the second-longest to run of any function in the utanos package. It might make sense to start this function and then go get a snack!

signatures <- GenerateSignatures(sample_by_component, 7)

# Usually worth stopping and saving the objects just created since re-running can take a while. 
# saveRDS(signatures, file = paste0("path_to_signatures/signatures_object.rds"))
# saveRDS(component_models, file = paste0("path_to_signatures/component_models_object.rds"))

The GenerateSignatures() function is very simple, wrapping just a single NMF call. It provides some default parameter values (ex. sets seed = 77777, and the nmf algorithm used to “brunet”) but should the user which to test many different options use the NMF::nmf() function directly.

As a final step to creating new signatures, visualize them using the NMF package’s various heatmap options:

signature-by-component matrix as a heatmap (what we call a ‘signature set’ or ‘signatures’)
sample-by-sample consensus matrix as a heatmap (same as above, just for one rank)
sample-by-signature matrix as a heatmap (contribution of each sample to each signature)

basismap(signatures, Rowv = NA, main = "Signature x Component matrix")
consensusmap(signatures, main='Consensus matrix - Cluster stability', 
             treeheight = 0, labRow = NA, labCol = NA)
coefmap(signatures, Colv="consensus", tracks=c("basis:"), 
        main="Sample x Signature matrix", treeheight = 0)

Section Note

For the modelling section utanos provides a number of wrapping functions for flexmix and NMF for convenience. However, it’s not unrealistic that the user may want more flexibility than these wrappers allow. In that case it is quite straightforward to grab the utanos source code from github out of these functions and run each step individually with more control.

V - Compare signature sets

Utanos includes 3 methods by which to compare between different signature sets.

Using the wasserstein distance (as a heatmap) to compare mixture components for two CN-features.
Compare mixture components for two CN-features using shaded line plots
Alluvial flow or Sankey plots to visulalize the difference in how samples will cluster given their exposures to two signature sets. By default the samples are assigned the signature for which they have the maximum exposure.

For this section, comparisons will be made between the signatures just created above, the signatures from the 2018 MacIntyre paper (HGSOC), and the 2024 Jamieson paper (p53abn EC).

cm_ex = example component models made in this vignette
sigs_ex = example signatures made in this vignette
cm_hgsoc = Component models from the 2018 Macintyre paper
sigs_hgsoc = Signatures from the 2018 Macintyre paper
cm_p53ec = Components models created using samples from the 2024 Jamieson paper
sigs_p53ec = Signatures from the 2024 Jamieson paper

cm_ex <- component_models
sigs_ex <- signatures
cm_hgsoc <- readRDS("~/repos/utanosmodellingdata/component_models/30kb_ovarian/component_models_britroc_aCNs.rds")
sigs_hgsoc <- readRDS("~/repos/utanosmodellingdata/signatures/30kb_ovarian/component_by_signature_britroc_aCNs.rds")
cm_p53ec <- readRDS("~/repos/utanosmodellingdata/component_models/30kb_endometrial/component_models_vancouver_aCNs.rds")
sigs_p53ec <- readRDS("~/repos/utanosmodellingdata/signatures/30kb_endometrial/component_by_signature_britmodelsvansigs5_aCNs.rds")

V.1 - Earth-movers distance

One-to-one sequential comparisons of each component composing two different signatures is difficult. There is no guarantee that a signature set run with one set of parameters or with one input dataset will have the same number of components as with another another set of parameters or input dataset. Therefore, comparisons that are intrinsically many-to-many are a better option.

Utanos provides a function, WassDistancePlot(), that calculates the 2-D Wasserstein (earth-movers distance) to transform each component in a signature for a given feature to each other component for the same feature in another signature set. A higher distance means two components are less similar and a lower distance means the opposite. The WassDistancePlot() function then orders the components in increasing order and colours high-similarity components lightly and less similar components a darker hue.

library(patchwork) # For easy plotting of multiple figures next to one another
cm_ex <- component_models
sigs_ex <- signatures
ggp2 <- WassDistancePlot(cm_a = cm_ex, 
                         cm_b = cm_hgsoc, 
                         component = 'osCN')
ggp3 <- WassDistancePlot(cm_a = cm_ex, 
                         cm_b = cm_hgsoc, 
                         component = 'copynumber')

ggp2$plot + ggp3$plot

The earth-movers distance heatmap shows that the two lowest components from the example signature’s oscillating CN-feature are both highly similar to the first component in the published signatures. The CN-feature ‘copynumber’ heatmap shows a typicaly trend of the lower components being much more similar to one another and then the distributions flatten and extend as they rise in value. Note that components A - c.6 and B - c.8 are less similar than we might expect. This is likely due to there being just 6 components in in the A set (example set), and so the value is pulled down.

V.2 - Shaded line plots of mixture components

Individual mixture components for a signature can be visualized using shaded line plots. In the case of the “segment size” copy-number feature this will be a series of Gaussians. The MixtureModelPlots() function from utanos can return plots for any or all of the components that compose each signature. Additionally, the specific components that are most prevalent for the signature indicated are shaded.

Some examples.

# Reorder the HGSOC signatures the same way as is done in the paper (keep things easy to compare)
sigs_hgsoc_df <- basis(sigs_hgsoc)
reord_britroc <- as.integer(c(2,6,5,4,7,3,1))
sigs_hgsoc_df <- sigs_hgsoc_df[,reord_britroc]

mixture_plots <- MixtureModelPlots(sigs_hgsoc_df, 
                                   cm_hgsoc, 
                                   sig_of_interest = 7)
mixture_plots$copynumber

As briefly mentioned in section V.2, the CN-feature ‘copynumber’ shows a typical trend of lower components being much more similar to one another and then the distributions flatten and extend as they rise in value reflecting an increase in variance. This is easy to see in the shaded line plots.

V.3 - Alluvial flow plots of samples across signatures

To make an Alluvial-flow/Sankey plot 2 comparisons are needed.

Signature exposures for each sample to each set of signatures:
Britroc HGSOC samples -> Britroc HGSOC Sigs
Britroc HGSOC samples -> Vancouver p53abn Endometrial Sigs

This is result in two matrices of signature exposures per sample.

The utanos function SEAlluvialPlot() takes two matrices of signature exposures and creates an Alluvial plot using the easyalluvial R-package. Of course, the same samples must be in both matrices. If not already installed, install the easyalluvial package.

library(easyalluvial)
# Pull exposures for the 91 highest quality samples out of the signatures object itself (the coefficients matrix)
expo_mat_hq <- NMF::scoef(sigs_hgsoc)
expo_mat_hq <- expo_mat_hq[reord_britroc,]
expo_mat_hq <- NormaliseMatrix(expo_mat_hq)
# Exposures were calculated for the 117-91 samples back in Section 3. 
expo_mat_mq <- sigexposures[reord_britroc,]
# Combine
expo_hgsoc <- cbind(expo_mat_hq, expo_mat_mq)

# Find exposures of HGSOC samples to p53abn EC signatures
expo_ec <- CallSignatureExposures(acn.obj,
                                  component_models = cm_hgsoc, 
                                  signatures = sigs_p53ec, 
                                  refgenome = 'hg19')

# Pass both matrices of signature exposures to the SEAlluvialPlot utanos function
alluvial_output <- SEAlluvialPlot(expo_hgsoc, expo_ec)
alluvial_output$plot

Notably, the alluvial plot highlights the likely similarity of signature 1 in HGSOC set and signature 2 in the p53abn EC set. Many samples with a maximum exposure to one, also have a max. exposure to the other. The same goes for A4 and B3.

Each of the past three subsections in combination with tests of significance can help the user in comparing and contrasting copy-number signature sets.

References

Gaujoux, Renaud, and Cathal Seoighe. 2010. “A Flexible r Package for Nonnegative Matrix Factorization.” BMC Bioinformatics 11 (1): 367. https://doi.org/10.1186/1471-2105-11-367.

Grün, Bettina, and Friedrich Leisch. 2008. “FlexMix Version 2: Finite Mixtures with Concomitant Variables and Varying and Constant Parameters.” Journal of Statistical Software 28 (4): 1–35. https://doi.org/10.18637/jss.v028.i04.

Macintyre, Geoff, Teodora E Goranova, Dilrini De Silva, Darren Ennis, Anna M Piskorz, Matthew Eldridge, Daoud Sie, et al. 2018. “Copy Number Signatures and Mutational Processes in Ovarian Carcinoma.” Nature Genetics 50 (9): 1262–70.