2. Investigating Copy-Number Diversity • utanos

I - Introduction

Broadly, this vignette seeks to demonstrate how a user could answer the following three questions using Utanos:

In my dataset, where do copy-number (CN) gains and losses co-occur?
To what extent do these aberrations co-occur across individual samples?
How can this be done at the relative-CN vs absolute-CN scales?

Utanos provides two options by which to find answers to these questions.
1 - using plots/diagrams to visualize the differences, or 2 - by manually examining informative, well-formatted tables.

Exclusions: This vignette does not cover data filtration or setting masks in the QDNAseq object in any detail. The vignette titled “Sample Filtering and Quality Evaluation” covers the filtration and QC steps.

As an example dataset we will again use the samples found in the utanosmodellingdata repository found here. If not already done, clone that repo to somewhere convenient such as a common ‘repos’ folder on your machine and read in the example data. It is human endometrial carcinoma sWGS data aligned to hg19. The associated publication (Jamieson, 2024) was published in CCR and can be found here.

> library(utanos)
> library(QDNAseq)
> library(magrittr)
> rcn.obj <- readRDS("~/repos/utanosmodellingdata/sample_copynumber_data/sample_rcn_data.rds")

> rcn.obj
QDNAseqCopyNumbers (storageMode: lockedEnvironment)
assayData: 103199 features, 10 samples 
  element names: copynumber, segmented 
protocolData: none
phenoData
  sampleNames: CC-CHM-1341 CC-CHM-1347 ... CC-HAM-0385 (10 total)
  varLabels: name total.reads ... loess.family (6 total)
  varMetadata: labelDescription
featureData
  featureNames: 1:1-30000 1:30001-60000 ... Y:59370001-59373566 (103199 total)
  fvarLabels: chromosome start ... comCNV.mask (11 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:

II - Call gains and losses on relative copy-number data

The callBins function from the QDNAseq package can be run on segmented copy number data to determine chromosomal aberrations. This function adds gain/loss probability slots to the input S4 object. Under the hood it makes use of the CGHcall R package. Please refer to those packages for further algorithmic details.

Here is an example after first filtering as outlined in the previous vignette.

> rcn.obj <- FilterCNs(cnobj = rcn.obj, genome = "hg19", 
                       maskcomCNVs = TRUE, removebins = TRUE)
> rcn.obj <- callBins(rcn.obj)

Notes:

Depending on dataset size the callBins() function can take a while

III - Plotting

III.1 - CN-Summary Plot

The CGHbase R-package provides an excellent plot showing the gain and loss probability across multiple samples. Utanos contains a modified version of this plotting function and it can be generated like so…

> SummaryCNPlot(rcn.obj, maskaberr = 0.1, maskprob = 0.2)

Notes:

The ‘maskaberr’ parameter - Default Value: 0.1. This parameter filters out all gains and losses across samples where the CN-change (the aberration itself) lies between this value and zero.
The ‘maskprob’ parameter - Default Value: 0.2. This parameter filters out all the gain/loss probabilities below this value. The callBins function assigns a probability of gain/loss to each bin of each sample. This is especially useful when analyzing noisy samples.

These parameters are quite similar but both are available because it allows users to mask low probability gain/loss regions without taking into account how large the gain/loss was.

III.2 - Relative-CN Diversity Plot

Alternatively, the relative copy-number changes can be visualized sample-by-sample as a heatmap with red signifying a relative gain and blue a loss. For just a few samples, row-wise clustering doesn’t have too much of an effect on the plot, but for larger cohorts setting the cluster parameter to TRUE can really help to highlight trends in the data. This is a simple hierarchical clustering.

> rcn.plot <- RCNDiversityPlot(rcn.obj, Xchr = TRUE, cluster = TRUE)
> rcn.plot

IV - Table Output

An alternative way to understand the data displayed in the plot for section III.1 is in tabular format. This allows for custom filtering, juxtaposing actual numbers, and creating dataset-specific plots. The MakeSummaryTable() function can be used to generate tables of the copy-number gains and losses.
If desired, the user can also generate a second table containing just the peaks by setting find_peaks = TRUE. The threshold parameters outlined in the below code section will apply to both the generated summary CN aberrations table and the peaks table. Peaks in the CN probability data are identified using the gsignal::findpeaks function. Peaks are simply defined as local maxima. This function itself is a port-over of its namesake function in the Octave package ‘signal’.
Alongside the copy-number gains and losses reported in these table is a column for the genes contained in the region. This is a convenience for checking whether or not some gene-of-interest is contained within a copy-number aberration. Just the human and mouse reference genomes are supported at this point (Ex. ‘mm10’).
The MakeSummaryTable() function returns a list with either 1 or 2 tables (type: tibble).

These tables can be easily saved as csvs, tsvs, or in your favourite tabular format for perusal in a different user-friendly application such as excel.

> library(gsignal)

# Probability of gain or loss threshold - i.e. declare the minimum mean probability of loss or gain across samples
# Expl. For region Chr1:2850001-2865000, and probabilities of loss for 10 samples of:
# 0.814 0 0 0 0 0.125 0.975 0 0 0
# The mean probability of loss would be 0.1914
> prob_loss <- 0.2
> prob_gain <- 0.2

# Log space mean relative copy number thresholds
# To keep things consistent with the plot from the last section, lets use 0.1
> low_threshold <- -0.1
> high_threshold <- 0.1

# Presence threshold - i.e. In what minimum proportion of samples does the gain or loss need to be present?
> proportion_threshold <- 0.2

# Run command
> output <- MakeSummaryTable(rcn.obj, 
                             low_threshold, high_threshold, 
                             prob_loss, prob_gain, 
                             proportion_threshold,
                             find_peaks = TRUE)
> output$summary_table %>% head(5)
# A tibble: 5 × 16
  chromosome    start      end mean_copy_number gain_probability loss_probability gain_count
  <chr>         <int>    <int>            <dbl>            <dbl>            <dbl>      <dbl>
1 1           8910001  9060000            0.157            0.244           0.0049          5
2 1           9600001 10230000            0.139            0.252           0.0038          6
3 1          10620001 10740000            0.119            0.222           0.0062          5
4 1          15420001 15630000            0.148            0.292           0.106           3
5 1          15960001 16080000            0.155            0.302           0.106           3
# ℹ 9 more variables: loss_count <dbl>, gain_proportion <dbl>, loss_proportion <dbl>,
#   gain_samples <chr>, loss_samples <chr>, bin_count <int>, sum_of_bin_lengths <dbl>,
#   coordinates <chr>, features_in_region <chr>

V - Scaling to the Absolute Copy-Number Space

V.1 - Performing the scaling

The Relative CNs space can be useful for identifying trends of relative gains and losses across samples in an experiment. However, absolute copy-numbers (aCNs) can be even more useful as they indicate ploidy and place real numbers on genomic features such as high-amplification events. Utanos makes use of the rascal package to scale from the relative to absolute space.
Here, we will briefly detail the options and workflow, but for a more thorough understanding we recommend the user read through the rascal vignette.

There are three major options for scaling depending on the data available:

mad/rmsd: Without any other data/metadata, just the copy-numbers, this is the best option. Different ploidy/cellularity combinations are assumed, the relative CNs are scaled to absolute, and then the difference of each value to the nearest whole-number is calculated. The difference is quantified as either the mean absolute deviation (mad) or root mean squared error (rmsq). Finally, a grid search is performed to determine the local minima and these are reported as the solution(s). If several ‘solutions’ are nearly equivalent then rascal returns several, otherwise a single cellularity-ploidy-mad/rmsq distance combo is returned. Many more options for tuning are available in the package and will not be discussed here.
vafs: In the case of having variant allele frequencies (vafs) for each sample, these can be used to identify the nearest ploidy-cellularity combination (as described in 1). There are two options under this umbrella. One will assume the maximum discovered VAF for the sample is an appropriate representation for the tumour fraction. Below is a equation describing the relationship used to select the optimal solution from the grid search. For more details see the vignette in the rascal github repo.

$tumour fraction_i \ = \ \frac{a_i c}{a_i c + 2(1-c)}$

$c = cellularity \text{, } a_i = \text{copy-number at locus i}$

The other will use the vafs of only certain genes the user identifies. For example if the user expects ‘PTEN’ vafs to be suggestive of the tumour fraction they can choose to scale using just those vafs.

Known ploidies: In the case of knowing the sample ploidies via some other method (ex. FISH), this option will simply scale the relative copy-numbers for each sample to absolute given those ploidies.

The utanos function documentation contains specifics on expected formatting for each of these options.

Utanos splits aCN scaling into two parts. First, the likeliest solutions (ploidy/cellularity/distance) are calculated using rascal, and then second the values are scaled and a new CN-object is made.

V.1.1 - Performing the scaling (part 1)

To find the optimal ploidy and cellularity, use the FindRascalSolutions() function.
Only the first argument is mandatory, the rest are optional.

Arguments:

cnobj - The relative copy-number object we’ve been using throughout all these vignettes.
min_ploidy & max_ploidy - Determines the range of ploidies to search from.
ploidy_step - The stepwise increment of ploidies along the range.
min_cellularity & max_cellularity - Determines the range of cellularities to search from.
cellularity_step - The stepwise increment of cellularities along the range.
distance_function - The distance function to use, “MAD” (mean absolute difference) or “RMSD” (root mean square difference).
distance_filter_scale_factor - The distance threshold above which solutions will be discarded as a multiple of the solution with the smallest distance.
max_proportion_zero - The maximum proportion of fitted absolute copy number values in the zero copy number state.
in_proportion_close_to_whole_number - The minimum proportion of fitted absolute copy number values sufficiently close to a whole number.
max_distance_from_whole_number - The maximum distance from a whole number that a fitted absolute copy number can be to be considered sufficiently close.
solution_proximity_threshold - How close two solutions can be before one will be filtered; reduces the number of best fit solutions where there are many minima in close proximity.
keep_all - TRUE to return all solutions but with additional best_fit column to indicate which are the local minima that are acceptable solutions (may be useful to avoid computing the distance grid twice).

> solutions <- FindRascalSolutions(cnobj)

V.1.2 - Performing the scaling (part 2)

To scale the samples use the CalculateACNs() function.
The first three arguments are mandatory, the rest are optional depending on context.
The first 4 arguments would benefit from some elaboration, so those will be described, and then we provide some examples for using this function.

Arguments:

cnobj - The relative copy-number object we’ve been using throughout all these vignettes.
acnmethod - The method for scaling. Expects a string. It can be one of:
- “maxvaf”: Chooses the solution that aligns with the maximum detected VAF. We assume that to be suggestive of the cellularity according to the formula above
- A character vector of the genes from which to pull VAFs, where genes are assumed to be in order of decreasing precedence, ex. c(‘TP53’, ‘KRAS’, ‘PTEN’),
- “mad”: Chooses the solution with the lowest MAD or RMSD score; in the case of a tie goes with the lower ploidy solution.
- A dataframe of known ploidies. Scales each sample to aCN space using these plodies. Expects a column with sample_id - matching those in the rcn.obj, and ploidy - a numeric value.
rascal_sols - The third argument expects a TSV file or data.frame of the calculated rascal solutions. See part 1 for details on creation.

Example solutions table:

> solutions <- FindRascalSolutions(cnobj)
> head(solutions)
##      sample_id ploidy cellularity distance
## 1: CC-CHM-1341   3.34        0.85    0.166
## 2: CC-CHM-1341   1.34        0.46    0.166
## 3: CC-CHM-1341   3.33        0.90    0.166
## 4: CC-CHM-1341   2.34        0.60    0.166
## 5: CC-CHM-1341   2.35        0.52    0.174
## 6: CC-CHM-1341   4.34        1.00    0.176

variants - The fourth argument for CalculateACNs, is an optional argument. In the event of having variants data this parameter expects a table or file path to a comma-separated values (CSV) file containing long-format data.
Requirements for the variants table when using VAFs from targeted panel sequencing or some other technology:
- The variants must must be in a datatable/dataframe
- The required columns are: sample_id, chromosome, start, end, gene_name, ref, alt, vaf (with these exact names, but in any order)
- Each row of said table must correspond to a unique variant
- Each variant must have an associated variant allele frequency
- Each row must also be associated with a specific sample, sample_id must match

Example variants data table:

> variants_path <- "~/some_path/variants.csv"
> variants <- data.table::fread(file = variants_path, sep = ',')
> variants <- variants %>% dplyr::rename(gene_name = gene_symbol)
> head(variants[, 1:10])
##      sample_id chromosome     start       end ref alt gene_name
## 1: CC-CHM-1341          3  30691871  30691872  GA   G    TGFBR2
## 2: CC-CHM-1341          3 142274739 142274740  AT   A       ATR
## 3: CC-CHM-1341          7 151945071 151945071   G  GT     KMT2C
## 4: CC-CHM-1341         10  90771767  90771767   G   A       FAS
## 5: CC-CHM-1341         16    343635    343635   C   T     AXIN1
## 6: CC-CHM-1341         17   7577556   7577556   C   T      TP53
##                       consequence estimated_impact  vafs
## 1:             frameshift_variant             HIGH 0.201
## 2:             frameshift_variant             HIGH 0.196
## 3: stop_gained,frameshift_variant             HIGH 0.212
## 4:               missense_variant         MODERATE 0.301
## 5:               missense_variant         MODERATE 0.226
## 6:               missense_variant         MODERATE 0.720

Note: this example table has more information than is necessary. There are extraneous columns.

Examples:

A. Scaling using the solution with the best (lowest) distance:

> outputs <- CalculateACNs(rcn.obj, 
                           acn.method = 'mad', 
                           rascal_sols = solutions)

B. Scaling using the ‘maxvaf’ option and providing some variant allele frequencies

> outputs <- CalculateACNs(rcn.obj, 
                           acn.method = 'maxvaf', 
                           rascal_sols = solutions, 
                           variants = variants, 
                           return_sols = TRUE, 
                           return_S4 = TRUE)
Warning message:
In ReplaceQDNAseqAssaySlots(cnobj, output[["acns"]], output[["acns_segs"]]) :
  Not all samples had ACN solutions.
    These samples are excluded from the returned acn object.
    The probability of loss/gain values in the returned acn object were calculated using the relative CNs object.
> names(outputs)
[1] "rascal_solutions" "acn.obj"
> outputs$rascal_solutions %>% head(5)
    sample_id ploidy cellularity tumour_fraction   vaf
1 CC-CHM-1341   1.34        0.46           0.365  0.72
2 CC-CHM-1347   3.46        0.32           0.592 0.874
3 CC-CHM-1355      2        0.83           0.828 0.395
4 CC-CHM-1361      2        0.72           0.719 0.313
5 CC-HAM-0369   1.94           1               1 0.872

C. Scaling by prioritizing variants from certain genes.

> outputs <- CalculateACNs(rcn.obj, 
                           acnmethod = c("TP53", "KRAS", "PTEN"), 
                           rascal_sols = solutions, 
                           variants = variants, 
                           return_sols = TRUE, 
                           return_S4 = TRUE)
Warning message:
In ReplaceQDNAseqAssaySlots(cnobj, output[["acns"]], output[["acns_segs"]]) :
  Not all samples had ACN solutions.
    These samples are excluded from the returned acn object.
    The probability of loss/gain values in the returned acn object were calculated using the relative CNs object.

Note: The warning message for C provides a useful message to users in case an ACN solution for some samples was not found by rascal.

V.2 - Plotting

V.2.1 - Copy-Number Segment Plots

These scaled copy-numbers, absolute copy-numbers (aCNs), can be visualized using the same plotting function mentioned in vignette 1. Let’s briefly compare the aCNs generated using the maxvaf solution compared to the ‘mad’ solution.

> maxvaf_outputs <- CalculateACNs(rcn.obj, 
                                  acn.method = 'maxvaf', 
                                  rascal_sols = solutions, 
                                  variants = variants, 
                                  return_S4 = TRUE)
> mad_outputs <- CalculateACNs(rcn.obj, 
                               acnmethod = 'mad', 
                               rascal_sols = solutions, 
                               return_S4 = TRUE)
> CNSegmentsPlot(maxvaf_outputs$acn.obj, 
                 sample = "CC-CHM-1341", 
                 min_copy_number = 0, 
                 max_copy_number = 12.5, 
                 copy_number_breaks = c(1:12))

> CNSegmentsPlot(mad_outputs$acn.obj, 
                 sample = "CC-CHM-1341", 
                 min_copy_number = 0, 
                 max_copy_number = 12.5, 
                 copy_number_breaks = c(1:12))

The scaling using the ‘mad’ set the sample ploidy to around 4 whereas the ‘maxvaf’ option instead preferred a ploidy of 2.

V.2.2 - Copy-Number Diversity Plots

Finally, similarly to section III.2, samples can be visualized all together as a heatmap. Using this strategy, sample-to-sample similarities and differences are easily differentiable. For datasets with larger numbers of samples, it is recommended to order the samples in some informative manner. This is done by default in the ACNDiversityPlot function. The ordering is found by performing hierarchical clustering on the passed in copy-number segments.

By design, the ACNDiversityPlot function expects data in long-format. This is to encourage users to think about the bin-size they would like to plot. For example, when plotting a genome 3 billion base pairs long bins 30kb across will barely be visible to the eye. A bin this size on a 7-inch plot would make up roughly 7e-05 of the width. So even if there is a one-bin amplification, it will not be visible on the heatmap. However, it will matter a lot the the plotting software. Drawing rectangles for this many minuscule bins potentially across many samples will be take a very long time!
TL;DR there is no real point in plotting tiny bins for this plot-type.
Instead, re-shape the data to a comfortable bin-size for plotting to improve speed. We have found 1e06 to 2e06 to be reasonably quick for up to several hundred samples.
If the user is interested in finding tiny amplification events, filtering and examining the tabular outputs would be a much better option (ex. outputs$acn_segment_tables).

> outputs <- CalculateACNs(rcn.obj, 
                           acnmethod = 'mad', 
                           rascal_sols = solutions)
> long_segs <- SegmentsToCopyNumber(outputs$acn_segment_tables, 
                                    1000000, 
                                    genome = 'hg19', 
                                    Xincluded = TRUE)
> colnames(long_segs) <- c("chromosome", "start", "end", "state", "sample_id")
> ACNDiversityPlot(long_segments = long_segs)