Run shallowHRD on a dataframe or file containing segmented relative copy-number data.

Utanos' version of the shallowHRD algorithm for detecting homologous recombination deficiency (HRD) on tumor samples from the number of large genomic alterations (LGAs). Original code found here (https://github.com/aeeckhou/shallowHRD), and original paper found here (https://academic.oup.com/bioinformatics/article/36/12/3888/5823300). Essentially, the algorithm can be divided into 5 sequential steps. We perform these steps two times (I call them 'passes'), so the first pass is sort of a 'discovery' stage, with still good estimates, but the second pass improves on the results from the first pass.

The overall idea is not so complicated: we have some raw segment data and want to find the number of LGAs of size >= 10Mb. To do so, the 5 steps are (excluding the cleaning functions):

Gather/merge segments by ratio_median: segments that have the same ratio_median might as well be considered as the same segment: imagine plotting the segments –segment 1– –segment 2–, since they have the same ratio_median, then they're on the same 'level' in the y-axis, so they can also be: —- a single segment—-. We merge segments in the same ratio_median AND in the same chromosome arm.

This is what GatherSegmentsByRatioMedian does. The other functions in the section do other minor things: adding chromosome arm column, removing spurious/unstable regions from the chromosomes, and other cleaning stuff. More description on their activities can be found in their respective documentation.

NOTE: in CleanBamRatiosFrame you have the option to log-transform the ratio_median's. If some ratio_medians in your file are negative, then this will produce NANs, in which case there's no need to log-transform.
Finding the threshold: we use KDE to find the minima of the density of all the ratio_median differences (between all pairs of segments). The idea here is that by finding the local minima of these differences, then we are finding a 'tolerance' point that determines whether we merge two segments or not. For example, imagine a plot again, and we have two segments (different ratio_medians, so they're not on the same level in the y-axis) :

—- segment 1 —- | diff <= threshold | ——————> —- segment 1 + segment 2 —- | —- segment 2 —-

Then the threshold indicates whether their distance is meaningful enough such that we should consider them a single segment. So, if their ratio_median difference is <= threshold, then we consider them the same segment because we aren't completely sure that they are far apart enough. If their ratio_median difference is > threshold, then we don't merge and we treat each segment as separate segments. All of this is illustrated above. More details on the procedure of finding the threshold find in the FindThreshold + helpers documentation.
Levels: while we are still not entirely sure the reasoning behind the section, we think that this section assigns segments a sort of 'confidence' in our merging. The idea here is that the largest segment likely has the highest confidence in its ratio_median as it was easier to sequence. Then, we use it as our reference to determine how confident we are with the rest of the segments. We use the threshold to see how close the other segments are to this largest segment, and if they are close (i.e. ratio_median difference <= threshold), then we assign it the same level as the largest segment. AssignLevels takes care of this. Finally, GatherSegmentsByLevels then merges segments in the same chromosome arm and with the same level.
Small segments. For the previous section, we were only working with 'large' segments, meaning their size is >= 3Mb. Now we have to figure out where the small segments fit in into our current segment data. InsertSmallSegments takes care of this, and it essentially iterates over the small segments and the large segments, and checks if we encounter any of the 6 cases it looks for. The cases depend on the positioning of the small segment with respect to the large segment, and they are all listed/visualized in the helper FinalizeSmallSegments documentation. Wherever appropriate, we merge small and large segments too (i.e. if their ratio_median diff <= threshold). By the end of the section, we have re-inserted small segments into our main segment data frame.
LGAs: now that we have re-inserted the small segments, we can finally determine the number of LGAs. CallLGA does this, and it is basically iterating over the finalized segments (marked as Graph 5) and checking which of the segments meet the criteria for an LGA (these requirements are listed in the aforementioned function's documentation.) The function returns the number of LGAs for different LGA sizes, the paper mostly cares about LGAs of size 10Mb, but you'll get results for sizes 3 to 11. You can also call GetLGAOfSize, which returns the segments that were marked as an LGA for the given LGA size, in case you want to visualize them.

To determine HRD, check the number of LGAs for 10Mb, and if the number of LGAs >= 20, then HRD was detected.

We repeat this procedure from the second step for the second pass. Note how the FindThreshold function doesn't receive the bam_ratio segments this time. Instead, we pass the finalized segments found in the first pass (Graph 5) - this is what we meant by 'improving' our results, as we are going to try to find a better threshold for this second round of function calling.

To finalize, while some of these choices seem arbitrary (e.g. why 3Mb as 'large' segments, and not 5MB? Why is >= 20 LGAs equal to HRD?), perhaps the following paper by the same author can give some intuition on their choices. Please refer to https://pubmed.ncbi.nlm.nih.gov/22933060/. Calculates the number of large genomic alterations for various sizes.

Usage

RunShallowHRD(
  raw_ratios_file,
  log_transform = TRUE,
  include_chr_X = FALSE,
  num_simulations = 1e+05,
  shrd_save_path = FALSE,
  sample = NULL,
  seed = 1337,
  plot = FALSE
)