Announcement

Collapse
No announcement yet.

"Inverse weighting" adjustments to mutation rates, and other weird stuff

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • "Inverse weighting" adjustments to mutation rates, and other weird stuff

    I've just spent some time reviewing some of the Y STR calculators out there. Some conclusions and some question. Informed commentary welcome.

    I see that most of them have a very crude method for determining the modal value in ambiguous cases where, strictly speaking, the marker has no mathematically modal value. Generally, they seem to assume that the low-er-est value is ancestral. I have heard that there is evidence that observed father-son mutations do tend to be upward rather than downward, but I sincerely doubt this trend is strong enough to make such a radical assumption during the construction of modal haplotypes. You can maybe argue that the particular case I observed is not representative, but I found that this default resulted in absolute nonsense during the computation of genetic distances of individuals apparently equidistant from the MRCA according to high resolution SNP tests. Instead, for my particular case, I performed a manual calculation using the median value.

    This solved one problem, and uncovered another, although easily resolved. Because I was using the infinite alleles method, it resulted in some otherwise inexplicable patterns in the GDs between donors, even though each now appeared to be equidistant from the MRCA. I just switched to the step-wise method, and (almost) all the anomalies appeared to be corrected.

    There was still the matter of one location which a single donor reported at a two-step difference from the modal, which led me to revisit that I think I know about multi-step mutations. From what I understand (maybe not accurately or at least not up-to-date), multi-step mutations are more likely on multi-copy markers, but still over all somewhat less than 10% of all mutations. This, at least, is my inference from the exercise I describe here. But if that's true, why is the calculator default to use the infinite allele method? And it's not just the McGee calculator that does this.

    Then there is the matter of assumed per locus mutation rates. McGee's utility was created, I believe, before much research was published on this, so it's not surprising that it's a little arbitrary, and I think, way too high in light of more recent data (e.g., Heinila 2012). That is understandable.

    A little harder to work through are the various adjustments that other calculators appear to make. I believe that most of the popular calculators do this, including FTDNA's "Tip", but I have never heard them described more specifically than as "inverse weighting" adjustments. I can't make anything more specific of this than that it relates in some way to the fact that there is a wide disparity in mutation rates for individual loci.

    I get it that these disparities will have a significant impact on the probability of hidden "convergent" mutations, but it seems to me that a straightforward tweak involving a Lorenz analysis of the AVERAGE rates is all that is required. In fact, the figures I arrive at under such a method typically come within 5% of the post-adjustment rates used by some of these calculators.

    What is currently inexplicable to me is why anyone felt it was necessary or even a good idea to adjust the rates in such a way that results in different age estimates for comparisons with the same genetic distance, the specific loci in question being the only differing variable. That just seems unrealistically cute to me. There is an irreducible randomness in this process, and becoming this specific just seems like splitting hairs. It's probably not a huge deal, since, as I said, my independent calculations of an adjusted rate typically differ by less than 5%, but conceptually it is difficult for me to understand.
    Last edited by benowicz; 6 January 2021, 11:55 AM.

  • #2
    Originally posted by benowicz View Post
    . . . I get it that these disparities will have a significant impact on the probability of hidden "convergent" mutations, but it seems to me that a straightforward tweak involving a Lorenz analysis of the AVERAGE rates is all that is required. In fact, the figures I arrive at under such a method typically come within 5% of the post-adjustment rates used by some of these calculators.

    What is currently inexplicable to me is why anyone felt it was necessary or even a good idea to adjust the rates in such a way that results in different age estimates for comparisons with the same genetic distance, the specific loci in question being the only differing variable. . . .
    One could say that these fast-mutating loci have a much larger probability of being the site of convergent mutations, but that would miss the entire point: Convergent mutations are the ones you CAN'T SEE--calculating the TMRCA with special reference to them would be pointless or maybe even impossible because the convergent mutation would be invisible. That's the special advantage of using a population-averaging method.

    I feel like this is one of those issues where the true answer is either simple but not widely appreciated, or impossibly opaque where a PhD in advanced mathematics would be required just to understand the explanation. But to me it feels like those calculations kind of outsmart themselves. The validity of statistics is a property of mass populations, and this tendency to hyper-segregate into smaller and smaller subsets just seems a self-defeating violation of the principle.
    Last edited by benowicz; 6 January 2021, 12:43 PM.

    Comment


    • #3
      I still want to hear a cogent argument for that aspect of the calculators' rate adjustments, whereby comparisons involving identical GD by different loci return different TMRCA estimates. Just because they sound like nonsense to me, not because I'm unable to deal with the trivial differences vs. my loci averaging method.

      But an interesting thought has occurred to me: Since we're comfortable in estimating the likelihood of convergent mutations for any number of visible mutations, we should be able to calculate the likelihood of identifying these convergent mutations by adding additional haplotypes to the comparison. Not that just any haplotype will do--the theory should be that all donors are equidistant from one another--but where such a comparison is possible, it could reduce any adjustment for convergence to down to zero.

      The upshot being that, under the right circumstances, it might be possible to extend the range of meaningful STR predictions well beyond current FTDNA recommendations, which I think is about 500 years.
      Last edited by benowicz; 9 January 2021, 04:11 PM.

      Comment

      Working...
      X