Announcement

Collapse
No announcement yet.

Reasonable in theory, but not in practice

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reasonable in theory, but not in practice

    After reperforming a TMRCA calculation for an historical benchmark clade under my new, slightly-tweaked algorithm I had an insight about Discover's (probable) methodology that I think very likely explains their very weird, shifting estimates for R-FGC23343 and subclades. My guess is that their fundamental orientation is opposite to mine, focusing on the distance between a fixed, remote ancestor, whereas I focus on the distance between the (only relatively) fixed dates associated with the living donors and a given clade.

    Orientation.png

    In theory, both approaches should return the same results--but only if the mutation/mutation observation and reporting processes functioned like a perfect genetic clock, instead of the highly variable stochastic processes that we know they actually are. For example, with respect to the MRCA between BY32575 and FGC23343, the (currently reported) typical donor variant counts are ~62.25and ~45.16, respectively.

    That's a discrepancy of about 17 variants, or about 27% of the longer (and in my opinion, more likely correct) typical count for BY32575. This scale is shocking and raises some urgent questions about the precise manner in which this discrepancy arose. To date I've explored the possibility that some SNPs currently reported within the anomalously long block FGC28370 may properly belong to a different, ancestral block, but at least with regard to the data reported for my cousin's kit, FT37222, it doesn't seem likely. At the moment I have a lot of questions but no answers.

    However, I can make some important, objectively factual statements that constitute strong support that my dates are likely to be more accurate than Discover's.

    First, all relaxed clock methods displace the adjustments for anomalous data further away from your point of origin. So methods that orient themselves from a fixed common ancestor, as I believe that Discover may be using, are going to attribute anomalies to dates closer to the living donor. There are a number of problems with this:

    -The identity of the MRCA may be fixed, but the whole premise of the TMRCA exercise is that we don't know when he was born. That's the whole point of this exercise. And even though donor birth years are variable, they are known and only variable within a very tight time frame. So approaches starting from the MRCA-oriented are clearly working from an unknown variable towards a known variable, which is exactly the opposite of the way math works in all variable estimation algorithms.

    -The recognized, standard way for properly calculating conditional probabilities is to multiply the probabilities associated with each individual event.

    For the purposes of illustration, let's say that the probability of observing an anomalous variant count over a fixed period of time is 1/4. Attributing this anomaly to a single ancestral clade, as my orientation from the living donor will tend to do, the probability will be 1/4. (1/4)^1=1/4.

    But attributing it separately to 3 distinct descendant clades, as the MRCA-oriented approach (probably) used by Discover does, the probability becomes 1/64. (1/4)^3=1/64.

    Would you rather play a lottery with a 25% chance of winning or only 1.56% chance of winning?

    By all means, whenever there is specific information available to confirm the specific range of dates within which an anomaly has occurred, overwrite the default, statistically calculated estimate. But if you don't know--and that is the whole premise of TMRCA estimation--why would you pick the objectively least probable estimate?

    Just for reference, here is a link to the weird block tree data that started this all. I don't know for a fact that this is how Discover is calculating these dates, but I do think it is a framework that at least makes their weird, constantly shifting estimates understandable, if not reasonable in practice.

    Like a car wreck, I just couldn't keep myself from looking, even though I know better. I've written what seems like twelve Bible-sized tomes about the death-spiral of quality in Discover's TMRCA estimates, but this one picture says it all. Way to steal my thunder.
    Attached Files

  • #2
    I wish the allowed edit period were longer. I'm still developing these ideas. Often I realize that there is a better, more accurate expression available. For "count", read prioritize.



    After reperforming a TMRCA calculation for an historical benchmark clade under my new, slightly-tweaked algorithm I had an insight about Discover's (probable) methodology that I think very likely explains their very weird, shifting estimates for R-FGC23343 and subclades. My guess is that their fundamental orientation is opposite to mine, focusing on the distance between a fixed, remote ancestor, whereas I focus on the distance between the (only relatively) fixed dates associated with the living donors and a given clade.

    Orientation v2.png

    In theory, both approaches should return the same results--but only if the mutation/mutation observation and reporting processes functioned like a perfect genetic clock, instead of the highly variable stochastic processes that we know they actually are. For example, with respect to the MRCA between BY32575 and FGC23343, the (currently reported) typical donor variant counts are ~62.25and ~45.16, respectively.

    That's a discrepancy of about 17 variants, or about 27% of the longer (and in my opinion, more likely correct) typical count for BY32575. This scale is shocking and raises some urgent questions about the precise manner in which this discrepancy arose. To date I've explored the possibility that some SNPs currently reported within the anomalously long block FGC28370 may properly belong to a different, ancestral block, but at least with regard to the data reported for my cousin's kit, FT37222, it doesn't seem likely. At the moment I have a lot of questions but no answers.

    However, I can make some important, objectively factual statements that constitute strong support that my dates are likely to be more accurate than Discover's.

    First, all relaxed clock methods displace the adjustments for anomalous data further away from your point of origin. So methods that orient themselves from a fixed common ancestor, as I believe that Discover may be using, are going to attribute anomalies to dates closer to the living donor. There are a number of problems with this:

    -The identity of the MRCA may be fixed, but the whole premise of the TMRCA exercise is that we don't know when he was born. That's the whole point of this exercise. And even though donor birth years are variable, they are known and only variable within a very tight time frame. So approaches starting from the MRCA-oriented are clearly working from an unknown variable towards a known variable, which is exactly the opposite of the way math works in all variable estimation algorithms.

    -The recognized, standard way for properly calculating conditional probabilities is to multiply the probabilities associated with each individual event.

    For the purposes of illustration, let's say that the probability of observing an anomalous variant count over a fixed period of time is 1/4. Attributing this anomaly to a single ancestral clade, as my orientation from the living donor will tend to do, the probability will be 1/4. (1/4)^1=1/4.

    But attributing it separately to 3 distinct descendant clades, as the MRCA-oriented approach (probably) used by Discover does, the probability becomes 1/64. (1/4)^3=1/64.

    Would you rather play a lottery with a 25% chance of winning or only 1.56% chance of winning?

    By all means, whenever there is specific information available to confirm the specific range of dates within which an anomaly has occurred, overwrite the default, statistically calculated estimate. But if you don't know--and that is the whole premise of TMRCA estimation--why would you pick the objectively least probable estimate?

    Just for reference, here is a link to the weird block tree data that started this all. I don't know for a fact that this is how Discover is calculating these dates, but I do think it is a framework that at least makes their weird, constantly shifting estimates understandable, if not reasonable in practice.​


    Like a car wreck, I just couldn't keep myself from looking, even though I know better. I've written what seems like twelve Bible-sized tomes about the death-spiral of quality in Discover's TMRCA estimates, but this one picture says it all. Way to steal my thunder.


    The mechanic through which I prioritize these recent variants is my new (and still developing) Lineage Independence statistic, by which I weight data for averaging at the level of the clades immediately below the clade being aged. At the very top level they aggregated by geometric mean as per conditional probability.

    I think most of the difficulties in TMRCA estimation have to do with idiosyncratic features of a clade's topography, the way some clades seem to be way longer or way shorter than their neighbors. Obviously, this is a reflection of the randomness inherent in the process of mutation, etc. These companies and technologies have


    I don't currently have an opinion as to whether Discover is creating a discrete, identifiable element to prioritize older variants, but by default, averaging at each hierarchical level by the plain, un-weighted variant count will have the same effect, as each additional donor sharing an ancestral variant creates another instance.

    Comment


    • #3
      What I've written is true--attributing anomalies solely to more recent generations will create large distortions in the probability distribution within the phylogenic tree. But to be honest, I didn't chose the starting point for my cascading TMRCA estimates based on some quantified target statistic. I'm trying to identify such a statistic, but the fact is that I chose my starting point based on the subjectively perceived nodal density--the founders of very prolific lineages, specifically to optimize sampling, old enough to present a large number of Bernoulli trials.

      Of necessity, because there essentially are no reliable data before the Medieval period, all my benchmark data was chosen from that date range. However, when I think about it more deeply, I guess there must be many significant lineages that pre-date this period. They're absolutely of no use as benchmarks because the date range surrounding the founders' birth years is just too large.

      R-ZZ40_1 seems to be one of those significant pre-Medieval lineages. The typical block length between it and R-BY32575 is pretty small. But apart from this TMRCA estimate process, we have absolutely no reliable basis to attribute any specific age to it. We could try correlating geographical distribution and anthropological data to the historical record, but the margins of error are in the thousands of years. Useless for benchmarking.

      It is shocking how consistently and how closely my optimized algorithm produces superior estimates for these benchmark Medieval lineages, always using the same base mutation rate. There has to be some valid objective principle underlying my methodology that can be reduced to a quantified statistic.

      If I had access to some global statistics like number of descendant blocks for each block within the tree, maybe I could reliably derive such a statistic. I could identify patterns. But that's not going to happen.

      I can point to the consistency of some STR data with my results, as well. There are aggregation techniques available to optimize the reliability of simple one-to-one comparisons, but due to the potential for convergence, the further we go back in time, the less weight that seems to carry. It's probably not the basis for a fully scalable paradigm.

      FT372222 TMRCA Analysis - SNP & STR.png

      FG23343 dated phylogeny as of 6 December 2022.png

      Comment


      • #4
        Originally posted by benowicz View Post
        What I've written is true--attributing anomalies solely to more recent generations will create large distortions in the probability distribution within the phylogenic tree. But to be honest, I didn't chose the starting point for my cascading TMRCA estimates based on some quantified target statistic. I'm trying to identify such a statistic, but the fact is that I chose my starting point based on the subjectively perceived nodal density--the founders of very prolific lineages, specifically to optimize sampling, old enough to present a large number of Bernoulli trials. . . .
        Actually, now that I think about it, I may have such a statistic at hand: a derivative of the Lineage Independence statistic that I'll maybe call the 'Nodal Density'. It's basically a measure of the number of effectively independent descendant lineages currently recorded for a given clade.


        Originally posted by benowicz View Post
        . . . However, I can make some important, objectively factual statements that constitute strong support that my dates are likely to be more accurate than Discover's. . . . The recognized, standard way for properly calculating conditional probabilities is to multiply the probabilities associated with each individual event.

        For the purposes of illustration, let's say that the probability of observing an anomalous variant count over a fixed period of time is 1/4. Attributing this anomaly to a single ancestral clade, as my orientation from the living donor will tend to do, the probability will be 1/4. (1/4)^1=1/4.

        But attributing it separately to 3 distinct descendant clades, as the MRCA-oriented approach (probably) used by Discover does, the probability becomes 1/64. (1/4)^3=1/64.

        Would you rather play a lottery with a 25% chance of winning or only 1.56% chance of winning?

        By all means, whenever there is specific information available to confirm the specific range of dates within which an anomaly has occurred, overwrite the default, statistically calculated estimate. But if you don't know--and that is the whole premise of TMRCA estimation--why would you pick the objectively least probable estimate? . . .
        Raising the calculated un-likelihood associated with an anomaly to the power of a given clade's 'Nodal Density', I get an objective measure of the total improbability implied by various scenarios.

        Volumetric Probability Analysis.png

        It is mind-blowingly more likely that the resolution for the anomalously low variant counts for R-FGC23343 subclades with respect to R-Z440_1 lies upstream rather than downstream. It's still very weird and requires a better explanation than we've been given to date. But until that explanation arrives, it's clear that the MRCA for R-FGC23343 is much closer to 703 C.E. as I calculate than the 454 B.C.E or whatever Discover is saying at the moment (constantly shifting).

        Both the Nodal Density and the 'Equilibrium Variant Count' are byproducts of a TMRCA estimate cascade where the mutation rate is set the to the default (i.e., once every ~60.85 years at BY700 resolution).

        Order of operations.png

        Just as a reminder, here is where I introduce this new (I think new) idea of Lineage Independence.

        https://forums.familytreedna.com/for...ool#post333536
        Last edited by benowicz; 9 December 2022, 10:37 AM.

        Comment


        • #5
          Well, I guess I can stop chasing my tail. There actually are no 17 missing variants in R-FGC23343. My dates are almost certainly correct.


          Not that different.png


          It is remarkable how so many of the blocks separating R-BY32575 from R-ZZ40_1 are validating by branching lineages. Especially in comparison to the long bottleneck in R-FGC23343. But there is considerable room for doubt in some of them. Most likely those fellows branching from R-BY32580 are (somewhat) more closely related to R-BY32575 than is otherwise apparent.


          I should never have doubted myself.png

          In a large enough sample size, the simple mathematical average values within a stochastic system will approximate the true population average. But taken in isolation, the published data for specific clades must represent only a very small portion of the total descendants worldwide. You can't assume that the mathematical average variant count in the Block Tree will always correspond to to the actual TMRCA in a very simple way.

          Hence the whole point of normalizing variant counts for the discrete probability distribution in my algorithm and aggregating the results by weighted average Lineage Independence.

          I still don't know exactly what the Discover database is doing, but I do know that the input figures I'm using in my algorithm must be more consistent with the macro-level statistics than whatever they're using to support their calculations for pre-historic clades like R-ZZ40_1. After seeing these statistics, and the correspondence with the STR data noted previously, what reasonable doubt can be left?

          https://forums.familytreedna.com/for...578#post333578

          Comment


          • #6
            Originally posted by benowicz View Post
            Well, I guess I can stop chasing my tail. There actually are no 17 missing variants in R-FGC23343. My dates are almost certainly correct. . . .
            For further consideration, by process of elimination:

            1. The dates Discover proposes for R-FGC23343 imply about 17 "missing" variants since R-ZZ40_1, ALL of which would have been positioned after the MRCA of FGC23343. This mind-blowingly improbable scenario would have to occurred on about 12 separate occasions (i.e., the number of effectively independent lineages under FGC23343).

            REALLY? How many people are out there who have won the lottery 12 times in a row?

            2. For the sake of discussion, without regards to Discover's dating for specific clades, if we could consider that perhaps these 17 missing variants were spread evenly under FGC23351, we would implicitly be saying that the actual coverage for those kits was about 27 % lower than the kits under S21184 for the same product.

            REALLY? For that to be true, FTDNA's QC would have to have not only release data for kits that are outrageously out of product spec, but also limited that only to those under FGC23351, and knowingly reported false coverage statistics. I have the info for my cousin's kit and the stats for several other donors has been stated in public projects. So I am confident to say, "No, I don't believe it is reasonable to believe in such a conspiracy".

            3. Also for the sake of discussion, if we still could sustain the belief that there actually were 17 missing variants under FGC23351, given # 1 and #2, they must be UPSTREAM of FGC23343. That would invalidate Discover's dates for FGC23343 and downstream clades, but it could actually be the case. In theory, if such an event occurred close enough to the split from R-ZZ40_1, it might have to happen only once, reducing the volumetric impact to a relatively more credible proportion.

            But why would anyone insist on such a scenario when the normalized variant counts reported by my algorithm are so consistent with the average number reported by the Block Tree under R-ZZ40_1 for both S21184 and FGC23351? Clearly some individual blocks upstream of BY32575 are anomalously long. Possibly FGC23351, upstream of FGC23343, could be short one or two variants--but I am extremely doubtful that it could be a full 17. On the other hand, almost certainly some blocks upstream of BY32575 should be normalized downward.

            Maybe one day I'll undertake such a normalized analysis of the clades between ZZ40_1 and BY32575. But that branch of the tree seems to have many more donors than FGC23351. I'm using an Excel workbook that has the virtue of transparency, but is inefficient from a memory point of view. I'm suspect that entering that volume of data would crash the workbook.
            Last edited by benowicz; 10 December 2022, 09:45 AM.

            Comment


            • #7
              Originally posted by benowicz View Post
              . . . But why would anyone insist on such a scenario when the normalized variant counts reported by my algorithm are so consistent with the average number reported by the Block Tree under R-ZZ40_1 for both S21184 and FGC23351? Clearly some individual blocks upstream of BY32575 are anomalously long. Possibly FGC23351, upstream of FGC23343, could be short one or two variants--but I am extremely doubtful that it could be a full 17. On the other hand, almost certainly some blocks upstream of BY32575 should be normalized downward. . .
              Another point in favor of the variant count between R-ZZ40_1 and R-BY32575 being somewhat over-stated instead of R-FGC23343 being under-stated: Mechanically speaking, "mass mutation" events seem to be more common than slow mutations or "mass deletion" events.

              It's easy to understand how a single gene conversion event might result in several new "point differences" which are not easily distinguished from true SNPs. But it stretches credulity to imagine such an event only coincidentally reverting several distinct specific positions to their ancestral values all at once. How could that even possibly happen?

              You'd have to believe that a DNA segment of considerable length coincidentally matched the ancestral values for a segment from a different position on the same chromosome. And with regard to Discover's dating of FGC23343 and subclades, on top of that, you'd have to imagine that chronologically, the overwritten SNPs all coincidentally related to the period AFTER FGC23343's MRCA. Impossible. There is no correlation between chronological sequence and physical position. Certainly not consistently over a time period corresponding to 17 variants.

              What other mechanical process could we even possibly imagine to account for such an under-count in FGC23343? Like a guy with XYY syndrome experienced true recombination that re-established ancestral values in the single copy that he passed on to his descendants? Apart from inherent health issues that make that unlikely, we'd have to imagine that this XYY syndrome condition was passed down several generations to account for the difference in variant count between the copies, and we know that this syndrome is NOT inherited.

              I'm not particularly well versed in the mechanics of DNA, so I am willing to be proven wrong. But I think the only credible mechanic to account for an anomalous undercount in a specific, contiguous timeframe would be simple slow mutation, and that is much rarer than the sudden appearance of multiple variants at once.

              Comment


              • #8
                This thread was a great opportunity for me to work out some complicated thoughts.

                In retrospect, clarity would have been a lot better served by framing this as an introductory overview of fundamental principles surrounding TMRCA estimation. But I'm working on this problem almost in isolation, so the closest thing I have to a foil to work out the finer details is speculation on what might be driving the difference between my results and Discover's. However, the black box nature of Discover's proprietary information means that process is going to be slow, painful and incredibly digressive.

                In other words, I'm probably prepared for that discussion only now.

                I am maybe sometimes a bit harsh with Discover's more obvious failures. It is a complicated problem, so there are bound to be a certain number of dead ends and an unfortunate amount of backtracking. But on the other hand, they're a fully funded team of professionals. You'd expect them to be able to do better than this. How am I, working almost alone, able to consistently produce superior quality estimates when the resources available to me are so limited in comparison?

                I mean, does anybody really think that makes sense? That having more information and more resources should result in lower quality?

                For what it's worth, here is a first stab at those principles of TMRCA estimation:

                -Estimates can only be meaningful if they're supported by meaningful effort.

                -At a bare minimum, those efforts should include a probabilistic normalization of variant count data. Probability is not linearly distributed, so simple mathematical averaging is not going to cut it.

                -The most useful estimate of MRCA is the one at the center of the probability distribution. From this we can infer two important points:

                1.) The ultimate goal should be to optimize the aggregate probability of the phylogenetic tree as a whole. Therefore, relaxed-clock methods should focus on the densest rather than the oldest nodes, which are not necessarily identical.

                2.) Within that constraint, remote matches actually provide more information on MRCA than close matches. Not all observations can be weighted equally.

                -If you're going to claim that your estimates are validated by cross-referencing with STR data, you should actually do that. This isn't such a complete black box that they public won't eventually catch on.

                -Proposed resolutions for ambiguous data should be scrutinized for relative plausibility with respect to the genetic mechanics that may give rise to apparent SNPs. This doesn't seem to be happening right now.


                I've also identified a couple areas for future study that might produce meaningful results. For instance, these exercises in aggregating STR data make me wonder whether better convergence modeling is possible. This could significantly clarify and extend the utility of STRs.​
                Last edited by benowicz; 10 December 2022, 11:47 PM.

                Comment

                Working...
                X