No announcement yet.

How does this paper meet the minimum standard for reperformance?

  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by benowicz View Post
    . . . That seems to be what the internet is all about. People just talking to themselves.
    No, that's not quite it. That's a cynically reductive hottake. It's really more like asking "Wie geht es?" or "How are you?" It's really a ritual of recognition, not an actual request for information to be seriously engaged with. The intention of lazy internet culture isn't to deceive or waste people's time. It's just saying "Hello". It only becomes a problem when somebody doesn't understand the ritual and gets upset when their expectations aren't met.

    Nobody ever tells you that upfront, although they probably should. The internet would probably be a very different place if the true nature of the ritual were plainly and widely understood.


    • #32
      Whoa! What a difference correct aggregation of data makes!

      Aggregated R-S781 from Iain McDonald study of 2021.png

      While it seems a near 100% certainty that the founder of R-S781 was Sir John Stewart of Bonkyll + 1298, his precise birth year is still a matter of debate. The current consensus seems to be about 1245, presumably being bracketed by significant events in the direct contemporary documentary record.

      I'm not going to enter that debate. A bit above my paygrade, even if somehow I had direct access to all the relevant archives. Plus, that kind of incredible precision was well beyond my expectation for estimates with the kind of DNA data I'm working with. I think we should be very happy with an MoE of a decade or so.

      The point is that this rate/algorithm is capable of producing some shockingly accurate results. The current Beta of Discover estimates the MRCA at 1468 C.E. That is not completely comparable with this data, which is only a small selection of the total available to the Discover team from FTDNA's block tree, but using this data, I arrive at 1266 C.E. Way, way, WAY better than McDonald's estimates from the original paper (2021).

      Databases of commercial DNA-testing companies now contain more customers with sequenced DNA than any completed academic study, leading to growing interest from academic and forensic entities. An important result for both these entities and the test takers themselves is how closely two individuals are related in time, as calculated through one or more molecular clocks. For Y-DNA, existing interpretations of these clocks are insufficiently accurate to usefully measure relatedness in historic times. In this article, I update the methods used to calculate coalescence ages (times to most-recent common ancestor, or TMRCAs) using a new, probabilistic statistical model that includes Y-SNP, Y-STR and ancilliary historical data, and provide examples of its use.

      Phylogeny has been updated at Genetic Homeland and FTDNA's block tree since McDonald's paper, to place R-A5025 and R-A5021 under a new block, R-FTT48.

      Variant counts were aggregated using the mathematical average rather than median or geometric mean as I had proposed earlier. Usually those options result in a somewhat smaller number given large, diverse samples. But individual nodes here, as you would expect, were pretty small.

      McDonald's original paper seemed to me to advocate for disaggregating this data, considering all donor samples to be equally indicative of the true TMRCA, despite the wild swings in representation and effective mutation rates of individual clades (e.g., R-BY11989.1 being about 42% of the total, and far slower than the other clades). I'm just guessing based on a handful of clades with equally weird disparities in representation and effective mutation (e.g., R-FGC23343 and subs), but I think Discover is also using this kind of disaggregated approach, resulting in sub-optimal TMRCA estimates.


      • #33
        Originally posted by benowicz View Post
        Whoa! What a difference correct aggregation of data makes! . . .
        Reperformed this calculation using only info reported for terminal clades within the FTDNA block tree as of today. FTDNA's phylogeny changed a little bit, but my approach to resolution--inferring through relative variant count length and FTDNA standard product specifications, rather than directly observing resolution--remains the same.

        R-S781 6 December 2022.png

        The "big" change in my methodology is in aggregation, ensuring that lineage data is weighted for their relative independence with regard to the top-level ancestral clade, instead of just raw donor count, which I definitely used in the past, and I suspect Discover is still using.

        I think most of the difficulties in TMRCA estimation have to do with idiosyncratic features of a clade's topography, the way some clades seem to be way longer or way shorter than their neighbors. Obviously, this is a reflection of the randomness inherent in the process of mutation, etc. These companies and technologies have

        I attribute the superior accuracy and consistency of my estimates to normalization of variant count data for the discrete probability distribution. However, I think this new aggregation method is a real improvement, maybe more relevant to clades with really anomalous topology (like FGC23343). It's very encouraging that this new aggregation method still returns very good estimates for benchmark clades like S781. If an algorithm doesn't produce consistent quality, it makes it look like estimates closely approximating actual known data are either just "dumb luck" or cheating.

        Now that I have a more meaningful way to assess the contribution of individual donors, I think my estimates of MoE, confidence intervals and Alpha are much more reliable. This is becoming a real science, instead of just lazily throwing dates out that that essentially mean nothing.