No announcement yet.

How does this paper meet the minimum standard for reperformance?

  • Filter
  • Time
  • Show
Clear All
new posts

  • How does this paper meet the minimum standard for reperformance?

    Here's a recent paper (originally published 4 April 2021) by Iain McDonald re: TMRCA calculation from Y chromosome STR and SNP data. I'm mostly interested in the SNP analysis, but I have to say the paper seems pretty poorly written to me.

    In theory, Figure 4 should be constructed in an appropriate format, giving you all the information you need, in conjunction with the assumed mutation rate of 83 years at the 15 mbp resolution (page 3) and generation span of 35 years (page 12) to reperform the calculation whose results are summarized on page 19. Excel has defined standard functions for the Poisson distribution, etc., so this isn't rocket science. Just 26 individual data points to be aggregated into 6 brother clades.

    But the reperformed results come nowhere near the TMRCA estimate reported by McDonald on page 19. Nowhere. Off by literally hundreds of years.

    Doesn't anybody proofread these things before they get published? Granted, this is still better than the copy of the Adamov 2015 paper I was given, which was completely opaque as to the sampled data. But it's still ultimately useless unless the reported data are internally inconsistent. The Xue 2009 paper had a tiny sample size, but at least all the numbers checked against one another.

    Attached Files

  • #2
    Okay, page # 6 contains a discussion of weighting subclades for the number of donors. That gives the presentation internal coherence, but it queers the data.

    S781 4 Sept 2021.png

    You're not measuring the population's inter-generation mutation rate any more. Instead, you're trying to measure the compound effect of self-selected sampling error, the relative reproductive success of each subclade, and the true population inter-generation mutation rate. But you have no idea when the downstream clades diverged from one another or else you'd be committing yourself to a circular logic vis-a-vis the assumption you're trying to test, mutation rate. So there's no objective basis to properly measure the first two variables for weighting (i.e., the relative # of transmission events for each subclade). These variables are unmeasurable and completely idiosyncratic, and it would be improper to extrapolate them to the general population via this weighting factor. For this kind of exercise to have any kind of statistical validity, your sampling units have to be independent at the top level calculation, which you've destroyed by unnecessarily introducing this weighting factor. You should have just used the average number of mutations adjusted towards a common base resolution.

    Plus, a 35 year generation span? That's almost certainly too high. In the modern era the consensus value is much closer to 31.5 years. Given that we're extrapolating well beyond that to Medieval times, 35 years is certainly too high.

    I'm just going off memory here, but I believe even Adamov made that point, about the selection of an appropriate generation span, in their 2015 paper. They must have used something around 32 years, so although 83 years at the 15 mbp resolution is what both Adamov and McDonald used here, the implication of McDonald's paper is that he believes the per generation mbp rate is faster than Adamov's rate.

    Comparing an unweighted Poisson distribution using the reported donor data, I get about 71% for McDonald's assumptions about generation span and mutation rate. Using my own model's assumptions of a 57 1/3 year rate at 15 mbp and 31.5 year generation span, I get a figure of about 35%, somewhat better. Assuming the estimated birth year of Sir John of Bonkyll as 1245, that is, which McDonald himself arrived at from a review of the documentary evidence.

    My full model is based around the binomial, not Poisson distribution, which you could argue about from a theoretical point of view, but as a practical matter it's clear that the assumptions I used are at least partially validated by this Poisson exercise.
    Last edited by benowicz; 4 September 2021, 11:04 PM.


    • #3
      There are distinctions to be made in the reliability of TMRCA calculations based on sample size, but I don't think that violating data integrity in your top level sampling units is the way to do it. Consider presenting some supplementary data on the achieved level of Alpha risk. That's the kind of thing that users will find ready benchmarks for in other, more routine value estimation exercises.


      • #4
        As far as dealing with heterogeneous data is concerned, I think this is where my method leveraging the binomial distribution has a leg up. Given our ignorance of the true TMRCA, population mutation rate and sampling error for each clade, the only theoretically consistent approach would be to assume that the observed array of mutations of all clades directly beneath the global MRCA is at the 50% point of probability curve. Don't try to correct data against a standard that you know you can't support. Just accept the data for what it is with complete indifference.

        You know there is a sample size correction for the binomial distribution. You could waste a lot of time trying to calculate this directly for each unique array that you're looking at, or you could just run a quick automated goal seek routine in Excel. For the kind of remote relationships that you need a TMRCA calc for anyway, that adjustment is usually approximated by dividing the estimated TMRCA by the geometric mean of observed mutations, divided again by the estimated TMRCA. So that provides you a pilot estimate to improve efficiency.

        You apply that adjustment to the data for your top-level sample data, and repeat until the geometric mean of the adjusted estimates of mutation rate for each sampled item approximate 50%. What you're essentially doing is centering the data around the middle of the binomial probability curve. At this stage, it doesn't matter how far off your pilot estimate of TMRCA is against the eventual, most likely estimate because the adjustments are expressed as a ratio to the TMRCA. You're just synchronizing the observed distribution of mutations with the distribution curve of the population-wide mutation rate itself. If you have the correct adjustment selected, the geometric mean of the adjusted mutation rate estimates will approximate 50%, regardless of estimated TMRCA.

        The last step is to iteratively calculate the geometric mean of the probabilities associated with these adjusted mutation rate estimates for a variety of TMRCA estimates, until it approximates 50%. This is interpolated from an assumed population average mutation rate, and a standard deviation. The confidence interval in Adamov 2015 implied a very specific curve shape, based on a standard deviation of about 1.0204*10^-9 years at the 15 mbp resolution.

        The assumed population average mutation rate itself is the center of the whole controversy. Of course I ran trials using the Xue and Adamov rates. But a global comparison of results using another rate (i.e., 1 mutation per 57 1/3 years at the 15 mbp resolution and 31.5 year generation span) returned a geometric mean that more consistently approximated 50%, when applied to my data sets.

        I rarely had as complete data as McDonald presented for S781 as regards the donors' actual achieved resolution, but I was able to make reasonably objective determinations of the product generation from published project data, and applied an assumed coverage of 10 mbp for BY500 and 15 mbp for BY700. If there is a serious deficiency in my methodology, this could be it. However, my ultimate goal was to produce a relatively user-friendly interface for the calculation template, and I highly doubt many users will be willing to go to the lengths that McDonald did in preparing Figure 4 of his paper. In any event, my methodology has yet to produce a less accurate estimate for any data set with a known MRCA that I've examined to date.


        • #5
          Well, this is depressing.

          Just checked on the Big Tree to verify the reported # of mutations per clade in McDonald's paper. It checks--that seems to be where he got the data for S781.

          The problem is that my personal experience with the Big Tree is that its data can be wildly inaccurate. Within just FGC23343 alone, by reference to the FTDNA block tree, I've seen the Big Tree put SNPs in the wrong block (FGC62822, now corrected), botch a call (the terminal SNP my cousin shares w/ 2 other guys, FT372222, is rejected by the Big Tree for one of them, i.e., kit # 612257), and the Big Tree shows an average of 21 unshared variants for kit #s 8Q7KV and N2640, but FTDNA shows only13. If it were merely a matter of failure to update the Big Tree for upgrades, FTDNA should report the larger number.

          How are we ever going to make any progress on this issue if we don't even have reliable mutation counts to begin with? FTDNA may not be perfect; it would certainly be a huge improvement if they reported resolution data within the block tree. But they do have a quality control process and customer service reps to field customer complaints. I just don't know what to think about relying on Big Tree data for this kind of work.


          • #6
            This is a mess. Not McDonald's presentation of his data--that was my fault that I originally missed the note about his weighting of downstream subclade data. It recalculates quite straightforwardly if you can accept his weighting of the data.

            But you CAN'T accept that weighting. It's a violation of the fundamental principle of data independence. You can't correct data for an unknown population parameter like the true # of men belonging to each downstream subclade or how many transmission events they represent.

            But it's more than that. It's the want of really reliable data on coverage and the # of mutations. Maybe Big Tree really did a better job with S781 than they did with FGC23343, but I have no way of knowing that. It's a complete fluke that I caught FGC23343 on the ground floor, and that enough members had published project data to allow me to independently develop decent estimates of resolution. I got on the train way too late with S781, which is so much bigger anyway, to feel confident that I could identify all the reporting errors on Big Tree.

            At least McDonald gave enough data to identify errors in his methodology and make meaningful comparisons with the population mutation rates proposed by others. He assumes a 35 year generation span, which is out line with consensus rates closer to the 31.5 years that I've been using. But on that 31.5 year basis, at the 15 mbp resolution level McDonald's mutation rate seems to be once in 74.7 years, Adamov once in about 83 years, and I recommend about once in 57.3 years.

            Beyond the assumptions, the structure of the McDonald (and I'm assuming Adamov) TMRCA algorithm is based on the Poisson distribution, whereas I'm using the binomial distribution. Maybe you could say that the Poisson distribution is theoretically superior in the sense that it assumes an infinite number of trials. Technically, the sampled unit is each base pair tested, not each year in the TMRCA range, so the numbers are very large and there is an argument that my model doesn't directly account for the possibility of observing multiple mutations arising from a single transmission event. But that possibility is extremely small--1/83*1/(83*14,999,999/15,000,000) or 0.01452%, to use McDonald's rate estimate. My model does capture this indirectly since my assumed mutation rate is really an effective rate derived through field observation and not a purely theoretical construction.

            So, in my opinion, there's not really much headway being made here towards a really solid estimate of Y chromosome SNP mutation rates here. Maybe we're getting a clearer idea of how a good study should be structured--McDonald doesn't really do such a bad job in communicating to a reperformance standard. It's just that I couldn't believe that he would try to include an adjustment for a totally unknown population parameter. That's just wrong. The real problem is probably a lack of a large body of reliable coverage and mutation data at a high level of precision.
            Last edited by benowicz; 5 September 2021, 10:23 AM.


            • #7
              Originally posted by benowicz View Post
              There are distinctions to be made in the reliability of TMRCA calculations based on sample size, but I don't think that violating data integrity in your top level sampling units is the way to do it. Consider presenting some supplementary data on the achieved level of Alpha risk. That's the kind of thing that users will find ready benchmarks for in other, more routine value estimation exercises.
              Apropos of the pointlessness of trying to weight for the relative number of donors represented in downstream clades, try computing the Alpha risk implied by the very small number of donors at every node up through the clade immediately below the global MRCA. There's no plausible scenario where you can ever get below 30%, which represents a ridiculous level of unreliability. You are essentially achieving no clarity at all regarding sampling error.


              • #8
                Keep at it! It's important to explore assumptions, methods, and hidden sources of bias. Out of this comes, eventually, an appreciation of just how far our still very limited data can be pushed.


                • #9
                  Thanks, John.

                  There are so many moving parts that it's easy to be impressed by a totally meaningless coincidence. That's why it's so important to have clarity about process. I think you're right--regardless of anything else, having an explicit discussion about reperformance furthers that goal.

                  Speaking of coincidence, it just so happens that around the time McDonald's paper was published in June--I only stumbled on it a couple days ago-- I performed an TMRCA calc for S781 using my own method. I dug up those notes, compared them to the detailed data that McDonald apparently derived from Big Tree, and checked for discrepancies in individual data points. And FTDNA and Big Tree do have some jarring differences in the # of SNPs reported for individual donors, but in the aggregate, the average # I calculated across all clades agreed pretty closely to the #'s in McDonald's data--within a fraction of a mutation.

                  So although an ideal methodology would use granular resolution data like that which McDonald used--provided, of course, that those numbers are reliable--rather than the crude estimates I'd been using, they didn't result in a significant discrepancy in this particular case.

                  I didn't bother to recalculate a weighted average using my data, as I consider it a pretty fundamental violation of data integrity to weight the way McDonald did. As I mentioned, the reductions in sampling risk McDonald assumes disappear when you calculate the implicit Alpha risk figures at each node along the network. Given the significance of some of the individual count discrepancies between FTDNA and Big Tree, all weighting would accomplish is to exaggerate any distortions.

                  Anyhow, I also noticed one important point that I haven't raised yet. The data I included in my recalculation a la McDonald's method did NOT include the anchor SNP of the S781 block itself, which is clearly a methodological error. The goal is the TMRCA between the average birth year of the living donors and their MRCA, who by definition was a carrier of S781.

                  So did my apparently successful recalculation reveal another error in McDonald's methodology? Or is there yet another meaningless coincidence preventing us from correctly assessing McDonald's results?

                  Maybe future publishing protocols should include a protected copy of the actual calculations in Excel instead of just a broad narrative description. Just to highlight the problems created by failing to do so, again just by coincidence, adding an adjustment for S781 itself to McDonald's weighted figures, assuming the more accepted 31.5 year generation span results in a near-perfect 49.4% Poisson distribution statistic.

                  You get the impression that we could be nearing a better estimate of the true average mutation rate, but there are so many opaque variables that you're left feeling less rather than more confident.


                  • #10
                    Oh, and one other point. We shouldn't be hanging our hats on apparently successful TMRCA recalculations for individual SNP blocks. My notes from June included a number of other case studies whose accuracy I considered globally by comparing the geometric mean of the confidence levels resulting at the true MRCA birth year to the 50% target. Globally, I came up with 46.7%, although individually S781 came in at 69.4%. Chasing individual successes could actually take us further away from a correct impression of the overall picture.

                    I don't think I saw anything like a global evaluation in the McDonald paper, although benchmarking individual cases vs. STR results--which I also did--is a good step forward.