Announcement

Collapse
No announcement yet.

How does this paper meet the minimum standard for reperformance?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How does this paper meet the minimum standard for reperformance?

    Here's a recent paper (originally published 4 April 2021) by Iain McDonald re: TMRCA calculation from Y chromosome STR and SNP data. I'm mostly interested in the SNP analysis, but I have to say the paper seems pretty poorly written to me.

    In theory, Figure 4 should be constructed in an appropriate format, giving you all the information you need, in conjunction with the assumed mutation rate of 83 years at the 15 mbp resolution (page 3) and generation span of 35 years (page 12) to reperform the calculation whose results are summarized on page 19. Excel has defined standard functions for the Poisson distribution, etc., so this isn't rocket science. Just 26 individual data points to be aggregated into 6 brother clades.

    But the reperformed results come nowhere near the TMRCA estimate reported by McDonald on page 19. Nowhere. Off by literally hundreds of years.

    Doesn't anybody proofread these things before they get published? Granted, this is still better than the copy of the Adamov 2015 paper I was given, which was completely opaque as to the sampled data. But it's still ultimately useless unless the reported data are internally inconsistent. The Xue 2009 paper had a tiny sample size, but at least all the numbers checked against one another.

    https://science.thewire.in/the-scien...them%20to%20be.

    Attached Files

  • #2
    Okay, page # 6 contains a discussion of weighting subclades for the number of donors. That gives the presentation internal coherence, but it queers the data.


    S781 4 Sept 2021.png


    You're not measuring the population's inter-generation mutation rate any more. Instead, you're trying to measure the compound effect of self-selected sampling error, the relative reproductive success of each subclade, and the true population inter-generation mutation rate. But you have no idea when the downstream clades diverged from one another or else you'd be committing yourself to a circular logic vis-a-vis the assumption you're trying to test, mutation rate. So there's no objective basis to properly measure the first two variables for weighting (i.e., the relative # of transmission events for each subclade). These variables are unmeasurable and completely idiosyncratic, and it would be improper to extrapolate them to the general population via this weighting factor. For this kind of exercise to have any kind of statistical validity, your sampling units have to be independent at the top level calculation, which you've destroyed by unnecessarily introducing this weighting factor. You should have just used the average number of mutations adjusted towards a common base resolution.

    Plus, a 35 year generation span? That's almost certainly too high. In the modern era the consensus value is much closer to 31.5 years. Given that we're extrapolating well beyond that to Medieval times, 35 years is certainly too high.

    I'm just going off memory here, but I believe even Adamov made that point, about the selection of an appropriate generation span, in their 2015 paper. They must have used something around 32 years, so although 83 years at the 15 mbp resolution is what both Adamov and McDonald used here, the implication of McDonald's paper is that he believes the per generation mbp rate is faster than Adamov's rate.

    Comparing an unweighted Poisson distribution using the reported donor data, I get about 71% for McDonald's assumptions about generation span and mutation rate. Using my own model's assumptions of a 57 1/3 year rate at 15 mbp and 31.5 year generation span, I get a figure of about 35%, somewhat better. Assuming the estimated birth year of Sir John of Bonkyll as 1245, that is, which McDonald himself arrived at from a review of the documentary evidence.

    My full model is based around the binomial, not Poisson distribution, which you could argue about from a theoretical point of view, but as a practical matter it's clear that the assumptions I used are at least partially validated by this Poisson exercise.
    Last edited by benowicz; 4 September 2021, 11:04 PM.

    Comment


    • #3
      There are distinctions to be made in the reliability of TMRCA calculations based on sample size, but I don't think that violating data integrity in your top level sampling units is the way to do it. Consider presenting some supplementary data on the achieved level of Alpha risk. That's the kind of thing that users will find ready benchmarks for in other, more routine value estimation exercises.

      Comment


      • #4
        As far as dealing with heterogeneous data is concerned, I think this is where my method leveraging the binomial distribution has a leg up. Given our ignorance of the true TMRCA, population mutation rate and sampling error for each clade, the only theoretically consistent approach would be to assume that the observed array of mutations of all clades directly beneath the global MRCA is at the 50% point of probability curve. Don't try to correct data against a standard that you know you can't support. Just accept the data for what it is with complete indifference.

        You know there is a sample size correction for the binomial distribution. You could waste a lot of time trying to calculate this directly for each unique array that you're looking at, or you could just run a quick automated goal seek routine in Excel. For the kind of remote relationships that you need a TMRCA calc for anyway, that adjustment is usually approximated by dividing the estimated TMRCA by the geometric mean of observed mutations, divided again by the estimated TMRCA. So that provides you a pilot estimate to improve efficiency.

        You apply that adjustment to the data for your top-level sample data, and repeat until the geometric mean of the adjusted estimates of mutation rate for each sampled item approximate 50%. What you're essentially doing is centering the data around the middle of the binomial probability curve. At this stage, it doesn't matter how far off your pilot estimate of TMRCA is against the eventual, most likely estimate because the adjustments are expressed as a ratio to the TMRCA. You're just synchronizing the observed distribution of mutations with the distribution curve of the population-wide mutation rate itself. If you have the correct adjustment selected, the geometric mean of the adjusted mutation rate estimates will approximate 50%, regardless of estimated TMRCA.

        The last step is to iteratively calculate the geometric mean of the probabilities associated with these adjusted mutation rate estimates for a variety of TMRCA estimates, until it approximates 50%. This is interpolated from an assumed population average mutation rate, and a standard deviation. The confidence interval in Adamov 2015 implied a very specific curve shape, based on a standard deviation of about 1.0204*10^-9 years at the 15 mbp resolution.

        The assumed population average mutation rate itself is the center of the whole controversy. Of course I ran trials using the Xue and Adamov rates. But a global comparison of results using another rate (i.e., 1 mutation per 57 1/3 years at the 15 mbp resolution and 31.5 year generation span) returned a geometric mean that more consistently approximated 50%, when applied to my data sets.

        I rarely had as complete data as McDonald presented for S781 as regards the donors' actual achieved resolution, but I was able to make reasonably objective determinations of the product generation from published project data, and applied an assumed coverage of 10 mbp for BY500 and 15 mbp for BY700. If there is a serious deficiency in my methodology, this could be it. However, my ultimate goal was to produce a relatively user-friendly interface for the calculation template, and I highly doubt many users will be willing to go to the lengths that McDonald did in preparing Figure 4 of his paper. In any event, my methodology has yet to produce a less accurate estimate for any data set with a known MRCA that I've examined to date.

        Comment


        • #5
          Well, this is depressing.

          Just checked on the Big Tree to verify the reported # of mutations per clade in McDonald's paper. It checks--that seems to be where he got the data for S781.



          The problem is that my personal experience with the Big Tree is that its data can be wildly inaccurate. Within just FGC23343 alone, by reference to the FTDNA block tree, I've seen the Big Tree put SNPs in the wrong block (FGC62822, now corrected), botch a call (the terminal SNP my cousin shares w/ 2 other guys, FT372222, is rejected by the Big Tree for one of them, i.e., kit # 612257), and the Big Tree shows an average of 21 unshared variants for kit #s 8Q7KV and N2640, but FTDNA shows only13. If it were merely a matter of failure to update the Big Tree for upgrades, FTDNA should report the larger number.



          How are we ever going to make any progress on this issue if we don't even have reliable mutation counts to begin with? FTDNA may not be perfect; it would certainly be a huge improvement if they reported resolution data within the block tree. But they do have a quality control process and customer service reps to field customer complaints. I just don't know what to think about relying on Big Tree data for this kind of work.

          Comment


          • #6
            This is a mess. Not McDonald's presentation of his data--that was my fault that I originally missed the note about his weighting of downstream subclade data. It recalculates quite straightforwardly if you can accept his weighting of the data.

            But you CAN'T accept that weighting. It's a violation of the fundamental principle of data independence. You can't correct data for an unknown population parameter like the true # of men belonging to each downstream subclade or how many transmission events they represent.

            But it's more than that. It's the want of really reliable data on coverage and the # of mutations. Maybe Big Tree really did a better job with S781 than they did with FGC23343, but I have no way of knowing that. It's a complete fluke that I caught FGC23343 on the ground floor, and that enough members had published project data to allow me to independently develop decent estimates of resolution. I got on the train way too late with S781, which is so much bigger anyway, to feel confident that I could identify all the reporting errors on Big Tree.

            At least McDonald gave enough data to identify errors in his methodology and make meaningful comparisons with the population mutation rates proposed by others. He assumes a 35 year generation span, which is out line with consensus rates closer to the 31.5 years that I've been using. But on that 31.5 year basis, at the 15 mbp resolution level McDonald's mutation rate seems to be once in 74.7 years, Adamov once in about 83 years, and I recommend about once in 57.3 years.

            Beyond the assumptions, the structure of the McDonald (and I'm assuming Adamov) TMRCA algorithm is based on the Poisson distribution, whereas I'm using the binomial distribution. Maybe you could say that the Poisson distribution is theoretically superior in the sense that it assumes an infinite number of trials. Technically, the sampled unit is each base pair tested, not each year in the TMRCA range, so the numbers are very large and there is an argument that my model doesn't directly account for the possibility of observing multiple mutations arising from a single transmission event. But that possibility is extremely small--1/83*1/(83*14,999,999/15,000,000) or 0.01452%, to use McDonald's rate estimate. My model does capture this indirectly since my assumed mutation rate is really an effective rate derived through field observation and not a purely theoretical construction.

            So, in my opinion, there's not really much headway being made here towards a really solid estimate of Y chromosome SNP mutation rates here. Maybe we're getting a clearer idea of how a good study should be structured--McDonald doesn't really do such a bad job in communicating to a reperformance standard. It's just that I couldn't believe that he would try to include an adjustment for a totally unknown population parameter. That's just wrong. The real problem is probably a lack of a large body of reliable coverage and mutation data at a high level of precision.
            Last edited by benowicz; 5 September 2021, 10:23 AM.

            Comment


            • #7
              Originally posted by benowicz View Post
              There are distinctions to be made in the reliability of TMRCA calculations based on sample size, but I don't think that violating data integrity in your top level sampling units is the way to do it. Consider presenting some supplementary data on the achieved level of Alpha risk. That's the kind of thing that users will find ready benchmarks for in other, more routine value estimation exercises.
              Apropos of the pointlessness of trying to weight for the relative number of donors represented in downstream clades, try computing the Alpha risk implied by the very small number of donors at every node up through the clade immediately below the global MRCA. There's no plausible scenario where you can ever get below 30%, which represents a ridiculous level of unreliability. You are essentially achieving no clarity at all regarding sampling error.

              Comment


              • #8
                Keep at it! It's important to explore assumptions, methods, and hidden sources of bias. Out of this comes, eventually, an appreciation of just how far our still very limited data can be pushed.

                Comment


                • #9
                  Thanks, John.

                  There are so many moving parts that it's easy to be impressed by a totally meaningless coincidence. That's why it's so important to have clarity about process. I think you're right--regardless of anything else, having an explicit discussion about reperformance furthers that goal.

                  Speaking of coincidence, it just so happens that around the time McDonald's paper was published in June--I only stumbled on it a couple days ago-- I performed an TMRCA calc for S781 using my own method. I dug up those notes, compared them to the detailed data that McDonald apparently derived from Big Tree, and checked for discrepancies in individual data points. And FTDNA and Big Tree do have some jarring differences in the # of SNPs reported for individual donors, but in the aggregate, the average # I calculated across all clades agreed pretty closely to the #'s in McDonald's data--within a fraction of a mutation.

                  So although an ideal methodology would use granular resolution data like that which McDonald used--provided, of course, that those numbers are reliable--rather than the crude estimates I'd been using, they didn't result in a significant discrepancy in this particular case.

                  I didn't bother to recalculate a weighted average using my data, as I consider it a pretty fundamental violation of data integrity to weight the way McDonald did. As I mentioned, the reductions in sampling risk McDonald assumes disappear when you calculate the implicit Alpha risk figures at each node along the network. Given the significance of some of the individual count discrepancies between FTDNA and Big Tree, all weighting would accomplish is to exaggerate any distortions.

                  Anyhow, I also noticed one important point that I haven't raised yet. The data I included in my recalculation a la McDonald's method did NOT include the anchor SNP of the S781 block itself, which is clearly a methodological error. The goal is the TMRCA between the average birth year of the living donors and their MRCA, who by definition was a carrier of S781.

                  So did my apparently successful recalculation reveal another error in McDonald's methodology? Or is there yet another meaningless coincidence preventing us from correctly assessing McDonald's results?

                  Maybe future publishing protocols should include a protected copy of the actual calculations in Excel instead of just a broad narrative description. Just to highlight the problems created by failing to do so, again just by coincidence, adding an adjustment for S781 itself to McDonald's weighted figures, assuming the more accepted 31.5 year generation span results in a near-perfect 49.4% Poisson distribution statistic.

                  You get the impression that we could be nearing a better estimate of the true average mutation rate, but there are so many opaque variables that you're left feeling less rather than more confident.

                  Comment


                  • #10
                    Oh, and one other point. We shouldn't be hanging our hats on apparently successful TMRCA recalculations for individual SNP blocks. My notes from June included a number of other case studies whose accuracy I considered globally by comparing the geometric mean of the confidence levels resulting at the true MRCA birth year to the 50% target. Globally, I came up with 46.7%, although individually S781 came in at 69.4%. Chasing individual successes could actually take us further away from a correct impression of the overall picture.

                    I don't think I saw anything like a global evaluation in the McDonald paper, although benchmarking individual cases vs. STR results--which I also did--is a good step forward.

                    Comment


                    • #11
                      Originally posted by benowicz View Post
                      Oh, and one other point. We shouldn't be hanging our hats on apparently successful TMRCA recalculations for individual SNP blocks. My notes from June included a number of other case studies whose accuracy I considered globally by comparing the geometric mean of the confidence levels resulting at the true MRCA birth year to the 50% target. Globally, I came up with 46.7%, although individually S781 came in at 69.4%. Chasing individual successes could actually take us further away from a correct impression of the overall picture.

                      I don't think I saw anything like a global evaluation in the McDonald paper, although benchmarking individual cases vs. STR results--which I also did--is a good step forward.
                      I've started to update my global tracking sheet for more recent data from the FTNDA block tree. Based on these figures, I think the rate I've been working with is much better than I've seen proposed elsewhere, including by McDonald in this paper. The statistic I'm monitoring is the geometric mean of the binomial confidence figures (i.e., 52.3% for my selected rate--pretty close to the target of 50%).

                      ‚Äč

                      I've done a few other case studies, but I chose only these 5 clades for presentation here because I think it'll be easier to justify the variant counts that I used, because they're either very close to what McDonald used in his analysis of S781, or because I gave a very specific account of my methodology in another forum post (i.e., BY32575).

                      https://forums.familytreedna.com/for...438#post331438

                      https://forums.familytreedna.com/for...575#post331781

                      My main issue with McDonald's method for evaluating rates is that it's not properly statistical. First of all, he forgets that the expected value is partially dependent on sample size. It only approaches the population average at a really large number of trials (i.e., TMRCA).

                      Secondly, he forgets that valid extrapolation can be done only from a set of data that are independent of one another, and he deliberately skews his data by weighting the variant counts reported for each subclade for the number of donors represented--but by definition members of the same subclade are not independent of one another. His rationale appears to be that additional representation makes the variant counts reported for that subclade more reliable, but even so the number of these donors at each node in the tree are very small, making any decrease in sampling risk negligible at best. Technically, he could identify the proper amount to adjust the subclade weighting by calculating the inferred Alpha risk given the number of donors represented at each node in the tree, but in my experience these improvements individually will be tiny--in the neighborhood of 10% or so, not the 25,380% by which he's adjusted R-FGC74572, for example. Besides which, these would likely offset one another at the level of the immediate subclades of the MRCA which we're calculating.

                      Plus, I don't believe McDonald's calculation properly considers that we need to add one to the aggregate variant counts for each subclade. After all, we are attempting to calculate the probabilities associated with the overall TMRCA, not the individual TMRCAs for the individual subclades.

                      I guess my inclusion of R-ZS3700 in this analysis might be considered a violation of a certain interpretation of the data independence principle, too, since it is a grandson clade of R-BY482. But if you remove it, it only improves the performance of my rate estimate, both relative to the 50% target and the aggregate confidence achieved by McDonald. So I don't suppose it's the worst sin in the world.

                      https://dna-explained.com/2019/01/24...-y-block-tree/

                      I also suppose that I could calculate the inferred Alpha risk globally, to get an idea of what sort of sample size I'd need to get a really robust estimate of the population average mutation rate. Maybe I'll do that in the future. Based on my experience with the variability in the calculation of Alpha for individual clades, I'm thinking it'll be a super high number--like around 1,000, which I don't think is supportable with the volume of reliable data currently available.
                      Last edited by benowicz; 17 December 2021, 08:03 PM.

                      Comment


                      • #12
                        My approach is fundamentally different. McDonald, and most algorithms I've seen to date, seem to view the probability associated with an observed array of variants under a common ancestor as the average of the probabilities associated with the variant counts of individual donors. But we know that the compound probability function is multiplicative, not additive. I say that the sampled unit is the subclade, not the donor. It seems pretty clear to me that the sampled unit can't be the donor--they're not independent of one another at the subclade level. The data you want to base your calculation on is the geometric mean of the average variants associated with each subclade. Plus one, of course, to account for the common ancestor.

                        My approach has another benefit in that it leverages the hint re: the TMRCA provided by the distribution of average variant accounts associated with each subclade. All of the immediate subclades are more or less equally distant in time (not variant counts) to the MRCA, and we know that the binomial distribution formula contains an adjustment for sample size (i.e., TMRCA) which therefore must be the same for all subclades. The size of this adjustment is inversely proportional to the TMRCA, and you get a useful pilot estimate of the adjustment by looking at the inverse of the geometric mean of the observed average variant accounts for all the subclades. Doing a simple goal seek in Excel allows you to identify the most-likely value for this specific adjustment when you set the constraint that the compound of the associated binomial probabilities to 50%.

                        Traditional averaging approaches miss this hint, and take the search for the population average mutation rate as a completely random trial and error exercise, which needless to say is pretty inefficient.

                        The last step in my process for deriving the TMRCA estimate is to find the unique value which results in a distribution of effective mutation rates (i.e., observed in the sample, not necessarily representative of the entire population) which compounds to 50%. Of course this assumes a normal distribution of effective mutation rates in the population, and the Xue 2009 paper provides this, which I believe itself was derived from an earlier theoretical work. I have tried alternate assumptions re: the standard deviation, but at least in the specific pilot studies I conducted, the difference in the assumed deviation had to be pretty extreme to result in a significant difference in my final result. In any event, as the chart from the last post shows, the binomial and normal distributions tracked pretty closely in the relevant range, so I think it's "on the right track". I don't think this standard deviation has been the subject of much controversy to date.

                        So I'm really aging the specific aggregate distribution of variants under the common ancestor, whereas McDonald's fundamental approach, which seems to be shared by just about everyone else, analyzes the donors in isolation, assuming that simple averaging will reflect the population in the aggregate. But again, we know that the compound probability function is multiplicative, not additive.

                        This link has a chart that illustrates the application of my method to a specific example, R-BY32575.

                        https://forums.familytreedna.com/for...575#post331781

                        Comment


                        • #13
                          Don't know why, but the chart from the post of 17 Dec 2021 isn't showing up whenever I access this thread any more. Here's an updated version, including summaries of the calcs for two more clades, total of 7.

                          Rate comparisons 19 Dec 2021.png

                          The calc for R-S781 uses the direct resolution data from Big Tree (as does McDonald's paper). The resulting figures (geometric means, not mathematical averages) are still pretty close to McDonald's figures. In fact, they're somewhat conservative, in that the geometric mean is always a little bit smaller than the mathematical average, which McDonald uses, meaning that they would favor McDonald's higher assumed mutation rate.

                          R-BY482 and R-ZS3700 are both featured on Roberta Estes' blog from 2019, but the resolution figures here are updated for a special method I use to indirectly infer from the current FTDNA block tree. In fact, I use this indirect method for R-BY32575, R-A7566 and R-FGC10116, too. Obviously, direct observation would be optimal, but there is a very limited set of complete data publicly available on known MRCAs. I had expressed some reservation about this indirect method before, but after performing a pilot comparison with another clade for whom I have direct resolution data (but don't know the MRCA), I feel much better about its reliability. Not a settled deal, but something that is definitely worth looking into.



                          I've already addressed the fact that McDonald's rate was designed to riff off the mathematical average and not geometric mean of the subclade variant counts. McDonald's algorithm also does not add one to these base figures, which I've discussed my objections to elsewhere. But if you wanted to give the benefit of the doubt, it doesn't improve the situation by much--the geometric mean of the binomial confidence still only reaches 33% for McDonald's rate, and 62.2% for my own. I'm firmly convinced that you need to add one to base figures, to reflect the fact that your calculating the TMRCA for the overall common ancestor, and not for each subclade individually, but in case you were curious, there you are.

                          Comment


                          • #14
                            Also calculated the inferred minimum sample size required to reduce the Alpha risk to the standard 5%--862. Pretty out of control if we were to rely solely on directly observed resolutions, and still kind of daunting even if you can accept the indirect method. This sample size here is 20 (i.e., the sampled unit is the number of immediate subclades or "branches" as I've been calling them here), and my estimate of Alpha risk is about 22%. As you see, the incremental reduction beyond the first few sampled items is pretty low.

                            Comment


                            • #15
                              Originally posted by benowicz View Post
                              Also calculated the inferred minimum sample size required to reduce the Alpha risk to the standard 5%--862. Pretty out of control if we were to rely solely on directly observed resolutions, and still kind of daunting even if you can accept the indirect method. This sample size here is 20 (i.e., the sampled unit is the number of immediate subclades or "branches" as I've been calling them here), and my estimate of Alpha risk is about 22%. As you see, the incremental reduction beyond the first few sampled items is pretty low.
                              It just occurs to me that maybe the margin of error might draw a clearer picture of the situation. The margin of error is the range of figures between which the true population average mutation rate must lie, at least according to this pilot study. Based on this data--open to challenge itself--I figure it's between one mutation in every 49.563 years and 70.1064 years, assuming a generation span of 31.5 years and resolution of 15 million base pairs.

                              To reduce this range to 10 years (i.e., +/- 5 years from the calculated mean), I figure you'd need a sample size of around 100. That sounds really high, but it might be achievable. After all, in my schema, the sampled units are the branches under a given common SNP, of which I got 20 looking at just 7 father-clades. Those efficiencies probably aren't going to be easy to achieve on a reliable basis, since few households are large enough to give rise to a large number of immediate subclades tested directly beneath their MRCA, but it still sounds like it might be achievable.

                              I mean, you're still going to have the problem of whether the selected father-clades can be called "truly random" given the social bias inherent in these long pedigrees, but it'd still be easily the best, most systematic study done to date.
                              Last edited by benowicz; 20 December 2021, 05:47 PM.

                              Comment

                              Working...
                              X