Announcement

Collapse
No announcement yet.

How does this paper meet the minimum standard for reperformance?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by benowicz View Post
    . . . R-BY482 and R-ZS3700 are both featured on Roberta Estes' blog from 2019, but the resolution figures here are updated for a special method I use to indirectly infer from the current FTDNA block tree. In fact, I use this indirect method for R-BY32575, R-A7566 and R-FGC10116, too. Obviously, direct observation would be optimal, but there is a very limited set of complete data publicly available on known MRCAs. I had expressed some reservation about this indirect method before, but after performing a pilot comparison with another clade for whom I have direct resolution data (but don't know the MRCA), I feel much better about its reliability. Not a settled deal, but something that is definitely worth looking into. . . .
    While I was expanding my survey to additional clades, I hit kind of a wall with my method for indirectly inferring resolution--two new clades had starkly contrasting trend profiles . But after a little experimentation, I think I may have come upon a method that is far superior, with a much firmer grounding in traditional statistical analysis. It involves a data transformation using the inverse of a particular version of Pearson's coefficient of skewness. It's extremely simple and produced strikingly consistent results with my expanded survey, so I'm very excited. However, it does raise some important questions about the order of operations in aggregating data among the various reported subclades, and I think it's going to take me a considerable amount of time to iron out to my satisfaction. At the moment, it suggests that the true population average mutation rate could be slightly slower than once every 60 years, but I'm not sure where it will land after these aggregation issues are resolved, and I don't know when that will be.

    Comment


    • #17
      Originally posted by benowicz View Post
      While I was expanding my survey to additional clades, I hit kind of a wall with my method for indirectly inferring resolution--two new clades had starkly contrasting trend profiles . But after a little experimentation, I think I may have come upon a method that is far superior . . .
      Think I may have to give up on this.

      That new method does a very good job at identifying the directionality of skewed date (i.e., over- vs. under-statement), but I'm not sure whether there is a very reliable way to correct it. In the case of the FTDNA block tree database, I'm dealing with two distinct test platforms (I think), whose resolutions have a ratio of 1:1.5. But what if there were 3 or more different platforms with different resolutions? I'm not an expert in the history of the block tree or FTDNA's product history, so for all I know, there may be exceptions to my expectation of only two platforms--transfers of 3rd party data? Even if there are only two platforms today, will there still be only two platforms X years from now?

      I feel like this new method does a much better job of approximating the true typical resolution for some clades than others. That would probably wash out over a really big study of many clades, but I don't think there is that much reliable data publicly available. So I'm "pretty" sure that my estimates of the true population mutation rate of around 60 years is "reasonable", that's not really a statistically reliable measure.

      It would be great if the block tree published the average resolution data for reported clades, the same way they report the average number of private variants. That would be a real service to the scientific community. I don't know if the objection is programming or computational costs or what, but the public benefit would be enormous.

      Comment


      • #18
        Came up with a different, more direct approach. Just survey the mutation rates implied by the range of variant counts reported for a number of established clades--assuming a 50% binomial confidence level, of course. There may be a muddle of different resolutions reflected in the data, but as of today, anyhow, the fastest possible rate would have to represent the BigY 700. It's not guaranteed that any of the subclades reported in the FTDNA block tree are composed solely of Big Y 700 kits, but we know that the true population-wide rate for Big Y 700 can't be any slower than the aggregate data we capture. Sample size needs to be much larger, but this represents a much better methodology than anything I've seen to date.

        Result: The true population wide rate could be even lower than my pilot estimates (e.g., ~57 years per mutation).

        Estimate of population BY700 rate - chart I.png

        Estimate of population BY700 rate - chart II.png

        Comment


        • #19
          Just another note addressing the adequacy of studies using a simple mathematical rather than a truly statistical approach: The relationship between confidence and changes in mutation rate is not linear, but exponential. That is, a 1% increase in the mutation rate will NOT result in a 1% decrease in the number of expected mutations at the 50% confidence level. Simple arithmetic is not going to solve this.

          Comment


          • #20
            Two thoughts just occurred to me:

            1. I forgot to add one to the variant counts when performing the base calculations--without doing this, we'd be measuring the age of the immediate subclades rather than the clades themselves--a fundamental data mismatch. Oops!

            2. Of course, the maximum values for the variant counts simply represent the lower end of a range of estimates for the Big Y 700 mutation rate. So that on its own won't quite do. However, by the same token, the minimum values for the variant counts must represent the upper end of a range of estimates for the Big Y 500 mutation rates--assuming this database is composed exclusively of BY700 and BY500. The relationship between the typical mutation rates for these two platforms is linear, so we should be able to straightforwardly extrapolate to the expected upper bound for the rates under the BY700 platform, and calculate the expected value as the geometric mean.

            Estimate of population BY700 rate - chart I +1.png

            Estimate of population BY700 rate - chart II +1.png

            Comment


            • #21
              No, this is not quite the solution. Sixty-point-eight (60.8) is just an upper limit on the rate. Binomial distributions are asymmetrical with a long right tail. The true population average mutation rate must be somewhat faster. But how fast? I would need to have an idea of the precise place on the curve represented by the upper or lower bounds.
              Last edited by benowicz; 26 December 2021, 12:31 PM.

              Comment


              • #22
                Originally posted by benowicz View Post
                No, this is not quite the solution. Sixty-point-eight (60.8) is just an upper limit on the rate. Binomial distributions are asymmetrical with a long right tail. The true population average mutation rate must be somewhat faster. But how fast? I would need to have an idea of the precise place on the curve represented by the upper or lower bounds.
                Ha ha! No, I'm wrong again. The relationship between the resolution levels is linear, it's the changes in the number of expected mutations at the 50% confidence level that is exponential. I'm obviously just trying too hard to find a way to infer resolution, but that's just a corner you can't cut. The provisional estimate of population-wide mutation rate of 60.8 most likely is correct, pending a study with a larger sample size. That's something.

                Comment


                • #23
                  Maybe spoke too soon about not being able to infer resolution from the block tree. Just tried another aggregation strategy for large datasets w/ known MRCA, averaging the confidence %'s at the 50th percentile. It seems to work pretty well. Here's the calculation for R-BY32575, descended from Riocard Mór Burke d. 1243, ancestor of the Burkes of Ireland. As the bottom chart shows, for the data currently available to me, it does a pretty good job, closely approaching the target aggregate confidence level of 50%.

                  Resolution inference - R-BY32575.png

                  Resolution inference summary.png
                  Attached Files
                  Last edited by benowicz; 27 December 2021, 03:23 AM.

                  Comment


                  • #24
                    Okay, trying to apply this algorithm to other clades revealed a problem that I had anticipated. Not all clades have distributions as tidy, nearly evenly split between platforms as these first four. Some of them have some truly freaky data spikes (see the chart for FGC23343 below). So I updated it to include a customizable field to set parameters for inclusion of individual data points into the analysis based on their subclade-specific confidence levels. By selecting all data points where the confidence level was between 25% and 75% (i.e., the middle 50% of the confidence curve) I ended up including roughly 90% of all data points and improved the overall predictive performance--the geometric mean of the individual confidence levels for the first four sampled clades is now even closer to the target of 50%.

                    Resolution inference - R-FGC23343 v4.png

                    Resolution inference summary v4.png

                    Comment


                    • #25
                      Just made my first attempt to test these rates and the algorithm against a different clade outside these first four, another one with a well-attested Medieval founder--Áed in Macáem Tóinlesc O'Neill, d. 1177. I'm very pleased with these results. More info linked below.

                      https://forums.familytreedna.com/for...877#post331877

                      Resolution inference summary v5.png

                      Comment

                      Working...
                      X