Originally posted by benowicz
View Post
Announcement
Collapse
No announcement yet.
How does this paper meet the minimum standard for reperformance?
Collapse
X
-
-
Originally posted by benowicz View PostWhile I was expanding my survey to additional clades, I hit kind of a wall with my method for indirectly inferring resolution--two new clades had starkly contrasting trend profiles . But after a little experimentation, I think I may have come upon a method that is far superior . . .
That new method does a very good job at identifying the directionality of skewed date (i.e., over- vs. under-statement), but I'm not sure whether there is a very reliable way to correct it. In the case of the FTDNA block tree database, I'm dealing with two distinct test platforms (I think), whose resolutions have a ratio of 1:1.5. But what if there were 3 or more different platforms with different resolutions? I'm not an expert in the history of the block tree or FTDNA's product history, so for all I know, there may be exceptions to my expectation of only two platforms--transfers of 3rd party data? Even if there are only two platforms today, will there still be only two platforms X years from now?
I feel like this new method does a much better job of approximating the true typical resolution for some clades than others. That would probably wash out over a really big study of many clades, but I don't think there is that much reliable data publicly available. So I'm "pretty" sure that my estimates of the true population mutation rate of around 60 years is "reasonable", that's not really a statistically reliable measure.
It would be great if the block tree published the average resolution data for reported clades, the same way they report the average number of private variants. That would be a real service to the scientific community. I don't know if the objection is programming or computational costs or what, but the public benefit would be enormous.
Comment
-
Came up with a different, more direct approach. Just survey the mutation rates implied by the range of variant counts reported for a number of established clades--assuming a 50% binomial confidence level, of course. There may be a muddle of different resolutions reflected in the data, but as of today, anyhow, the fastest possible rate would have to represent the BigY 700. It's not guaranteed that any of the subclades reported in the FTDNA block tree are composed solely of Big Y 700 kits, but we know that the true population-wide rate for Big Y 700 can't be any slower than the aggregate data we capture. Sample size needs to be much larger, but this represents a much better methodology than anything I've seen to date.
Result: The true population wide rate could be even lower than my pilot estimates (e.g., ~57 years per mutation).
Estimate of population BY700 rate - chart I.png
Estimate of population BY700 rate - chart II.png
Comment
-
Just another note addressing the adequacy of studies using a simple mathematical rather than a truly statistical approach: The relationship between confidence and changes in mutation rate is not linear, but exponential. That is, a 1% increase in the mutation rate will NOT result in a 1% decrease in the number of expected mutations at the 50% confidence level. Simple arithmetic is not going to solve this.
Comment
-
Two thoughts just occurred to me:
1. I forgot to add one to the variant counts when performing the base calculations--without doing this, we'd be measuring the age of the immediate subclades rather than the clades themselves--a fundamental data mismatch. Oops!
2. Of course, the maximum values for the variant counts simply represent the lower end of a range of estimates for the Big Y 700 mutation rate. So that on its own won't quite do. However, by the same token, the minimum values for the variant counts must represent the upper end of a range of estimates for the Big Y 500 mutation rates--assuming this database is composed exclusively of BY700 and BY500. The relationship between the typical mutation rates for these two platforms is linear, so we should be able to straightforwardly extrapolate to the expected upper bound for the rates under the BY700 platform, and calculate the expected value as the geometric mean.
Estimate of population BY700 rate - chart I +1.png
Estimate of population BY700 rate - chart II +1.png
Comment
-
No, this is not quite the solution. Sixty-point-eight (60.8) is just an upper limit on the rate. Binomial distributions are asymmetrical with a long right tail. The true population average mutation rate must be somewhat faster. But how fast? I would need to have an idea of the precise place on the curve represented by the upper or lower bounds.Last edited by benowicz; 26 December 2021, 12:31 PM.
Comment
-
Originally posted by benowicz View PostNo, this is not quite the solution. Sixty-point-eight (60.8) is just an upper limit on the rate. Binomial distributions are asymmetrical with a long right tail. The true population average mutation rate must be somewhat faster. But how fast? I would need to have an idea of the precise place on the curve represented by the upper or lower bounds.
Comment
-
Maybe spoke too soon about not being able to infer resolution from the block tree. Just tried another aggregation strategy for large datasets w/ known MRCA, averaging the confidence %'s at the 50th percentile. It seems to work pretty well. Here's the calculation for R-BY32575, descended from Riocard Mór Burke d. 1243, ancestor of the Burkes of Ireland. As the bottom chart shows, for the data currently available to me, it does a pretty good job, closely approaching the target aggregate confidence level of 50%.
Resolution inference - R-BY32575.png
Resolution inference summary.png
Attached FilesLast edited by benowicz; 27 December 2021, 03:23 AM.
Comment
-
Okay, trying to apply this algorithm to other clades revealed a problem that I had anticipated. Not all clades have distributions as tidy, nearly evenly split between platforms as these first four. Some of them have some truly freaky data spikes (see the chart for FGC23343 below). So I updated it to include a customizable field to set parameters for inclusion of individual data points into the analysis based on their subclade-specific confidence levels. By selecting all data points where the confidence level was between 25% and 75% (i.e., the middle 50% of the confidence curve) I ended up including roughly 90% of all data points and improved the overall predictive performance--the geometric mean of the individual confidence levels for the first four sampled clades is now even closer to the target of 50%.
Resolution inference - R-FGC23343 v4.png
Resolution inference summary v4.png
Comment
-
Just made my first attempt to test these rates and the algorithm against a different clade outside these first four, another one with a well-attested Medieval founder--Áed in Macáem Tóinlesc O'Neill, d. 1177. I'm very pleased with these results. More info linked below.
https://forums.familytreedna.com/for...877#post331877
Resolution inference summary v5.png
Comment
Comment