No announcement yet.

The science behind Big Y 700 mutation rates

  • Filter
  • Time
  • Show
Clear All
new posts

  • The science behind Big Y 700 mutation rates

    I've been looking for a reliable estimate of average mutation rates for Big Y 700. To date I've seen crude estimates--with no support--varying between once in every 81 years to once in every 160 years. This was kind of troubling because I noted an enormous variety of mutations since the MRCA for the FGC23343+ group (i.e., from 13 to 34). I really wanted some semi-technical explanation.

    I think I've finally found a good one, but I'd appreciate some input from knowledgeable users to see how far off my understanding is. Anyhow, here's the situation as I currently understand it: The base mutation rate is about 2.3*10(-8), or 1 per 23,255,814 base pairs tested.

    This estimate can be confidently converted into TMRCA only if you know the # of base pairs that have been reliably tested for a specific donor, and this is where it gets a bit tricky. Standard product specifications can only give us a good estimate of the absolute # of base pairs tested--about 23.6 million for Big Y 700.

    But that's NOT the same thing as the number of base pairs reliably tested because of flukey things outside of the company's control, like the quality of the sample, etc., which could prevent some regions from receiving an adequate number of high quality scans to be considered reliable. The sample for the kit I co-admin is nearly 13 years old, so I wouldn't be surprised if sample quality alone counted for our less-than-expected number of SNPs--although all in all, we actually fell within the average range for the FGC23343+ people, so I'm not disappointed. But I would like to better understand what the other variables could be.

    Anyhow, assuming 100% fully expected coverage for the tested areas, I think the conversion to an estimate of average years per mutation should look like this:

    23,255,814 / 15,000,000 * 33= 1 mutation every 51.16 years

    The 23 million number are the base odds discussed above, and the 33 years is an estimate of the number of years between generations. The 15 million number is an estimate of the population average number of base pairs receiving reliable scans per product specs. I want to see some kind of support for THAT number. To date all I've seen is a flat statement from the DNAeXplained blog, without any technical specifics.

    I'm not questioning the veracity of the number. It hits the nail right on the head for two sets of donors under FGC23343. I just want to know how accurate this understanding is. I know there is a lot of discussion about whether particular mutations can be considered true SNPs based on whether they are located in or outside of the so-called "comBED regions", but I'm assuming that is not a relevant variable in this particular case because the FTDNA white paper specifies the region tested, and I'm assuming that FTDNA would not test it unless it fully met the "comBED region" criteria. I'm more interested in why there should be such a wide variety of reported SNPs since the MRCA among Big Y 700 donors.

    And as a post-script, I think I may have inadvertently found a good workaround for age estimates within my wider FGC23343 group, using an average 85 years to mutation and adjusting each branch for the intra-clade average number of mutations to the average for the entire group. I arrived at that number subjectively by taking of survey of recommendations from internet posts, without any technical support or reasoning, but it seems to work pretty well with the calculated overall 51.16 year average for the "most active" subclades. It also jibes pretty well with STR results for one numerous subclade with divergent branches with MRCA's born in the 1600s and the 1400s. For that reason I'm pretty happy with it, although I doubt 85 years could be very useful as a population-wide generalization unless typical coverage is actually significantly lower figure implied by that 15 million base pair number. Hard to know without seeing where that number came from.

  • #2
    As a post- post-script, I'll add that I checked to see whether the observed discrepancy in the # of SNPs since the MRCA could be explained by normal deviation in the underlying mutation rate, and I don't think so. That paper gave a 95% two-sided confidence interval between 2.3* 10^(-8) and 6.3*10^(-8). To arrive at my subjectively determined (but still pretty good fit within the observed FGC23343+) 85 year average between SNPs, I'd have to be at roughly 2.6*10^(-18), which leaves less than 5% of a normally distributed curve below this point. That's like odds of 21 to 1 against the sole explanation being in the underlying mutation rate for any individual donor.

    I mean, maybe the mutation rate is part of it, but certainly not all of it. Certainly not if the actually observed FGC23343 results average 85 years among 13 donors. The odds of them averaging 85 years based solely on variation in the mutation rate are probably like 1 in 3.228*10^17. Maybe one single donor could have a one-off average of 85, but among all 13 donors, highly unlikely.


    • #3
      I'll have to see whether I can formulate a less technical, more intuitive way to phrase this, but I think I've found an explanation for the observed good fit for an 85 year average mutation rate.

      If I calculate the inverse cumulative binomial probability of observing one mutation in 85 years given a base mutation rate of 1 every 51.16 years, I come up with a confidence level of 49.672% -- pretty darn close to the target 50% for the center of a normal distribution. All within the assumptions discussed above.

      That makes sense, right? If that's correct, then 85 years per SNP probably should be recommended as baseline expectation for Big Y 700 users--although there can be a shocking level of variation among individual donors. In any case, I could probably construct a table to estimate the probability of any specific observed scenario.


      • #4
        Originally posted by benowicz View Post
        I'll have to see whether I can formulate a less technical, more intuitive way to phrase this, but I think . . . 85 years per SNP probably should be recommended as baseline expectation for Big Y 700 users . . .
        No, that can't be correct. That would only make sense if I were examining a period that extended only 85 years, but I'm looking at nearly 2,000 years. The binomial distribution is kind of a sampling correction that doesn't apply here. At least not in the way I calculated.

        There must be a mismatch between the region examined in that national library of medicine paper and the region tested by FTDNA that it considers reliable. And who knows what the typical # of mbp are. I feel like I've identified the correct variables, but that I don't have any visibility as to what the true values really should be.


        • #5
          Originally posted by benowicz View Post
          . . . There must be a mismatch between the region examined in that national library of medicine paper and the region tested by FTDNA that it considers reliable. . .
          This must be the key variable, the statistics related to that portion of the Y chromosome tested by FTDNA, as compared to that discussed in that scientific paper.

          I still would very much like to know where that 15,000,000 base pair reliably covered number comes from. But whatever the correct number is, I'd have to assume that FTDNA's quality control ensures that customer kits get pretty close to it before they publish results. So that probably can't move much.

          If FTDNA tested a large enough number, you'd expect the sample mean to approximate the population mean--although the standard deviation should be significantly higher for the sample versus the population at large. This must be the 'X' in my algebra equation.

          I made a naive guess as to what sample standard deviation should be for the region tested by FTDNA, assuming the currently observed results for FGC23343 are representative. I come up with a 95% confidence interval of 1.1 to 7.5 *10^-8, versus the scientific paper's 2.3 to 6.3 *10-8. Same mean, just a much larger standard deviation.

          This allows me to infer a distribution curve to assess the probability associated with any scenario given an observed number of SNPs since the common ancestor and an estimate of their year of birth.

          It's just a fun little exercise for right now, but it seems like we ought to be able to gather data from customers in order to come up with a stronger population-wide validity.


          • #6
            Originally posted by benowicz View Post
            . . . I made a naive guess as to what sample standard deviation should be for the region tested by FTDNA, assuming the currently observed results for FGC23343 are representative. I come up with a 95% confidence interval of 1.1 to 7.5 *10^-8, versus the scientific paper's 2.3 to 6.3 *10-8. Same mean, just a much larger standard deviation. . . .
            I should probably convert that into a range of years per SNP: Based on my observations, the 95% two-sided confidence interval stretches from about 204.404 years to 29.245 years, versus the paper's 95.652 years to 34.921 years, with a mean in both cases, of 51.163 years. Assuming, of course, that this 15,000,000 number is supportable.


            • #7
              I don't consider myself an expert, but from what little I know, it's only STR's that have useful average mutation rates.

              SNPs on the other hand have widely varying mutation rates and are not useful for estimating time to MRCA.
              SNP's are used for separating populations. Once separated, STRs are used for calculation of time to MRCA.


              • #8
                I think that's accurate. But I'm dealing with an SNP subclade where few donors have matches within the timeframe that STRs are reliable. After about 15 or 20 generations, the likelihood of back- or convergent- mutations renders STR comparisons almost useless. So the best I can do is derive a probability curve for SNP mutations with a really large standard deviation.

                It's not a complete guess. One sub-group within this clade do have STR matches with MRCA born around 1650 and 1400 A.D. It's kind of useful as a benchmark for estimating the SNP mutation rates for the other members. My current best guess is that, for the total group, the average mutation rate is about 85 years, with the brackets for a 2-sided 95% confidence interval at about 341 and 49 years. It's far from precise, but I think it's the best available.
                Last edited by benowicz; 7 October 2020, 03:06 PM.


                • #9
                  Within my project, there are 9 men with the same surname who have Y111 STR results available.
                  MRCA calculated based on Y111 results is some where between 31-44 generations ago.
                  3 men have terminal SNP results publicly available, but they are all the same.
                  I'm sure we have private unique SNP mutations, but so far, they have not been considered significant by FTDNA.
                  Of course, we could also just be outliers.


                  • #10
                    I've done a bit more research concerning the dating of SNPs and it is as I thought that there is a huge amount of uncertainty. In general, the amount of uncertainty increases dramatically with newer SNPs. It's not until a SNP is about 800 years old, that we can assume the uncertainty is less than 100% at a 95% confidence level. Here is a link to a science paper that discusses it in more detail:

                    One must use STR statistics or paper genealogy records when dating to less than 800 years.
                    Last edited by AndrewRoss; 9 October 2020, 12:37 PM.


                    • #11
                      This is a terrific paper. It puts so much of my other research into context. I'll be working with this for a while. Thanks.

                      But STRs are only useful for dating if the TMRC is less than 600 years or so, due to convergent mutations. I'm well past that benchmark for the group I'm working with. STRs are just not going to work here.

                      I definitely have paper records that reach "towards" my timeframe--about 1,000 years ago. There is also a very idiosyncratic geographic distribution, which points to specific hypothesis. But sealing the deal with 100% solid paper confirmation probably won't happen. There's an outside chance that one or two pairs of donors might, but the vast majority simply won't.

                      This paper is great because it's more recent, comprehensive and specific than anything else I've seen. Yes, there is considerable uncertainty surrounding the calculation of MRCAs based on SNPs, but at least it allows me to quantify that uncertainty.


                      • #12
                        Here's what I ended up doing:

                        I framed SNP dating as a kind of least-squares problem, using Excel's goal seek function to find a TMRCA that resolved the array of observed private variants into the equilibrium position (i.e., 50% joint cumulative confidence) with regard to one of the per Mb mutation rate distributions described in that paper. It's really not as complicated as that may make it sound. It requires the use of only a couple bog-standard Excel functions and very little data entry. Once the donor data is input, the only independent variable is the TMRCA. You do you have to perform a few manual iterations of the goal search before the range of possibilities is digestible for Excel, but it's really not a big deal.


                        • #13
                          After spending some time kicking the tires on my model, I tweaked it a bit--especially w/ regard to the assumed average generation span and calculation of standard deviation for descendant clades. But the most influential variable is still the mutation rates selected. I've been looking comparing two distributions:

                          1. Implied by the study of modern pedigrees per the 2015 Adamov paper (i.e., 2.56* 95% CI: 2.03* 3.16*10^-8), and

                          2. Implied by the 2009 Xue paper (i.e., 4.3* 95% CI: 2.3* 6.3*10^-8).

                          The Adamov paper looked at 41 samples, the Xue paper looked at only 2. However, the Xue paper only looked to test one observation against the rate distribution implied by another study that focused on a deep evolutionary hypothesis, rather than attempting to derive an estimate of the distribution in the modern human population. The underlying evolutionary study was more comprehensive, but also more theoretical. Probably it is the case that Adamov's sample is just too small.

                          I think this idea is born out by my anecdotal observations. I looked for some other published exercises between people with known MRCA, but I could only find one who gave enough detail for me to test against my model.


                          So, Roberta Estes performed two comparisons, and derived average mutation rates of 54.25 years and 65.6 years for the Big Y 500 product. Both of these observations would be outside the 95% two-sided confidence interval under the Adamov distribution, but under what I'll call the Xue distribution, they come within 1% and 18% respectively. For my own part, the currently observed FGC23343+ folks returned a 10% lower standard deviation from the Xue distribution as compared to the Adamov distribution per my model. Also, the arithmetic average of the cumulative confidence figures averaged 59% under Xue as compared to 68% under Adamov. The Xue numbers look much more plausible.

                          If anything, the true distribution of rates in the population is probably a bit faster. You can't really tell for sure given the really small number of observations I'm looking at. But I think the binomial distribution analysis that my model is based on is conceptually appropriate, and within those parameters, the Xue rates make much more sense.

                          Which is not to discount the usefulness of the Adamov paper. That's the only paper I've seen to date that addresses the significance of varying test resolutions, which I didn't fully understand until I read it.