Announcement

Collapse
No announcement yet.

More nuanced understanding of YSTR mutation rates...

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • More nuanced understanding of YSTR mutation rates...

    Is anyone out there aware of folks working on refining TMRCA models specifically for relationships 200 yrs or more back?

    I believe that, in the interest of simplification, popularly published TMRCA models:
    1. Use an average mutation rate for the general population (ie, as opposed to median or one tailored for a specific population, such as haplogroup or subclade)
    2. Take a linear rather than exponential approach.

    It seems unrealistic to expect that mutation rates do not vary from lineage to lineage, if not further from individual to individual, generation to generation, or even region to region. And clearly, the general population growth rate is exponential rather than linear.

    But I suppose that conventional approach makes sense, from the point of view that, in order to preserve the integrity of the formula, one should minimize the number of subjective assumptions. And perhaps this will result, on balance, in the most accurate assessment for the largest number of individuals.

    However, it seems to concede that it will result in 'improper exclusion' for a large number of individuls performing TMRCA analyses for putative relationships 200 years or more out. Maybe improperly excluding as much as 50 pct of these folks.

    I find that annoying, not only because I suspect that I am one of that out-of-scope 50 pct, but also because it leaves glaring, gaping holes in the model that I find conceptually unacceptable.

    For example, the Irish demographic history suggested to me by the conventional TMRCA model seems absurd. The land, given historical level of technology, could never have supported the population size implied by the diversity of Irish R1b1b2 under this model.

    Or so I intuited. I'm don't hold a PHD in genetics or agriculural history or demography. I kinda resigned myself to frustration on this point.

    But recently a friend mentioned to me that an SNP project he participated in showed a TMRCA of 15,000 yrs with another gent--this despite a generally accepted age of 3,000 for their shared subclade. So the conventional model returned a TMRCA FIVE times greater than that allowed by the existence of their subclade?

    My point is that maybe it is possible to better quantify precision of TMRCA models. Even a naieve person like me can see that the example above makes no sense under a literal interpretation of the model.

    So, anyone aware of folks working on TMRCA models that can do better? Maybe by cross referencing other disciplines? I supposed genealogies of chief families will be as relevant as agriculture and demography, though maybe less persuasive to some folks.

    Jack

  • #2
    Originally posted by Clochaire
    Is anyone out there aware of folks working on refining TMRCA models specifically for relationships 200 yrs or more back? ...

    Jack
    Great ideas here, and I sure hope the scientists and researchers are listening! We ought to be getting more and more refined estimates as the data comes in.

    Jayne

    Comment


    • #3
      Jack, check out the forum discussions at the E-M35 project - there's quite a bit of info by some very knowledgable people.



      I usually use the McGee utility for estimating tmrca; Bennett G. tested some haplotypes for me with a beta version of a utility FTDNA was working on this past summer, and his calcs came very close to McGee's...Vinnie

      Comment


      • #4
        Originally posted by Clochaire
        But recently a friend mentioned to me that an SNP project he participated in showed a TMRCA of 15,000 yrs with another gent--this despite a generally accepted age of 3,000 for their shared subclade. So the conventional model returned a TMRCA FIVE times greater than that allowed by the existence of their subclade?


        Jack
        I am aware of what you mention above because the person who related this to you brought this up to me as a problem. He is referring to the R1b-U106 Project, of which I'm one of the co-administrators. The administrator is David Weston.

        As I explained to your friend, David uses the most conservative approach when computing TMRCA between two members. This means that he uses the 95% confidence interval (CI). Most people use the 50% CI, since, by definition, this is the most likely TMRCA as it's the midpoint of observed mutation rates. Only a few cases will be found where the TMRCA will be at the 95% CI, since this assumes the slowest mutation rate observed in any cases of the various father-son studies.

        So, it is the case that the age of R1b-U106 is about 3,500-4,000 years (not 3,000 as your friend stated) or perhaps a bit more. It is surely not the case that two members of that sub-clade have a TMRCA of 15,000 years - this is merely an artifact of the conservative method David uses in presenting TMRCA to members. In fact, the FTDNA TMRCA calculator provides different levels of CI which tell you the number of generations back to TMRCA for each CI. The McGee utility does the same thing, many think more accurately, as Vinnie posted in this thread. In essence, the main point of your posting has been answered - if you want to play around with different mutation rates in a TMRCA calculator, that capability already exists both at FTDNA and with the McGee utility.

        Comment


        • #5
          Originally posted by MMaddi
          I am aware of what you mention above because the person who related this to you brought this up to me as a problem. He is referring to the R1b-U106 Project, of which I'm one of the co-administrators. The administrator is David Weston.

          As I explained to your friend, David uses the most conservative approach when computing TMRCA between two members. This means that he uses the 95% confidence interval (CI). Most people use the 50% CI, since, by definition, this is the most likely TMRCA as it's the midpoint of observed mutation rates. Only a few cases will be found where the TMRCA will be at the 95% CI, since this assumes the slowest mutation rate observed in any cases of the various father-son studies.

          So, it is the case that the age of R1b-U106 is about 3,500-4,000 years (not 3,000 as your friend stated) or perhaps a bit more. It is surely not the case that two members of that sub-clade have a TMRCA of 15,000 years - this is merely an artifact of the conservative method David uses in presenting TMRCA to members. In fact, the FTDNA TMRCA calculator provides different levels of CI which tell you the number of generations back to TMRCA for each CI. The McGee utility does the same thing, many think more accurately, as Vinnie posted in this thread. In essence, the main point of your posting has been answered - if you want to play around with different mutation rates in a TMRCA calculator, that capability already exists both at FTDNA and with the McGee utility.
          Mike.

          I think we are, and always have been, in agreement on the fundamental facts of the situation, but perhaps not on interpreting their implications.

          The key facts as I see them are:

          1. The project has been correctly applying conventionally accepted TMRCA calculators.

          2. Evaluations of this type have 2 complimentary risks that are unavoidable:

          a.) Type 1 Error: Risk of improperly accepting a factually incorrect proposition
          b.) Type 2 Error: Risk of improperly rejecting a factually correct proposition

          It may not be possible to reduce one type of risk without increasing the other risk.

          3. The 15,000 TMRCA estimate was the result of a calculation using a 95% confidence interval with regard to controlling Type 1 Error. This practice would generally be regarded as apropriately conservative, I imagine, because one would generally be more interested in controlling Type 1 Errors than Type 2 Errors.

          4. Conventional SNP history dates origin of this particular subclade to somewhere around 3,500 to 4,500 years ago.


          Where I feel we may differ is with the implications for evaluating reliability of TMRCA calculators. I believe that the 15,000 year vs. 3,000 discrepancy in MRCA is an excellent example of the limitations of applying linear TMRCA models to an exponential phenomenum like population growth. If SNPs truely define a population and TMRCA models are reliable for remote periods, a TMRCA calc could not assign a relationship to any period significantly older than origin of the subclade.

          I'm not saying that I have the answer here--far from it. It quite understand that every tool has its limitations. Wouldn't it be weird to think, at our first go round, that we could directly observe only the last 100 or 200 years of genetic history, understand population growth to be a complicated, but ultimately exponential function, and yet somehow expect a high level of confidence about our understanding of events 1,000+ years ago?

          I guess what I'm asking for is kind of a 'unified theory of Y STR and SNP mutation and population growth'. A wayyyyy tall order, to be sure. But I think some of the folks at the link provided by Vinnie are thinking along the same lines as me -- working to share knowledge and incrementally at least, make our DNA clocks more accurate.

          Jack

          Comment


          • #6
            Originally posted by Clochaire
            Where I feel we may differ is with the implications for evaluating reliability of TMRCA calculators. I believe that the 15,000 year vs. 3,000 discrepancy in MRCA is an excellent example of the limitations of applying linear TMRCA models to an exponential phenomenum like population growth. If SNPs truely define a population and TMRCA models are reliable for remote periods, a TMRCA calc could not assign a relationship to any period significantly older than origin of the subclade.

            I'm not saying that I have the answer here--far from it. It quite understand that every tool has its limitations. Wouldn't it be weird to think, at our first go round, that we could directly observe only the last 100 or 200 years of genetic history, understand population growth to be a complicated, but ultimately exponential function, and yet somehow expect a high level of confidence about our understanding of events 1,000+ years ago?

            I guess what I'm asking for is kind of a 'unified theory of Y STR and SNP mutation and population growth'. A wayyyyy tall order, to be sure. But I think some of the folks at the link provided by Vinnie are thinking along the same lines as me -- working to share knowledge and incrementally at least, make our DNA clocks more accurate.

            Jack
            Reflecting on what you write above, I don't think we disagree much on the problem you're bringing up.

            We are dealing with two different measuring tools here - Y-STRs and SNPs. They both occur on the same y chromosome, but aren't exactly the same.

            Here's one problem that's troubling, of which I'm aware of a few cases. If two men are in different subclades, how can they have a genetic distance so low that a TMRCA calculator will say they share a common ancestor within the last few hundred years. These are extreme examples, but I have seen a few. It's complicated by the possibility of an incorrect SNP test result. So where there is this sort of conflict, it's important that both people have a SNP test and the seemingly conflicting results are double-checked.

            Then there is the question you bring up which relates to the accuracy of mutation rates and how they're applied and what they tell us. I think that in the vast majority of cases, TMRCA calculators when considering two haplotypes will give a good ballpark estimate of when a common ancestor may have lived. Once you have that estimate, then the job is to find him in the paper trail documentation available. There have been so many success stories in genetic genealogy that I think for most people who resort to DNA testing they can expect to get an answer that's helpful to them.

            There are kinds of questions that need to be researched further, such as differing mutation rates for certain markers in different haplogroups. There are also fundamental questions involved in how and why mutations happen - are they truly random or are there external factors of whatever sort that affect mutation rates in certain circumstances.

            I think that more research and experience will help us understand these factors. In the meantime, I wouldn't worry too much about the usefulness or accuracy of TMRCA calculations, except in those isolated cases where there seems to be a serious problem.

            Comment


            • #7
              Originally posted by Clochaire
              If SNPs truely define a population and TMRCA models are reliable for remote periods, a TMRCA calc could not assign a relationship to any period significantly older than origin of the subclade.
              Just thought of a simplified but clear example of my thought process.

              Imagine that you and I sit across from one another at a table. I show you a row of 5 coins lined up on the table--4 pennies and a quarter.

              Then I place 1 coin each in 1 of 5 identicle boxes. 5 boxes, each with 1 coin inside.

              Then I shuffle the boxes around a-la-streetside shellgamer.

              Assuming that I allow you as many guesses as you like, and do not replace or reshuffle any boxes after each selection, how many boxes will you have to select in order to be 100pct certain that you find the quarter?

              I say 4. But aparently, the logic of conventional TMRCA models says 20.

              Comment


              • #8
                Originally posted by MMaddi
                Here's one problem that's troubling, of which I'm aware of a few cases. If two men are in different subclades, how can they have a genetic distance so low that a TMRCA calculator will say they share a common ancestor within the last few hundred years. These are extreme examples, but I have seen a few.
                Really? This would be interesting to me, with possible application to my research on classic Ui Neill genealogies.

                I think I would approach this by looking at a network graph of each of these anomolous individuals and their respective subclade modals. Then I would draw a perimeter around each subclade modal, representing the expected # of mutations during the conventional history of the SNP.

                I would expect that these anomolous individuals fall within that area of overlap of the subclade perimeters.

                Of course, the size of those perimeters are highly sensitive to the selection of a linear or exponential population mutation model. My guess is that the conceptually more acurate exponential model would return a freakin' huge perimeter. But it would still be interesting to see how close each individual is to the edge of his subclade.

                Maybe it could have implications for geographic origin of the subclade? I think conventional wisdom is that diversity is highest in geographic region of origin.

                Would you be able to post links to relevant Ysearch IDs?

                Comment


                • #9
                  Originally posted by Clochaire
                  I would expect that these anomolous individuals fall within that area of overlap of the subclade perimeters.

                  Of course, the size of those perimeters are highly sensitive to the selection of a linear or exponential population mutation model. My guess is that the conceptually more acurate exponential model would return a freakin' huge perimeter.
                  Of course, if an individual falls outside if the perimeter of its subclade modal, it could suggest independent origins of the defining SNP.

                  There would probably be a lot of argument over whether assumed population mutation rate is generous enough. But that could be a fruitfull discussion, too.

                  Comment


                  • #10
                    Originally posted by Clochaire

                    Would you be able to post links to relevant Ysearch IDs?
                    This is the example that most perplexes and troubles me - http://tinyurl.com/5u8ryc

                    Both Misner and Sobolewski have DYS492=13, which in over 90% of cases means they should be U106+. Also, Misner has a "null 425" result, which is a distinct cluster in R1b-U106, with a common ancestor estimated to have lived about 2,500 years ago.

                    So Sobolewski should be U106+. Yet he has tested U106-. Well, perhaps he's part of a population out of which U106 and its null 425 cluster arose - call him pre-U106. Nope, he has tested P312+, which is on a different branch of R1b-M269 than U106.

                    It gets worse. The genetic distance, at 66 markers, between Misner and Sobolewski is just 6. According to the McGee Utility, this puts their common ancestor back just 480 years, at the 50% confidence interval. Yet, Sobolewski doesn't have a null 425 and Misner does. Misner's null 425 indicates a common ancestor with all other R1b-U106's with a null 425 about 2,500 years ago, as I stated above.

                    Misner has not SNP-tested yet, so this is all supposition, but pretty firm supposition. I've suggested that he test at least for P312, but he wasn't interested in that. If I am correct in my suppositions, Misner and Sobolewski are in different subclades, yet seem to have a common ancestor about 500 years ago, based on their GD.

                    So, how can Sobolewski and Misner share a common ancestor about 500 years ago when it seems impossible? This seems to be a remarkable case of convergence that tricks a good TMRCA calculator into computing a common ancestor between two men who don't share one in any meaningful sense.

                    Comment


                    • #11
                      Originally posted by MMaddi
                      This is the example that most perplexes and troubles me - http://tinyurl.com/5u8ryc

                      Both Misner and Sobolewski have DYS492=13, which in over 90% of cases means they should be U106+....
                      This is a good one. But apart from the fact that I still have to obtain and learn the haplotype network graphing software (been kinda busy here at work), there seem to be some information gaps that would prevent a very strong conclusion at this point:
                      1. Sobolewski's test shows him on a clearly U106- branch, Misner hasn't tested at all.
                      2. Our primary reason for suspecting U106+ are values for 2 individual STR markers. One fellow (Misner?) shows only 1 of the typical U106 values for these markers.
                      3. It is recognized that perphaps 10 pct of the folks with 492:13 are U106-. Do I have that right? It may become important to distinguish between correlation of 90 pct of U106+ to 492:13 on one hand, and 90 pct of 492:13 to U106+ on the other.

                      If I have the above correct, it could simply mean that these fellas belong to one of the 10 pct lineages where 492:13 does NOT equal U106.

                      This reminds me of the odds crunching I did earlier about my own discordant results and unexpected haplotype results for some other folks indicating M222+. I won't rehash that old horse's prayer here, but the upshot was that, despite what I still feel looked reasonably like an 85 pct probability of switch of our results by the lab, it turned out, in actuality, that the culprit was a less than 1 pct probable series of lab testing errors.

                      So, intuitively, I have no high difficulty in accepting those results as valid and consistent, at least given what I know now.

                      Still, it would make history if Misner turned out to be U106+. My personal bias would be to think that SNP history is much more complicated than current convention.

                      Jack

                      Comment


                      • #12
                        Originally posted by Clochaire
                        This is a good one. But apart from the fact that I still have to obtain and learn the haplotype network graphing software (been kinda busy here at work), there seem to be some information gaps that would prevent a very strong conclusion at this point:
                        1. Sobolewski's test shows him on a clearly U106- branch, Misner hasn't tested at all.
                        2. Our primary reason for suspecting U106+ are values for 2 individual STR markers. One fellow (Misner?) shows only 1 of the typical U106 values for these markers.
                        3. It is recognized that perphaps 10 pct of the folks with 492:13 are U106-. Do I have that right? It may become important to distinguish between correlation of 90 pct of U106+ to 492:13 on one hand, and 90 pct of 492:13 to U106+ on the other.

                        If I have the above correct, it could simply mean that these fellas belong to one of the 10 pct lineages where 492:13 does NOT equal U106.


                        Jack
                        I think it's more clear-cut than you portray it.

                        Here are the statistics for all the members in the R1b-P312 Project who have 67 markers:

                        DYS492 value Number Percentage
                        11 or 12 169 93.4
                        13 5 2.8
                        14 7 3.9

                        Here are the statistics for all the members in the R1b-U106 Project who have 67 markers:

                        DYS492 value Number Percentage
                        12 3 1.1
                        13 254 95.1
                        14 or 15 10 3.7

                        Combining the two projects together, 98.1% of those with DYS492=13 (n=259) have been SNP-tested as U106+, while the other 1.9% have a P312+ result. That's as close to a "slam dunk" as you can get for a marker value predicting or indicating subclade.

                        In the case of Misner, it's true that he hasn't had a SNP test, so it's unsure that he's U106+. However, he has a "null 425" and DYS492=13, plus two of the three secondary characteristic marker values (DYS447=24, DYS460=10 but differs on DYS390) for the U106+/null 425 cluster. Every member of this cluster who's had a SNP test has been U106+.

                        Checking for Misner's closest matches (genetic distance of 15 or less, at 66 markers) in the ysearch database, only 4 of the U106+/null 425 cluster members show up among dozens of his closest matches. So it seems that on the surface Misner is a member of the cluster, but a closer look says he likely is not. (I had taken him out of my spreadsheet for the cluster months ago.)

                        Checking the DYS492 value for his 135 closest matches, 81 have 12 and 54 have 13. So it looks like Misner is probably U106-, like Sobolewski, and their small genetic distance is a true indication of a recent common ancestor. I'd still like to see Misner get a SNP test.

                        Even with the near "slam dunk" predictability factor of DYS492=13, this case proves that using modal STR values will always give you a certain percentage of wrong calls. SNPS rule.

                        Comment

                        Working...
                        X