Announcement

Collapse
No announcement yet.

Big Y SNP mutation rate

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Big Y SNP mutation rate

    Hi All. I got my Big Y results recently through FTDNA, unfortunately however I am a bit of a novice when it comes to Genealogy.

    My Big Y result has resulted in a match with one other person in Haplogroup D. According to the results we:

    * Share 498 novel variance
    * Have 26,724 matching SNPs
    * But we have 3 known SNP differences: CTS1530 PF2000 Y1251

    I have read an article amount the mutation rate for Ashkenazi Levites at https://sites.google.com/site/levite...kenazi-levites

    where on average an SNP mutation occurs once every 150 years. Does anyone know if this mutation rate is specific to the Levites due to a shared history or is this a rough guide for all SNP mutation rates? Does my result of 3 known SNP differences mean that the person I am matched with share a common ancestor between 90 to around 450 years? (90 because we can trace our genealogy back to a maximum of 3 - 2 generations) Also, I have read else where that the rate of mutation on the Y Chromosome differs from area to area. How does this affect SNP mutations rather than markers?

  • #2
    The rule-of-thumb that one reliable SNP occurs every 150 years applies to both known and novel SNPs. But there are so many caveats that...

    I strongly suggest that you ask FTDNA for your BAM file (raw data) and submit it to YFull for analysis. yDNA haplogroup D is rare among FTDNA customers, but YFull's D haplotree has 5 regular customers as well as many research samples.

    YFull's 5 customers in D come from:
    the Philippines
    Ukraine
    Kazakhstan
    Japan
    One has undeclared ancestry.

    The guys from Ukraine and Kazakhstan have an estimated TMRCA of only 1550 years.

    Comment


    • #3
      I'm now using a rate of 160 years per Big Y-tested SNP in the combBED area, the rate suggested in D. Adamov et al., Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data, available at rjgg.molgen.org/index.php/RJGGRE/article/download/151/175

      YFull's calculations are based on that same methodology. http://www.yfull.com/faq/what-yfulls...n-methodology/ That's not surprising, given that at least two of the authors of that paper are founders of YFull.

      To use that methodology to calculate the time to an MRCA, it's necessary to eliminate those SNPs reported by Big Y that fall outside the combBED area (Table 1 to that article identifies the positions of SNPs in the combBED area), as well as SNPs that are otherwise unreliable (because, for example, they are shared by multiple men on different branches). (Inclusion of such SNPs will result in substantial overstatement of the calculated time to the MRCA.)

      Because there is considerable variance in the number of SNP mutations on an individual man's line, the margin of error in the calculated time to an MRCA will be quite large if the pool of Big Y-tested men is small. This can also be a significant issue when calculating the time from the start of a bottleneck to the end of a bottleneck, unless there are several lines branching out from the same bottleneck.

      Because YFull has automated this process and has access to a large pool of Big Y results, it's likely the most reliable way for members of a small cluster to identify their shared reliable SNPs and to calculate the range of time to an MRCA (unless the men in that cluster belong to an FTDNA project with administrators who are able to perform such an analysis).

      Comment


      • #4
        Great thanks all for the replies.

        Comment


        • #5
          Jeff -

          Thanks for the info. I do have a question about the 160 years rate. I did find it in the paper you link, but I also notice that they use a rate of 144.41 when calculating subclades. Stated near the bottom of the YFull web page you linked reads :

          The second formula uses an assumed mutation rate of 144.41 years (0.8178*10-9, which is the average of the mutation rates of the ancient Anzick-1 sample and of a group of known genealogies, and an assumed age of 60 years for living providers of YFull samples.

          So, I had originally thought to use 144.41 for my TMRCA estimates. Any thoughts on that?

          Comment


          • #6
            Zen--

            That's a good question.

            As shown at, for example, http://www.yfull.com/tree-info/R-Y2619/, YFull is using two separate formulas. The first formula calculates, for each tested man, the ratio of all SNPs in the combBED area to that man's tested SNPs in the combBED area, and then multiplies that ratio by the number of Big Y-reported SNPs to derive an corrected number of SNPs. The second formula multiplies the corrected number of SNPs by 144.41 years and then adds 60 years (as the assumed age of tested men).

            Based upon the small sample of results shown on the linked page, it looks as if there's fairly broad variation in the range of SNPs in the combBED area that are being tested by the Big Y test. Putting aside two FGC tests (YF01462 and YF01935) that tested almost all of the SNPs in the combBED area, the ratio of all SNPs in the combBED area to tested SNPs in the combBED area ranges from 1.025:1 (YF02919) to 1.16:1 (YF01611).

            My guess is that the average ratio of all SNPs in the combBED area to all SNPs in the combBED area in the samples considered by YFull when calculating its figure of 144.41 years/SNP approached 1.1:1, which is approximately the ratio of the 160 years/SNP reported in Adamov et al. to the 144.41 years/SNP used by YFull.

            I think that the figure of 144.41 years/SNP should be very reliable for a calculation based upon corrected SNPs using YFull's methodology of considering the ratio of all SNPs in the combBED area to tested SNPs in the combBED area.

            However, the figure of 160 years/SNP should be quite accurate when considering the total number of reliable Big Y-reported SNPs, without YFull's adjustment. Even if that figure is off by, say, five or 10 years, that contribution to the margin of error would, I assume, often be less significant than, for example, the contributions resulting from small sample size (especially over a relatively short time period) or even the assumption as to the average age of tested men.

            Comment


            • #7
              The difficulty with 150, 144, or 160 years per SNP is that, while it could be an accurate value for the average (statistical mean) of the available data, you can't apply this figure to an individual pedigree in the sense that every 150 years, on a Tuesday, one more SNP will mutate. The mutation of one more SNP is still just a random and rare event, that could occur at any time. In a 150 year period, there might be no mutations at all, or half a dozen (exercise for the reader: compute those odds!). It's not until you are dealing with time spans well beyond the "genealogically relevant" timeframe, or with large sample sizes, that you can begin use the average mutation rate as an analytical tool.

              Comment


              • #8
                Jeff -

                Thanks for the explanation. I am now trying to come up with a corrected SNP count using the data I have which I received from the person who analyzes our BAM files for our haplogroup.

                Unfortunately, he does not calculate the read length that overlaps the combBED area. I am wondering if there is an alternative way to correct the SNPs? The summary data I have per kit looks like this :

                REF_N=33,591,060
                CALLABLE=8,969,383
                NO_COVERAGE=7,482,019
                LOW_COVERAGE=1,972,936
                POOR_MAPPING_QUALITY=5,249,257

                Any idea?

                John, good point! However, I have been told that TMRCA calcs using SNPs are more accurate than STRs. Not sure this is true or not. Is it maybe that, for ancient calculations SNPs are better and for closer (genealogical times) STRs are the preferred approach?

                Comment


                • #9
                  Originally posted by John McCoy View Post
                  The difficulty with 150, 144, or 160 years per SNP is that, while it could be an accurate value for the average (statistical mean) of the available data, you can't apply this figure to an individual pedigree in the sense that every 150 years, on a Tuesday, one more SNP will mutate. The mutation of one more SNP is still just a random and rare event, that could occur at any time. In a 150 year period, there might be no mutations at all, or half a dozen (exercise for the reader: compute those odds!). It's not until you are dealing with time spans well beyond the "genealogically relevant" timeframe, or with large sample sizes, that you can begin use the average mutation rate as an analytical tool.
                  Exactly.

                  Averages for populations are averages for populations and not necessarily characteristics of any individual. Let's say that in a given population men on average have 2.5 children. So Joe Random has 2.5 kids? ? ?

                  Mr W

                  Comment


                  • #10
                    Originally posted by dna View Post
                    Exactly.

                    Averages for populations are averages for populations and not necessarily characteristics of any individual. Let's say that in a given population men on average have 2.5 children. So Joe Random has 2.5 kids? ? ?

                    Mr W
                    Why the question mark about 2.5 children? It's actually very simple. It means on average given two sets of parents the number of children are 2 for the one and 3 for the other.

                    Comment


                    • #11
                      Originally posted by thetick View Post
                      Why the question mark about 2.5 children?
                      Because he was trying to make a point as he was agreeing with John McCoy. He used that question of 2.5 children as an example, not as an actual question he had.

                      Comment


                      • #12
                        [QUOTE=zen4life;421315]

                        Thanks for the explanation. I am now trying to come up with a corrected SNP count using the data I have which I received from the person who analyzes our BAM files for our haplogroup.

                        Unfortunately, he does not calculate the read length that overlaps the combBED area. I am wondering if there is an alternative way to correct the SNPs?

                        . . .

                        Any idea?


                        Zen--

                        I'm not sure if there's any way, other than having YFull analyze each kit, to determine the SNP count within the combBED area for each man's kit.

                        With regard to the use of an individual man's results to try to calculate the time to an MRCA, in a cluster going back 1,300 years I've seen the number of reliable downstream SNPs in the tested men's combBED area varying from 1 to 12.

                        I think that it's necessary to use results from multiple men on multiple lines in a cluster to date the time to that cluster's MRCA with any degree of reliability (and, even then, I would expect the margin of error to be quite significant, especially in a genealogical timeframe).

                        Comment


                        • #13
                          I agree it would be ideal to have everyone upload to YFull and have them do it. However, it is a hard sell to get all members of a project to pay another 50 bucks to upload to YFull. But let's be honest, FTDNA really should be doing this analysis. That is my opinion anyway.

                          And yes, I agree that a small sample size is going to have a wider margin of error. And especially going back to genealogical times.

                          Going back to the corrected SNPs formula, I think I may have been looking at it wrong. What it appears to be, is a formula to compensate for the different coverage area between Big Y and FGC. The FGC kits cover a much higher area so the formula only adds a small amount to the number of SNPs where for Big Y, it adds more.

                          For example, this is probably a Big Y kit :

                          23.0/7567730*8467165 = 25.73 SNPs

                          Where this is probably a FGC kit :

                          24.0/8440400*8467165 = 24.08 SNPs

                          That sound about right? If so, I do not need this formula at all since I am just comparing Big Y kits at the moment. I just need to figure out the best mutation rate to use.

                          Comment


                          • #14
                            Zen--

                            It looks as if YFull's methodology is designed not only to calibrate Big Y test results to FGC test results, but to calibrate Big Y test results to one another.

                            Looking at the six Big Y tests reported on http://www.yfull.com/tree-info/R-Y2619/, the percentage of SNPs reported within the combBED region ranges from 86.5% (kit YF01611) to 97.5% (kit YF02919). The average percentage of SNPs reported within the combBED region on those six Big Y tests is 93.3%.

                            To take a larger sample size, the percentage of SNPs reported within the combBED region for 58 Big Y tests for Z-93+ men at http://www.yfull.com/tree-info/R-Z93/ is 91.8%. Dividing 144.41 years/SNP by 91.8% yields 157.3 years, which is quite close to the 160 years/SNP from the paper.

                            I'd be inclined to use the 160 years/SNP figure, rather than to try to replicate YFull's adjustment based upon the percentage of SNPs reported within the combBED region. For the reasons discussed earlier in this thread, the actual mutation rate per SNP may vary broadly, so we're dealing with approximations, which become less reliable with smaller sample sizes and shorter periods of time to an MRCA.

                            Comment


                            • #15
                              Jeff -

                              Ok, so it looks like you called it, unless it is some major coincidence.

                              I compared 6 Big Y kits belonging to Sept A to 1 Big Y kit belonging to Sept B. According to the Irish Annals, these two Septs rollup to a common ancestor who lived in the late 300s as the father of the common ancestor took over kingship of the tribe in 379.

                              Plugging in 160 years per SNP and averaging out the results, gives me a TMRCA of 377 AD. Coincedence? Hmmmmm.

                              I am going to try a couple other comparisons but for those, I don't have as specific of a date for the TMRCA, only ranges. Anyway, thanks for the help.

                              Comment

                              Working...
                              X