No announcement yet.

DNA vs Genetic Grouping - Errors & Misunderstandings

  • Filter
  • Time
  • Show
Clear All
new posts

  • DNA vs Genetic Grouping - Errors & Misunderstandings

    Hi there,

    I've had several family members DNA tested through MyHeritage over the years, including myself. I tested myself several years ago, my paternal grandmother most recently at the beginning of this year.

    Genetic Grouping is a relatively new addition to MyHeritage, which I don't fully understand yet, but I will continue to look into it. Interestingly, for most of our results, the genetic groups MyHeritage has suggested, have been far more accurate than the DNA data their providing us with recently. Or perhaps this is where I'm not 100% clear on the difference between the two, could someone that understands the principal behind this give me a Genetic Grouping vs DNA results guide for dummies?

    In regards to the possible DNA result errors I'm referring to - I'll use my grandmother, my dad and myself as an example. MyHeritage suggests that I am 58% N/W European - which is accurate, and matches the concept of my heritage I've been taught so far. My father was tested several years ago as well - his results were 60% NW Euro - also fairly accurate going by data we've known about previously, although I would expected a slightly higher NW Euro %, considering I had 58%, only 2% less than him - my mother is Eastern European. Now the interesting part is my grandmother - MyHeritage has classed her as being NIL % NW Euro, even though she is obviously my father's mother and we have a recorded written history of her paternal lineage coming from the same place in Germany for close to a millennium. They all came from the same village or thereabouts, and at no point left a ~50km radius within the last 500 years. The records we have before this, so between 1000-1500AD, indicate some movement within Germany only. At the same time, MyHeritage has confirmed that she is the biological parent of my father. How is this possible?

    MyHeritage believes my grandmother to be close to 30% English - we have nothing to do with the English at all, neither my father nor I have any DNA classed as English by MyHeritage. They've said that she's 50% Balkan (Hungary, Serbia, Bulgaria, etc) - again, we have no written records of this at all, and my father has only 16% Balkan, I have 0%. I would expect my father to perhaps have 16% Balkan or thereabouts through his paternal lineage, but certainly not anywhere close to 50%. My grandmother shouldnt have any, or hardly any, traces at all.

    HOWEVER - when it comes to the Genetic Grouping, MyHeritage was accurate to a T! They were even able to pinpoint the exact German state that my grandmother and her forefathers come from, which I thought was pretty amazing actually - almost to a 50/100km radius. But at the same time, her DNA suggests that she is 0% German or NW Euro? That's not possible, is it? MyHeritage's Genetic Grouping feature had similarly close to perfect results when they pinpointed my mother's cousin's origin in Hungary - spec around the Ukraine/Slovakian border. They were again correct within a 50/100km radius.

    How can it be that MyHeritage's DNA results & genetic grouping can be so far apart from each other in accuracy? Should I simply be asking MyHeritage to review my grandmother's DNA results?
    Last edited by Staebe; 7 March 2022, 03:42 AM.

  • #2
    So, ethnicity estimates are well known to be the least accurate part of DNA testing for genealogy. Don't stress over the differences, and I wouldn't bother asking for a review, because the estimates will probably change whenever the next update comes out. I'm sure there are many posts here in the forums about estimates being inaccurate.

    I don't know much about Genetic Groups, but here are two articles about Genetic Grouping at myheritage:

    Here are two free webinars by myHeritage about Genetic Groups:


    • #3
      One glaring issue with the "traditional" approach to admixture analysis is that the source data are individual SNP scores based on small DNA fragments. The important information about which SNP scores actually come from the same homologue (i.e., maternal versus paternal copies of each chromosome) are unknown. This issue is currently addressed by using something like the segment matching algorithm to infer whether the kit being tested could contain a sequence that is characteristic of a particular "reference group", whose inferred sequences in turn are derived with the same limitations of unconnected SNP scores. At some point in the future, "long-read sequencing" data will become widely available, and we may finally learn to what extent the data based on small DNA fragments has been misleading us.

      In any case, the inferred, and possibly artifactual sequences used for admixture testing are apparently much shorter than the inferred sequences used for detecting segment matches between kits for the purpose of detecting genealogical relationships. For genealogical comparisons, we usually exclude segments shorter than about 10 cM, and with further experience, it has often turned out that segments as large as 15 cM or more are simply unhelpful and possibly not real. Why, then, would we give so much credence to shorter segments for admixture testing?

      An alternative approach would be FIRST to establish true "phased" sequences for each kit, by constructing the set of all inferred segment matches of sufficient length that triangulate, or that can be derived by comparing parent-child kits or other close relatives. Those segments are very likely real, based on SNP scores from the same homologues. Then we can compare those sequences with admixture "reference group" sequences (derived by a similar phasing process based on actual triangulating segments, not on population-level inferred sequences).

      It may well be that the "genetic grouping" methodology is based on longer shared segments that are observed to triangulate, and if that is how it actually works, the results could well be more informative than the older "admixture testing" methodology. But we won't really be able to assess the extent to which any of these methodologies produces artifactual results until we can make some comparisons with "admixture" analysis based on true long-read sequences.

      One measure of how much art goes into this sort of analysis is the degree to which different vendors and different algorithm versions produce discrepant results for the same kit. Another measure is the degree to which the results from parents and their children disagree, when tested with the same algorithm version. It is also possible in some cases (GEDmatch provides this, for example) to construct "phased" kits (the set of all sequences from a child that are also found in the mother or father, for example) and run them through various admixture algorithms. To get a more general picture of how the results differ among these different approaches, it would be necessary for a large number of people to collaborate in some sort of systematic comparison.