No announcement yet.

How is the analysis done?

  • Filter
  • Time
  • Show
Clear All
new posts

  • How is the analysis done?

    Can anyone tell me or link me to an explanation of how these origins are done, please? How is the SNP data of AC, TT, AT, etc matched to a reference population? Is a reference population a single amalgam of results with a higher proportion of homozygous SNPs due to inbreeding, in which case is this matched to my data in the same way as a relative’s data would be?

    I see that a mixed race person has been correctly phased into the two racial groups, though each is sometimes Haplotype 1 and sometimes 2. How has this been done, as I didn’t think it was possible to know which letter, AC, etc, was from which Haplotype.

  • #2
    As far as I know, the subject kit is matched to a derived consensus sequence (presumably, haploid) for each reference group using the "normal" segment matching algorithm, and there seems to be no actual sequence data available from each reference group. In other words, to the best of my knowledge, we don't know what specific SNP's or their + and - scores are being used (and which ones are not being used) to determine the degree of matching with any reference group.

    There have been several explanations of the methodology involved in constructing reference groups and generally how matching is determined. But not at the level of detail you described. Not sure where to find the latest explanation here on FTDNA. A huge number of assumptions are involved in the methodology, for the most part untestable. It is possible we would see very different results if the entire "ethnic origins" methodology were based on true phased data, i.e., from "long-read sequencing".


    • #3
      Thanks for that. I since found a brief explanation on Ancestry of how they attempt to phase data by using all the overlapping matching segments from other people who match me. But then they somehow combine data from all chromosomes to give the ethnicity subdivisions of each parent, though not knowing which parent is which. I suppose if several chromosomes have, say, Welsh origin, and several opposite ones have Italian, then they would assume that all the Welsh comes from the same parent and all the Italian from the other, which of course, might not be the case, as both parents might be mixed. Perhaps FTDNA use a similar method to phase the data.

      My ethnicity results from FTDNA and Ancestry are mostly plausible, but show odd variations. On FTDNA one of my daughters has numerous Eastern European segments, which could only have come from me, yet I have none! So clearly pinning down genetic variations for what we perceive as ethnicities is not easy.


      • #4
        The fact that "admixture" results from different vendors and at different times yield discrepant results, tells us that the methodology isn't particularly robust. Another sign of difficulty is when a child's kit shows "ethnic" components that are not present in either parent!

        Real "phased" data has to come from comparison with actual relatives. Pretending to "phase" data by assuming that the subject kit has the same sequences as a large number of other people (this is properly "imputation", not "phasing") isn't helpful for the analysis of individuals, even if it makes some sense for the analysis of populations.

        In fact, I think there's a serious methodological flaw in the way the analysis is done. Even if the "reference groups" are valid, based on multivariate statistical analysis that shows each such group is relatively homogeneous and distinct from other reference groups, the comparison of a test kit needs to be based on the best available sequence data from that kit, not on the isolated SNP scores. There is an obvious way to get much closer to the "real" sequence data for the test kit, by compiling all of the significant shared sequences that the test kit has with other kits -- the kit's matching segments! We want to find out if the test kit's actual matched segments, each of which should almost always come from a single homologue, show a significant similarity to any reference group. A tool that considers all of a test kit's matching segments and produces a set of phased sequences would seem to be a useful step forward while we are waiting for long-read sequencing. Those who have kits for themselves and at least one parent also have the option to use phasing tools available on several web sites.