No announcement yet.

What constitutes a match

  • Filter
  • Time
  • Show
Clear All
new posts

  • What constitutes a match

    My sister and I both used ftdna, and I'm trying to correlate our raw data files to the matches reported by the chromosome browser spreadsheet.

    CB reports we have a match of 11,800 SNPs at a given locus. I located the locus number in the build 36 data file, and wrote a program to generate a matrix of rs names and the allele pair for each of us. I assumed a "match" would consist of a common allele for each of us in every rs-named snp in the region defined by the start/stop locus numbers.

    What I find is that the 11,800 snp sequence is mostly a match (with an overlapping allele value in each position), but there's one rs position with no overlapping value (e.g.. CC for me and GG for her).

    I've tried running the comparison on both build 36 and build 37 data files with similar results.

    So I assume the ftdna algorithm is forgiving in that a single difference doesn't break a match. I've modified my algorithm to accept a difference if the 50 snps on either side have a common (overlapping) allele. The results are better, but still not a perfect match to ftdna's chromosome browser matches. Does anyone know how ftdna computes matches?

    I've also noticed that the ftdna CB spreadsheet reports several matches with exactly 500 snp sequences; not 501 or 512 but exactly 500. Others have an exact number, e.g., 1132. Any idea why there would be so many with a round number of matching snps?

    I've also looked at matches between locus 1 and locus 2 that contain perhaps 1500 snps, and CB will sometimes report more matching snps that exist in the raw data file between those loci.

    I'm just trying to make sense of the numbers that ftdna is reporting in its CB output. Any insight would be greatly appreciated.

  • #2
    Did you and your sister test on the same chip version?

    Current chip tests 693733 positions
    previous chip tested 696752 positions
    older chip tested 707269 positions

    one chip version may record a certain SNP in forward orientation while another chip version records it in reverse orientation.Opposite sides of DNA double helix, if one side is A the other is T, if one side is G the other side is C.

    ie)On Chromosome 1,rs1153103, position 1404875(build 36)
    Chip version testing 707269 records value in reverse orientation (TT, GG, or TG/GT), while chip version testing 693733 records it in forward orientation (AA, CC, or AC)
    In this case AA is the same as TT, CC is the same as GG, and TG is the same as AC.

    I believe Algorithm will just count these discordant SNP's as a no call. The Algorithm allows for a certain number of no calls/discordant SNPs within a matching segment.

    If there happens to be multiple of these within a matching segment, it will reject segment as a match even though it is a positive match. Currently I believe this issue is only on the X chromosome in regards to a rejected positive larger segment.

    I've also looked at matches between locus 1 and locus 2 that contain perhaps 1500 snps, and CB will sometimes report more matching snps that exist in the raw data file between those loci.
    As for the different numbers of SNPS recorded for a segment verse number of SNPS listed in Raw Data is probably due to fact that FTDNA removes the tested SNPs from raw data that are deemed medical.
    Last edited by prairielad; 23 January 2015, 05:59 PM.


    • #3
      Re the many segments that are exact multiples of 100, this FAQ has some background. It mentions sets of 50-100 SNPs, but my impression is that the overwhelming majority are 100-SNPs.

      Re the few exceptions with an odd number of SNPs, I've wondered about them, too. They do seem to be larger, but that's the only thing I've noticed.

      For readers who don't want to write their own analysis routine, David Pike has a handy utility.


      • #4
        Addendum: I just did a fresh download of a Chromosome Browser file. There's more variation in the size of segments with and without multiples of 100 than I recalled from the "olden days." One difference between now and then is that it's possible to download all segments as a batch file. Early on, the download would include data for just those people highlighted in the browser, so that wasn't such a good sample size.