Announcement

Collapse
No announcement yet.

Pseudo/False Segments under 5cM

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pseudo/False Segments under 5cM

    I thought I would share a little insight into why it is recommended to not set store in the validity of segments under 5cM/500SNP

    Started new post due to off topic subject of original poster question in following thread
    http://forums.familytreedna.com/showthread.php?t=38542

    I manually phased 14 family member raw data, chromosome by chromosome based on how we all share each other (Full base pair, 1/2 base pair, no match) See Document 1 in following link of my families chromosome 22 (partial file) as an example of what I did. This enabled me to breakdown our DNA into my Grandparents parent A and parent B chromosomes (need more data to distinguish Maternal vs Paternal can only say Parent A and Parent B per chromosome)

    Document 2 in link is an example of comparing 2 brothers unphased data (chromosomes 1 thru 22) with parameters set as 1cM/100SNPs at Gedmatch
    I then compared their maternal phased files and paternal phased files against each other.

    Unphased data finds 198 segments below 5cM, while maternal files only finds 13 and paternal file only finds 17. That means 85% of these small segments are false due to algorithm picking between each brothers 2 values from each of their separate maternal and paternal chromosomes.

    Code:
    https://onedrive.live.com/redir?resid=B112A5E875C2BB5D!305&authkey=!AD4Ltyj7QR1vHVk&ithint=folder%2cxlsx

  • #2
    I should add that these brothers have a mother that was born in 1909 Romania, her paternal side originates out of Germany and moved to Romania (Bukovina, Austro-Hungarian Empire) in 1803.

    On their paternal side they have a Scottish Grandfather(1850) (with all lines going back to late 1700 Scotland) and an Irish Grandmother(1860) whos parents where from Ireland.

    Comment


    • #3
      Another thing, in Document 2

      Small segments like what is shown on the chromosome 10 phased maternal and paternal section, are due to fact that I entered no calls (-) on positions that I was unable to phase into a single value. Those segments are one continuous segment in reality.

      Comment


      • #4
        Thank you, Prairielad. Phasing of segments is the gold standard but it doesn't seem to be enough to convince the IT people or the public at large that these small segments are usually false positives due to pseudo-segments.

        Computer generated psuedo-segments from unphased duplicate markers are not the only problem (but it is the main issue here). I have also shown in the previous thread how even tiny phased segments that have single sequences on one chromosome can be in common with too many people and can also be ambiguous because you don't know which ancestor to follow back in time. A distant cousin can match you, your mother and both maternal grandparents on the same tiny phased X segment; so where did it come from? This is a slightly different situation. You must rule in one grandparent and rule out the other grandparent and follow the segment back in time when there is a chance too many ancestors match that same segment.

        Until people really understand phasing, crossover rates, simulations over many generations, mapping, pile-ups, SNP-poor regions etc. we are going to have a problem with this concept. Even FTDNA has failed to convey this observation that small segments are unlikely to be relevant. Small segments continue to carry excessive weight in cousinship predictions. It is about time to discuss the various studies and observations on this subject. You can't just listen to archaic guestimations of probabilities. We need an evidence-based approach.

        Here is example of another common misconception. Lets say you upload your raw data to GEDmatch and look at the various biogeographical analyses. In one analysis you see that on 10 different chromosomes, there are tiny slivers of Native American. These are very small segments, less than 1 cM and most of these are away from the centromere. You know you have one Native American princess in your pedigree back 10 generations. All others in your pedigree are well documented to have European heritage. You come to the conclusion that because the segments get sliced approximately in half with every generation, all these uniquely Amerindian segments must have come from this one specific lineage of Native American that is in the distant past. You think that it is unlikely that these segments could have been passed down IBD from anywhere else because of your well documented paper trail. You conclude that by process of elimination, one common ancestor actually gave you all these 10 segments passed down IBD through the same parent-to-child path in your pedigree. So my question to our genealogy group is, do you think these conclusions are mathematically sound? Why or why not? There can be more than one explanation so don't be shy in this debate.

        Comment


        • #5
          Pseudo/False Segment Under 5 cM

          OTOH. I have dozens of small segment matches which overlap much larger matching segments and share common ancestors with those other kits. On #2, at location 180-220M, any match over 4 cM is almost certain to have Boucher ancestry. Same with Guyon on #16 and Archambault on #20. I have a triple line of ancestry to Boucher and a double line to Archambault, but only a single line to Guyon.

          Comment


          • #6
            [QUOTE=Kathy Johnston;418189]

            ....

            Until people really understand phasing, crossover rates, simulations over many generations, mapping, pile-ups, SNP-poor regions etc. we are going to have a problem with this concept. Even FTDNA has failed to convey this observation that small segments are unlikely to be relevant. Small segments continue to carry excessive weight in cousinship predictions. It is about time to discuss the various studies and observations on this subject. You can't just listen to archaic guestimations of probabilities. We need an evidence-based approach.

            ....QUOTE]

            One benefit to these small segments (at FTDNA at least) is that FTDNA requires one to have at least 20cM in total of shared DNA to be declared a match. Many distant cousins may only share 1 valid segment under 20cM, so these false segments allow this match to be shown on your match list. Sometimes these matches maybe a key to figuring out a group of matches, other times these remote cousins maybe extremely remote.

            Another thing, my understanding, these little segments carry less weight then the segments over 5cm/500 SNPs in determining cousinship. You can share more DNA with a 5th to Remote Cousin (1 segment over 5cm/500SNP and multiple small segments under 5cm/500SNP) then a 3rd to fourth Cousin (1 or more segments over 5cM/500SNP and only a few small segments under 5cM/500SNP)

            3rd Cousin 1x A (common ancestor is shared father, different mother) 3rd Cousin to 5th Cousin
            1 26594971 29406188 1.38 500
            1 71302722 75637307 2.2 800
            1 170453410 173357037 1.42 500
            2 23019126 31194150 8.88 2100
            4 130597537 133425346 2.16 500
            6 33053603 33472061 1.57 500
            6 115142017 120242476 2.86 1000
            7 21764587 30678486 12.85 2884
            7 86131629 87246282 1.47 500
            8 50532761 53292222 1.85 500
            8 64372741 66856177 2.52 500
            10 121579903 123084874 2.96 589
            11 50189333 57826280 2.7 800
            12 38223618 41139617 2.74 800
            19 15591034 16652443 2.02 500
            Total shared DNA 49.58cM

            3rd Cousin 1x B (2nd to 4th Cousin)
            1 38631015 41138032 3.52 600
            1 243907032 245665855 2.29 500
            2 18449496 20393429 2.34 600
            2 233444593 239674017 13.46 2071
            4 30142909 34709899 3.85 678
            5 44438874 51814954 1.39 500
            6 33096673 33927560 1.82 600
            6 93544268 95718940 1.88 500
            7 17778322 29363846 16.78 3884
            8 115080765 118304099 1.92 600
            11 48970261 56273717 1.09 500
            11 68577145 70101905 1.81 500
            12 20875620 21525661 2.46 500
            13 73935857 76154533 2.63 700
            15 40471747 43796695 2.2 600
            Total shared DNA 59.44cM

            Undetermined 5th to Remote Cousin

            1 10891317 13647613 4.21 593
            1 60138925 61623798 1.69 500
            1 159547087 162870526 4.61 1000
            1 202778327 204786830 3.33 575
            2 134556625 137439496 2.15 700
            2 183404764 190231010 2.92 1100
            3 111548022 113610833 1.54 500
            3 129723971 132221303 1.87 500
            6 85403976 88469505 2.75 600
            6 116492886 118791736 1.09 500
            7 75679525 77904733 2.29 500
            8 17782405 18875108 3.4 700
            8 27058580 28195157 2.77 500
            11 65075841 68173670 4.01 700
            11 92180200 94413959 3.13 600
            12 57716570 60521100 2.41 500
            12 86346855 89243803 3.11 500
            13 99682492 101019787 2.26 500
            18 72484246 74518781 9.2 840
            Total shared DNA 58.74cM

            Comment


            • #7
              Originally posted by prairielad View Post
              One benefit to these small segments (at FTDNA at least) is that FTDNA requires one to have at least 20cM in total of shared DNA to be declared a match. Many distant cousins may only share 1 valid segment under 20cM, so these false segments allow this match to be shown on your match list. Sometimes these matches maybe a key to figuring out a group of matches, other times these remote cousins maybe extremely remote.
              There are many times a valid segment lets say 8 cM to 15 cM doesn't show as a match here at FTDNA because there are not enough of the shared false small segments to reach the 20 cM matching threshold.

              Comment


              • #8
                Originally posted by mattn View Post
                There are many times a valid segment lets say 8 cM to 15 cM doesn't show as a match here at FTDNA because there are not enough of the shared false small segments to reach the 20 cM matching threshold.
                I find this is particularly a problem with admixed individuals, for example, someone who is 89% African but has a valid distant ancestor of European descent. There can be an 18% cM IBD segment that is not showing up as a match because there are not enough tiny false positive matches between them.

                Comment


                • #9
                  Originally posted by prairielad View Post

                  One benefit to these small segments (at FTDNA at least) is that FTDNA requires one to have at least 20cM in total of shared DNA to be declared a match. Many distant cousins may only share 1 valid segment under 20cM, so these false segments allow this match to be shown on your match list.
                  Why not just remove the 20 cM requirement? It might have seemed like a good idea conceptually, but it isn't working out so well in practice. FTDNA may feel they're stuck with it, though, as it would take a massive amount of recomputation for all those pair-wise comparisons. If they did undertake such a task, maybe they could incorporate X matches at the same time.

                  Comment


                  • #10
                    Originally posted by prairielad View Post
                    I thought I would share a little insight into why it is recommended to not set store in the validity of segments under 5cM/500SNP

                    Started new post due to off topic subject of original poster question in following thread
                    http://forums.familytreedna.com/showthread.php?t=38542

                    I manually phased 14 family member raw data, chromosome by chromosome based on how we all share each other (Full base pair, 1/2 base pair, no match) See Document 1 in following link of my families chromosome 22 (partial file) as an example of what I did. This enabled me to breakdown our DNA into my Grandparents parent A and parent B chromosomes (need more data to distinguish Maternal vs Paternal can only say Parent A and Parent B per chromosome)

                    Document 2 in link is an example of comparing 2 brothers unphased data (chromosomes 1 thru 22) with parameters set as 1cM/100SNPs at Gedmatch
                    I then compared their maternal phased files and paternal phased files against each other.

                    Unphased data finds 198 segments below 5cM, while maternal files only finds 13 and paternal file only finds 17. That means 85% of these small segments are false due to algorithm picking between each brothers 2 values from each of their separate maternal and paternal chromosomes.

                    Code:
                    https://onedrive.live.com/redir?resid=B112A5E875C2BB5D!305&authkey=!AD4Ltyj7QR1vHVk&ithint=folder%2cxlsx
                    Isn't some of our dna passed on in whole segments?

                    Comment


                    • #11
                      Originally posted by 1798 View Post
                      Isn't some of our dna passed on in whole segments?
                      I am not sure what you are asking about.

                      50% of our DNA is passed on, which is passed on to child in the form of 23 single chromosomes. 23 single chromosomes from father and 23 single chromosomes from mother, we have 46 single chromosomes. Each of these single chromosomes from parent are a random mixture of each of their maternal and paternal chromosome pairs, alternating segments of each of their maternal and paternal chromosome.
                      Sometimes a child will receive a full maternal or paternal single chromosome from parent that did not recombine with parents opposite chromosome of the pair.

                      In regards to phased or unphased, a child will never receive both maternal and paternal values, only one or the other (50%)

                      ie)lets say Bolded letters value of paternal and unbolded value of maternal chromosome.
                      rs35940137 1 930066 GG
                      rs3128117 1 934427 TC
                      rs2465126 1 936897 AA
                      rs2341365 1 938555 AA
                      rs15842 1 938784 CC
                      rs6657048 1 947503 CC
                      rs2710888 1 949705 TC
                      rs3128126 1 952073 AG
                      rs2710875 1 967643 TT
                      rs2465136 1 980280 TC
                      rs2488991 1 984254 TT
                      rs7526076 1 988258 AG
                      rs3766192 1 1007060 TC
                      rs3766191 1 1007450 CC

                      Child will receive either bolded values segment or unbolded values segment. or part of bolded and part of unbolded if crossover point happens along this point.
                      Last edited by prairielad; 4th October 2015, 04:17 AM.

                      Comment


                      • #12
                        Thank you PraireLad for sharing the results of your very interesting study with us. I've included a link to this thread and your study in the ISOGG Wiki page on IBD in the section on false positive matching:

                        http://www.isogg.org/wiki/Identical_...sitive_matches

                        I think many people will find your study of interest.

                        Comment


                        • #13
                          Originally posted by Ann Turner View Post
                          Why not just remove the 20 cM requirement? It might have seemed like a good idea conceptually, but it isn't working out so well in practice. FTDNA may feel they're stuck with it, though, as it would take a massive amount of recomputation for all those pair-wise comparisons. If they did undertake such a task, maybe they could incorporate X matches at the same time.
                          I agree and so does everyone else I know who has studied the issue extensively.

                          Comment


                          • #14
                            Originally posted by prairielad View Post
                            Document 2 in link is an example of comparing 2 brothers unphased data (chromosomes 1 thru 22) with parameters set as 1cM/100SNPs at Gedmatch
                            I then compared their maternal phased files and paternal phased files against each other.

                            Unphased data finds 198 segments below 5cM, while maternal files only finds 13 and paternal file only finds 17. That means 85% of these small segments are false due to algorithm picking between each brothers 2 values from each of their separate maternal and paternal chromosomes.

                            Code:
                            https://onedrive.live.com/redir?resid=B112A5E875C2BB5D!305&authkey=!AD4Ltyj7QR1vHVk&ithint=folder%2cxlsx
                            I've been looking at document 2 in some detail. At first I thought maybe I could do some calculations showing the difference in segment boundaries for phased vs unphased data ("fuzzy boundaries")but it's a bit complicated because you're comparing brothers, who can be completely identical for some regions and half-identical for others. Maybe I'll do something similar for other relatives when I get a round tuit (on sale at eBay).

                            I wanted to ask about chromosome 10, which is broken up into quite a few smaller segments. When you look at the unphased data, do you see some mismatches?

                            Incidentally, Jim Bartlett just posted an essay for me on his blog "Anatomy of an IBS segment," in which I dissect one (a longish one) to show how it can happen.

                            http://segmentology.org/2015/10/02/a...an-ibs-segment

                            Comment


                            • #15
                              Thank you Ann for that very interesting article. I've now added the link to the Wiki pages on IBD and IBS and will be sharing your article widely. There are some very difficult concepts to understand in autosomal DNA testing.

                              Comment

                              Working...
                              X