Announcement

Collapse
No announcement yet.

Big Y enhancements started Oct 10th

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Big Y enhancements started Oct 10th

    My Big Y results have not been updated yet so I can't much of this. There are a number of enhancements in Big Y. The new Big Y results will all be using the updated reference model Hg38. Think of it as a map that the Big Y most raw results are compared with to to look for changes (variants or SNPs).

    It's more complicated but I don't know if everyone wants all the details or cares. All prior Big Y results are supposed to be converted to the Hg38 format. This is important so comparisons can be apples to apples. This the essence of discovering the tree branching in genetic genealogy.

    There new and improved on-line tools that make analysis easier for more people. There is Big Y Chromosome Browser and an Terminal SNP Guide has been added to the Big Y Matching. This is important as the old way was very limited.

    If interested, please become familiar with the Big Y Learning Center.

    https://www.familytreedna.com/learn/...g/big-y/big-y/

  • #2
    If the new Build 38 pages that have appeared are any guide
    to what will happen after all are converted, there is still totally insufficient info to be sure of generating a correct
    tree of people. One will still need BED files at the very least, and in some cases BAM files too, as the chromosome browser does not give read map quality info. That's important.

    Based on how many updated web pages have appeared so far,
    I calculate that the update will take 80 days.

    Comment


    • #3
      Originally posted by dtvmcdonald View Post
      If the new Build 38 pages that have appeared are any guide
      to what will happen after all are converted, there is still totally insufficient info to be sure of generating a correct
      tree of people. One will still need BED files at the very least, and in some cases BAM files too, as the chromosome browser does not give read map quality info. That's important.
      It appears that the VCF files are much more robust in providing quality of the test call per location but they may not include locations with less that 4x reads.

      The Big Y Chromosome Browser needs to support going directly to a specified location, even if it has less that 4x reads.... in my opinion. Otherwise, we have to go through all the BAM rig-a-ma-roll stuff anyway or at least down at the real genetic genealogy levels of the last several hundred years.

      Dtvmcdonald, are you of the opinion that if there is a no derived call in the VCF and the location is within a BED region we can reliably assume it is truly ancestral? We've had luck with that in past analyses but it is sometimes wrong. With greater reporting down to 4x maybe in the VCF files maybe that just about solves the problem. Maybe not.

      What's the qualifications for BED regions inclusion? Is FTDNA including locations with only 4x coverage? I don't know. I can probably live with that, but there will be cases where external BAM interpretations are needed anyway unless FTDNA allows one to go a specific location on the Big Y Browser.

      Originally posted by dtvmcdonald View Post
      Based on how many updated web pages have appeared so far,
      I calculate that the update will take 80 days.
      I don't know if it will take that long, but it is going slower than I thought. I thought it was 5-7 days so we shall see how much progress is made by about Wednesday.

      In any case, I can easily see it taking 80 days of kinks to work out.
      Last edited by mwwalsh; 15th October 2017, 03:22 PM.

      Comment


      • #4
        The "New Big Y Results" email sent by FTDNA is outdated. It uses the old (no longer existing) Big Y learning center page link among other outdated details. I reported it to FTDNA and I was told the following earlier today (Sunday).

        We are going through a Big Y update right now. Big Y 2.0 will be completed by the end of next week.

        Comment


        • #5
          Originally posted by The_Contemplator View Post
          The "New Big Y Results" email sent by FTDNA is outdated. It uses the old (no longer existing) Big Y learning center page link among other outdated details. I reported it to FTDNA and I was told the following earlier today (Sunday).
          I'm not sure what link is the old link or whatever but I received by email on Thursday from the FTDNA project admin help desk this URL:
          https://www.familytreedna.com/learn/...g/big-y/big-y/

          It looks relevant as it talks about the new SNP Terminal Guid and new Big Y on-line Browser.

          Comment


          • #6
            Right. I'm aware of the new one. The old one is this one:
            http://www.familytreedna.com/learn/u...ts/big-y-page/

            Comment


            • #7
              "Dtvmcdonald, are you of the opinion that if there is a no derived call in the VCF and the location is within a BED region we can reliably assume it is truly ancestral?"

              Of course not! There will always be locations that will need both a look at the BAM file and also simple logic involving
              the known state of other markers in the person involved.
              We do not yet know, and won't until we have numerous bams,
              how reliable the "boundaries" set the MAP quality of "reads" quality can get before they throw out that whole "read". Up until Build 38, that number is zero ... in other words, they use "reads" that map to multiple places!

              With good mapping quality locations I have not found locations that are not called in the VCF and are in the bed and are wrong. In other words ... barring bad mapping quality reads, the answer is YES. But ... they DO use bad
              mapping quality reads. I'm pretty sure that Build 38 adds
              new areas (around centromere) that will add to the Build 37 regions around 22,220,000 where there are lots of bad mapping quality reads. There are what are clearly bad mapping quality reads showing up in that centromere area on their new chromosome browser.



              Caveat emptor!

              Comment


              • #8
                Originally posted by dtvmcdonald View Post
                "Dtvmcdonald, are you of the opinion that if there is a no derived call in the VCF and the location is within a BED region we can reliably assume it is truly ancestral?"

                Of course not! There will always be locations that will need both a look at the BAM file and also simple logic involving
                the known state of other markers in the person involved.
                We do not yet know, and won't until we have numerous bams,
                how reliable the "boundaries" set the MAP quality of "reads" quality can get before they throw out that whole "read". Up until Build 38, that number is zero ... in other words, they use "reads" that map to multiple places!

                With good mapping quality locations I have not found locations that are not called in the VCF and are in the bed and are wrong. In other words ... barring bad mapping quality reads, the answer is YES. But ... they DO use bad
                mapping quality reads. I'm pretty sure that Build 38 adds
                new areas (around centromere) that will add to the Build 37 regions around 22,220,000 where there are lots of bad mapping quality reads. There are what are clearly bad mapping quality reads showing up in that centromere area on their new chromosome browser.

                Caveat emptor!
                I have never studied this but one of the R1b1a2 citizen-science guys told me that if an SNP was absent from the VCF file AND within a BED region (not on the edge) then it was truly ancestral over 95% of the time. I don't think that was any widespread study, but this general assumption is used in a lot of tree building exercises.

                I agree reviewing the BAM file is better, but even then the answers are not black and white but subject to interpretation.

                If the VCF files are much more robust and have lower thresholds for inclusion of SNPs we might see a diminished need to review the BAM files in every case. Sometimes it might be more useful just to go ahead and test via Sanger Sequencing just to be sure. It's all time vs cost vs accuracy trade-off.

                Comment


                • #9
                  In my experience your 95% number is low. Its better than that. But 95% is abysmal.

                  Also, however, it is my very strong opinion that in certain regions, including the (former)22,200,000 region and now probably centromere regions, Sanger can ONLY prove the correct results if it, Sanger, is homozygous for the
                  entire length of its 500 or more base read! If not, which it
                  likely won't be in the problem regions, Illumina 250
                  or longer reads are the gold standard. This is because Illumina reads (clumps from) single molecules while Sanger reads the average of (amplified copies of) many molecules. Hence with Illumina you can, by hand if necessary
                  or by computer if you (well, actually "I") write a program
                  that requires perfect matching except for the location in question.

                  FTDNA includes in their "known" SNPs perfect examples of the problem. For "some reason" (read: cluelessness) they include NONE of the good CLD SNPs but do include CLD57 and CLD31,
                  neither of which are REAL SNPs!!!!!!!

                  What they are is things that they call as SNPs but are
                  due to reading two specific locations which their caller
                  is misassigning to certain places in the (former) 22,200,000 area. All these reads have "zero" map quality. That is, within their map criterion they map to multiple places. In people who have the REAL CLD57 mutation ... which is apparently a 1700 base long delete ... there will be no
                  "reads" from the real spot, and many from the wrong one,
                  so they call the base as the wrong one. In people without the real CLD57 there will be roughly equal number of reads from two places and et voila!, a nocall.

                  Sanger won't help. Illumina 250 base reads will, but in this
                  particular location, 150 will usually do as all men with CLD57+ (that is, who are CTS4179+) have another real mutation nearby.

                  Whewwwww that's a long explanation, sorry.

                  Comment


                  • #10
                    Originally posted by dtvmcdonald View Post
                    In my experience your 95% number is low. Its better than that. But 95% is abysmal.

                    Also, however, it is my very strong opinion that in certain regions, including the (former)22,200,000 region and now probably centromere regions, Sanger can ONLY prove the correct results if it, Sanger, is homozygous for the
                    entire length of its 500 or more base read! If not, which it
                    likely won't be in the problem regions, Illumina 250
                    or longer reads are the gold standard. This is because Illumina reads (clumps from) single molecules while Sanger reads the average of (amplified copies of) many molecules. Hence with Illumina you can, by hand if necessary
                    or by computer if you (well, actually "I") write a program
                    that requires perfect matching except for the location in question.

                    FTDNA includes in their "known" SNPs perfect examples of the problem. For "some reason" (read: cluelessness) they include NONE of the good CLD SNPs but do include CLD57 and CLD31,
                    neither of which are REAL SNPs!!!!!!!

                    What they are is things that they call as SNPs but are
                    due to reading two specific locations which their caller
                    is misassigning to certain places in the (former) 22,200,000 area. All these reads have "zero" map quality. That is, within their map criterion they map to multiple places. In people who have the REAL CLD57 mutation ... which is apparently a 1700 base long delete ... there will be no
                    "reads" from the real spot, and many from the wrong one,
                    so they call the base as the wrong one. In people without the real CLD57 there will be roughly equal number of reads from two places and et voila!, a nocall.

                    Sanger won't help. Illumina 250 base reads will, but in this
                    particular location, 150 will usually do as all men with CLD57+ (that is, who are CTS4179+) have another real mutation nearby.

                    Whewwwww that's a long explanation, sorry.
                    This is just my memory, but the actual number given to me was more than 98%. I just didn't want to pin it down as he probably just looked at a couple dozen BAM and VCF/BED file sets to determine that.

                    Have you looked at the new REGIONS.BED files. Is this difficult region included? FTDNA has told me they considered some the types of regions to be bad and there was internal disagreement between folks including the old lab director during the original Big Y design. Since he islong gone on his own, they can eliminate those regions they don't like apparently with internal consensus now. I don't know if we are talking about the same regions, though.
                    Last edited by mwwalsh; 16th October 2017, 05:29 PM.

                    Comment


                    • #11
                      Originally posted by dtvmcdonald View Post
                      ...which it
                      likely won't be in the problem regions, Illumina 250
                      or longer reads are the gold standard. This is because Illumina reads (clumps from) single molecules while Sanger reads the average of (amplified copies of) many molecules. Hence with Illumina you can, by hand if necessary
                      or by computer if you (well, actually "I") write a program
                      that requires perfect matching except for the location in question.
                      They say they are using Illumina NovaSeq now. Does this have better or worse capability than the 250?

                      Comment


                      • #12
                        Confussed

                        Originally posted by mwwalsh View Post
                        My Big Y results have not been updated yet so I can't much of this. There are a number of enhancements in Big Y. The new Big Y results will all be using the updated reference model Hg38. Think of it as a map that the Big Y most raw results are compared with to to look for changes (variants or SNPs).

                        It's more complicated but I don't know if everyone wants all the details or cares. All prior Big Y results are supposed to be converted to the Hg38 format. This is important so comparisons can be apples to apples. This the essence of discovering the tree branching in genetic genealogy.

                        There new and improved on-line tools that make analysis easier for more people. There is Big Y Chromosome Browser and an Terminal SNP Guide has been added to the Big Y Matching. This is important as the old way was very limited.

                        If interested, please become familiar with the Big Y Learning Center.

                        https://www.familytreedna.com/learn/...g/big-y/big-y/
                        A sibling's Big Y was due last month and kept being delayed, all things considered understandable. However this just happened again :to Oct 30 - Nov 13 this time but the test WAS completed before October 10, 2017. That night it posted his new terminal Haplogroup (correct as it was the same as mine)and I was able to download the results and print out his SNP certificate which looked complete. I think they were the old "method" as the next day the download file was twice as big? The following day it all disappeared except his terminal Haplogroup which still appeared on his main page and in projects spreadsheets.

                        Does anybody know what is going on? Is there a FTDNA status available anywhere?

                        Any information would be appreciated.

                        Comment


                        • #13
                          There are no new regions.bed file that I can see, let alone
                          get.

                          No files at all have yet appeared in the core of my project,
                          CTS4179+.

                          Comment


                          • #14
                            Originally posted by eztimes View Post
                            Does anybody know what is going on? Is there a FTDNA status available anywhere?
                            Ignore the new deadline. FTDNA is going through a major update of the Big Y result pages. Once your kit has been reprocessed you will have access to the new results page. It will look something like this:

                            https://www.familytreedna.com/learn/...g/big-y/big-y/

                            Comment


                            • #15
                              Originally posted by dtvmcdonald View Post
                              There are no new regions.bed file that I can see, let alone
                              get.

                              No files at all have yet appeared in the core of my project,
                              CTS4179+.
                              I've seen a few of the late summer orders come in last week. The REGIONS.BED file is about 35% longer so much more definition in the way of regions. There are some complaints that many of these are too short. I don't know. What's the location range of the region you are concerned about (in Hg38 parlance)?

                              Comment

                              Working...
                              X