Go Back   Family Tree DNA Forums > Paternal Lineages (Y-DNA) > BIG Y and SNP Discovery

BIG Y and SNP Discovery This area is for talk about BIG Y results.

Reply
 
Thread Tools Display Modes
  #1  
Old 15th October 2017, 02:57 PM
mwwalsh mwwalsh is offline
FTDNA Customer
 
Join Date: Oct 2007
Posts: 152
Big Y enhancements started Oct 10th

My Big Y results have not been updated yet so I can't much of this. There are a number of enhancements in Big Y. The new Big Y results will all be using the updated reference model Hg38. Think of it as a map that the Big Y most raw results are compared with to to look for changes (variants or SNPs).

It's more complicated but I don't know if everyone wants all the details or cares. All prior Big Y results are supposed to be converted to the Hg38 format. This is important so comparisons can be apples to apples. This the essence of discovering the tree branching in genetic genealogy.

There new and improved on-line tools that make analysis easier for more people. There is Big Y Chromosome Browser and an Terminal SNP Guide has been added to the Big Y Matching. This is important as the old way was very limited.

If interested, please become familiar with the Big Y Learning Center.

https://www.familytreedna.com/learn/...g/big-y/big-y/
Reply With Quote
  #2  
Old 15th October 2017, 03:57 PM
dtvmcdonald dtvmcdonald is offline
mtDNA: | Big Y Pending
 
Join Date: Mar 2011
Posts: 213
If the new Build 38 pages that have appeared are any guide
to what will happen after all are converted, there is still totally insufficient info to be sure of generating a correct
tree of people. One will still need BED files at the very least, and in some cases BAM files too, as the chromosome browser does not give read map quality info. That's important.

Based on how many updated web pages have appeared so far,
I calculate that the update will take 80 days.
Reply With Quote
  #3  
Old 15th October 2017, 04:13 PM
mwwalsh mwwalsh is offline
FTDNA Customer
 
Join Date: Oct 2007
Posts: 152
Quote:
Originally Posted by dtvmcdonald View Post
If the new Build 38 pages that have appeared are any guide
to what will happen after all are converted, there is still totally insufficient info to be sure of generating a correct
tree of people. One will still need BED files at the very least, and in some cases BAM files too, as the chromosome browser does not give read map quality info. That's important.
It appears that the VCF files are much more robust in providing quality of the test call per location but they may not include locations with less that 4x reads.

The Big Y Chromosome Browser needs to support going directly to a specified location, even if it has less that 4x reads.... in my opinion. Otherwise, we have to go through all the BAM rig-a-ma-roll stuff anyway or at least down at the real genetic genealogy levels of the last several hundred years.

Dtvmcdonald, are you of the opinion that if there is a no derived call in the VCF and the location is within a BED region we can reliably assume it is truly ancestral? We've had luck with that in past analyses but it is sometimes wrong. With greater reporting down to 4x maybe in the VCF files maybe that just about solves the problem. Maybe not.

What's the qualifications for BED regions inclusion? Is FTDNA including locations with only 4x coverage? I don't know. I can probably live with that, but there will be cases where external BAM interpretations are needed anyway unless FTDNA allows one to go a specific location on the Big Y Browser.

Quote:
Originally Posted by dtvmcdonald View Post
Based on how many updated web pages have appeared so far,
I calculate that the update will take 80 days.
I don't know if it will take that long, but it is going slower than I thought. I thought it was 5-7 days so we shall see how much progress is made by about Wednesday.

In any case, I can easily see it taking 80 days of kinks to work out.

Last edited by mwwalsh; 15th October 2017 at 04:22 PM.
Reply With Quote
  #4  
Old 15th October 2017, 08:44 PM
The_Contemplator The_Contemplator is offline
FTDNA Customer
 
Join Date: Jun 2015
Posts: 701
The "New Big Y Results" email sent by FTDNA is outdated. It uses the old (no longer existing) Big Y learning center page link among other outdated details. I reported it to FTDNA and I was told the following earlier today (Sunday).

Quote:
We are going through a Big Y update right now. Big Y 2.0 will be completed by the end of next week.
Reply With Quote
  #5  
Old 15th October 2017, 09:33 PM
mwwalsh mwwalsh is offline
FTDNA Customer
 
Join Date: Oct 2007
Posts: 152
Quote:
Originally Posted by The_Contemplator View Post
The "New Big Y Results" email sent by FTDNA is outdated. It uses the old (no longer existing) Big Y learning center page link among other outdated details. I reported it to FTDNA and I was told the following earlier today (Sunday).
I'm not sure what link is the old link or whatever but I received by email on Thursday from the FTDNA project admin help desk this URL:
https://www.familytreedna.com/learn/...g/big-y/big-y/

It looks relevant as it talks about the new SNP Terminal Guid and new Big Y on-line Browser.
Reply With Quote
  #6  
Old 15th October 2017, 11:05 PM
The_Contemplator The_Contemplator is offline
FTDNA Customer
 
Join Date: Jun 2015
Posts: 701
Right. I'm aware of the new one. The old one is this one:
http://www.familytreedna.com/learn/u...ts/big-y-page/
Reply With Quote
  #7  
Old 16th October 2017, 11:06 AM
dtvmcdonald dtvmcdonald is offline
mtDNA: | Big Y Pending
 
Join Date: Mar 2011
Posts: 213
"Dtvmcdonald, are you of the opinion that if there is a no derived call in the VCF and the location is within a BED region we can reliably assume it is truly ancestral?"

Of course not! There will always be locations that will need both a look at the BAM file and also simple logic involving
the known state of other markers in the person involved.
We do not yet know, and won't until we have numerous bams,
how reliable the "boundaries" set the MAP quality of "reads" quality can get before they throw out that whole "read". Up until Build 38, that number is zero ... in other words, they use "reads" that map to multiple places!

With good mapping quality locations I have not found locations that are not called in the VCF and are in the bed and are wrong. In other words ... barring bad mapping quality reads, the answer is YES. But ... they DO use bad
mapping quality reads. I'm pretty sure that Build 38 adds
new areas (around centromere) that will add to the Build 37 regions around 22,220,000 where there are lots of bad mapping quality reads. There are what are clearly bad mapping quality reads showing up in that centromere area on their new chromosome browser.



Caveat emptor!
Reply With Quote
  #8  
Old 16th October 2017, 12:04 PM
mwwalsh mwwalsh is offline
FTDNA Customer
 
Join Date: Oct 2007
Posts: 152
Quote:
Originally Posted by dtvmcdonald View Post
"Dtvmcdonald, are you of the opinion that if there is a no derived call in the VCF and the location is within a BED region we can reliably assume it is truly ancestral?"

Of course not! There will always be locations that will need both a look at the BAM file and also simple logic involving
the known state of other markers in the person involved.
We do not yet know, and won't until we have numerous bams,
how reliable the "boundaries" set the MAP quality of "reads" quality can get before they throw out that whole "read". Up until Build 38, that number is zero ... in other words, they use "reads" that map to multiple places!

With good mapping quality locations I have not found locations that are not called in the VCF and are in the bed and are wrong. In other words ... barring bad mapping quality reads, the answer is YES. But ... they DO use bad
mapping quality reads. I'm pretty sure that Build 38 adds
new areas (around centromere) that will add to the Build 37 regions around 22,220,000 where there are lots of bad mapping quality reads. There are what are clearly bad mapping quality reads showing up in that centromere area on their new chromosome browser.

Caveat emptor!
I have never studied this but one of the R1b1a2 citizen-science guys told me that if an SNP was absent from the VCF file AND within a BED region (not on the edge) then it was truly ancestral over 95% of the time. I don't think that was any widespread study, but this general assumption is used in a lot of tree building exercises.

I agree reviewing the BAM file is better, but even then the answers are not black and white but subject to interpretation.

If the VCF files are much more robust and have lower thresholds for inclusion of SNPs we might see a diminished need to review the BAM files in every case. Sometimes it might be more useful just to go ahead and test via Sanger Sequencing just to be sure. It's all time vs cost vs accuracy trade-off.
Reply With Quote
  #9  
Old 16th October 2017, 04:54 PM
dtvmcdonald dtvmcdonald is offline
mtDNA: | Big Y Pending
 
Join Date: Mar 2011
Posts: 213
In my experience your 95% number is low. Its better than that. But 95% is abysmal.

Also, however, it is my very strong opinion that in certain regions, including the (former)22,200,000 region and now probably centromere regions, Sanger can ONLY prove the correct results if it, Sanger, is homozygous for the
entire length of its 500 or more base read! If not, which it
likely won't be in the problem regions, Illumina 250
or longer reads are the gold standard. This is because Illumina reads (clumps from) single molecules while Sanger reads the average of (amplified copies of) many molecules. Hence with Illumina you can, by hand if necessary
or by computer if you (well, actually "I") write a program
that requires perfect matching except for the location in question.

FTDNA includes in their "known" SNPs perfect examples of the problem. For "some reason" (read: cluelessness) they include NONE of the good CLD SNPs but do include CLD57 and CLD31,
neither of which are REAL SNPs!!!!!!!

What they are is things that they call as SNPs but are
due to reading two specific locations which their caller
is misassigning to certain places in the (former) 22,200,000 area. All these reads have "zero" map quality. That is, within their map criterion they map to multiple places. In people who have the REAL CLD57 mutation ... which is apparently a 1700 base long delete ... there will be no
"reads" from the real spot, and many from the wrong one,
so they call the base as the wrong one. In people without the real CLD57 there will be roughly equal number of reads from two places and et voila!, a nocall.

Sanger won't help. Illumina 250 base reads will, but in this
particular location, 150 will usually do as all men with CLD57+ (that is, who are CTS4179+) have another real mutation nearby.

Whewwwww that's a long explanation, sorry.
Reply With Quote
  #10  
Old 16th October 2017, 05:37 PM
mwwalsh mwwalsh is offline
FTDNA Customer
 
Join Date: Oct 2007
Posts: 152
Quote:
Originally Posted by dtvmcdonald View Post
In my experience your 95% number is low. Its better than that. But 95% is abysmal.

Also, however, it is my very strong opinion that in certain regions, including the (former)22,200,000 region and now probably centromere regions, Sanger can ONLY prove the correct results if it, Sanger, is homozygous for the
entire length of its 500 or more base read! If not, which it
likely won't be in the problem regions, Illumina 250
or longer reads are the gold standard. This is because Illumina reads (clumps from) single molecules while Sanger reads the average of (amplified copies of) many molecules. Hence with Illumina you can, by hand if necessary
or by computer if you (well, actually "I") write a program
that requires perfect matching except for the location in question.

FTDNA includes in their "known" SNPs perfect examples of the problem. For "some reason" (read: cluelessness) they include NONE of the good CLD SNPs but do include CLD57 and CLD31,
neither of which are REAL SNPs!!!!!!!

What they are is things that they call as SNPs but are
due to reading two specific locations which their caller
is misassigning to certain places in the (former) 22,200,000 area. All these reads have "zero" map quality. That is, within their map criterion they map to multiple places. In people who have the REAL CLD57 mutation ... which is apparently a 1700 base long delete ... there will be no
"reads" from the real spot, and many from the wrong one,
so they call the base as the wrong one. In people without the real CLD57 there will be roughly equal number of reads from two places and et voila!, a nocall.

Sanger won't help. Illumina 250 base reads will, but in this
particular location, 150 will usually do as all men with CLD57+ (that is, who are CTS4179+) have another real mutation nearby.

Whewwwww that's a long explanation, sorry.
This is just my memory, but the actual number given to me was more than 98%. I just didn't want to pin it down as he probably just looked at a couple dozen BAM and VCF/BED file sets to determine that.

Have you looked at the new REGIONS.BED files. Is this difficult region included? FTDNA has told me they considered some the types of regions to be bad and there was internal disagreement between folks including the old lab director during the original Big Y design. Since he islong gone on his own, they can eliminate those regions they don't like apparently with internal consensus now. I don't know if we are talking about the same regions, though.

Last edited by mwwalsh; 16th October 2017 at 06:29 PM.
Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Desired Enhancements to FTDNA Website wolong Group Administrators - Advanced Chat 6 13th November 2017 03:27 PM
How did you get started? lab Adoptees Forum 4 16th May 2017 10:32 PM
FTDNA Enhancements & Bug Fixes - 6/18/2014 efgen Announcements and New Features 8 20th June 2014 11:51 AM
Is 10th cousin a good estimate? Mylineage Paternal Lineage (Y-DNA STR) Advanced 3 10th March 2013 07:17 AM
Family Finder enhancements manoss Features Requests & Bug Reports Area 1 7th April 2011 03:41 AM


All times are GMT -5. The time now is 04:45 AM.


Family Tree DNA - World Headquarters

1445 North Loop West, Suite 820
Houston, Texas 77008, USA

Phone: (713) 868-1438 | Fax: (832) 201-7147
Copyright 2001-2010 Genealogy by Genetics, Ltd.
Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.