After reperforming a TMRCA calculation for an historical benchmark clade under my new, slightly-tweaked algorithm I had an insight about Discover's (probable) methodology that I think very likely explains their very weird, shifting estimates for R-FGC23343 and subclades. My guess is that their fundamental orientation is opposite to mine, focusing on the distance between a fixed, remote ancestor, whereas I focus on the distance between the (only relatively) fixed dates associated with the living donors and a given clade.
Orientation.png
In theory, both approaches should return the same results--but only if the mutation/mutation observation and reporting processes functioned like a perfect genetic clock, instead of the highly variable stochastic processes that we know they actually are. For example, with respect to the MRCA between BY32575 and FGC23343, the (currently reported) typical donor variant counts are ~62.25and ~45.16, respectively.
That's a discrepancy of about 17 variants, or about 27% of the longer (and in my opinion, more likely correct) typical count for BY32575. This scale is shocking and raises some urgent questions about the precise manner in which this discrepancy arose. To date I've explored the possibility that some SNPs currently reported within the anomalously long block FGC28370 may properly belong to a different, ancestral block, but at least with regard to the data reported for my cousin's kit, FT37222, it doesn't seem likely. At the moment I have a lot of questions but no answers.
However, I can make some important, objectively factual statements that constitute strong support that my dates are likely to be more accurate than Discover's.
First, all relaxed clock methods displace the adjustments for anomalous data further away from your point of origin. So methods that orient themselves from a fixed common ancestor, as I believe that Discover may be using, are going to attribute anomalies to dates closer to the living donor. There are a number of problems with this:
-The identity of the MRCA may be fixed, but the whole premise of the TMRCA exercise is that we don't know when he was born. That's the whole point of this exercise. And even though donor birth years are variable, they are known and only variable within a very tight time frame. So approaches starting from the MRCA-oriented are clearly working from an unknown variable towards a known variable, which is exactly the opposite of the way math works in all variable estimation algorithms.
-The recognized, standard way for properly calculating conditional probabilities is to multiply the probabilities associated with each individual event.
For the purposes of illustration, let's say that the probability of observing an anomalous variant count over a fixed period of time is 1/4. Attributing this anomaly to a single ancestral clade, as my orientation from the living donor will tend to do, the probability will be 1/4. (1/4)^1=1/4.
But attributing it separately to 3 distinct descendant clades, as the MRCA-oriented approach (probably) used by Discover does, the probability becomes 1/64. (1/4)^3=1/64.
Would you rather play a lottery with a 25% chance of winning or only 1.56% chance of winning?
By all means, whenever there is specific information available to confirm the specific range of dates within which an anomaly has occurred, overwrite the default, statistically calculated estimate. But if you don't know--and that is the whole premise of TMRCA estimation--why would you pick the objectively least probable estimate?
Just for reference, here is a link to the weird block tree data that started this all. I don't know for a fact that this is how Discover is calculating these dates, but I do think it is a framework that at least makes their weird, constantly shifting estimates understandable, if not reasonable in practice.
Orientation.png
In theory, both approaches should return the same results--but only if the mutation/mutation observation and reporting processes functioned like a perfect genetic clock, instead of the highly variable stochastic processes that we know they actually are. For example, with respect to the MRCA between BY32575 and FGC23343, the (currently reported) typical donor variant counts are ~62.25and ~45.16, respectively.
That's a discrepancy of about 17 variants, or about 27% of the longer (and in my opinion, more likely correct) typical count for BY32575. This scale is shocking and raises some urgent questions about the precise manner in which this discrepancy arose. To date I've explored the possibility that some SNPs currently reported within the anomalously long block FGC28370 may properly belong to a different, ancestral block, but at least with regard to the data reported for my cousin's kit, FT37222, it doesn't seem likely. At the moment I have a lot of questions but no answers.
However, I can make some important, objectively factual statements that constitute strong support that my dates are likely to be more accurate than Discover's.
First, all relaxed clock methods displace the adjustments for anomalous data further away from your point of origin. So methods that orient themselves from a fixed common ancestor, as I believe that Discover may be using, are going to attribute anomalies to dates closer to the living donor. There are a number of problems with this:
-The identity of the MRCA may be fixed, but the whole premise of the TMRCA exercise is that we don't know when he was born. That's the whole point of this exercise. And even though donor birth years are variable, they are known and only variable within a very tight time frame. So approaches starting from the MRCA-oriented are clearly working from an unknown variable towards a known variable, which is exactly the opposite of the way math works in all variable estimation algorithms.
-The recognized, standard way for properly calculating conditional probabilities is to multiply the probabilities associated with each individual event.
For the purposes of illustration, let's say that the probability of observing an anomalous variant count over a fixed period of time is 1/4. Attributing this anomaly to a single ancestral clade, as my orientation from the living donor will tend to do, the probability will be 1/4. (1/4)^1=1/4.
But attributing it separately to 3 distinct descendant clades, as the MRCA-oriented approach (probably) used by Discover does, the probability becomes 1/64. (1/4)^3=1/64.
Would you rather play a lottery with a 25% chance of winning or only 1.56% chance of winning?
By all means, whenever there is specific information available to confirm the specific range of dates within which an anomaly has occurred, overwrite the default, statistically calculated estimate. But if you don't know--and that is the whole premise of TMRCA estimation--why would you pick the objectively least probable estimate?
Just for reference, here is a link to the weird block tree data that started this all. I don't know for a fact that this is how Discover is calculating these dates, but I do think it is a framework that at least makes their weird, constantly shifting estimates understandable, if not reasonable in practice.
Comment