I've just spent some time reviewing some of the Y STR calculators out there. Some conclusions and some question. Informed commentary welcome.
I see that most of them have a very crude method for determining the modal value in ambiguous cases where, strictly speaking, the marker has no mathematically modal value. Generally, they seem to assume that the low-er-est value is ancestral. I have heard that there is evidence that observed father-son mutations do tend to be upward rather than downward, but I sincerely doubt this trend is strong enough to make such a radical assumption during the construction of modal haplotypes. You can maybe argue that the particular case I observed is not representative, but I found that this default resulted in absolute nonsense during the computation of genetic distances of individuals apparently equidistant from the MRCA according to high resolution SNP tests. Instead, for my particular case, I performed a manual calculation using the median value.
This solved one problem, and uncovered another, although easily resolved. Because I was using the infinite alleles method, it resulted in some otherwise inexplicable patterns in the GDs between donors, even though each now appeared to be equidistant from the MRCA. I just switched to the step-wise method, and (almost) all the anomalies appeared to be corrected.
There was still the matter of one location which a single donor reported at a two-step difference from the modal, which led me to revisit that I think I know about multi-step mutations. From what I understand (maybe not accurately or at least not up-to-date), multi-step mutations are more likely on multi-copy markers, but still over all somewhat less than 10% of all mutations. This, at least, is my inference from the exercise I describe here. But if that's true, why is the calculator default to use the infinite allele method? And it's not just the McGee calculator that does this.
Then there is the matter of assumed per locus mutation rates. McGee's utility was created, I believe, before much research was published on this, so it's not surprising that it's a little arbitrary, and I think, way too high in light of more recent data (e.g., Heinila 2012). That is understandable.
A little harder to work through are the various adjustments that other calculators appear to make. I believe that most of the popular calculators do this, including FTDNA's "Tip", but I have never heard them described more specifically than as "inverse weighting" adjustments. I can't make anything more specific of this than that it relates in some way to the fact that there is a wide disparity in mutation rates for individual loci.
I get it that these disparities will have a significant impact on the probability of hidden "convergent" mutations, but it seems to me that a straightforward tweak involving a Lorenz analysis of the AVERAGE rates is all that is required. In fact, the figures I arrive at under such a method typically come within 5% of the post-adjustment rates used by some of these calculators.
What is currently inexplicable to me is why anyone felt it was necessary or even a good idea to adjust the rates in such a way that results in different age estimates for comparisons with the same genetic distance, the specific loci in question being the only differing variable. That just seems unrealistically cute to me. There is an irreducible randomness in this process, and becoming this specific just seems like splitting hairs. It's probably not a huge deal, since, as I said, my independent calculations of an adjusted rate typically differ by less than 5%, but conceptually it is difficult for me to understand.
I see that most of them have a very crude method for determining the modal value in ambiguous cases where, strictly speaking, the marker has no mathematically modal value. Generally, they seem to assume that the low-er-est value is ancestral. I have heard that there is evidence that observed father-son mutations do tend to be upward rather than downward, but I sincerely doubt this trend is strong enough to make such a radical assumption during the construction of modal haplotypes. You can maybe argue that the particular case I observed is not representative, but I found that this default resulted in absolute nonsense during the computation of genetic distances of individuals apparently equidistant from the MRCA according to high resolution SNP tests. Instead, for my particular case, I performed a manual calculation using the median value.
This solved one problem, and uncovered another, although easily resolved. Because I was using the infinite alleles method, it resulted in some otherwise inexplicable patterns in the GDs between donors, even though each now appeared to be equidistant from the MRCA. I just switched to the step-wise method, and (almost) all the anomalies appeared to be corrected.
There was still the matter of one location which a single donor reported at a two-step difference from the modal, which led me to revisit that I think I know about multi-step mutations. From what I understand (maybe not accurately or at least not up-to-date), multi-step mutations are more likely on multi-copy markers, but still over all somewhat less than 10% of all mutations. This, at least, is my inference from the exercise I describe here. But if that's true, why is the calculator default to use the infinite allele method? And it's not just the McGee calculator that does this.
Then there is the matter of assumed per locus mutation rates. McGee's utility was created, I believe, before much research was published on this, so it's not surprising that it's a little arbitrary, and I think, way too high in light of more recent data (e.g., Heinila 2012). That is understandable.
A little harder to work through are the various adjustments that other calculators appear to make. I believe that most of the popular calculators do this, including FTDNA's "Tip", but I have never heard them described more specifically than as "inverse weighting" adjustments. I can't make anything more specific of this than that it relates in some way to the fact that there is a wide disparity in mutation rates for individual loci.
I get it that these disparities will have a significant impact on the probability of hidden "convergent" mutations, but it seems to me that a straightforward tweak involving a Lorenz analysis of the AVERAGE rates is all that is required. In fact, the figures I arrive at under such a method typically come within 5% of the post-adjustment rates used by some of these calculators.
What is currently inexplicable to me is why anyone felt it was necessary or even a good idea to adjust the rates in such a way that results in different age estimates for comparisons with the same genetic distance, the specific loci in question being the only differing variable. That just seems unrealistically cute to me. There is an irreducible randomness in this process, and becoming this specific just seems like splitting hairs. It's probably not a huge deal, since, as I said, my independent calculations of an adjusted rate typically differ by less than 5%, but conceptually it is difficult for me to understand.
Comment