Recently somebody showed me this excellent web utility, Hunter Provyn's STR match finder. To me, it seems like it's mainly billed as a website to help expand the range of STR marker haplotypes available for comparison, but I personally found it more useful as a tutorial in how to properly calculate genetic distance. The Provyn utility's output includes the intermediate check figure labelled "Log of Sum", which represents the log natural of the product of the mutation rates for mutation markers, each individually raised to the power of the number of mutations observed, calculated according the hybrid allele method.

I don't know why it had never occurred to me before, but that is actually the statistically correct method of calculating the genetic distance between two haplotypes. The joint probability of two independent events is calculated as the product of the probabilities associated with each event.

I was never quite satisfied with rationales for use of the infinite allele method in linear models, although I had successfully replicated (pretty closely) the results of of other popular TMRCA calculators using infinite allele, a linear model and a GD-grossup factor to account for potential convergence, calculated from the coefficient of inequality inherent in the chosen mutation rate set. So it seemed 'good enough for government work', as they say.

But knowing what I know now, I figured that an optimal TMRCA calculator would both be based on logarithmic model AND include an adjustment for potential convergence. Maybe I will one day find out that there is a better way of calculating that convergence adjustment, but for my present purposes I just used that coefficient of inequality. Probably not perfect, but way better than the methods I'd been using before, which were clearly sub-optimal, in giving equal weight to mutations at all loci.

All the utilities that I have used to date to generate networks/pedigrees for an array of haplotypes seem to be based on that kind of linear algorithm, that projects relationships based on the raw, unadjusted number of mutations observed. Which, for the reasons discussed above, I now know is sub-optimal from the point of view of good statistical methodology. I'm sure that those linear models are capable of returning "okay" estimates, especially for cases where the donors' SNP status is known and the actual MRCA is quite recent, but I like to work with much more remote relationships, in order to capture sample sizes that are large enough to approach statistical significance.

I think I've come up with a trick to optimize this networking exercise--derive the coefficients of correlation for the logarithmically estimated TMRCAs among a set of benchmark haplotypes--selected on the basis of SNP status. The correspondence of a selected individual haplotype to this range can be expressed as a percentage, and might be thought of as a kind of an STR fingerprinting index for the relevant SNP block. In other words, you can predict the (relative) likelihood of a given STR haplotype testing positive for a given SNP by comparing the consistency of TMRCA estimates with a set of reference STR haplotypes.

It has its limitations. It's highly sensitive to possible over- and under- representation of haplotypes within the reference set. Like any method short of direct, high-resolution SNP testing, it can return an incorrect result, especially when comparisons are made between closely related clades. But it still seems way better than the naive methods I am familiar with, emphasizing the matching of allele values at specific markers selected as 'diagnostic'--although I suppose this could be developed into a method of optimizing the identification of diagnostic markers.

This exercise also showed me that this fingerprinting index corresponded closely to the relative position of the donors' terminal clade in FTDNA's block tree. The higher the index %, the more defined that donor's terminal subclade will be with regard to the reference clade. Of course that's a reflection of skewing of the source data, but it's still useful as an objective way to make difficult decisions in deriving a cladogram or pedigree.

This method can also be applied to haplotypes occupying undifferentiated spaces under a given reference clade. Helpful, since it is very unlikely that all undifferentiated lineages actually diverged at exactly the same time.

The practical result of this exercise applied to my work on the FT372222 subclade of FGC23343 is that the Saddington donor really does seem to have a connection to Western Normandy, very consistent with the surname's historical origins, but much less consistent with the concentration of its ancestral clade, Z209, around the Basque country in Spain and France. Best estimate, based on 111 and 67 haplotypes, is a MRCA with a family from the Channel Islands born around 1100 A.D. Very useful in resolving an ambiguous case.

https://forums.familytreedna.com/for...522#post331522

https://forums.familytreedna.com/for...343#post332145

I don't know why it had never occurred to me before, but that is actually the statistically correct method of calculating the genetic distance between two haplotypes. The joint probability of two independent events is calculated as the product of the probabilities associated with each event.

I was never quite satisfied with rationales for use of the infinite allele method in linear models, although I had successfully replicated (pretty closely) the results of of other popular TMRCA calculators using infinite allele, a linear model and a GD-grossup factor to account for potential convergence, calculated from the coefficient of inequality inherent in the chosen mutation rate set. So it seemed 'good enough for government work', as they say.

But knowing what I know now, I figured that an optimal TMRCA calculator would both be based on logarithmic model AND include an adjustment for potential convergence. Maybe I will one day find out that there is a better way of calculating that convergence adjustment, but for my present purposes I just used that coefficient of inequality. Probably not perfect, but way better than the methods I'd been using before, which were clearly sub-optimal, in giving equal weight to mutations at all loci.

All the utilities that I have used to date to generate networks/pedigrees for an array of haplotypes seem to be based on that kind of linear algorithm, that projects relationships based on the raw, unadjusted number of mutations observed. Which, for the reasons discussed above, I now know is sub-optimal from the point of view of good statistical methodology. I'm sure that those linear models are capable of returning "okay" estimates, especially for cases where the donors' SNP status is known and the actual MRCA is quite recent, but I like to work with much more remote relationships, in order to capture sample sizes that are large enough to approach statistical significance.

I think I've come up with a trick to optimize this networking exercise--derive the coefficients of correlation for the logarithmically estimated TMRCAs among a set of benchmark haplotypes--selected on the basis of SNP status. The correspondence of a selected individual haplotype to this range can be expressed as a percentage, and might be thought of as a kind of an STR fingerprinting index for the relevant SNP block. In other words, you can predict the (relative) likelihood of a given STR haplotype testing positive for a given SNP by comparing the consistency of TMRCA estimates with a set of reference STR haplotypes.

It has its limitations. It's highly sensitive to possible over- and under- representation of haplotypes within the reference set. Like any method short of direct, high-resolution SNP testing, it can return an incorrect result, especially when comparisons are made between closely related clades. But it still seems way better than the naive methods I am familiar with, emphasizing the matching of allele values at specific markers selected as 'diagnostic'--although I suppose this could be developed into a method of optimizing the identification of diagnostic markers.

This exercise also showed me that this fingerprinting index corresponded closely to the relative position of the donors' terminal clade in FTDNA's block tree. The higher the index %, the more defined that donor's terminal subclade will be with regard to the reference clade. Of course that's a reflection of skewing of the source data, but it's still useful as an objective way to make difficult decisions in deriving a cladogram or pedigree.

This method can also be applied to haplotypes occupying undifferentiated spaces under a given reference clade. Helpful, since it is very unlikely that all undifferentiated lineages actually diverged at exactly the same time.

The practical result of this exercise applied to my work on the FT372222 subclade of FGC23343 is that the Saddington donor really does seem to have a connection to Western Normandy, very consistent with the surname's historical origins, but much less consistent with the concentration of its ancestral clade, Z209, around the Basque country in Spain and France. Best estimate, based on 111 and 67 haplotypes, is a MRCA with a family from the Channel Islands born around 1100 A.D. Very useful in resolving an ambiguous case.

https://forums.familytreedna.com/for...522#post331522

https://forums.familytreedna.com/for...343#post332145

## Comment