Supposedly multistep STR mutations account for about 10% of all mutations. Adjusting for this dramatically improved the power of my algorithm to accurately predict SNP status.
I've been working with a very small dataset, so take that for what it's worth. But also in that vein, some of the SNP blocks between which I'm attempting to discriminate are extremely closely relatedseparated by only 1 SNP, and are very old on top of that. That's exactly the result I'd hope to see in a robust model.
This seems to result in a more accurate TMRCA calculation overall. Center of the distribution approximates SNPbased techniques even more closely.
According to this iteration, Dorey's affinity with FT372222 appears even stronger, in comparison to FGC28383.
Maybe more trials will reveal latent weaknesses in this methodology, but right now I think a really good algorithm should be logarithmic and include, at a minimum, an adjustment for multistep mutations. Not sure about the need for additional adjustment specifically for convergence, but expanding sample size through an additional iterative step comparing the resulting coefficients of correlation for the preliminary TMRCA predictions could make for a really strong fingerprinting tool.
Announcement
Collapse
No announcement yet.
STR haplotype networking and fingerprinting
Collapse
X

Originally posted by benowicz View Post. . . I think I've come up with a trick to optimize this networking exercisederive the coefficients of correlation for the logarithmically estimated TMRCAs among a set of benchmark haplotypesselected on the basis of SNP status. The correspondence of a selected individual haplotype to this range can be expressed as a percentage, and might be thought of as a kind of an STR fingerprinting index for the relevant SNP block. In other words, you can predict the (relative) likelihood of a given STR haplotype testing positive for a given SNP by comparing the consistency of TMRCA estimates with a set of reference STR haplotypes.
It has its limitations. It's highly sensitive to possible over and under representation of haplotypes within the reference set. Like any method short of direct, highresolution SNP testing, it can return an incorrect result, especially when comparisons are made between closely related clades. But it still seems way better than the naive methods I am familiar with, emphasizing the matching of allele values at specific markers selected as 'diagnostic'although I suppose this could be developed into a method of optimizing the identification of diagnostic markers.
This exercise also showed me that this fingerprinting index corresponded closely to the relative position of the donors' terminal clade in FTDNA's block tree. The higher the index %, the more defined that donor's terminal subclade will be with regard to the reference clade. Of course that's a reflection of skewing of the source data, but it's still useful as an objective way to make difficult decisions in deriving a cladogram or pedigree.
This method can also be applied to haplotypes occupying undifferentiated spaces under a given reference clade. Helpful, since it is very unlikely that all undifferentiated lineages actually diverged at exactly the same time.
The practical result of this exercise applied to my work on the FT372222 subclade of FGC23343 is that the Saddington donor really does seem to have a connection to Western Normandy, very consistent with the surname's historical origins, but much less consistent with the concentration of its ancestral clade, Z209, around the Basque country in Spain and France. Best estimate, based on 111 and 67 haplotypes, is a MRCA with a family from the Channel Islands born around 1100 A.D. Very useful in resolving an ambiguous case.
https://forums.familytreedna.com/for...522#post331522
https://forums.familytreedna.com/for...343#post332145
So maybe the situation is a little more ambiguous. Dorey has more consistent matches with FGC23343 than any other clade, and by the fingerprinting algorithm I described above, his haplotype is way more "typical" of FT372222 than FGC28383, but that could be because FT372222 just less welldefined. Weirdly, Dorey would appear at an almost identically dated node whether you included him in a cladogram for FT372222 or FG28383. The clades are just too closely related to feel superconfident on the basis of haplotype alone.
I think the geographic distribution is more consistent with the history of the Saddington's, but we are talking about a TMRCA of nearly 1,000 years, so anything could happen. It would really be helpful if we had highres SNP data for Dorey, but I guess that's not happening any time soon.
At least I learned how to draw a good cladogram. I was able to reproduce Provyn's results pretty easily, so there's that.
Leave a comment:

STR haplotype networking and fingerprinting
Recently somebody showed me this excellent web utility, Hunter Provyn's STR match finder. To me, it seems like it's mainly billed as a website to help expand the range of STR marker haplotypes available for comparison, but I personally found it more useful as a tutorial in how to properly calculate genetic distance. The Provyn utility's output includes the intermediate check figure labelled "Log of Sum", which represents the log natural of the product of the mutation rates for mutation markers, each individually raised to the power of the number of mutations observed, calculated according the hybrid allele method.
I don't know why it had never occurred to me before, but that is actually the statistically correct method of calculating the genetic distance between two haplotypes. The joint probability of two independent events is calculated as the product of the probabilities associated with each event.
I was never quite satisfied with rationales for use of the infinite allele method in linear models, although I had successfully replicated (pretty closely) the results of of other popular TMRCA calculators using infinite allele, a linear model and a GDgrossup factor to account for potential convergence, calculated from the coefficient of inequality inherent in the chosen mutation rate set. So it seemed 'good enough for government work', as they say.
But knowing what I know now, I figured that an optimal TMRCA calculator would both be based on logarithmic model AND include an adjustment for potential convergence. Maybe I will one day find out that there is a better way of calculating that convergence adjustment, but for my present purposes I just used that coefficient of inequality. Probably not perfect, but way better than the methods I'd been using before, which were clearly suboptimal, in giving equal weight to mutations at all loci.
All the utilities that I have used to date to generate networks/pedigrees for an array of haplotypes seem to be based on that kind of linear algorithm, that projects relationships based on the raw, unadjusted number of mutations observed. Which, for the reasons discussed above, I now know is suboptimal from the point of view of good statistical methodology. I'm sure that those linear models are capable of returning "okay" estimates, especially for cases where the donors' SNP status is known and the actual MRCA is quite recent, but I like to work with much more remote relationships, in order to capture sample sizes that are large enough to approach statistical significance.
I think I've come up with a trick to optimize this networking exercisederive the coefficients of correlation for the logarithmically estimated TMRCAs among a set of benchmark haplotypesselected on the basis of SNP status. The correspondence of a selected individual haplotype to this range can be expressed as a percentage, and might be thought of as a kind of an STR fingerprinting index for the relevant SNP block. In other words, you can predict the (relative) likelihood of a given STR haplotype testing positive for a given SNP by comparing the consistency of TMRCA estimates with a set of reference STR haplotypes.
It has its limitations. It's highly sensitive to possible over and under representation of haplotypes within the reference set. Like any method short of direct, highresolution SNP testing, it can return an incorrect result, especially when comparisons are made between closely related clades. But it still seems way better than the naive methods I am familiar with, emphasizing the matching of allele values at specific markers selected as 'diagnostic'although I suppose this could be developed into a method of optimizing the identification of diagnostic markers.
This exercise also showed me that this fingerprinting index corresponded closely to the relative position of the donors' terminal clade in FTDNA's block tree. The higher the index %, the more defined that donor's terminal subclade will be with regard to the reference clade. Of course that's a reflection of skewing of the source data, but it's still useful as an objective way to make difficult decisions in deriving a cladogram or pedigree.
This method can also be applied to haplotypes occupying undifferentiated spaces under a given reference clade. Helpful, since it is very unlikely that all undifferentiated lineages actually diverged at exactly the same time.
The practical result of this exercise applied to my work on the FT372222 subclade of FGC23343 is that the Saddington donor really does seem to have a connection to Western Normandy, very consistent with the surname's historical origins, but much less consistent with the concentration of its ancestral clade, Z209, around the Basque country in Spain and France. Best estimate, based on 111 and 67 haplotypes, is a MRCA with a family from the Channel Islands born around 1100 A.D. Very useful in resolving an ambiguous case.
https://forums.familytreedna.com/for...522#post331522
https://forums.familytreedna.com/for...343#post332145Tags: None
Leave a comment: