Announcement

Collapse
No announcement yet.

Viking FGC23343

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • It's been a really productive couple of days. I think the aging algorithm I'm using now has some legs.

    https://forums.familytreedna.com/for...870#post331870

    But the bottom line for tracing the Saddingtons and the Vincents to a common ancestor in western Normandy born in or after 1000 A.D. remains about the same--29% probability. Consistent with STR analyses using consensus mutation rates. FT372222 was probably formed about 880 A.D., so there most likely are branches that predate the arrival of the Saddington/Vincent ancestors in Normandy, but the idea that the Saddingtons and the Vincents could be branches of the d'Aubigny/de Mowbray family is plausible--with the current information.


    Originally posted by benowicz View Post

    That 25% represented a rough average of the confidence levels at 1000 A.D. calculated per an SNP and an STR analysis I performed a while ago.

    I just made an attempt to validate the SNP rate I used, and now I change my mind. It's certainly far faster than the rates typically bandied about in the genetic genealogy community, but it actually seems pretty well supported. Yeah, in an absolute sense there could be more sampling done to validate it, but relative to what's available for other estimates, it's pretty good. "Optimistic" is kind of wide of the mark.

    https://forums.familytreedna.com/for...833#post331833

    There's significant disagreement about the correct STR rates to use, too, and all that's complicated by convergence. And I just remember now that there is that fellow named Carroll who actually seems to be a Vincent. Tacking his 67 markers on to SDV's 111 marker profile actually brings the MRCA significantly closer vs. the other FT372222 guys.

    https://www.familytreedna.com/public...frame=yresults

    So, realistically, there is maybe a 36% chance that the MRCA between the Vincents and the Saddingtons could have been born as recently as 1000 A.D. That's not bad.

    Comment


    • I see that the new FT372222 fellow is a participant of the Z209+ project, earliest known ancestor named Sebastian Guerrero Molina, born about 1677 in Spain.

      https://www.familytreedna.com/public...frame=yresults

      I would be very happy if there were a further definition of FT372222 related to this fellow--maybe the manual review isn't quite done yet. But I'm not too hopeful.

      In the current situation, I'm not sure there's much I can add. It currently doesn't change the average number of private variants under FT372222, so my previous estimate of an MRCA of 880 A.D. is also unchanged. On the surface the geographic distribution looks a little unusual, but my last post did anticipate branches of FT372222 pre-dating the arrival of the de Rollos family's ancestors in Normandy, so I can't say that was a surprise.

      It's finally great to have some real information on one of the Spanish branches of FGC23343. I just wish there was a little more to work with. I'm far from an expert in the Spanish language or interpreting historical documents from the Spanish Empire, but I have had some success working with families of Costa Rican and Cuban origin, so I think I could learn to work my way around, given enough info. I have the subjective feeling that the family of this new donor is also from a colonial family rather than one with very recent roots in Spain itself, so maybe there are important contextual clues available about Sebastian's birthplace in the history of his nation. I don't know if my other experiences with Latin American genealogy are characteristic of the subject as a whole, but I think these families tend to be very well documented. The surnames Guerrero and Molina alone aren't very telling, being extremely common and widespread throughout Spain.


      Comment


      • Latest age estimates for the various clades, leveraging some info from public projects.

        FGC23343 is the best supported age estimate, with the son clades, naturally, having successively smaller sample sizes, with margins of error in excess of 100 years in each direction. It's pretty hard to be super-confident with such small sample sizes.

        The estimate for ZZ40 is just a really crude back-of-the-envelope thing performed only from an averaging of extremely high-level info reported in FTDNA's block tree. But it seems consistent with the differences between my estimates and the estimates of other popular projects, with my estimates typically about 1,000 years more recent. I can support my estimates with a very good pilot study of multiple, well-documented Medieval lineages, so I know that might be controversial to some, but you know, it is what it is.

        No two ways about it, FGC23343 is a weird clade. The two oldest documented families have roots in France, from the Middle Ages, but involving regions that I think were surprising for a lot of the donors. The Garnets seem to have a solid documentary connection to Aquitaine through Roger the Poitevin de Montgomery, and that's not entirely unprecedented, given the history of the R-BY32575 Burke family.

        But the FT372222 Saddingtons, with pretty well-documented origins in western Normandy, are probably still pretty shocked by this geographic distribution. There is a plausible explanation in the archaeological and contemporary documentary evidence pointing to south western France being part of an integrated network of Viking raiders/traders (aka the 'Loire' or 'Garonne' Viking bands) headquartered in Dublin, Ireland. It's pretty much a settled feature of history that western Normandy was settled in the 10th century by Vikings from the Irish Sea region.

        I think there is also some DNA evidence reinforcing this conclusion to be found in the DNA of the Bruce family of Clackmannan, Scotland. They're defined by R-FTB15831, which is currently a pretty long block founded (i.e., NOT expanded) around the same time that FGC23343's MRCA lived. Their history in the Conquest-era crosses paths several times with the Saddingtons' de Rollos ancestors, from Western Normandy (i.e., Brix near Cherbourg for the Bruces, Bricqueville-la-Blouette for the de Rollos) to northern Yorkshire. The geographic distribution of the brother clades of R-FTB15831 suggest that it originated in Shetland or the western Isles, also being an example of a non-Scandinavian family integrated into the Viking aristocracy. While FGC23343 and FTB15831 are both descendants of DF27, they're pretty widely divergent from one another, so FGC23343 is still very conspicuous for its recent Basque origins, but the general point of the basic plausibility of an hypothesis involving Dublin remains.

        Probably also requiring some explanation is the relatively large number of Spanish donors given the apparent origin of FGC23343 on the French side of the border. You'd expect there to be some back-and-forth across the border over time, but the pattern here seems to be a consistent southern migration. That probably makes sense in the history of Gascony and Navarre during the 8th century. It seems a settled part of history that the Basque or proto-Basque people strongly resisted the encroaching Franking kingdom and the founding of the kingdom of Navarre is partly attributed to this.


        Pedigree of FGC23343.png

        Comment


        • Hi
          I’ve arrived late for this discussion but my Shetland family are allegedly descended from the chester and flintshire de Monte Altos (Mowat). Although this lineage is oft repeated, our y dna haplogroup is I1 which would argue against it. The rest of the distantly related Mowats in the northern isles are I1 also. Many of the surnames which represent our nearest ‘cousins’ appear to have interpretations which imply roots in Brittany and Normandy however. Our ancestors in Angus in the 15th century are credited with evolving the modern spelling of our surname from de montalt or montealto although when we arrived in Shetland in the early17 th century, the Norse notaries corrupted the Mowat spelling to Movat, softened locally to Mouat eventually.

          Comment


          • Sorry for not responding earlier. It took me a while to catch up after a hiatus.

            This thread has become monstrously long, and even I forget where I pick up and leave off on certain topics. I'm not really an expert on the de Montalt family, so I had to do some additional catch-up on them to be able to engage with your comment.

            My original interest in the de Montalts was kind of peripheral, just as additional context surrounding one highly speculative--and since abandoned--line of inquiry about FGC28369. I was researching as to whether the de Montalts might be only related by marriage to the founder of FGC28369 in England. I'm really most interested in the subclades of FGC23343, and although others may have extended the speculation beyond my intention, I never believed that the de Montalts were FGC28369+.

            I speculated that the de Montalts most likely belonged to a subclade of R-U106, based on only two only distantly related participants in the Mold DNA project--kits #N3481 and 222351, one bearing the name Moffet. William the Conqueror's original Earl of Chester, Gerbod the Fleming, was from the Low Countries, where R-U106 is particularly common. I heaped on the additional wild speculation that the de Montalts may have come over with him, as his supporting staff and vassals.

            https://en.wikipedia.org/wiki/Gerbod...arl_of_Chester

            But in reality, I don't think anybody has even reliably established whether there are genuine direct male line descendants of this de Montalt family surviving into the modern era, and if there are, whether any of them have done high-resolution DNA testing. So as far as I know--and this is not my focus of study--these de Montalts may indeed be some branch of I1.

            The complexity of the competing claims of descend from the Cheshire de Montalts confuses the heck out of me. Frankly, I have begun to think that any family whose surname begins with the letter 'M' may have claimed to descend from them at one time or another, the actual historical descent being obscure enough to render almost any story plausible. After a super-brief survey of the literature, I believe that the family with the most scholarly support as possible descendants are the Maude family, Viscounts Hawarden in the Peerage of Ireland, coming to that country from Riddlesden, Yorkshire in the 17th century.

            But the thread of contemporary documentation connecting the Riddlesdon Maude/Mold family to the Cheshire de Montalt family is unclear to me. Maybe if I did more research it would be, but as it stands now, I suspect this connection may be unproven and possibly a bit speculative. Here is a quote from a very thoroughly cited paper by Ann Whiting, focusing on genuine, historical branch of the Cheshire de Montalts at Castle Riding, Norfolk:

            "There were almost certainly distant de Montalt heirs still in existence that possibly started with Andomar de Montalt, founder of the Yorkshire branch of the family and family of Emma – any possible claim made as heirs-in-law failed for the main part. Apart from anything else any heirs were to be of Robert’s body. However other lands, excluded from the fines in 1327 were also held by the Crown after he died. A petition by Robert de Morle Morley, kinsman and heir of Robert de Mohaut to the King and council stated Therefore after Robert de Mohaut’s death, the esheator seized all his lands, tenements, fees and advowsons, into the king’s hand, both those included in the fine and the others."

            http://www.castle-rising-history.co....%20montalt.pdf

            So even Whiting seems to be hedging her bets, carefully adding the ambiguous combination of qualifiers "almost certainly".

            The significance of people of recent British or Irish descent having distant Y chromosome matches with people with recent Norman ancestry is very interesting to me, although I don't quite know what to think about most of the examples I've seen to date. From what I know of Norman history, I believe the general population were probably always very heterogenous, comprising migrants from a wide variety of places all along the Western seaboard of Europe as well as Continental Europe over millennia. Certainly the Medieval aristocracy of Normandy were very heterogenous, with probably only a very small minority descending from Viking settlers in the direct male line. "Normandy" as a Scandinavian colony is maybe a bit of convenient-but-misleading over simplification, and the Viking settlers were probably never more than a politically dominant minority that contributed relatively little to the genetic legacy of the region. I would expect such cross-channel DNA matches to be rare.

            But there may be exceptions to that. Like the Dorey family of the Channel Islands who my recent researches suggest likely share a MRCA born around 1100 A.D. with the historical de Saddington family of Leicestershire.

            Those few examples of unquestioned, historically-backed clade identification for Norman migrants to Britain and Ireland of which I am aware suggest that matches between modern donors living in Normandy and Shetland in particular have to be examined with special caution. Migration routes could be very complex and sometimes circular. The consensus of conventional historians seems to be that the Viking aristocracy of Western Normandy arrived there in the mid-10th century from the colonies around Man and the Hebrides.

            For example, R-FTB15831 is almost certainly the genetic signature of the de Bruce family of Brix near Cherbourg. From what I could tell, the most closely related brother clades seem to cluster in Shetland, Orkney and the Hebrides. Given the estimated ages for those clades, I strongly suspect this reflects the Bruce's origins among families native to the Scottish islands who became incorporated into the Viking aristocracy, migrated with them to Normandy, and only coincidentally found their way back to Scotland, even though from what I gather, the Bruce surname is particularly common in the northern islands. The northern islands are small, just the kind of place where you would expect a founding event to lead to have a very skewing effect, but the diversity and ages of those clades related to R-FTB15831 in the region suggests to me that their presence long predates the Viking era.

            https://www.whodoyouthinkyouaremagaz...of-bruce-clan/

            Comment


            • I've had a bit of breakthrough over at Wikitree. Putting Nancy (Renick) Vincent as the mother of Joseph instead of Sarah Hoke, the second wife of John Frazier. This put my DNA, as the 28th great-grandson of Guillaume de Normandie through the Renick line, his grandfather John Vincent may not be a direct paternal line but marrying in Valley, Botetourt, Colony of Virginia put things closer than I ever thought about. This is of course a paper trail on wikitree but it puts some of the other clades close to my FGC23343 and R-FT372222 brothers in perspective as immigrants. Some dates don't match and unknown confidences but looks like an exciting revelation


              From Wikitree
              28th great grandson


              1. Sean is the son of [private father] [unknown confidence]
              2. [Private] is the son of Paul Vincent [unknown confidence]
              3. Paul is the son of Homer Levi Vincent (1910-1988) [unknown confidence]
              4. Homer is the son of James Henry Vincent (1876-1955) [unknown confidence]
              5. James is the son of Joseph G Vincent (abt.1819-1896) [unknown confidence]
              6. Joseph is the son of Nancy (Renick) Vincent (1783-1824) [unknown confidence]
              7. Nancy is the daughter of Letitia Wells Dalton (1756-1834) [unknown confidence]
              8. Letitia is the daughter of Samuel Dalton Sr. (1699-1805) [unknown confidence]
              9. Samuel is the son of William Dalton (1666-1733) [confident]
              10. William is the son of Tyrell Dalton (1646-1682) [confident]
              11. Tyrell is the son of Michael Dalton (abt.1619-) [unknown confidence]
              12. Michael is the son of Oliver Dalton (bef.1590-bef.1619) [unknown confidence]
              13. Oliver is the son of Francis (Thornton) Dalton (1568-bef.1601) [unknown confidence]
              14. Francis is the daughter of William Thornton Esq. (abt.1540-1570) [unknown confidence]
              15. William is the son of Oliver Thornton (abt.1510-1557) [confident]
              16. Oliver is the son of Thomas Thornton (1475-) [unknown confidence]
              17. Thomas is the son of Thomas Thornton (abt.1445-) [unknown confidence]
              18. Thomas is the son of Roger Thornton (1415-) [unknown confidence]
              19. Roger is the son of William Thornton (1385-) [unknown confidence]
              20. William is the son of Margaret (Stapleton) Thornton (1360-) [unknown confidence]
              21. Margaret is the daughter of Alice (St Philibert) Stapleton (abt.1330-abt.1383) [unknown confidence]
              22. Alice is the daughter of Ada (Botetourt) de St Philibert (-1349) [unknown confidence]
              23. Ada is the daughter of Maud (FitzThomas) Botetourt (1270-abt.1329) [unknown confidence]
              24. Maud is the daughter of Beatrice (Beauchamp) de Munchensy (1243-1285) [confident]
              25. Beatrice is the daughter of Ida (Longespée) de Beauchamp (abt.1208-bef.1270) [unknown confidence]
              26. Ida is the daughter of William (Plantagenet) Longespée (abt.1176-1226) [unknown confidence]
              27. William is the son of Henry Plantagenet (1133-1189) [confident]
              28. Henry is the son of Matilda (Normandie) of England (1102-1167) [confident]
              29. Matilda is the daughter of Henry (Normandie) of England (1068-1135) [confident]
              30. Henry is the son of Guillaume (Normandie) de Normandie (abt.1027-1087) [confident]
              This makes Guillaume the 28th great grandfather of Sean.
              Last edited by SDV; 28 June 2022, 11:34 AM.

              Comment


              • The Beta rollout of the Discover database, including MRCA estimates for clades in FTDNA's block tree came out before the holiday, and I have to say that there's a lot of good stuff there. I think they really nailed most of the benchmark clades I identified in my own pilot study that began late last year and wound down in January. Really good. There was some difficult to interpret topology in some of the clades I reviewed, like R-BY21154, but for the most part they seem to have gotten them correct.

                https://forums.familytreedna.com/for...312#post332312

                Unfortunately, R-FGC23343 and subclades doesn't seem to be among the success stories. Discovery has aged these about 700 years older than my own estimates (e.g., MRCA at 6 C.E. Discovery vs. 692 C.E. per me).

                I don't preclude the technical possibility that they know something I don't--I assume they have access to the complete actual coverage data for all of the donors in question, whereas I only have a few individuals I can use as 'spot checks'. But what data I do have is good, and suggests to me that Discovery's estimates are almost a literal impossibility. As such, until more data is available showing me otherwise, for R-FGC23343 and subclades I'll stick to the estimates I've developed independently myself, as summarized in this speculative chart.

                Pedigree of FGC23343.png

                Given the otherwise excellent quality of estimates I've seen in Discovery so far, I think this calls for some explanation. My current guess is that the Discovery estimates, like mine, may be based on an indirect inference of coverage of data anonymized within the block tree, but that at least in the case of R-FGC23343, some pretty standard interpretive heuristics were not applied by Discovery, which may have led to this sub-optimal estimate. For example, when identifying anomalous data to exclude from the TMRCA calculation, they may have failed to notice the inherent implausibility of the scenario they selected. See the chart and notes below.

                R-FGC23343 6 July 2022 - two alternate analyses.png


                The general form of the equation calculating the joint probability of more than event occurring within a given period of time is X^Y, where X is the probability of the event occurring once, and Y is the number of times it happens within this time period. So while both analyses excluded two items from the calculation as anomalous, the first alternative (mine) is exponentially more probable than the second (what I believe Discovery's process may approximate) because the first assumes only one anomalous event whereas the second assumes at least two (actually three, if you count each undifferentiated donor as a separate item, which I think you must).

                Drilling down deeper, I can verify that the assumptions embedded in Discovery's reported dates seem extremely unlikely, if not literally impossible. For example, using algebra to derive the implicit coverage statistics implied by Discover, given data published for the single R-FT351092 donor at the Big Tree, suggests that FTDNA's QC department would never release kits with such low coverage. Right now I don't know exactly what the minimum acceptable level is, vis-a-vis the standard core coverage, but I'm pretty sure it's above 42%.

                Analysis of coverage inherent in Discovery - R-FT351092 5 July 2022 pt II.png


                So for now, until I can be convinced otherwise, I will stick to my own current estimates of these clades when discussing them, despite the otherwise very good quality of the Discovery estimates for most the clades I've reviewed to date.

                Just as a side note, I don't think this necessarily reflects a really serious or pervasive flaw in Discovery's methodology. It has to be admitted that the reported variant counts for R-FGC23343 are kind of weird, and someone looking at them without researching actual donor coverage could easily have come to a conclusion different than mine. Maybe they would have noted the relative plausibility of having only one anomalous event vs. having three, but I could also see how that detail might have been missed in a very manual process for a Beta version.

                Comment


                • Originally posted by SDV View Post
                  I've had a bit of breakthrough over at Wikitree. . . This makes Guillaume the 28th great grandfather of Sean.
                  Without digging into the specifics of every point in the chain, I'd say that descent sounds generally plausible to me. One of those old chestnuts that the BBC runs in its features section from time to time is that about half of all Britons are descended from Edward III.

                  Maybe the proportion for Americans is even higher. I've heard that emigrants generally tended to come from the higher end of the middling social strata, as they were more likely to be able to afford the cost of passage. I think there's a grain of truth to it. There were convicts, etc. as well, but they were less likely to strike root and leave a lot of descendants. It rings well with Famine studies in Ireland.

                  In any event, every election cycle the American papers run a story about how both candidates descend from Edward III. Like clockwork.
                  Last edited by benowicz; 6 July 2022, 07:53 AM.

                  Comment


                  • Originally posted by benowicz View Post
                    . . . So for now, until I can be convinced otherwise, I will stick to my own current estimates of these clades when discussing them, despite the otherwise very good quality of the Discovery estimates for most the clades I've reviewed to date. . . .
                    No change in my opinion about the relative reliability of my own analysis vs. figures reported by Discover to date, but one note appreciating the complexity of the task of indirectly inferring coverage, which I suspect Discover may be doing as well. The most theoretically correct method would be to aggregate data for all immediate subclades by the geometric mean of the variant count, given the logarithmic character of the probability structure.

                    I myself definitely do not have the data at the granular level necessary to do this, and while I tend to believe the Discover team would, it seems likely that they are not doing so.

                    I don't truly know if this is what's going on at Discover, but it seems possible, and given the stark inconsistencies apparent between Discover's estimates for R-FGC23343 and subclades and the other clades in my pilot study, it seems worth considering.


                    More observations on Discovery MRCA estimates - 8 July 2022.png

                    I suspect that the Discover team may be taking too aggressive an approach to sampling. At every point in my pilot study I expressed admiration at the high compression of observations (i.e., aggregated by mathematical average at the level of the terminal clade) around the central 50% of the binomial probability curve associated with their MRC ancestral clade. About 90% of observations fell within this central 50%. While that does accurately reflect an strongly in-control process vis-a-vis conformity of individual kits to product standard, it definitely understates the true deviation of variant counts among individual donors due to averaging for aggregation within terminal clades.

                    In other words, if the Discover team is using this type of indirect approach to determining kit coverage, they should be targeting a sampling rate closer to 50% rather than 90% or even 70% if they are using disaggregated data for individual donor data counts. It seems hard to believe that they would be using such an inferential method. Presumably they have access to the actual coverage data. But it is even harder for me to otherwise explain the huge differential noted between the BY700 mutation rates implicit in their estimates for benchmark clades like R-BY32575 (i.e., ~60 years) and the ~100 years for R-FGC23343, and the inconsistency of the R-FGC23343 results with STR analysis. Maybe they hoped for some kind of processing efficiencies? That might be reasonable for very large clades like the benchmark clades I looked at in my pilot study, where outliers would be washed out in the sheer volume of data, but certainly it would be more problematic for small clades like R-FGC23343.

                    However any of that may actually be, the key thing to take away from all of this would be the optimal method for inferring coverage from individual donor variant counts. It would be to simply calculate the geometric mean for each subclade at each node up through to the ultimate ancestral clade whose MRCA is being estimated. While I don't have all the data necessary to fully perform this calculation for R-FGC23343, I do have enough to appreciate the challenges of intra-clade deviation, and from what I can tell, my original MRCA estimate of 691 C.E. is far closer to the results for a fully optimized calculation than Discover's estimates.

                    How would the application of this method affect the other clades I reviewed in my pilot study? Probably not by too much. If Discover was using the disaggregated sampling approach that I hypothesize here (maybe a weird, controversial theory), it doesn't seem to have resulted in a significant discrepancy with my own estimates. Probably because those clades are so large that outliers easily fall out in the wash.

                    Comment


                    • Yes, I think that problem with incorrect aggregation is the most likely explanation.

                      My pilot study was inspired by a critique of a paper by Iain McDonald from last year. I almost forgot about a very important observation I made there.

                      https://forums.familytreedna.com/for...nce#post332287

                      My notes were couched in language that was maybe at times a little less than diplomatic, but very early on I identified a critical mistake McDonald had made. I think the Discover team have likely repeated it. They have forgotten that that they're not trying to measure the number of mutations in the Y chromosome of a typical clade donor, but the time lapsed since the most recent common ancestor, and those are two very different things.

                      In that paper, they weighted the average number of mutations for well-represented subclades more heavily, which is the exact opposite of what you want to do. By doing so you're introducing a new random variable that obscures, not elucidates, the phenomenon you want to isolate, which is the time lapsed since the overall most recent common ancestor. If by random chance your most heavily represented subclades also just happen to have a particularly unusual number of mutations vis-a-vis the typical mutation rate in the human population as a whole, you will be exacerbating the TMRCA measurement problem by weighting them more heavily.

                      Aggregation problems for R-S781 9 July 2022.png

                      The algorithm I used in my pilot study is not perfectly optimal either. I don't have access to the granular donor-level data to be able to perform a fully correct aggregation of data according to the clade's specific phylogeny. I suppose I could have done better than I did, though, by aggregating the data I did have in that way. But by using the geometric mean rather than the mathematical average, and inferring the appropriate mutation rate from the aggregate study of several unrelated clades, I'm sure that I have gotten very, very close indeed.

                      The difficulty presented by R-FGC23343 and subclades is that it is both very small, and that a relatively large number of closely related donors (i.e., 5 donors under R-FGC28370, out of a total of 16 donors under R-FGC23343) have reported a freakishly high number of mutations as compared to other BY700 donors. It seems pretty clear to me that R-FGC28370 should not be considered 5 independent trials. The four undifferentiated donors under R-FT372222, all BY700, or the four donors under R-BY97678, also apparently BY700, who are all much more remotely related, are all much better candidates for the proper basis to measure the age of R-FGC23343. Surely it can't be a coincidence that their variant counts are much more similar to one another than R-FGC2870 is to any of them.

                      My MRCA estimate of ~700 C.E. for R-FGC23343 has to be much closer to the mark than Discover's. It's so much closer to the STR derived dates, too.
                      Attached Files
                      Last edited by benowicz; 9 July 2022, 07:08 AM.

                      Comment


                      • If the TMRCA calculations depend on the assumption that the samples are "random" (i.e., independent samples of the same phenomenon), and I don't see how you can proceed without this assumption, it is very clear that some adjustment has to be made when the assumption is violated. One indication that the samples are not random would be that they are on the same descendant node. It doesn't seem fair to treat such samples as independent. Rather, they (in the form of their "average", by whatever metric seems most appropriate) count as one measure (one sample) of the desired TMRCA. One would have to start at the most recent node and systematically estimate the age of that node, working back in time to the desired node, rather like the simplest method of constructing a cladogram. Independent sampling is basic!

                        Comment


                        • Exactly. That's what I tried to do to the best of my ability here.

                          Aggregation by phylogeny R-FGC23343 9 July 2022.png

                          You can identify the independent observations by their phylogeny. The immediate clades under R-FGC23343 are:

                          R-BY97678
                          R-FGC28383
                          R-FT372222

                          And, of course, undifferentiated R-FGC23343, of which there are two donors.

                          That much is very simple. None of them are any more closely related to one another than they are to R-FGC23343. Perfectly or almost perfectly independent.

                          It is theoretically possible that two or three of them actually are ever so slightly more closely related to one another than to R-FGC23343, but just through dumb luck no block-defining SNP mutation occurred during the interim. In any event, such a hypothetical event would not make them significantly more related to one another than to R-FGC23343. Certainly not within the context of the 1,000 + year time frame we're talking about.

                          The limit on the validity of the exercise that I performed here is that I don't have specific variant count info for all of the individual donors. I have enough culled together from public projects and a single kit that I administer to make some pretty good guesses to fill in the gaps, but it's not perfect. Not like what the Discover team could do if they wanted to.

                          Under ideal conditions there would be little variability in the variant counts of different donors tested on the same platform. Under those conditions, the mathematical average--which is what the block tree reports--would approximate the median, which would approximate the geometric mean. But as I noted in an earlier analysis of FT372222, that is not the case. Under there, the variant counts seem likely to swing around quite a bit, from 16 at the low end to 24 at the high end--but all the donors are BY700. That much I know for a fact.

                          So I would strongly recommend using the geometric mean, to facilitate normalization of values by the associated confidence levels at the top-level TMRCA calculation for R-FGC23343.

                          Why didn't McDonald in that paper or the Discover team do this? As you say, independence is fundamental to the validity of these calculations. My guess is that it seemed like too much work. I mean, they could front-load that work by writing up a macro or series of queries to automate the process, but they may have seen that usually, and especially for the large clades that get all the press, their very crude, disaggregated approach was 'good enough'. Even I who normally bust people's chops for exactly this sort of thing didn't get too worked up when I noticed the 200 year variance of Discover's estimate for R-S781 vs. my own calculations. I only got worked up when I saw a 700 year discrepancy for a different clade I do a lot of research on.

                          I have to imagine there are a few other people out there like me who are gobsmacked when a particular clade they're working on is way off. But there probably aren't too many of us, and the Discover team may consider that an acceptable price for efficiency.

                          Comment

                          Working...
                          X