Announcement

Collapse
No announcement yet.

E3b project cladograms

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • E3b project cladograms

    I have opened this new thread to share our views, questions and opinions about the phylogeny of haplogroup E3b. As we know, E3b is an old and diverse haplogroup which comprises several subclades.

    My initial assumption is that just as haplogroups can be predicted by statistical analysis of marker values of individual haplotypes, we can also predict or "infer" the subclades within our haplogroup by using the proper software tools to analyze a sample of E3b haplotypes.

    The objective is to create a diagram or tree that will allow us to visually represent the main bifurcations of haplogroup E3b into its main subclades and then identify individual haplotypes by their Id. label in the ending branches.

    The data that I've been using comes from the E3b project but since new members are joining all the time, some haplotypes are being upgraded from 12 markers to 25 or 37, etc., it is necessary to periodically update the E3b cladograms.


    To look at a sample E3b cladogram click here.
    The colors were added to highlight what I believe to be the three main subclades, that is E3b3, E3b1 and E3b2 in the order of appearance.
    (I'll be posting a cladogram with the latest data soon.)

    As to the software tools used to create the tree, there is a wide assortment of available software packages but I have opted for using some of the same tools that have proven useful in other DNA projects, namely McGee's YDNA comparison tool, PHYLIP (Phylogeny Inference Package), Tree View and Tree Explorer, plus a couple of other image conversion applications.

    For an excellent primer on Phylogenetics I recommend visiting the National Center for Biotechnology Information .

    What follows is some basic terminology about phylogenetic trees copied from the above page.
    • Node: represents a taxonomic unit. This can be either an existing species or an ancestor.
    • Branch: defines the relationship between the taxa in terms of descent and ancestry.
    • Topology: the branching patterns of the tree.
    • Branch length: represents the number of changes that have occurred in the branch.
    • Root: the common ancestor of all taxa.
    • Distance scale: scale that represents the number of differences between organisms or sequences.
    • Clade: a group of two or more taxa or DNA sequences that includes both their common ancestor and all of their descendents.
    • Operational Taxonomic Unit (OTU): taxonomic level of sampling selected by the user to be used in a study, such as individuals, populations, species, genera, or bacterial strains.


    http://www.ncbi.nlm.nih.gov/About/pr.../treechart.gif

    More to come...

  • #2
    Root location on current E3b cladogram

    Victor,

    Ok - thanks for the taxonomic link and the definitions.

    If I am reading this correctly the "Root" for this tree would simply be an extension (to the left or "upstream") of the first horizontal line that joins the three clades - and that extension would be labeled E3b M-35+ as all ancestors of anyone in this specific tree would have this SNP as well as all descendants.

    Then the intersection of the first vertical line would be the node (M-123+) for E3b3 clade.

    The intersection with the 2nd vertical line would be the nodes designating E3b1 (M-78+) and E3b2 (M81+) being the upper and lower segments respectively of this vertical line.

    It does appear that the last four haplotypes included in the M-78+ group may be forming a sub-clade - it would be interesting to see one of the closely grouped threesome be "deeper" SNP tested.

    Thanks for your cladogram efforts!

    Bill

    Comment


    • #3
      Originally posted by Bill Harvey
      Victor,

      Ok - thanks for the taxonomic link and the definitions.

      If I am reading this correctly the "Root" for this tree would simply be an extension (to the left or "upstream") of the first horizontal line that joins the three clades - and that extension would be labeled E3b M-35+ as all ancestors of anyone in this specific tree would have this SNP as well as all descendants.

      Then the intersection of the first vertical line would be the node (M-123+) for E3b3 clade.

      The intersection with the 2nd vertical line would be the nodes designating E3b1 (M-78+) and E3b2 (M81+) being the upper and lower segments respectively of this vertical line.

      It does appear that the last four haplotypes included in the M-78+ group may be forming a sub-clade - it would be interesting to see one of the closely grouped threesome be "deeper" SNP tested.

      Thanks for your cladogram efforts!

      Bill
      You're right, Bill. The diagram doesn't actually show the root but it would be, as you say an extension to the left of the first line that joins the three clades.

      The following new diagram offers a new perspective and I made my interpretation of what clades the branches represent.

      http://img.villagephotos.com/p/2005-...25_dec03cb.jpg

      What I've noticed since I started doing the cladograms when there were only close to 40 records is that as new records are added the haplotypes get rearranged a little bit within their own cluster. What has remained more or less constant are the three main branches, including the top one which at the moment represents a single haplotype.

      Victor

      Comment


      • #4
        E3b median-joining diagram

        One more visual representation of the same data:
        http://img.villagephotos.com/p/2005-.../e3b_25_MJ.jpg

        Comment


        • #5
          E3b median-joining diagram : for 37 markers also?

          Victor ,

          This newest version gives the same basic info as the prior format (my personal favorite of the three recent trees) but seems to accentuate the difference M81 shows in relationship to M78 and M123 - these latter indicate a distinct separation from one another but not as sharply as M81 shows a separation from the other two clades in your latest effort.

          Is this primarily due to only having one test in the M123 clade? - whereas the other two have numerous test samples.

          I would like to see the results of all 37 markers in a run using the same format as your prior tree. I would like to begin trying to sort out the potential M123 testees from all the rest and am curious as to whether the additional 12 markers will be of any assistance in determining specific modal differences?

          If it is a lot of work - just forget about doing it. I'll probably take forever to figure it out anyway!

          Bill

          Comment


          • #6
            Originally posted by Bill Harvey
            Victor ,

            This newest version gives the same basic info as the prior format (my personal favorite of the three recent trees) but seems to accentuate the difference M81 shows in relationship to M78 and M123 - these latter indicate a distinct separation from one another but not as sharply as M81 shows a separation from the other two clades in your latest effort.

            Is this primarily due to only having one test in the M123 clade? - whereas the other two have numerous test samples.

            I would like to see the results of all 37 markers in a run using the same format as your prior tree. I would like to begin trying to sort out the potential M123 testees from all the rest and am curious as to whether the additional 12 markers will be of any assistance in determining specific modal differences?

            If it is a lot of work - just forget about doing it. I'll probably take forever to figure it out anyway!

            Bill
            Bill,

            In regards to separation of M123 and M78 in the median-joining diagram I think it is not dependent on the number of test samples but on genetic distance. The branch length is supposed to be proportional to the amount of steps between one haplotype and another. In other words, E3b3 seems to be closer to E3b1.

            In the previous illustration, the curved branches tree, the E3b2 cluster appears next to E3b3 but this proximity does not necessarily reflects genetic distance. That's why I made a note that the tree doesn't reflect the chronology of genetic events. The main usefulness of that diagram is to show the clustering of haplotypes. The ordinal position of the haplotypes is only relevant within their own clade.

            To get another idea about distances see this phylogenetic tree that shows the number of steps between each node.

            Next time I'll do also the 37 marker haplotypes to see it there is a similar or different pattern.

            Victor

            Comment


            • #7
              New SNP results

              Hi Everyone,

              The E3b Project is now approaching the 100 haplotype mark. We currently have 5 confirmed SNPs (beyond M35): two M78+ (E3b1), one M81+ (E3b2) and two M123+ (E3b3).

              Below are the links to the latest 12 marker and 25 marker based cladograms.
              The comparison of these two diagrams helps to illustrate one important consideration about the inference of subclades by the genetic distance amongst haplotypes in our recordset.

              The objective is to find the optimal marker count to process the data by the inference software and generate the cladograms. Which has a higher prediction value, 12 or 25?

              12 marker cladogram:
              http://img.villagephotos.com/p/2005-...-12-051213.jpg

              25 marker cladogram:
              http://img.villagephotos.com/p/2005-...25-051213x.jpg

              The main observation I made on the 12 marker cladogram is that the second M123+ SNP (32872) does not cluster next or very near to our first M123+ (19310). This raises the question if all those haplotypes located between these two could also be M123+ or if the 12 marker based cladogram doesn't provide enough resolution for the software to create the correct clustering?

              Unfortunately, the haplotype of the second confirmed M123+ SNPs (32872) only has tested 12 markers so we can't know for now. Maybe when we get a few more SNP results we will know the answer. The other confirmed SNPs seem to support the model in both cladograms. I'm inclined to think that the 25 marker cladogram produces better results. Any comments?

              Victor


              p.s. @ Bill, I tried to run the 37 markers also but I was getting a run time error that could not pinpoint and correct. I'll keep trying.

              Comment


              • #8
                Originally posted by Victor
                Bill,



                To get another idea about distances see this phylogenetic tree that shows the number of steps between each node.



                Victor

                please tell me your usuing more then just the e3b group

                how can you use that few people to decide anything

                i look at the chart and i dont know what it represents at least now it has some explaination
                why not just use ftdnas maybe i am missing something please tell me what it is

                Comment


                • #9
                  Originally posted by Jim Denning
                  please tell me your usuing more then just the e3b group

                  how can you use that few people to decide anything

                  i look at the chart and i dont know what it represents at least now it has some explaination
                  why not just use ftdnas maybe i am missing something please tell me what it is
                  Hello Jim,

                  With a couple of exceptions when I've inserted one or two external haplotypes, I've been using exclusively the dataset from the E3b project.

                  I agree that the sample size is small to make any definitive conclusions about anything but so far the latest SNP confirmations have not contradicted the clustering generated by the software application.

                  Maybe if I describe briefly the process used to get from haplotype dataset to cladogram you'll get a better understanding.

                  For example, in the latest cladograms I start creating two files out of the whole E3b project dataset: one with 12 marker records and another with 25 marker records. I run each one thru the whole following process.

                  Generate a PHYLIP (Phylogeny Inference Package) compatible data file using McGee's YDNA comparison utility, out of the TMRCA table (infinite allele model) with the following settings:
                  Probability 50%,
                  Mutation Rate FTDNATiP(tm) 0.004..0.009,
                  Units 25 years/generation.

                  Next, process the data using the Kitsch module of the PHYLIP package with the following settings:
                  Method Fitch-Margoliash
                  Lower triangular data matrix
                  Randomized input order (seed = 9, 11 times)

                  Finally draw the cladogram/phylogram/radial tree from the resulting "phylip" file using either TreeView or TreeExplorer.

                  The comments and shading are added with a regular graphics editor.

                  So, in essence what the inference software does is rearrange and cluster the haplotypes on the base of genetic distance amongst all haplotypes in the dataset as measured by the TMRCA table. Our assumption is that the main branches in the diagram could correlate to corresponding subclades.

                  Of course all of this is just experimental and an attempt to understand better the branching of our haplogroup.

                  Regards,

                  Victor

                  Comment


                  • #10
                    New SNP results - 12 marker cladogram questions?

                    Victor,

                    Regarding the placement of #19310 and #32872 on the 12 marker cladogram - an analysis of the non-modal STRs reveals that #19310 had a total of 7 out of 12 non-modal markers - each of the seven was one off from modal- whereas #32872 had three non-modal markers one of which was 8 where the modal was 10 (DYS 391).

                    Also of interest is #N15407 which actually is shown as a separate clade if I am reading the graph correctly. Here again there is a total of seven non-modal markers but DYS 385-a has a genetic distance of 3 from modal. In fact DYS 385-a and DYS 385-b being at a double 19,19 is unique in your sample base. The only sure way to determine whether this kit# indicates M123 or another sub-clade would be to extend the STR marker string for an educated guess or be SNP tested and be sure.

                    Interestingly, #19310 has a total of seven non-modal markers in the second panel of 13 markers with three of the seven each showing a genetic distance of 3 and a fourth mutation (DYS458) has a genetic distance of 4 from modal.

                    When comparing the relatively clear and precise distinction of the known clade groupings shown at 25 marker level with the non-distinct and confused graphing at 12 markers, there doesn't appear to be much value given by the 12 marker level - considering the amount of work you put into the cladogram generation.

                    Now.... for 37 marker cladograms..... I believe I see some benefit to be gained in defining marker patterns for use in STR prognosticating but we undoubtedly need a larger database and many more SNP tested samples to say for sure.

                    Food for thought?

                    Bill

                    Comment


                    • #11
                      Limited value of 12 marker cladograms

                      Originally posted by Bill Harvey
                      Victor,

                      Regarding the placement of #19310 and #32872 on the 12 marker cladogram - an analysis of the non-modal STRs reveals that #19310 had a total of 7 out of 12 non-modal markers - each of the seven was one off from modal- whereas #32872 had three non-modal markers one of which was 8 where the modal was 10 (DYS 391).
                      Good observation, Bill. The uniqueness of these two haplotypes seems to support our assumption that distinct subclades should have accumulated enough distance from other subclades to make them stand apart. This pre-supposes that there's a correlation between the appearance of a defining Unique Event Polymorphism or SNP and haplotype allele values.

                      Also of interest is #N15407 which actually is shown as a separate clade if I am reading the graph correctly. Here again there is a total of seven non-modal markers but DYS 385-a has a genetic distance of 3 from modal. In fact DYS 385-a and DYS 385-b being at a double 19,19 is unique in your sample base. The only sure way to determine whether this kit# indicates M123 or another sub-clade would be to extend the STR marker string for an educated guess or be SNP tested and be sure.

                      Interestingly, #19310 has a total of seven non-modal markers in the second panel of 13 markers with three of the seven each showing a genetic distance of 3 and a fourth mutation (DYS458) has a genetic distance of 4 from modal.
                      The way the software works is finding first the two haplotypes with the greatest distance between each other and placing them at opposite ends. It then rearranges the remaining haplotypes according to genetic distance/proximity from each other. At the 12 marker level it is #N15407 that appears in the top branch, although it doesn't necessarily mean that it is a distinct subclade. At the 25 marker level it is #19310 which has the greatest distance from the bottom cluster. In the latter case we have confirmed that #19310 was indeed in a different subclade. In the former case (#N15407) when we get a SNP result it could also turn out to be in a different subclade or it may simply be that the 12 marker based distance matrix doesn't amplify enough the haplotype distinctions. In other words, all our haplotypes are very similar at the 12 marker level regardless of what subclades we belong to. It is at a higher number of markers where the differences (or genetic distance) start to show.

                      When comparing the relatively clear and precise distinction of the known clade groupings shown at 25 marker level with the non-distinct and confused graphing at 12 markers, there doesn't appear to be much value given by the 12 marker level - considering the amount of work you put into the cladogram generation.
                      Agreed again. One reason that I have decided to make the 12 marker diagrams is mainly for those same persons who have only tested the 12 marker panel and might come and browse by this forum. I would encourage them, if possible, to upgrade their haplotypes And to confirm their subclade by SNP testing.

                      Now.... for 37 marker cladograms..... I believe I see some benefit to be gained in defining marker patterns for use in STR prognosticating but we undoubtedly need a larger database and many more SNP tested samples to say for sure.

                      Food for thought?

                      Bill
                      Right. The E3b project database is slowly growing. We just broke the 100 haplotypes mark. As to the 37 marker cladograms, when I figure out the bug in the software I'll post a new diagram.

                      Victor

                      Comment


                      • #12
                        Median joining networks - E3b

                        As of today, the E3b project stands at 115 records.

                        This time I want to share with you some fluxus diagrams. The first shows the global picture generated by all haplotypes (at 12 marker level). As in a previous diagram with fewer haplotypes, the two-cluster pattern remains very similar.
                        http://img.villagephotos.com/p/2005-...23/fluxus4.jpg

                        The second diagram is a zoom-in of the cluster where the M81 haplotype is located. The numbers correspond to the individual haplotype id's.
                        http://img.villagephotos.com/p/2005-...23/fluxus2.jpg

                        The third graphic is a bit more crowded and some id numbers overlap making them hard to identify. This cluster might need to be analized on its own as it seems to show considerable variation.
                        http://img.villagephotos.com/p/2005-...23/fluxus3.jpg


                        p.s. An excellent document on network diagrams (although it doesn't always open correctly) is here:
                        http://dimacs.rutgers.edu/Workshops/...orksTREE01.pdf

                        Victor

                        Comment


                        • #13
                          Cladogram from 25 marker panel haplotypes.

                          http://img.villagephotos.com/p/2005-...1_25_phylo.jpg

                          Note: TMRCA calculated using a constant mutation rate of 0.0024; units are generations not years.

                          Comment


                          • #14
                            Originally posted by Victor
                            Cladogram from 25 marker panel haplotypes.

                            http://img.villagephotos.com/p/2005-...1_25_phylo.jpg

                            Note: TMRCA calculated using a constant mutation rate of 0.0024; units are generations not years.
                            The diagram from the URL above is missing the legend for the branch labels.

                            triangle = M123 (E3b3)
                            square = M81 (E3b2)
                            rhombus = M78 (E3b1)

                            Comment


                            • #15
                              [QUOTE=Victor]
                              To look at a sample E3b cladogram click here.
                              The colors were added to highlight what I believe to be the three main subclades, that is E3b3, E3b1 and E3b2 in the order of appearance.
                              (I'll be posting a cladogram with the latest data soon.)QUOTE]


                              vistor you keep showing this and i look and say whats this saying
                              i dont know the numbers and it doesnt mean anything with out something.
                              whats it supposed to be saying

                              Comment

                              Working...
                              X