Announcement

Collapse
No announcement yet.

Reported Regions in the BED file - Why?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reported Regions in the BED file - Why?

    I have looked at 8 samples and have a concern about the reported regions in the BED files and more specifically the number of called SNPs present in the shorter regions. Can someone provide a technical reason why BED regions less than the length of the NGS fragment are being reported? More specifically I am concerned about "HIGH QUALITY" SNP calls in regions that are less than 4x the average sequence fragment length. One cannot get a perfect pile of fragments at exactly the sample location. Why even look at and report information from at least 45% of the currently supplied BED regions?

    What filters are used to remove apparent SNPs from regions that are not long enough to be appropriate for the technology used?


    [B]Length1 Length2 Number % Total[/B]
    0 20 35708 31.5
    21 50 9574 8.4
    51 100 6182 5.5
    101 300 16134 14.2
    301 501 9302 8.2
    501 1000 13322 11.7
    1001 1500 7132 6.3
    1501 2000 4401 3.9
    2001 2500 3134 2.8
    2501 2800 1343 1.2
    2801 10000 7176 6.3

  • #2
    I can't answer wkauffman's question, but I'd also like to know. I have a related observation about the BED file region lengths.

    For the only Big Y dataset I've seen, the BED file indicates that 10.015 million bases are of acceptable quantity -- just above the target level. The median region length is 75 bases. 38% of the regions have a length of 20 or fewer bases.

    Excluding regions with fewer than 8 base pairs, the above tally becomes 9.9986 million bases.

    In other words, the inclusion of short regions enables this particular kit to meet the advertised coverage level.

    I could use some assurance that these short regions are meaningful, and that their inclusion is justified for reasons other than allowing this particular kit to surpass the 10M target.

    Also, is the BED file based on the VCF file -- summarising which regions have PASS variants and which have REJECTED variants? Or is the BED file computed first, and then PASS/REJECTED status in the VCF file based in part on the ranges in the BED file?

    Comment


    • #3
      Originally posted by wkauffman View Post
      Can someone provide a technical reason why BED regions less than the length of the NGS fragment are being reported?
      This is a really good question. The BigY BED file indicates regions of confident calls, not regions that were sequenced. So a region could be smaller than the read length. It could also be much larger. If you give me more information, I would be happy to look at the specific region that troubles you.

      Comment


      • #4
        Originally posted by dmittelman View Post
        This is a really good question. The BigY BED file indicates regions of confident calls, not regions that were sequenced. So a region could be smaller than the read length. It could also be much larger. If you give me more information, I would be happy to look at the specific region that troubles you.
        Thanks. When the BAM files become available it will be easier to work through this topic. Still very curious how such small regions could be called high quality and not be influence by the quality of the adjacent base calls.
        Attached is a file with short regions in two of the files. region sequences like
        10017761 10017763
        10017765 10017766
        10017767 10017768
        10017771 10017772
        10017774 10017776
        10017779 10017780

        in multiple files don't seem right to be reportable.
        Attached Files

        Comment


        • #5
          Originally posted by dmittelman View Post
          This is a really good question. The BigY BED file indicates regions of confident calls, not regions that were sequenced. So a region could be smaller than the read length. It could also be much larger. If you give me more information, I would be happy to look at the specific region that troubles you.
          But since the bottom line for the end user is whether a SNP is reported, are we to take it that SNPs outside the listed ranges in the BED files will not appear in the reports? Or is it more complicated than that?
          I did receive a comment from someone who had cross checked the BED file info with what is claimed to be a BAM file from BigY and this seemed to indicate that the other areas weren't sequenced at all.
          Last edited by ikennedy; 6th March 2014, 02:40 AM.

          Comment


          • #6
            Originally posted by wkauffman View Post
            Thanks. When the BAM files become available it will be easier to work through this topic. Still very curious how such small regions could be called high quality and not be influence by the quality of the adjacent base calls.
            Attached is a file with short regions in two of the files. region sequences like
            10017761 10017763
            10017765 10017766
            10017767 10017768
            10017771 10017772
            10017774 10017776
            10017779 10017780

            in multiple files don't seem right to be reportable.
            Contact customer service and request your BAM. I will make sure you get it and then you will be able to see the nearby sequence.

            Comment


            • #7
              Originally posted by ikennedy View Post
              But since the bottom line for the end user is whether a SNP is reported, are we to take it that SNPs outside the listed ranges in the BED files will not appear in the reports? Or is it more complicated than that?
              I did receive a comment from someone who had cross checked the BED file info with what is claimed to be a BAM file from BigY and this seemed to indicate that the other areas weren't sequenced at all.
              The BED file lists highly confident positions. If we find a SNP somewhere outside the BED, it is not reported on the website, however it will be in the VCF file you download with the BED. In the VCF file you will see it with the designation 'FAIL'.

              As for your second comment, the BED file covers confident calls, not sequenced positions. In general, we are sequencing more than 13 million positions but we tend to report close to 11.5 to 12.5 as confident.

              You should request your BAM file from customer service and then you can confirm this for your specific case.

              Hope this helps.

              Comment


              • #8
                Originally posted by dmittelman View Post
                In the VCF file you will see it with the designation 'FAIL'.
                Did you mean to write 'REJECTED' rather than 'FAIL'?

                Originally posted by dmittelman View Post
                The BED file lists highly confident positions.
                Originally posted by dmittelman View Post
                As for your second comment, the BED file covers confident calls, not sequenced positions. In general, we are sequencing more than 13 million positions but we tend to report close to 11.5 to 12.5 as confident.
                Dr Mittelman, I'm confused. In one paragraph you say the BED file describes high confidence regions. In the next paragraph you say it covers confident regions (a superset of the high confidence regions?). What is the distinction between confidence and high confidence? Does the BED file describe regions of confidence or high confidence?

                If the Big Y confidently reports 11.5-12.5 Mb as "confident", then how come out of the 20 BED files I've seen, the maximum coverage is 10.78 Mbp and the mean is 10.31 Mbp?
                Last edited by Jessant; 7th March 2014, 02:16 PM. Reason: corrected spelling of Mittelman (sorry)

                Comment


                • #9
                  Originally posted by Jessant View Post
                  Did you mean to write 'REJECTED' rather than 'FAIL'?

                  Yes.

                  Dr Mittelman, I'm confused. In one paragraph you say the BED file describes high confidence regions. In the next paragraph you say it covers confident regions (a superset of the high confidence regions?). What is the distinction between confidence and high confidence? Does the BED file describe regions of confidence or high confidence?

                  My fault -- I used high in one sentence and not the next sentence. Oversight on my part.

                  If the Big Y confidently reports 11.5-12.5 Mb as "confident", then how come out of the 20 BED files I've seen, the maximum coverage is 10.78 Mbp and the mean is 10.31 Mbp?
                  If you give me the IDs for the 20 BED files you looked at, I would be happy to confirm the math, but our team based our numbers on all samples we released on February 27th.

                  We are continuously rolling out new samples and later this month the team will update our stats on whatever samples are posted.

                  Comment


                  • #10
                    Apologies for the bizarre formatting!

                    Comment


                    • #11
                      Formatting?

                      Would you like me to fix that for you? I can, though I hesitate to use the word fix in the context of BIG Ys.
                      Originally posted by dmittelman View Post
                      Apologies for the bizarre formatting!

                      Comment


                      • #12
                        Originally posted by dmittelman View Post
                        If you give me the IDs for the 20 BED files you looked at, I would be happy to confirm the math, but our team based our numbers on all samples we released on February 27th.

                        We are continuously rolling out new samples and later this month the team will update our stats on whatever samples are posted.
                        The 20 BED files I inspected are the 20 BigY VCF/BED .zip packages currently uploaded to the R1b-L21 Yahoo group. So it's not like I was cherry-picking my data. I will PM you the kit numbers and my calculated coverages for each.

                        Each row in the BED file represents an interval of confident (or is that highly confident?) coverage. Assuming these intervals are inclusive -- that they include the endpoints -- then the length of each interval is

                        Code:
                        end position - start position + 1.
                        The coverage for a particular BED file is the sum of the lengths of its intervals. Correct?

                        If intervals are exclusive (exclude endpoints) then the length is
                        Code:
                        end position - start position - 1
                        and the overall coverage is several thousand base pairs less.

                        Comment


                        • #13
                          I see from http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1 that the intervals are left-closed, right open. Meaning that the length of each interval is simply
                          Code:
                          end_position - start_position
                          , and the figures I reported are slightly higher than they should be.

                          Comment


                          • #14
                            shorty regions - poor assembly

                            An initial look at a Big-Y BAM file indicates that the groupings of short regions of the BED files represent areas where there appear to be sequence fragment assembly problems. In general there appear to be more assembly errors in these areas in Big-Y than in Full Y.

                            Comment


                            • #15
                              This looks like a good thread to ask this question in.

                              I want to generate a file that lists the whole
                              (sequencable) Y with each position listed as
                              either nocall or the base in the tested person.
                              If I accept the FTDNA take on "PASS" versus not
                              from the VCF file, I take it I could do this as follows:

                              in a computer program start with the HUGO reference
                              file. That has either ACGT or N for nocall. I would
                              have an array covering the positions listed in the HUGO reference with the values from HUGO in it.

                              I then look at the .bed file and if a position is
                              NOT in the bed file, I mark that position in my array
                              as an N for nocall.

                              I then look at the VCF file. If a position is
                              not PASS I mark it as N (but as I read earlier in
                              this thread, the previous step should already have done that). Then I go through and look at the last column of
                              the VCF file and if it has a 1 (and only a 1) and my array is not already an N AND I check the REF and ALT columns
                              and see that they have only one letter each I use the value the the ALT column in my array. If the REF or ALT columns
                              have more than one letter, I have an insertion or deletion.
                              If its an insertion I'm just going to use the letter I
                              and put the inserted bases in an auxiliary file. If
                              it is a deletion I have to figure out the correct
                              number and places to put a D.

                              Is this correct?

                              Doug McDonald

                              Comment

                              Working...
                              X