Selecting 65000 SNPs where AF is close to 0.5 in all or most populations












3












$begingroup$


I am evaluating the tool somalier (https://github.com/brentp/somalier) and I need to create a list of about 65,000 SNPs where the allele frequency (AF) is as close to 0.5 as possible across the most representative set of populations possible with today's data.



My alignments are in GRCh38, so I need a vcf input file for somalier also in GRCh38, not the one that is available in the somalier website, which is in GRCh37.



I am starting as input with the vcfs from the 1000genomes project in GRCh38 positions, and I would like to come up with a filtering strategy that:
(1) Sorts all SNPs by the AF being close to 0.5 in as many populations as possible.
(2) Takes the first 65,000 SNPs from that list and produces a vcf file from them.



My simplistic strategy so far has been to take the global AF value if it's between 0.3-0.7 and print it if it's a SNP with this perl one-liner:



gunzip -c my.vcf.gz | grep -v '#' | perl -lne 'if ($_ =~ /;AF=0.[34567]/) { if ($_ =~ /w+td+tS+twtwt/) {print $_} }'


Any recommendations?










share|improve this question











$endgroup$












  • $begingroup$
    If possible, I strongly urge you to use hg19 instead of GRCh38 because there are much better frequency data available. Basically, GnomAD uses hg19, so any gnomAD frequencies for hg38 will come from lifting over the positions and that doesn't work too well in my experience.
    $endgroup$
    – terdon
    Feb 8 at 14:17










  • $begingroup$
    I haven't been able to reliably liftover the sites.vcf.gz file that can be downloaded from the github.com/brentp/somalier site. I will create another Q for the liftover issue.
    $endgroup$
    – 719016
    Feb 8 at 14:24










  • $begingroup$
    What do you mean? Somalier's github page says it's for hg19, not hg38: "where sites is a VCF of variant sites (provided by somalier for hg19)."
    $endgroup$
    – terdon
    Feb 8 at 14:33










  • $begingroup$
    My alignments are in GRCh38, so I need a vcf input file for somalier also in GRCh38, not the one that is in GRCh37 (gh19).
    $endgroup$
    – 719016
    Feb 8 at 14:47
















3












$begingroup$


I am evaluating the tool somalier (https://github.com/brentp/somalier) and I need to create a list of about 65,000 SNPs where the allele frequency (AF) is as close to 0.5 as possible across the most representative set of populations possible with today's data.



My alignments are in GRCh38, so I need a vcf input file for somalier also in GRCh38, not the one that is available in the somalier website, which is in GRCh37.



I am starting as input with the vcfs from the 1000genomes project in GRCh38 positions, and I would like to come up with a filtering strategy that:
(1) Sorts all SNPs by the AF being close to 0.5 in as many populations as possible.
(2) Takes the first 65,000 SNPs from that list and produces a vcf file from them.



My simplistic strategy so far has been to take the global AF value if it's between 0.3-0.7 and print it if it's a SNP with this perl one-liner:



gunzip -c my.vcf.gz | grep -v '#' | perl -lne 'if ($_ =~ /;AF=0.[34567]/) { if ($_ =~ /w+td+tS+twtwt/) {print $_} }'


Any recommendations?










share|improve this question











$endgroup$












  • $begingroup$
    If possible, I strongly urge you to use hg19 instead of GRCh38 because there are much better frequency data available. Basically, GnomAD uses hg19, so any gnomAD frequencies for hg38 will come from lifting over the positions and that doesn't work too well in my experience.
    $endgroup$
    – terdon
    Feb 8 at 14:17










  • $begingroup$
    I haven't been able to reliably liftover the sites.vcf.gz file that can be downloaded from the github.com/brentp/somalier site. I will create another Q for the liftover issue.
    $endgroup$
    – 719016
    Feb 8 at 14:24










  • $begingroup$
    What do you mean? Somalier's github page says it's for hg19, not hg38: "where sites is a VCF of variant sites (provided by somalier for hg19)."
    $endgroup$
    – terdon
    Feb 8 at 14:33










  • $begingroup$
    My alignments are in GRCh38, so I need a vcf input file for somalier also in GRCh38, not the one that is in GRCh37 (gh19).
    $endgroup$
    – 719016
    Feb 8 at 14:47














3












3








3





$begingroup$


I am evaluating the tool somalier (https://github.com/brentp/somalier) and I need to create a list of about 65,000 SNPs where the allele frequency (AF) is as close to 0.5 as possible across the most representative set of populations possible with today's data.



My alignments are in GRCh38, so I need a vcf input file for somalier also in GRCh38, not the one that is available in the somalier website, which is in GRCh37.



I am starting as input with the vcfs from the 1000genomes project in GRCh38 positions, and I would like to come up with a filtering strategy that:
(1) Sorts all SNPs by the AF being close to 0.5 in as many populations as possible.
(2) Takes the first 65,000 SNPs from that list and produces a vcf file from them.



My simplistic strategy so far has been to take the global AF value if it's between 0.3-0.7 and print it if it's a SNP with this perl one-liner:



gunzip -c my.vcf.gz | grep -v '#' | perl -lne 'if ($_ =~ /;AF=0.[34567]/) { if ($_ =~ /w+td+tS+twtwt/) {print $_} }'


Any recommendations?










share|improve this question











$endgroup$




I am evaluating the tool somalier (https://github.com/brentp/somalier) and I need to create a list of about 65,000 SNPs where the allele frequency (AF) is as close to 0.5 as possible across the most representative set of populations possible with today's data.



My alignments are in GRCh38, so I need a vcf input file for somalier also in GRCh38, not the one that is available in the somalier website, which is in GRCh37.



I am starting as input with the vcfs from the 1000genomes project in GRCh38 positions, and I would like to come up with a filtering strategy that:
(1) Sorts all SNPs by the AF being close to 0.5 in as many populations as possible.
(2) Takes the first 65,000 SNPs from that list and produces a vcf file from them.



My simplistic strategy so far has been to take the global AF value if it's between 0.3-0.7 and print it if it's a SNP with this perl one-liner:



gunzip -c my.vcf.gz | grep -v '#' | perl -lne 'if ($_ =~ /;AF=0.[34567]/) { if ($_ =~ /w+td+tS+twtwt/) {print $_} }'


Any recommendations?







vcf snp genomics variation






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 8 at 14:48







719016

















asked Feb 8 at 14:10









719016719016

1,034415




1,034415












  • $begingroup$
    If possible, I strongly urge you to use hg19 instead of GRCh38 because there are much better frequency data available. Basically, GnomAD uses hg19, so any gnomAD frequencies for hg38 will come from lifting over the positions and that doesn't work too well in my experience.
    $endgroup$
    – terdon
    Feb 8 at 14:17










  • $begingroup$
    I haven't been able to reliably liftover the sites.vcf.gz file that can be downloaded from the github.com/brentp/somalier site. I will create another Q for the liftover issue.
    $endgroup$
    – 719016
    Feb 8 at 14:24










  • $begingroup$
    What do you mean? Somalier's github page says it's for hg19, not hg38: "where sites is a VCF of variant sites (provided by somalier for hg19)."
    $endgroup$
    – terdon
    Feb 8 at 14:33










  • $begingroup$
    My alignments are in GRCh38, so I need a vcf input file for somalier also in GRCh38, not the one that is in GRCh37 (gh19).
    $endgroup$
    – 719016
    Feb 8 at 14:47


















  • $begingroup$
    If possible, I strongly urge you to use hg19 instead of GRCh38 because there are much better frequency data available. Basically, GnomAD uses hg19, so any gnomAD frequencies for hg38 will come from lifting over the positions and that doesn't work too well in my experience.
    $endgroup$
    – terdon
    Feb 8 at 14:17










  • $begingroup$
    I haven't been able to reliably liftover the sites.vcf.gz file that can be downloaded from the github.com/brentp/somalier site. I will create another Q for the liftover issue.
    $endgroup$
    – 719016
    Feb 8 at 14:24










  • $begingroup$
    What do you mean? Somalier's github page says it's for hg19, not hg38: "where sites is a VCF of variant sites (provided by somalier for hg19)."
    $endgroup$
    – terdon
    Feb 8 at 14:33










  • $begingroup$
    My alignments are in GRCh38, so I need a vcf input file for somalier also in GRCh38, not the one that is in GRCh37 (gh19).
    $endgroup$
    – 719016
    Feb 8 at 14:47
















$begingroup$
If possible, I strongly urge you to use hg19 instead of GRCh38 because there are much better frequency data available. Basically, GnomAD uses hg19, so any gnomAD frequencies for hg38 will come from lifting over the positions and that doesn't work too well in my experience.
$endgroup$
– terdon
Feb 8 at 14:17




$begingroup$
If possible, I strongly urge you to use hg19 instead of GRCh38 because there are much better frequency data available. Basically, GnomAD uses hg19, so any gnomAD frequencies for hg38 will come from lifting over the positions and that doesn't work too well in my experience.
$endgroup$
– terdon
Feb 8 at 14:17












$begingroup$
I haven't been able to reliably liftover the sites.vcf.gz file that can be downloaded from the github.com/brentp/somalier site. I will create another Q for the liftover issue.
$endgroup$
– 719016
Feb 8 at 14:24




$begingroup$
I haven't been able to reliably liftover the sites.vcf.gz file that can be downloaded from the github.com/brentp/somalier site. I will create another Q for the liftover issue.
$endgroup$
– 719016
Feb 8 at 14:24












$begingroup$
What do you mean? Somalier's github page says it's for hg19, not hg38: "where sites is a VCF of variant sites (provided by somalier for hg19)."
$endgroup$
– terdon
Feb 8 at 14:33




$begingroup$
What do you mean? Somalier's github page says it's for hg19, not hg38: "where sites is a VCF of variant sites (provided by somalier for hg19)."
$endgroup$
– terdon
Feb 8 at 14:33












$begingroup$
My alignments are in GRCh38, so I need a vcf input file for somalier also in GRCh38, not the one that is in GRCh37 (gh19).
$endgroup$
– 719016
Feb 8 at 14:47




$begingroup$
My alignments are in GRCh38, so I need a vcf input file for somalier also in GRCh38, not the one that is in GRCh37 (gh19).
$endgroup$
– 719016
Feb 8 at 14:47










2 Answers
2






active

oldest

votes


















5












$begingroup$

Just use bcftools view for filtering:



$ bcftools view -i 'AF>0.3 && AF<0.7' input.vcf.gz > output.vcf


To truncate this list to 65,000 SNPs count the header lines, sum them to the number of SNPs you like to have and use head -n.



$ bcftools view -h input.vcf.gz|wc -l
255

$ bcftools view -i 'AF>0.3 && AF<0.7' input.vcf.gz | head -n 65255 > output.vcf





share|improve this answer









$endgroup$





















    3












    $begingroup$

    Using bcftools is the right way, but you could also greatly simplify the command you're using. You can do the same thing with a single zgrep call:



    zgrep -P '^[^#]+td+tS+twtwt.*;AF=0.[3-7]' my.vcf.gz


    And to limit to the first 65000 results:



    zgrep -m 65000 -P '^[^#]+td+tS+twtwt.*;AF=0.[3-7]' my.vcf.gz


    Finally, to keep the header of the original vcf unchanged, you could do:



    zgrep -m 65000 -P '^(#|[^#]+td+tS+twtwt.*;AF=0.[3-7])' my.vcf.gz


    I also suggest you use hg19 instead of hg38 data since the best available source for frequencies, the gnomAD database, is based on hg19 and the tool you want to test also provides a site file for hg19. But that may not be possible for you.



    If your VCF file has multiple AF entries for different sub-populations and you want to use those, you'll probably need to script something. We can't really help with that unless you show us some example input lines and start writing the basic script yourself though.






    share|improve this answer









    $endgroup$









    • 1




      $begingroup$
      I have a different opinion about that. hg19 is outdated sind 2013. Whenever you start something new, you should move to the current reference genome, which is by now hg38. There are lifted over versions of 1000 Genomes, ExAC and gnomAD which provide at least the frequency data by ensembl. As long as the frequency data are given in the INFO column you can easily filter by them using bcftools. So there is no need to script something.
      $endgroup$
      – finswimmer
      Feb 8 at 19:07












    • $begingroup$
      @finswimmer not sure what you are referring to, but if it's about hg38, I am not saying hg19 is better (it isn't) only that the gnomAD data is based on hg19 and liftover is imperfect. As for the rest, yes of course bcftools is the way to go!
      $endgroup$
      – terdon
      Feb 8 at 19:10













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "676"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6988%2fselecting-65000-snps-where-af-is-close-to-0-5-in-all-or-most-populations%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    5












    $begingroup$

    Just use bcftools view for filtering:



    $ bcftools view -i 'AF>0.3 && AF<0.7' input.vcf.gz > output.vcf


    To truncate this list to 65,000 SNPs count the header lines, sum them to the number of SNPs you like to have and use head -n.



    $ bcftools view -h input.vcf.gz|wc -l
    255

    $ bcftools view -i 'AF>0.3 && AF<0.7' input.vcf.gz | head -n 65255 > output.vcf





    share|improve this answer









    $endgroup$


















      5












      $begingroup$

      Just use bcftools view for filtering:



      $ bcftools view -i 'AF>0.3 && AF<0.7' input.vcf.gz > output.vcf


      To truncate this list to 65,000 SNPs count the header lines, sum them to the number of SNPs you like to have and use head -n.



      $ bcftools view -h input.vcf.gz|wc -l
      255

      $ bcftools view -i 'AF>0.3 && AF<0.7' input.vcf.gz | head -n 65255 > output.vcf





      share|improve this answer









      $endgroup$
















        5












        5








        5





        $begingroup$

        Just use bcftools view for filtering:



        $ bcftools view -i 'AF>0.3 && AF<0.7' input.vcf.gz > output.vcf


        To truncate this list to 65,000 SNPs count the header lines, sum them to the number of SNPs you like to have and use head -n.



        $ bcftools view -h input.vcf.gz|wc -l
        255

        $ bcftools view -i 'AF>0.3 && AF<0.7' input.vcf.gz | head -n 65255 > output.vcf





        share|improve this answer









        $endgroup$



        Just use bcftools view for filtering:



        $ bcftools view -i 'AF>0.3 && AF<0.7' input.vcf.gz > output.vcf


        To truncate this list to 65,000 SNPs count the header lines, sum them to the number of SNPs you like to have and use head -n.



        $ bcftools view -h input.vcf.gz|wc -l
        255

        $ bcftools view -i 'AF>0.3 && AF<0.7' input.vcf.gz | head -n 65255 > output.vcf






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Feb 8 at 14:31









        finswimmerfinswimmer

        87229




        87229























            3












            $begingroup$

            Using bcftools is the right way, but you could also greatly simplify the command you're using. You can do the same thing with a single zgrep call:



            zgrep -P '^[^#]+td+tS+twtwt.*;AF=0.[3-7]' my.vcf.gz


            And to limit to the first 65000 results:



            zgrep -m 65000 -P '^[^#]+td+tS+twtwt.*;AF=0.[3-7]' my.vcf.gz


            Finally, to keep the header of the original vcf unchanged, you could do:



            zgrep -m 65000 -P '^(#|[^#]+td+tS+twtwt.*;AF=0.[3-7])' my.vcf.gz


            I also suggest you use hg19 instead of hg38 data since the best available source for frequencies, the gnomAD database, is based on hg19 and the tool you want to test also provides a site file for hg19. But that may not be possible for you.



            If your VCF file has multiple AF entries for different sub-populations and you want to use those, you'll probably need to script something. We can't really help with that unless you show us some example input lines and start writing the basic script yourself though.






            share|improve this answer









            $endgroup$









            • 1




              $begingroup$
              I have a different opinion about that. hg19 is outdated sind 2013. Whenever you start something new, you should move to the current reference genome, which is by now hg38. There are lifted over versions of 1000 Genomes, ExAC and gnomAD which provide at least the frequency data by ensembl. As long as the frequency data are given in the INFO column you can easily filter by them using bcftools. So there is no need to script something.
              $endgroup$
              – finswimmer
              Feb 8 at 19:07












            • $begingroup$
              @finswimmer not sure what you are referring to, but if it's about hg38, I am not saying hg19 is better (it isn't) only that the gnomAD data is based on hg19 and liftover is imperfect. As for the rest, yes of course bcftools is the way to go!
              $endgroup$
              – terdon
              Feb 8 at 19:10


















            3












            $begingroup$

            Using bcftools is the right way, but you could also greatly simplify the command you're using. You can do the same thing with a single zgrep call:



            zgrep -P '^[^#]+td+tS+twtwt.*;AF=0.[3-7]' my.vcf.gz


            And to limit to the first 65000 results:



            zgrep -m 65000 -P '^[^#]+td+tS+twtwt.*;AF=0.[3-7]' my.vcf.gz


            Finally, to keep the header of the original vcf unchanged, you could do:



            zgrep -m 65000 -P '^(#|[^#]+td+tS+twtwt.*;AF=0.[3-7])' my.vcf.gz


            I also suggest you use hg19 instead of hg38 data since the best available source for frequencies, the gnomAD database, is based on hg19 and the tool you want to test also provides a site file for hg19. But that may not be possible for you.



            If your VCF file has multiple AF entries for different sub-populations and you want to use those, you'll probably need to script something. We can't really help with that unless you show us some example input lines and start writing the basic script yourself though.






            share|improve this answer









            $endgroup$









            • 1




              $begingroup$
              I have a different opinion about that. hg19 is outdated sind 2013. Whenever you start something new, you should move to the current reference genome, which is by now hg38. There are lifted over versions of 1000 Genomes, ExAC and gnomAD which provide at least the frequency data by ensembl. As long as the frequency data are given in the INFO column you can easily filter by them using bcftools. So there is no need to script something.
              $endgroup$
              – finswimmer
              Feb 8 at 19:07












            • $begingroup$
              @finswimmer not sure what you are referring to, but if it's about hg38, I am not saying hg19 is better (it isn't) only that the gnomAD data is based on hg19 and liftover is imperfect. As for the rest, yes of course bcftools is the way to go!
              $endgroup$
              – terdon
              Feb 8 at 19:10
















            3












            3








            3





            $begingroup$

            Using bcftools is the right way, but you could also greatly simplify the command you're using. You can do the same thing with a single zgrep call:



            zgrep -P '^[^#]+td+tS+twtwt.*;AF=0.[3-7]' my.vcf.gz


            And to limit to the first 65000 results:



            zgrep -m 65000 -P '^[^#]+td+tS+twtwt.*;AF=0.[3-7]' my.vcf.gz


            Finally, to keep the header of the original vcf unchanged, you could do:



            zgrep -m 65000 -P '^(#|[^#]+td+tS+twtwt.*;AF=0.[3-7])' my.vcf.gz


            I also suggest you use hg19 instead of hg38 data since the best available source for frequencies, the gnomAD database, is based on hg19 and the tool you want to test also provides a site file for hg19. But that may not be possible for you.



            If your VCF file has multiple AF entries for different sub-populations and you want to use those, you'll probably need to script something. We can't really help with that unless you show us some example input lines and start writing the basic script yourself though.






            share|improve this answer









            $endgroup$



            Using bcftools is the right way, but you could also greatly simplify the command you're using. You can do the same thing with a single zgrep call:



            zgrep -P '^[^#]+td+tS+twtwt.*;AF=0.[3-7]' my.vcf.gz


            And to limit to the first 65000 results:



            zgrep -m 65000 -P '^[^#]+td+tS+twtwt.*;AF=0.[3-7]' my.vcf.gz


            Finally, to keep the header of the original vcf unchanged, you could do:



            zgrep -m 65000 -P '^(#|[^#]+td+tS+twtwt.*;AF=0.[3-7])' my.vcf.gz


            I also suggest you use hg19 instead of hg38 data since the best available source for frequencies, the gnomAD database, is based on hg19 and the tool you want to test also provides a site file for hg19. But that may not be possible for you.



            If your VCF file has multiple AF entries for different sub-populations and you want to use those, you'll probably need to script something. We can't really help with that unless you show us some example input lines and start writing the basic script yourself though.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Feb 8 at 14:49









            terdonterdon

            4,2401729




            4,2401729








            • 1




              $begingroup$
              I have a different opinion about that. hg19 is outdated sind 2013. Whenever you start something new, you should move to the current reference genome, which is by now hg38. There are lifted over versions of 1000 Genomes, ExAC and gnomAD which provide at least the frequency data by ensembl. As long as the frequency data are given in the INFO column you can easily filter by them using bcftools. So there is no need to script something.
              $endgroup$
              – finswimmer
              Feb 8 at 19:07












            • $begingroup$
              @finswimmer not sure what you are referring to, but if it's about hg38, I am not saying hg19 is better (it isn't) only that the gnomAD data is based on hg19 and liftover is imperfect. As for the rest, yes of course bcftools is the way to go!
              $endgroup$
              – terdon
              Feb 8 at 19:10
















            • 1




              $begingroup$
              I have a different opinion about that. hg19 is outdated sind 2013. Whenever you start something new, you should move to the current reference genome, which is by now hg38. There are lifted over versions of 1000 Genomes, ExAC and gnomAD which provide at least the frequency data by ensembl. As long as the frequency data are given in the INFO column you can easily filter by them using bcftools. So there is no need to script something.
              $endgroup$
              – finswimmer
              Feb 8 at 19:07












            • $begingroup$
              @finswimmer not sure what you are referring to, but if it's about hg38, I am not saying hg19 is better (it isn't) only that the gnomAD data is based on hg19 and liftover is imperfect. As for the rest, yes of course bcftools is the way to go!
              $endgroup$
              – terdon
              Feb 8 at 19:10










            1




            1




            $begingroup$
            I have a different opinion about that. hg19 is outdated sind 2013. Whenever you start something new, you should move to the current reference genome, which is by now hg38. There are lifted over versions of 1000 Genomes, ExAC and gnomAD which provide at least the frequency data by ensembl. As long as the frequency data are given in the INFO column you can easily filter by them using bcftools. So there is no need to script something.
            $endgroup$
            – finswimmer
            Feb 8 at 19:07






            $begingroup$
            I have a different opinion about that. hg19 is outdated sind 2013. Whenever you start something new, you should move to the current reference genome, which is by now hg38. There are lifted over versions of 1000 Genomes, ExAC and gnomAD which provide at least the frequency data by ensembl. As long as the frequency data are given in the INFO column you can easily filter by them using bcftools. So there is no need to script something.
            $endgroup$
            – finswimmer
            Feb 8 at 19:07














            $begingroup$
            @finswimmer not sure what you are referring to, but if it's about hg38, I am not saying hg19 is better (it isn't) only that the gnomAD data is based on hg19 and liftover is imperfect. As for the rest, yes of course bcftools is the way to go!
            $endgroup$
            – terdon
            Feb 8 at 19:10






            $begingroup$
            @finswimmer not sure what you are referring to, but if it's about hg38, I am not saying hg19 is better (it isn't) only that the gnomAD data is based on hg19 and liftover is imperfect. As for the rest, yes of course bcftools is the way to go!
            $endgroup$
            – terdon
            Feb 8 at 19:10




















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Bioinformatics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6988%2fselecting-65000-snps-where-af-is-close-to-0-5-in-all-or-most-populations%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

            ComboBox Display Member on multiple fields

            Is it possible to collect Nectar points via Trainline?