Filter different identical characters in multiple words












-1















I have a very large wordlist. How can I use Unix (or possibly Python) to find instances of multiple words fitting specific character-sharing criteria? For example, I want Words 1 and 2 to have the same fourth and seventh characters, Words 2 and 3 to have the same fourth and ninth characters, and Words 3 and 4 to have the same second, fourth, and ninth characters.



Example:



aaadiigjlf
abcdefghij
aswdofflle
bbbbbbbbbb
bisofmlwpa
fsbdfopkld
gikfkwpspa
hogkellgis


might return



abcdefghij
aaadiigjlf
fsbdfopkld
aswdofflle


EDIT: For clarification, I need the code to return any words that share the same characters in given positions; I don't have specific characters (like "d" and "g" as given in the example) in mind. Also, I'd like it to be able to return words that don't fit ALL of the criteria; e.g. in the example given, Words 1 and 4 share a fourth character, but not necessarily the second, seventh, and ninth. With the program I'm running in its finished form, I'm expecting it to return a very small list of words (probably only ten) based on nine strict character-sharing criteria.










share|improve this question





























    -1















    I have a very large wordlist. How can I use Unix (or possibly Python) to find instances of multiple words fitting specific character-sharing criteria? For example, I want Words 1 and 2 to have the same fourth and seventh characters, Words 2 and 3 to have the same fourth and ninth characters, and Words 3 and 4 to have the same second, fourth, and ninth characters.



    Example:



    aaadiigjlf
    abcdefghij
    aswdofflle
    bbbbbbbbbb
    bisofmlwpa
    fsbdfopkld
    gikfkwpspa
    hogkellgis


    might return



    abcdefghij
    aaadiigjlf
    fsbdfopkld
    aswdofflle


    EDIT: For clarification, I need the code to return any words that share the same characters in given positions; I don't have specific characters (like "d" and "g" as given in the example) in mind. Also, I'd like it to be able to return words that don't fit ALL of the criteria; e.g. in the example given, Words 1 and 4 share a fourth character, but not necessarily the second, seventh, and ninth. With the program I'm running in its finished form, I'm expecting it to return a very small list of words (probably only ten) based on nine strict character-sharing criteria.










    share|improve this question



























      -1












      -1








      -1








      I have a very large wordlist. How can I use Unix (or possibly Python) to find instances of multiple words fitting specific character-sharing criteria? For example, I want Words 1 and 2 to have the same fourth and seventh characters, Words 2 and 3 to have the same fourth and ninth characters, and Words 3 and 4 to have the same second, fourth, and ninth characters.



      Example:



      aaadiigjlf
      abcdefghij
      aswdofflle
      bbbbbbbbbb
      bisofmlwpa
      fsbdfopkld
      gikfkwpspa
      hogkellgis


      might return



      abcdefghij
      aaadiigjlf
      fsbdfopkld
      aswdofflle


      EDIT: For clarification, I need the code to return any words that share the same characters in given positions; I don't have specific characters (like "d" and "g" as given in the example) in mind. Also, I'd like it to be able to return words that don't fit ALL of the criteria; e.g. in the example given, Words 1 and 4 share a fourth character, but not necessarily the second, seventh, and ninth. With the program I'm running in its finished form, I'm expecting it to return a very small list of words (probably only ten) based on nine strict character-sharing criteria.










      share|improve this question
















      I have a very large wordlist. How can I use Unix (or possibly Python) to find instances of multiple words fitting specific character-sharing criteria? For example, I want Words 1 and 2 to have the same fourth and seventh characters, Words 2 and 3 to have the same fourth and ninth characters, and Words 3 and 4 to have the same second, fourth, and ninth characters.



      Example:



      aaadiigjlf
      abcdefghij
      aswdofflle
      bbbbbbbbbb
      bisofmlwpa
      fsbdfopkld
      gikfkwpspa
      hogkellgis


      might return



      abcdefghij
      aaadiigjlf
      fsbdfopkld
      aswdofflle


      EDIT: For clarification, I need the code to return any words that share the same characters in given positions; I don't have specific characters (like "d" and "g" as given in the example) in mind. Also, I'd like it to be able to return words that don't fit ALL of the criteria; e.g. in the example given, Words 1 and 4 share a fourth character, but not necessarily the second, seventh, and ninth. With the program I'm running in its finished form, I'm expecting it to return a very small list of words (probably only ten) based on nine strict character-sharing criteria.







      command-line






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 29 at 23:09







      J.T.

















      asked Jan 29 at 22:16









      J.T.J.T.

      11




      11






















          1 Answer
          1






          active

          oldest

          votes


















          1














          Use grep which uses Regular Expressions:



          # Find all lines where the fourth and seventh letter are "d" and "g"
          grep '...d..g' somefile

          # Find all lines where the fourth and ninth letters are "d" and "l"
          grep '...d....l' somefile


          If you want to enforce both rules, you would chain them together using a pipe:



          grep '...d..g' somefile | grep '...d....l'


          You can reduce the verbosity of a regex and multiple dots using the syntax {123} instead of 123 dots, such as:



          egrep '.{3}d.{2}g' somefile


          Note that as your regular expression gets more complicated you may need to use the egrep to support some syntax, such as the repetition syntax above.






          share|improve this answer
























          • Sorry, I need the code to return any words that share the same characters in given positions; I don't have specific characters (like "d" and "g" as given in the example) in mind. Also, I'd like it to be able to return multiple words that don't fit the same criteria; e.g. in the example given, Words 1 and 4 share a fourth character, but not necessarily the second, seventh, and ninth.

            – J.T.
            Jan 29 at 23:03













          • That's more complicated and likely would need to be done with a real programming language such as Python. It may be possible with awk but overall I can't think of a (clean) "unix" way to do that.

            – Kristopher Ives
            Jan 29 at 23:10











          • Hmm...then should I re-ask this in a Python forum, or would I still be able to ask here?

            – J.T.
            Jan 29 at 23:13











          • It's possible someone has a wizardly way of doing it that I'm not aware of, so I'm interested if anyone here can solve it. You might also want to post the exact question on Unix Stack Exchange as well as Stack Overflow

            – Kristopher Ives
            Jan 29 at 23:14











          • All right, I'll try crossposting. Thanks!

            – J.T.
            Jan 29 at 23:19











          Your Answer








          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "89"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1113945%2ffilter-different-identical-characters-in-multiple-words%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          Use grep which uses Regular Expressions:



          # Find all lines where the fourth and seventh letter are "d" and "g"
          grep '...d..g' somefile

          # Find all lines where the fourth and ninth letters are "d" and "l"
          grep '...d....l' somefile


          If you want to enforce both rules, you would chain them together using a pipe:



          grep '...d..g' somefile | grep '...d....l'


          You can reduce the verbosity of a regex and multiple dots using the syntax {123} instead of 123 dots, such as:



          egrep '.{3}d.{2}g' somefile


          Note that as your regular expression gets more complicated you may need to use the egrep to support some syntax, such as the repetition syntax above.






          share|improve this answer
























          • Sorry, I need the code to return any words that share the same characters in given positions; I don't have specific characters (like "d" and "g" as given in the example) in mind. Also, I'd like it to be able to return multiple words that don't fit the same criteria; e.g. in the example given, Words 1 and 4 share a fourth character, but not necessarily the second, seventh, and ninth.

            – J.T.
            Jan 29 at 23:03













          • That's more complicated and likely would need to be done with a real programming language such as Python. It may be possible with awk but overall I can't think of a (clean) "unix" way to do that.

            – Kristopher Ives
            Jan 29 at 23:10











          • Hmm...then should I re-ask this in a Python forum, or would I still be able to ask here?

            – J.T.
            Jan 29 at 23:13











          • It's possible someone has a wizardly way of doing it that I'm not aware of, so I'm interested if anyone here can solve it. You might also want to post the exact question on Unix Stack Exchange as well as Stack Overflow

            – Kristopher Ives
            Jan 29 at 23:14











          • All right, I'll try crossposting. Thanks!

            – J.T.
            Jan 29 at 23:19
















          1














          Use grep which uses Regular Expressions:



          # Find all lines where the fourth and seventh letter are "d" and "g"
          grep '...d..g' somefile

          # Find all lines where the fourth and ninth letters are "d" and "l"
          grep '...d....l' somefile


          If you want to enforce both rules, you would chain them together using a pipe:



          grep '...d..g' somefile | grep '...d....l'


          You can reduce the verbosity of a regex and multiple dots using the syntax {123} instead of 123 dots, such as:



          egrep '.{3}d.{2}g' somefile


          Note that as your regular expression gets more complicated you may need to use the egrep to support some syntax, such as the repetition syntax above.






          share|improve this answer
























          • Sorry, I need the code to return any words that share the same characters in given positions; I don't have specific characters (like "d" and "g" as given in the example) in mind. Also, I'd like it to be able to return multiple words that don't fit the same criteria; e.g. in the example given, Words 1 and 4 share a fourth character, but not necessarily the second, seventh, and ninth.

            – J.T.
            Jan 29 at 23:03













          • That's more complicated and likely would need to be done with a real programming language such as Python. It may be possible with awk but overall I can't think of a (clean) "unix" way to do that.

            – Kristopher Ives
            Jan 29 at 23:10











          • Hmm...then should I re-ask this in a Python forum, or would I still be able to ask here?

            – J.T.
            Jan 29 at 23:13











          • It's possible someone has a wizardly way of doing it that I'm not aware of, so I'm interested if anyone here can solve it. You might also want to post the exact question on Unix Stack Exchange as well as Stack Overflow

            – Kristopher Ives
            Jan 29 at 23:14











          • All right, I'll try crossposting. Thanks!

            – J.T.
            Jan 29 at 23:19














          1












          1








          1







          Use grep which uses Regular Expressions:



          # Find all lines where the fourth and seventh letter are "d" and "g"
          grep '...d..g' somefile

          # Find all lines where the fourth and ninth letters are "d" and "l"
          grep '...d....l' somefile


          If you want to enforce both rules, you would chain them together using a pipe:



          grep '...d..g' somefile | grep '...d....l'


          You can reduce the verbosity of a regex and multiple dots using the syntax {123} instead of 123 dots, such as:



          egrep '.{3}d.{2}g' somefile


          Note that as your regular expression gets more complicated you may need to use the egrep to support some syntax, such as the repetition syntax above.






          share|improve this answer













          Use grep which uses Regular Expressions:



          # Find all lines where the fourth and seventh letter are "d" and "g"
          grep '...d..g' somefile

          # Find all lines where the fourth and ninth letters are "d" and "l"
          grep '...d....l' somefile


          If you want to enforce both rules, you would chain them together using a pipe:



          grep '...d..g' somefile | grep '...d....l'


          You can reduce the verbosity of a regex and multiple dots using the syntax {123} instead of 123 dots, such as:



          egrep '.{3}d.{2}g' somefile


          Note that as your regular expression gets more complicated you may need to use the egrep to support some syntax, such as the repetition syntax above.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 29 at 22:57









          Kristopher IvesKristopher Ives

          2,93211525




          2,93211525













          • Sorry, I need the code to return any words that share the same characters in given positions; I don't have specific characters (like "d" and "g" as given in the example) in mind. Also, I'd like it to be able to return multiple words that don't fit the same criteria; e.g. in the example given, Words 1 and 4 share a fourth character, but not necessarily the second, seventh, and ninth.

            – J.T.
            Jan 29 at 23:03













          • That's more complicated and likely would need to be done with a real programming language such as Python. It may be possible with awk but overall I can't think of a (clean) "unix" way to do that.

            – Kristopher Ives
            Jan 29 at 23:10











          • Hmm...then should I re-ask this in a Python forum, or would I still be able to ask here?

            – J.T.
            Jan 29 at 23:13











          • It's possible someone has a wizardly way of doing it that I'm not aware of, so I'm interested if anyone here can solve it. You might also want to post the exact question on Unix Stack Exchange as well as Stack Overflow

            – Kristopher Ives
            Jan 29 at 23:14











          • All right, I'll try crossposting. Thanks!

            – J.T.
            Jan 29 at 23:19



















          • Sorry, I need the code to return any words that share the same characters in given positions; I don't have specific characters (like "d" and "g" as given in the example) in mind. Also, I'd like it to be able to return multiple words that don't fit the same criteria; e.g. in the example given, Words 1 and 4 share a fourth character, but not necessarily the second, seventh, and ninth.

            – J.T.
            Jan 29 at 23:03













          • That's more complicated and likely would need to be done with a real programming language such as Python. It may be possible with awk but overall I can't think of a (clean) "unix" way to do that.

            – Kristopher Ives
            Jan 29 at 23:10











          • Hmm...then should I re-ask this in a Python forum, or would I still be able to ask here?

            – J.T.
            Jan 29 at 23:13











          • It's possible someone has a wizardly way of doing it that I'm not aware of, so I'm interested if anyone here can solve it. You might also want to post the exact question on Unix Stack Exchange as well as Stack Overflow

            – Kristopher Ives
            Jan 29 at 23:14











          • All right, I'll try crossposting. Thanks!

            – J.T.
            Jan 29 at 23:19

















          Sorry, I need the code to return any words that share the same characters in given positions; I don't have specific characters (like "d" and "g" as given in the example) in mind. Also, I'd like it to be able to return multiple words that don't fit the same criteria; e.g. in the example given, Words 1 and 4 share a fourth character, but not necessarily the second, seventh, and ninth.

          – J.T.
          Jan 29 at 23:03







          Sorry, I need the code to return any words that share the same characters in given positions; I don't have specific characters (like "d" and "g" as given in the example) in mind. Also, I'd like it to be able to return multiple words that don't fit the same criteria; e.g. in the example given, Words 1 and 4 share a fourth character, but not necessarily the second, seventh, and ninth.

          – J.T.
          Jan 29 at 23:03















          That's more complicated and likely would need to be done with a real programming language such as Python. It may be possible with awk but overall I can't think of a (clean) "unix" way to do that.

          – Kristopher Ives
          Jan 29 at 23:10





          That's more complicated and likely would need to be done with a real programming language such as Python. It may be possible with awk but overall I can't think of a (clean) "unix" way to do that.

          – Kristopher Ives
          Jan 29 at 23:10













          Hmm...then should I re-ask this in a Python forum, or would I still be able to ask here?

          – J.T.
          Jan 29 at 23:13





          Hmm...then should I re-ask this in a Python forum, or would I still be able to ask here?

          – J.T.
          Jan 29 at 23:13













          It's possible someone has a wizardly way of doing it that I'm not aware of, so I'm interested if anyone here can solve it. You might also want to post the exact question on Unix Stack Exchange as well as Stack Overflow

          – Kristopher Ives
          Jan 29 at 23:14





          It's possible someone has a wizardly way of doing it that I'm not aware of, so I'm interested if anyone here can solve it. You might also want to post the exact question on Unix Stack Exchange as well as Stack Overflow

          – Kristopher Ives
          Jan 29 at 23:14













          All right, I'll try crossposting. Thanks!

          – J.T.
          Jan 29 at 23:19





          All right, I'll try crossposting. Thanks!

          – J.T.
          Jan 29 at 23:19


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Ask Ubuntu!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1113945%2ffilter-different-identical-characters-in-multiple-words%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

          ComboBox Display Member on multiple fields

          Is it possible to collect Nectar points via Trainline?