Count letter frequency in word list, excluding duplicates in the same word












28















I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.



For example if i have:



words=["tree","bone","indigo","developer"]


The frequency will be:



letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}


As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.



This is the algorithm that I came up with, it's implemented in Python:



for word in words:
count=0;

for letter in word:
if(letter.isalpha()):
if((letters[letter.lower()] > 0 && count == 0) ||
(letters[letter.lower()] == 0 && count == 0)):

letters[letter.lower()]+=1
count=1

elif(letters[letter.lower()]==0 && count==1):
letters[letter.lower()]+=1


But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.










share|improve this question




















  • 9





    I would describe the requirement as counting "the number of words which include each letter".

    – Stobor
    Jan 16 at 23:48











  • Related stackoverflow.com/questions/46486462/…

    – Kasrâmvd
    Jan 17 at 0:21






  • 1





    @Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

    – mbj
    Jan 17 at 2:38











  • @mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

    – Stobor
    Jan 17 at 6:48
















28















I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.



For example if i have:



words=["tree","bone","indigo","developer"]


The frequency will be:



letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}


As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.



This is the algorithm that I came up with, it's implemented in Python:



for word in words:
count=0;

for letter in word:
if(letter.isalpha()):
if((letters[letter.lower()] > 0 && count == 0) ||
(letters[letter.lower()] == 0 && count == 0)):

letters[letter.lower()]+=1
count=1

elif(letters[letter.lower()]==0 && count==1):
letters[letter.lower()]+=1


But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.










share|improve this question




















  • 9





    I would describe the requirement as counting "the number of words which include each letter".

    – Stobor
    Jan 16 at 23:48











  • Related stackoverflow.com/questions/46486462/…

    – Kasrâmvd
    Jan 17 at 0:21






  • 1





    @Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

    – mbj
    Jan 17 at 2:38











  • @mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

    – Stobor
    Jan 17 at 6:48














28












28








28


6






I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.



For example if i have:



words=["tree","bone","indigo","developer"]


The frequency will be:



letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}


As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.



This is the algorithm that I came up with, it's implemented in Python:



for word in words:
count=0;

for letter in word:
if(letter.isalpha()):
if((letters[letter.lower()] > 0 && count == 0) ||
(letters[letter.lower()] == 0 && count == 0)):

letters[letter.lower()]+=1
count=1

elif(letters[letter.lower()]==0 && count==1):
letters[letter.lower()]+=1


But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.










share|improve this question
















I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.



For example if i have:



words=["tree","bone","indigo","developer"]


The frequency will be:



letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}


As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.



This is the algorithm that I came up with, it's implemented in Python:



for word in words:
count=0;

for letter in word:
if(letter.isalpha()):
if((letters[letter.lower()] > 0 && count == 0) ||
(letters[letter.lower()] == 0 && count == 0)):

letters[letter.lower()]+=1
count=1

elif(letters[letter.lower()]==0 && count==1):
letters[letter.lower()]+=1


But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.







python algorithm






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 16 at 19:29









Prune

43.4k143456




43.4k143456










asked Jan 16 at 19:06









MattGeekMattGeek

17319




17319








  • 9





    I would describe the requirement as counting "the number of words which include each letter".

    – Stobor
    Jan 16 at 23:48











  • Related stackoverflow.com/questions/46486462/…

    – Kasrâmvd
    Jan 17 at 0:21






  • 1





    @Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

    – mbj
    Jan 17 at 2:38











  • @mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

    – Stobor
    Jan 17 at 6:48














  • 9





    I would describe the requirement as counting "the number of words which include each letter".

    – Stobor
    Jan 16 at 23:48











  • Related stackoverflow.com/questions/46486462/…

    – Kasrâmvd
    Jan 17 at 0:21






  • 1





    @Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

    – mbj
    Jan 17 at 2:38











  • @mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

    – Stobor
    Jan 17 at 6:48








9




9





I would describe the requirement as counting "the number of words which include each letter".

– Stobor
Jan 16 at 23:48





I would describe the requirement as counting "the number of words which include each letter".

– Stobor
Jan 16 at 23:48













Related stackoverflow.com/questions/46486462/…

– Kasrâmvd
Jan 17 at 0:21





Related stackoverflow.com/questions/46486462/…

– Kasrâmvd
Jan 17 at 0:21




1




1





@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

– mbj
Jan 17 at 2:38





@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

– mbj
Jan 17 at 2:38













@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

– Stobor
Jan 17 at 6:48





@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

– Stobor
Jan 17 at 6:48












8 Answers
8






active

oldest

votes


















47














A variation on @Primusa answer without using update:



from collections import Counter

words = ["tree", "bone", "indigo", "developer"]
counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())


Output



Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})


Basically convert each word to a set and then iterate over each set.






share|improve this answer

































    14














    Create a counter object and then update it with sets for each word:



    from collections import Counter
    c = Counter()

    for word in wordlist:
    c.update(set(word.lower()))





    share|improve this answer



















    • 2





      It would be helpful to compare the time complexity of this solution to the one provided by OP

      – Jordan Singer
      Jan 16 at 19:11






    • 2





      @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

      – Primusa
      Jan 16 at 19:58











    • Right, I suggested that because OP was interested in efficiency.

      – Jordan Singer
      Jan 16 at 20:00











    • I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

      – Walter Tross
      Jan 16 at 20:33











    • @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

      – Primusa
      Jan 16 at 20:35



















    13














    One without Counter



    words=["tree","bone","indigo","developer"]
    d={}
    for word in words: # iterate over words
    for i in set(word): # to remove the duplication of characters within word
    d[i]=d.get(i,0)+1


    Output



    {'b': 1,
    'd': 2,
    'e': 3,
    'g': 1,
    'i': 1,
    'l': 1,
    'n': 2,
    'o': 3,
    'p': 1,
    'r': 2,
    't': 1,
    'v': 1}





    share|improve this answer
























    • Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

      – MattGeek
      Jan 16 at 20:36



















    8














    Comparing speed of the solutions presented so far:



    def f1(words):
    c = Counter()
    for word in words:
    c.update(set(word.lower()))
    return c

    def f2(words):
    return Counter(
    c
    for word in words
    for c in set(word.lower()))

    def f3(words):
    d = {}
    for word in words:
    for i in set(word.lower()):
    d[i] = d.get(i, 0) + 1
    return d


    My timing function (using different sizes for the list of words):



    word_list = [
    'tree', 'bone', 'indigo', 'developer', 'python',
    'language', 'timeit', 'xerox', 'printer', 'offset',
    ]

    for exp in range(5):
    words = word_list * 10**exp

    result_list =
    for i in range(1, 4):
    t = timeit.timeit(
    'f(words)',
    'from __main__ import words, f{} as f'.format(i),
    number=100)
    result_list.append((i, t))

    print('{:10,d} words | {}'.format(
    len(words),
    ' | '.join(
    'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))


    The results:



            10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec
    100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
    1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
    10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
    100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec


    The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).






    share|improve this answer


























    • @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

      – Ralf
      Jan 16 at 20:28











    • Thanks for this good comparison.

      – MattGeek
      Jan 16 at 20:32



















    0














    Try using a dictionary comprehension:



    import string
    print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})





    share|improve this answer































      0














      A bit too late to the party, but here you go:



      freq = {k: sum(k in word for word in words) for k in set(''.join(words))}


      which returns:



      {'i': 1, 'v': 1, 'p': 1, 'b': 1, 'e': 3, 'g': 1, 't': 1, 'n': 2, 'd': 2, 'o': 3, 'l': 1, 'r': 2}





      share|improve this answer































        0














        from collections import Counter  
        import string

        words=["tree","bone","indigo","developer"]
        y=Counter(string.ascii_lowercase)
        new_dict=dict(y)

        for k in new_dict:
        new_dict[k]=0
        trial = 0
        while len(words) > trial:
        for let in set(words[trial]):
        if let in new_dict:
        new_dict[str(let)]=new_dict[str(let)]+1

        trial = trial +1
        print(new_dict)





        share|improve this answer

































          -1














          The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.



          import string
          counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}


          which produces a dict like this:



          {'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}



          Here's my update of Ralf's timings:



                  10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec
          100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
          1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
          10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
          100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec


          based on the following code and the word list from https://github.com/dwyl/english-words/



          import string
          import timeit
          import random
          from collections import Counter

          def f1(words):
          c = Counter()
          for word in words:
          c.update(set(word.lower()))
          return c

          def f2(words):
          return Counter(
          c
          for word in words
          for c in set(word.lower()))

          def f3(words):
          d = {}
          for word in words:
          for i in set(word.lower()):
          d[i] = d.get(i, 0) + 1
          return d


          def f4(words):
          d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
          return d


          with open('words.txt') as word_file:
          valid_words = set(word_file.read().split())

          for exp in range(5):

          result_list =
          for i in range(1, 5):
          t = timeit.timeit(
          'f(words)',
          'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
          number=100)
          result_list.append((i, t))

          print('{:10,d} words | {}'.format(
          len(words),
          ' | '.join(
          'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

          print(f4(random.sample(valid_words, 10000)))
          print(f4(random.sample(valid_words, 1000)))
          print(f4(random.sample(valid_words, 100)))
          print(f4(random.sample(valid_words, 10)))






          share|improve this answer



















          • 1





            But this is ASCII only -- in this day and age?

            – Janne Karila
            Jan 17 at 7:38











          • I would replace len([w for w in words if c in w.lower()]) with sum(c in w.lower() for w in words). Should be a bit faster. Thanks for the timings btw. +1

            – Ev. Kounis
            Jan 17 at 15:56













          • It seems that yours is actually faster. Anyway. +1

            – Ev. Kounis
            Jan 17 at 16:02











          • @JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.

            – Stobor
            2 days ago











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54223703%2fcount-letter-frequency-in-word-list-excluding-duplicates-in-the-same-word%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          8 Answers
          8






          active

          oldest

          votes








          8 Answers
          8






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          47














          A variation on @Primusa answer without using update:



          from collections import Counter

          words = ["tree", "bone", "indigo", "developer"]
          counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())


          Output



          Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})


          Basically convert each word to a set and then iterate over each set.






          share|improve this answer






























            47














            A variation on @Primusa answer without using update:



            from collections import Counter

            words = ["tree", "bone", "indigo", "developer"]
            counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())


            Output



            Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})


            Basically convert each word to a set and then iterate over each set.






            share|improve this answer




























              47












              47








              47







              A variation on @Primusa answer without using update:



              from collections import Counter

              words = ["tree", "bone", "indigo", "developer"]
              counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())


              Output



              Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})


              Basically convert each word to a set and then iterate over each set.






              share|improve this answer















              A variation on @Primusa answer without using update:



              from collections import Counter

              words = ["tree", "bone", "indigo", "developer"]
              counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())


              Output



              Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})


              Basically convert each word to a set and then iterate over each set.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Jan 16 at 20:48

























              answered Jan 16 at 19:14









              Daniel MesejoDaniel Mesejo

              16.5k21430




              16.5k21430

























                  14














                  Create a counter object and then update it with sets for each word:



                  from collections import Counter
                  c = Counter()

                  for word in wordlist:
                  c.update(set(word.lower()))





                  share|improve this answer



















                  • 2





                    It would be helpful to compare the time complexity of this solution to the one provided by OP

                    – Jordan Singer
                    Jan 16 at 19:11






                  • 2





                    @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

                    – Primusa
                    Jan 16 at 19:58











                  • Right, I suggested that because OP was interested in efficiency.

                    – Jordan Singer
                    Jan 16 at 20:00











                  • I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

                    – Walter Tross
                    Jan 16 at 20:33











                  • @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

                    – Primusa
                    Jan 16 at 20:35
















                  14














                  Create a counter object and then update it with sets for each word:



                  from collections import Counter
                  c = Counter()

                  for word in wordlist:
                  c.update(set(word.lower()))





                  share|improve this answer



















                  • 2





                    It would be helpful to compare the time complexity of this solution to the one provided by OP

                    – Jordan Singer
                    Jan 16 at 19:11






                  • 2





                    @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

                    – Primusa
                    Jan 16 at 19:58











                  • Right, I suggested that because OP was interested in efficiency.

                    – Jordan Singer
                    Jan 16 at 20:00











                  • I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

                    – Walter Tross
                    Jan 16 at 20:33











                  • @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

                    – Primusa
                    Jan 16 at 20:35














                  14












                  14








                  14







                  Create a counter object and then update it with sets for each word:



                  from collections import Counter
                  c = Counter()

                  for word in wordlist:
                  c.update(set(word.lower()))





                  share|improve this answer













                  Create a counter object and then update it with sets for each word:



                  from collections import Counter
                  c = Counter()

                  for word in wordlist:
                  c.update(set(word.lower()))






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 16 at 19:10









                  PrimusaPrimusa

                  5,6851627




                  5,6851627








                  • 2





                    It would be helpful to compare the time complexity of this solution to the one provided by OP

                    – Jordan Singer
                    Jan 16 at 19:11






                  • 2





                    @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

                    – Primusa
                    Jan 16 at 19:58











                  • Right, I suggested that because OP was interested in efficiency.

                    – Jordan Singer
                    Jan 16 at 20:00











                  • I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

                    – Walter Tross
                    Jan 16 at 20:33











                  • @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

                    – Primusa
                    Jan 16 at 20:35














                  • 2





                    It would be helpful to compare the time complexity of this solution to the one provided by OP

                    – Jordan Singer
                    Jan 16 at 19:11






                  • 2





                    @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

                    – Primusa
                    Jan 16 at 19:58











                  • Right, I suggested that because OP was interested in efficiency.

                    – Jordan Singer
                    Jan 16 at 20:00











                  • I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

                    – Walter Tross
                    Jan 16 at 20:33











                  • @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

                    – Primusa
                    Jan 16 at 20:35








                  2




                  2





                  It would be helpful to compare the time complexity of this solution to the one provided by OP

                  – Jordan Singer
                  Jan 16 at 19:11





                  It would be helpful to compare the time complexity of this solution to the one provided by OP

                  – Jordan Singer
                  Jan 16 at 19:11




                  2




                  2





                  @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

                  – Primusa
                  Jan 16 at 19:58





                  @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

                  – Primusa
                  Jan 16 at 19:58













                  Right, I suggested that because OP was interested in efficiency.

                  – Jordan Singer
                  Jan 16 at 20:00





                  Right, I suggested that because OP was interested in efficiency.

                  – Jordan Singer
                  Jan 16 at 20:00













                  I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

                  – Walter Tross
                  Jan 16 at 20:33





                  I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

                  – Walter Tross
                  Jan 16 at 20:33













                  @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

                  – Primusa
                  Jan 16 at 20:35





                  @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

                  – Primusa
                  Jan 16 at 20:35











                  13














                  One without Counter



                  words=["tree","bone","indigo","developer"]
                  d={}
                  for word in words: # iterate over words
                  for i in set(word): # to remove the duplication of characters within word
                  d[i]=d.get(i,0)+1


                  Output



                  {'b': 1,
                  'd': 2,
                  'e': 3,
                  'g': 1,
                  'i': 1,
                  'l': 1,
                  'n': 2,
                  'o': 3,
                  'p': 1,
                  'r': 2,
                  't': 1,
                  'v': 1}





                  share|improve this answer
























                  • Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

                    – MattGeek
                    Jan 16 at 20:36
















                  13














                  One without Counter



                  words=["tree","bone","indigo","developer"]
                  d={}
                  for word in words: # iterate over words
                  for i in set(word): # to remove the duplication of characters within word
                  d[i]=d.get(i,0)+1


                  Output



                  {'b': 1,
                  'd': 2,
                  'e': 3,
                  'g': 1,
                  'i': 1,
                  'l': 1,
                  'n': 2,
                  'o': 3,
                  'p': 1,
                  'r': 2,
                  't': 1,
                  'v': 1}





                  share|improve this answer
























                  • Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

                    – MattGeek
                    Jan 16 at 20:36














                  13












                  13








                  13







                  One without Counter



                  words=["tree","bone","indigo","developer"]
                  d={}
                  for word in words: # iterate over words
                  for i in set(word): # to remove the duplication of characters within word
                  d[i]=d.get(i,0)+1


                  Output



                  {'b': 1,
                  'd': 2,
                  'e': 3,
                  'g': 1,
                  'i': 1,
                  'l': 1,
                  'n': 2,
                  'o': 3,
                  'p': 1,
                  'r': 2,
                  't': 1,
                  'v': 1}





                  share|improve this answer













                  One without Counter



                  words=["tree","bone","indigo","developer"]
                  d={}
                  for word in words: # iterate over words
                  for i in set(word): # to remove the duplication of characters within word
                  d[i]=d.get(i,0)+1


                  Output



                  {'b': 1,
                  'd': 2,
                  'e': 3,
                  'g': 1,
                  'i': 1,
                  'l': 1,
                  'n': 2,
                  'o': 3,
                  'p': 1,
                  'r': 2,
                  't': 1,
                  'v': 1}






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 16 at 19:17









                  mad_mad_

                  4,08711022




                  4,08711022













                  • Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

                    – MattGeek
                    Jan 16 at 20:36



















                  • Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

                    – MattGeek
                    Jan 16 at 20:36

















                  Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

                  – MattGeek
                  Jan 16 at 20:36





                  Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

                  – MattGeek
                  Jan 16 at 20:36











                  8














                  Comparing speed of the solutions presented so far:



                  def f1(words):
                  c = Counter()
                  for word in words:
                  c.update(set(word.lower()))
                  return c

                  def f2(words):
                  return Counter(
                  c
                  for word in words
                  for c in set(word.lower()))

                  def f3(words):
                  d = {}
                  for word in words:
                  for i in set(word.lower()):
                  d[i] = d.get(i, 0) + 1
                  return d


                  My timing function (using different sizes for the list of words):



                  word_list = [
                  'tree', 'bone', 'indigo', 'developer', 'python',
                  'language', 'timeit', 'xerox', 'printer', 'offset',
                  ]

                  for exp in range(5):
                  words = word_list * 10**exp

                  result_list =
                  for i in range(1, 4):
                  t = timeit.timeit(
                  'f(words)',
                  'from __main__ import words, f{} as f'.format(i),
                  number=100)
                  result_list.append((i, t))

                  print('{:10,d} words | {}'.format(
                  len(words),
                  ' | '.join(
                  'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))


                  The results:



                          10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec
                  100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
                  1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
                  10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
                  100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec


                  The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).






                  share|improve this answer


























                  • @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

                    – Ralf
                    Jan 16 at 20:28











                  • Thanks for this good comparison.

                    – MattGeek
                    Jan 16 at 20:32
















                  8














                  Comparing speed of the solutions presented so far:



                  def f1(words):
                  c = Counter()
                  for word in words:
                  c.update(set(word.lower()))
                  return c

                  def f2(words):
                  return Counter(
                  c
                  for word in words
                  for c in set(word.lower()))

                  def f3(words):
                  d = {}
                  for word in words:
                  for i in set(word.lower()):
                  d[i] = d.get(i, 0) + 1
                  return d


                  My timing function (using different sizes for the list of words):



                  word_list = [
                  'tree', 'bone', 'indigo', 'developer', 'python',
                  'language', 'timeit', 'xerox', 'printer', 'offset',
                  ]

                  for exp in range(5):
                  words = word_list * 10**exp

                  result_list =
                  for i in range(1, 4):
                  t = timeit.timeit(
                  'f(words)',
                  'from __main__ import words, f{} as f'.format(i),
                  number=100)
                  result_list.append((i, t))

                  print('{:10,d} words | {}'.format(
                  len(words),
                  ' | '.join(
                  'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))


                  The results:



                          10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec
                  100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
                  1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
                  10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
                  100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec


                  The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).






                  share|improve this answer


























                  • @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

                    – Ralf
                    Jan 16 at 20:28











                  • Thanks for this good comparison.

                    – MattGeek
                    Jan 16 at 20:32














                  8












                  8








                  8







                  Comparing speed of the solutions presented so far:



                  def f1(words):
                  c = Counter()
                  for word in words:
                  c.update(set(word.lower()))
                  return c

                  def f2(words):
                  return Counter(
                  c
                  for word in words
                  for c in set(word.lower()))

                  def f3(words):
                  d = {}
                  for word in words:
                  for i in set(word.lower()):
                  d[i] = d.get(i, 0) + 1
                  return d


                  My timing function (using different sizes for the list of words):



                  word_list = [
                  'tree', 'bone', 'indigo', 'developer', 'python',
                  'language', 'timeit', 'xerox', 'printer', 'offset',
                  ]

                  for exp in range(5):
                  words = word_list * 10**exp

                  result_list =
                  for i in range(1, 4):
                  t = timeit.timeit(
                  'f(words)',
                  'from __main__ import words, f{} as f'.format(i),
                  number=100)
                  result_list.append((i, t))

                  print('{:10,d} words | {}'.format(
                  len(words),
                  ' | '.join(
                  'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))


                  The results:



                          10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec
                  100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
                  1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
                  10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
                  100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec


                  The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).






                  share|improve this answer















                  Comparing speed of the solutions presented so far:



                  def f1(words):
                  c = Counter()
                  for word in words:
                  c.update(set(word.lower()))
                  return c

                  def f2(words):
                  return Counter(
                  c
                  for word in words
                  for c in set(word.lower()))

                  def f3(words):
                  d = {}
                  for word in words:
                  for i in set(word.lower()):
                  d[i] = d.get(i, 0) + 1
                  return d


                  My timing function (using different sizes for the list of words):



                  word_list = [
                  'tree', 'bone', 'indigo', 'developer', 'python',
                  'language', 'timeit', 'xerox', 'printer', 'offset',
                  ]

                  for exp in range(5):
                  words = word_list * 10**exp

                  result_list =
                  for i in range(1, 4):
                  t = timeit.timeit(
                  'f(words)',
                  'from __main__ import words, f{} as f'.format(i),
                  number=100)
                  result_list.append((i, t))

                  print('{:10,d} words | {}'.format(
                  len(words),
                  ' | '.join(
                  'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))


                  The results:



                          10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec
                  100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
                  1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
                  10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
                  100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec


                  The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Jan 16 at 20:28

























                  answered Jan 16 at 20:01









                  RalfRalf

                  5,31141133




                  5,31141133













                  • @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

                    – Ralf
                    Jan 16 at 20:28











                  • Thanks for this good comparison.

                    – MattGeek
                    Jan 16 at 20:32



















                  • @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

                    – Ralf
                    Jan 16 at 20:28











                  • Thanks for this good comparison.

                    – MattGeek
                    Jan 16 at 20:32

















                  @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

                  – Ralf
                  Jan 16 at 20:28





                  @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

                  – Ralf
                  Jan 16 at 20:28













                  Thanks for this good comparison.

                  – MattGeek
                  Jan 16 at 20:32





                  Thanks for this good comparison.

                  – MattGeek
                  Jan 16 at 20:32











                  0














                  Try using a dictionary comprehension:



                  import string
                  print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})





                  share|improve this answer




























                    0














                    Try using a dictionary comprehension:



                    import string
                    print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})





                    share|improve this answer


























                      0












                      0








                      0







                      Try using a dictionary comprehension:



                      import string
                      print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})





                      share|improve this answer













                      Try using a dictionary comprehension:



                      import string
                      print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})






                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Jan 17 at 9:49









                      U9-ForwardU9-Forward

                      14.5k21338




                      14.5k21338























                          0














                          A bit too late to the party, but here you go:



                          freq = {k: sum(k in word for word in words) for k in set(''.join(words))}


                          which returns:



                          {'i': 1, 'v': 1, 'p': 1, 'b': 1, 'e': 3, 'g': 1, 't': 1, 'n': 2, 'd': 2, 'o': 3, 'l': 1, 'r': 2}





                          share|improve this answer




























                            0














                            A bit too late to the party, but here you go:



                            freq = {k: sum(k in word for word in words) for k in set(''.join(words))}


                            which returns:



                            {'i': 1, 'v': 1, 'p': 1, 'b': 1, 'e': 3, 'g': 1, 't': 1, 'n': 2, 'd': 2, 'o': 3, 'l': 1, 'r': 2}





                            share|improve this answer


























                              0












                              0








                              0







                              A bit too late to the party, but here you go:



                              freq = {k: sum(k in word for word in words) for k in set(''.join(words))}


                              which returns:



                              {'i': 1, 'v': 1, 'p': 1, 'b': 1, 'e': 3, 'g': 1, 't': 1, 'n': 2, 'd': 2, 'o': 3, 'l': 1, 'r': 2}





                              share|improve this answer













                              A bit too late to the party, but here you go:



                              freq = {k: sum(k in word for word in words) for k in set(''.join(words))}


                              which returns:



                              {'i': 1, 'v': 1, 'p': 1, 'b': 1, 'e': 3, 'g': 1, 't': 1, 'n': 2, 'd': 2, 'o': 3, 'l': 1, 'r': 2}






                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Jan 17 at 15:49









                              Ev. KounisEv. Kounis

                              10.9k21546




                              10.9k21546























                                  0














                                  from collections import Counter  
                                  import string

                                  words=["tree","bone","indigo","developer"]
                                  y=Counter(string.ascii_lowercase)
                                  new_dict=dict(y)

                                  for k in new_dict:
                                  new_dict[k]=0
                                  trial = 0
                                  while len(words) > trial:
                                  for let in set(words[trial]):
                                  if let in new_dict:
                                  new_dict[str(let)]=new_dict[str(let)]+1

                                  trial = trial +1
                                  print(new_dict)





                                  share|improve this answer






























                                    0














                                    from collections import Counter  
                                    import string

                                    words=["tree","bone","indigo","developer"]
                                    y=Counter(string.ascii_lowercase)
                                    new_dict=dict(y)

                                    for k in new_dict:
                                    new_dict[k]=0
                                    trial = 0
                                    while len(words) > trial:
                                    for let in set(words[trial]):
                                    if let in new_dict:
                                    new_dict[str(let)]=new_dict[str(let)]+1

                                    trial = trial +1
                                    print(new_dict)





                                    share|improve this answer




























                                      0












                                      0








                                      0







                                      from collections import Counter  
                                      import string

                                      words=["tree","bone","indigo","developer"]
                                      y=Counter(string.ascii_lowercase)
                                      new_dict=dict(y)

                                      for k in new_dict:
                                      new_dict[k]=0
                                      trial = 0
                                      while len(words) > trial:
                                      for let in set(words[trial]):
                                      if let in new_dict:
                                      new_dict[str(let)]=new_dict[str(let)]+1

                                      trial = trial +1
                                      print(new_dict)





                                      share|improve this answer















                                      from collections import Counter  
                                      import string

                                      words=["tree","bone","indigo","developer"]
                                      y=Counter(string.ascii_lowercase)
                                      new_dict=dict(y)

                                      for k in new_dict:
                                      new_dict[k]=0
                                      trial = 0
                                      while len(words) > trial:
                                      for let in set(words[trial]):
                                      if let in new_dict:
                                      new_dict[str(let)]=new_dict[str(let)]+1

                                      trial = trial +1
                                      print(new_dict)






                                      share|improve this answer














                                      share|improve this answer



                                      share|improve this answer








                                      edited yesterday

























                                      answered yesterday









                                      Edward KaharoEdward Kaharo

                                      113




                                      113























                                          -1














                                          The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.



                                          import string
                                          counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}


                                          which produces a dict like this:



                                          {'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}



                                          Here's my update of Ralf's timings:



                                                  10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec
                                          100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
                                          1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
                                          10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
                                          100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec


                                          based on the following code and the word list from https://github.com/dwyl/english-words/



                                          import string
                                          import timeit
                                          import random
                                          from collections import Counter

                                          def f1(words):
                                          c = Counter()
                                          for word in words:
                                          c.update(set(word.lower()))
                                          return c

                                          def f2(words):
                                          return Counter(
                                          c
                                          for word in words
                                          for c in set(word.lower()))

                                          def f3(words):
                                          d = {}
                                          for word in words:
                                          for i in set(word.lower()):
                                          d[i] = d.get(i, 0) + 1
                                          return d


                                          def f4(words):
                                          d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
                                          return d


                                          with open('words.txt') as word_file:
                                          valid_words = set(word_file.read().split())

                                          for exp in range(5):

                                          result_list =
                                          for i in range(1, 5):
                                          t = timeit.timeit(
                                          'f(words)',
                                          'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
                                          number=100)
                                          result_list.append((i, t))

                                          print('{:10,d} words | {}'.format(
                                          len(words),
                                          ' | '.join(
                                          'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

                                          print(f4(random.sample(valid_words, 10000)))
                                          print(f4(random.sample(valid_words, 1000)))
                                          print(f4(random.sample(valid_words, 100)))
                                          print(f4(random.sample(valid_words, 10)))






                                          share|improve this answer



















                                          • 1





                                            But this is ASCII only -- in this day and age?

                                            – Janne Karila
                                            Jan 17 at 7:38











                                          • I would replace len([w for w in words if c in w.lower()]) with sum(c in w.lower() for w in words). Should be a bit faster. Thanks for the timings btw. +1

                                            – Ev. Kounis
                                            Jan 17 at 15:56













                                          • It seems that yours is actually faster. Anyway. +1

                                            – Ev. Kounis
                                            Jan 17 at 16:02











                                          • @JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.

                                            – Stobor
                                            2 days ago
















                                          -1














                                          The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.



                                          import string
                                          counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}


                                          which produces a dict like this:



                                          {'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}



                                          Here's my update of Ralf's timings:



                                                  10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec
                                          100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
                                          1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
                                          10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
                                          100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec


                                          based on the following code and the word list from https://github.com/dwyl/english-words/



                                          import string
                                          import timeit
                                          import random
                                          from collections import Counter

                                          def f1(words):
                                          c = Counter()
                                          for word in words:
                                          c.update(set(word.lower()))
                                          return c

                                          def f2(words):
                                          return Counter(
                                          c
                                          for word in words
                                          for c in set(word.lower()))

                                          def f3(words):
                                          d = {}
                                          for word in words:
                                          for i in set(word.lower()):
                                          d[i] = d.get(i, 0) + 1
                                          return d


                                          def f4(words):
                                          d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
                                          return d


                                          with open('words.txt') as word_file:
                                          valid_words = set(word_file.read().split())

                                          for exp in range(5):

                                          result_list =
                                          for i in range(1, 5):
                                          t = timeit.timeit(
                                          'f(words)',
                                          'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
                                          number=100)
                                          result_list.append((i, t))

                                          print('{:10,d} words | {}'.format(
                                          len(words),
                                          ' | '.join(
                                          'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

                                          print(f4(random.sample(valid_words, 10000)))
                                          print(f4(random.sample(valid_words, 1000)))
                                          print(f4(random.sample(valid_words, 100)))
                                          print(f4(random.sample(valid_words, 10)))






                                          share|improve this answer



















                                          • 1





                                            But this is ASCII only -- in this day and age?

                                            – Janne Karila
                                            Jan 17 at 7:38











                                          • I would replace len([w for w in words if c in w.lower()]) with sum(c in w.lower() for w in words). Should be a bit faster. Thanks for the timings btw. +1

                                            – Ev. Kounis
                                            Jan 17 at 15:56













                                          • It seems that yours is actually faster. Anyway. +1

                                            – Ev. Kounis
                                            Jan 17 at 16:02











                                          • @JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.

                                            – Stobor
                                            2 days ago














                                          -1












                                          -1








                                          -1







                                          The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.



                                          import string
                                          counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}


                                          which produces a dict like this:



                                          {'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}



                                          Here's my update of Ralf's timings:



                                                  10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec
                                          100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
                                          1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
                                          10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
                                          100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec


                                          based on the following code and the word list from https://github.com/dwyl/english-words/



                                          import string
                                          import timeit
                                          import random
                                          from collections import Counter

                                          def f1(words):
                                          c = Counter()
                                          for word in words:
                                          c.update(set(word.lower()))
                                          return c

                                          def f2(words):
                                          return Counter(
                                          c
                                          for word in words
                                          for c in set(word.lower()))

                                          def f3(words):
                                          d = {}
                                          for word in words:
                                          for i in set(word.lower()):
                                          d[i] = d.get(i, 0) + 1
                                          return d


                                          def f4(words):
                                          d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
                                          return d


                                          with open('words.txt') as word_file:
                                          valid_words = set(word_file.read().split())

                                          for exp in range(5):

                                          result_list =
                                          for i in range(1, 5):
                                          t = timeit.timeit(
                                          'f(words)',
                                          'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
                                          number=100)
                                          result_list.append((i, t))

                                          print('{:10,d} words | {}'.format(
                                          len(words),
                                          ' | '.join(
                                          'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

                                          print(f4(random.sample(valid_words, 10000)))
                                          print(f4(random.sample(valid_words, 1000)))
                                          print(f4(random.sample(valid_words, 100)))
                                          print(f4(random.sample(valid_words, 10)))






                                          share|improve this answer













                                          The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.



                                          import string
                                          counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}


                                          which produces a dict like this:



                                          {'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}



                                          Here's my update of Ralf's timings:



                                                  10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec
                                          100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
                                          1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
                                          10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
                                          100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec


                                          based on the following code and the word list from https://github.com/dwyl/english-words/



                                          import string
                                          import timeit
                                          import random
                                          from collections import Counter

                                          def f1(words):
                                          c = Counter()
                                          for word in words:
                                          c.update(set(word.lower()))
                                          return c

                                          def f2(words):
                                          return Counter(
                                          c
                                          for word in words
                                          for c in set(word.lower()))

                                          def f3(words):
                                          d = {}
                                          for word in words:
                                          for i in set(word.lower()):
                                          d[i] = d.get(i, 0) + 1
                                          return d


                                          def f4(words):
                                          d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
                                          return d


                                          with open('words.txt') as word_file:
                                          valid_words = set(word_file.read().split())

                                          for exp in range(5):

                                          result_list =
                                          for i in range(1, 5):
                                          t = timeit.timeit(
                                          'f(words)',
                                          'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
                                          number=100)
                                          result_list.append((i, t))

                                          print('{:10,d} words | {}'.format(
                                          len(words),
                                          ' | '.join(
                                          'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

                                          print(f4(random.sample(valid_words, 10000)))
                                          print(f4(random.sample(valid_words, 1000)))
                                          print(f4(random.sample(valid_words, 100)))
                                          print(f4(random.sample(valid_words, 10)))







                                          share|improve this answer












                                          share|improve this answer



                                          share|improve this answer










                                          answered Jan 17 at 0:56









                                          StoborStobor

                                          32.7k55357




                                          32.7k55357








                                          • 1





                                            But this is ASCII only -- in this day and age?

                                            – Janne Karila
                                            Jan 17 at 7:38











                                          • I would replace len([w for w in words if c in w.lower()]) with sum(c in w.lower() for w in words). Should be a bit faster. Thanks for the timings btw. +1

                                            – Ev. Kounis
                                            Jan 17 at 15:56













                                          • It seems that yours is actually faster. Anyway. +1

                                            – Ev. Kounis
                                            Jan 17 at 16:02











                                          • @JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.

                                            – Stobor
                                            2 days ago














                                          • 1





                                            But this is ASCII only -- in this day and age?

                                            – Janne Karila
                                            Jan 17 at 7:38











                                          • I would replace len([w for w in words if c in w.lower()]) with sum(c in w.lower() for w in words). Should be a bit faster. Thanks for the timings btw. +1

                                            – Ev. Kounis
                                            Jan 17 at 15:56













                                          • It seems that yours is actually faster. Anyway. +1

                                            – Ev. Kounis
                                            Jan 17 at 16:02











                                          • @JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.

                                            – Stobor
                                            2 days ago








                                          1




                                          1





                                          But this is ASCII only -- in this day and age?

                                          – Janne Karila
                                          Jan 17 at 7:38





                                          But this is ASCII only -- in this day and age?

                                          – Janne Karila
                                          Jan 17 at 7:38













                                          I would replace len([w for w in words if c in w.lower()]) with sum(c in w.lower() for w in words). Should be a bit faster. Thanks for the timings btw. +1

                                          – Ev. Kounis
                                          Jan 17 at 15:56







                                          I would replace len([w for w in words if c in w.lower()]) with sum(c in w.lower() for w in words). Should be a bit faster. Thanks for the timings btw. +1

                                          – Ev. Kounis
                                          Jan 17 at 15:56















                                          It seems that yours is actually faster. Anyway. +1

                                          – Ev. Kounis
                                          Jan 17 at 16:02





                                          It seems that yours is actually faster. Anyway. +1

                                          – Ev. Kounis
                                          Jan 17 at 16:02













                                          @JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.

                                          – Stobor
                                          2 days ago





                                          @JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.

                                          – Stobor
                                          2 days ago


















                                          draft saved

                                          draft discarded




















































                                          Thanks for contributing an answer to Stack Overflow!


                                          • Please be sure to answer the question. Provide details and share your research!

                                          But avoid



                                          • Asking for help, clarification, or responding to other answers.

                                          • Making statements based on opinion; back them up with references or personal experience.


                                          To learn more, see our tips on writing great answers.




                                          draft saved


                                          draft discarded














                                          StackExchange.ready(
                                          function () {
                                          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54223703%2fcount-letter-frequency-in-word-list-excluding-duplicates-in-the-same-word%23new-answer', 'question_page');
                                          }
                                          );

                                          Post as a guest















                                          Required, but never shown





















































                                          Required, but never shown














                                          Required, but never shown












                                          Required, but never shown







                                          Required, but never shown

































                                          Required, but never shown














                                          Required, but never shown












                                          Required, but never shown







                                          Required, but never shown







                                          Popular posts from this blog

                                          Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

                                          ComboBox Display Member on multiple fields

                                          Is it possible to collect Nectar points via Trainline?