Commonest[] function doesn't actually show commonest elements












4












$begingroup$


I'm using Commonest function on a list with 500000 words to get the 10 most frequent elements. Then by using WordCloud, I found out that the most frequent word is actually "far", and then checked it by StringCount. So the thing I would like to know is why results from WordCloud and Commonest are so different, and how to make Commonest work properly?



File with words I used, also here



(Sorry for all mistakes, English is only my 3rd language..)



session










share|improve this question











$endgroup$












  • $begingroup$
    It's because "far" is a part of many of the words, e.g., "afar", "farsighted", "airfare", etc.
    $endgroup$
    – Carl Woll
    Mar 7 at 16:24










  • $begingroup$
    I removed the bugs tag for now, in accordance with the tag description.
    $endgroup$
    – Szabolcs
    Mar 7 at 16:35








  • 1




    $begingroup$
    @CarlWoll Commonest definitely works correctly (can be verified with Counts/Tally). As for the WordCloud behaviour, I'm very sceptical about this, I would not consider it correct ... bug?
    $endgroup$
    – Szabolcs
    Mar 7 at 16:37












  • $begingroup$
    @Szabolcs The bug is within DeleteStopwords, if it is a bug - I show it at the bottom of my answer.
    $endgroup$
    – Carl Lange
    Mar 7 at 16:47












  • $begingroup$
    It seems to me like DeleteStopwords, which WordCloud uses by default, is deleting a lot of words that it ought not be deleting, such as "custom-made", "runner-up", "interest" and so on (Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]). Perhaps it would be good to report this to WRI and see if they consider it a bug. OP, I would rename this question something like "WordCloud processes input text badly".
    $endgroup$
    – Carl Lange
    Mar 7 at 16:56


















4












$begingroup$


I'm using Commonest function on a list with 500000 words to get the 10 most frequent elements. Then by using WordCloud, I found out that the most frequent word is actually "far", and then checked it by StringCount. So the thing I would like to know is why results from WordCloud and Commonest are so different, and how to make Commonest work properly?



File with words I used, also here



(Sorry for all mistakes, English is only my 3rd language..)



session










share|improve this question











$endgroup$












  • $begingroup$
    It's because "far" is a part of many of the words, e.g., "afar", "farsighted", "airfare", etc.
    $endgroup$
    – Carl Woll
    Mar 7 at 16:24










  • $begingroup$
    I removed the bugs tag for now, in accordance with the tag description.
    $endgroup$
    – Szabolcs
    Mar 7 at 16:35








  • 1




    $begingroup$
    @CarlWoll Commonest definitely works correctly (can be verified with Counts/Tally). As for the WordCloud behaviour, I'm very sceptical about this, I would not consider it correct ... bug?
    $endgroup$
    – Szabolcs
    Mar 7 at 16:37












  • $begingroup$
    @Szabolcs The bug is within DeleteStopwords, if it is a bug - I show it at the bottom of my answer.
    $endgroup$
    – Carl Lange
    Mar 7 at 16:47












  • $begingroup$
    It seems to me like DeleteStopwords, which WordCloud uses by default, is deleting a lot of words that it ought not be deleting, such as "custom-made", "runner-up", "interest" and so on (Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]). Perhaps it would be good to report this to WRI and see if they consider it a bug. OP, I would rename this question something like "WordCloud processes input text badly".
    $endgroup$
    – Carl Lange
    Mar 7 at 16:56
















4












4








4





$begingroup$


I'm using Commonest function on a list with 500000 words to get the 10 most frequent elements. Then by using WordCloud, I found out that the most frequent word is actually "far", and then checked it by StringCount. So the thing I would like to know is why results from WordCloud and Commonest are so different, and how to make Commonest work properly?



File with words I used, also here



(Sorry for all mistakes, English is only my 3rd language..)



session










share|improve this question











$endgroup$




I'm using Commonest function on a list with 500000 words to get the 10 most frequent elements. Then by using WordCloud, I found out that the most frequent word is actually "far", and then checked it by StringCount. So the thing I would like to know is why results from WordCloud and Commonest are so different, and how to make Commonest work properly?



File with words I used, also here



(Sorry for all mistakes, English is only my 3rd language..)



session







list-manipulation string-manipulation filtering






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 10 at 23:38









J. M. is slightly pensive

97.9k10304464




97.9k10304464










asked Mar 7 at 16:18









AplefullAplefull

503




503












  • $begingroup$
    It's because "far" is a part of many of the words, e.g., "afar", "farsighted", "airfare", etc.
    $endgroup$
    – Carl Woll
    Mar 7 at 16:24










  • $begingroup$
    I removed the bugs tag for now, in accordance with the tag description.
    $endgroup$
    – Szabolcs
    Mar 7 at 16:35








  • 1




    $begingroup$
    @CarlWoll Commonest definitely works correctly (can be verified with Counts/Tally). As for the WordCloud behaviour, I'm very sceptical about this, I would not consider it correct ... bug?
    $endgroup$
    – Szabolcs
    Mar 7 at 16:37












  • $begingroup$
    @Szabolcs The bug is within DeleteStopwords, if it is a bug - I show it at the bottom of my answer.
    $endgroup$
    – Carl Lange
    Mar 7 at 16:47












  • $begingroup$
    It seems to me like DeleteStopwords, which WordCloud uses by default, is deleting a lot of words that it ought not be deleting, such as "custom-made", "runner-up", "interest" and so on (Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]). Perhaps it would be good to report this to WRI and see if they consider it a bug. OP, I would rename this question something like "WordCloud processes input text badly".
    $endgroup$
    – Carl Lange
    Mar 7 at 16:56




















  • $begingroup$
    It's because "far" is a part of many of the words, e.g., "afar", "farsighted", "airfare", etc.
    $endgroup$
    – Carl Woll
    Mar 7 at 16:24










  • $begingroup$
    I removed the bugs tag for now, in accordance with the tag description.
    $endgroup$
    – Szabolcs
    Mar 7 at 16:35








  • 1




    $begingroup$
    @CarlWoll Commonest definitely works correctly (can be verified with Counts/Tally). As for the WordCloud behaviour, I'm very sceptical about this, I would not consider it correct ... bug?
    $endgroup$
    – Szabolcs
    Mar 7 at 16:37












  • $begingroup$
    @Szabolcs The bug is within DeleteStopwords, if it is a bug - I show it at the bottom of my answer.
    $endgroup$
    – Carl Lange
    Mar 7 at 16:47












  • $begingroup$
    It seems to me like DeleteStopwords, which WordCloud uses by default, is deleting a lot of words that it ought not be deleting, such as "custom-made", "runner-up", "interest" and so on (Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]). Perhaps it would be good to report this to WRI and see if they consider it a bug. OP, I would rename this question something like "WordCloud processes input text badly".
    $endgroup$
    – Carl Lange
    Mar 7 at 16:56


















$begingroup$
It's because "far" is a part of many of the words, e.g., "afar", "farsighted", "airfare", etc.
$endgroup$
– Carl Woll
Mar 7 at 16:24




$begingroup$
It's because "far" is a part of many of the words, e.g., "afar", "farsighted", "airfare", etc.
$endgroup$
– Carl Woll
Mar 7 at 16:24












$begingroup$
I removed the bugs tag for now, in accordance with the tag description.
$endgroup$
– Szabolcs
Mar 7 at 16:35






$begingroup$
I removed the bugs tag for now, in accordance with the tag description.
$endgroup$
– Szabolcs
Mar 7 at 16:35






1




1




$begingroup$
@CarlWoll Commonest definitely works correctly (can be verified with Counts/Tally). As for the WordCloud behaviour, I'm very sceptical about this, I would not consider it correct ... bug?
$endgroup$
– Szabolcs
Mar 7 at 16:37






$begingroup$
@CarlWoll Commonest definitely works correctly (can be verified with Counts/Tally). As for the WordCloud behaviour, I'm very sceptical about this, I would not consider it correct ... bug?
$endgroup$
– Szabolcs
Mar 7 at 16:37














$begingroup$
@Szabolcs The bug is within DeleteStopwords, if it is a bug - I show it at the bottom of my answer.
$endgroup$
– Carl Lange
Mar 7 at 16:47






$begingroup$
@Szabolcs The bug is within DeleteStopwords, if it is a bug - I show it at the bottom of my answer.
$endgroup$
– Carl Lange
Mar 7 at 16:47














$begingroup$
It seems to me like DeleteStopwords, which WordCloud uses by default, is deleting a lot of words that it ought not be deleting, such as "custom-made", "runner-up", "interest" and so on (Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]). Perhaps it would be good to report this to WRI and see if they consider it a bug. OP, I would rename this question something like "WordCloud processes input text badly".
$endgroup$
– Carl Lange
Mar 7 at 16:56






$begingroup$
It seems to me like DeleteStopwords, which WordCloud uses by default, is deleting a lot of words that it ought not be deleting, such as "custom-made", "runner-up", "interest" and so on (Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]). Perhaps it would be good to report this to WRI and see if they consider it a bug. OP, I would rename this question something like "WordCloud processes input text badly".
$endgroup$
– Carl Lange
Mar 7 at 16:56












2 Answers
2






active

oldest

votes


















10












$begingroup$

You should be using TextWords to segment your data into words. Things like StringCount[data, "far"] will also count "fart".



Commonest[TextWords[txt], 10]



{"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode", "veronica", "crispness", "ashen"}




WordCloud[TextWords[txt]]


enter image description here



You can use Counts to get the counts of each word as well:



TakeLargest[Counts[TextWords[txt]], 20]



<|"affirm" -> 29, "equation" -> 28, "veronica" -> 28, "ashen" -> 28,
"crispness" -> 28, "knacker" -> 27, "validly" -> 27,
"squander" -> 27, "nematode" -> 27, "autoimmune" -> 27,
"calligrapher" -> 27, "pus" -> 26, "sledding" -> 26,
"tablecloth" -> 26, "inclusive" -> 26, "variegated" -> 26,
"gastrointestinal" -> 26, "undercoat" -> 26, "washout" -> 26,
"reconnoitering" -> 26|>




It seems to me that the issue with WordCloud is actually an issue within DeleteStopwords, which WordCloud is using internally when the input is a string.



enter image description here



You can prevent WordCloud from using DeleteStopwords by passing PreprocessingRules -> None:



enter image description here



It seems to me that DeleteStopwords is deleting many words that perhaps it shouldn't be:



Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]



{"a", "about", "above", "across", "add-on", "after", "again", 
"against", "all", "almost", "alone", "along", "already", "also",
"although", "always", "among", "an", "and", "another", "any",
"anyone", "anything", "anywhere", "are", "around", "as", "at",
"back", "back-to-back", "be", "because", "become", "before",
"behind", "being", "below", "between", "born-again", "both",
"built-in", "but", "by", "can-do", "custom-made", "do", "done",
"down", "during", "each", "either", "enough", "even", "ever",
"every", "everyone", "everything", "everywhere", "far-off",
"far-out", "few", "find", "first", "for", "four", "from", "full",
"further", "get", "give", "go", "have-not", "he", "head-on", "her",
"here", "hers", "herself", "him", "himself", "his", "how", "however",
"if", "in", "interest", "into", "it", "its", "itself", "keep",
"laid-back", "last", "least", "less", "ma'am", "made", "man-made",
"many", "may", "me", "might", "more", "most", "mostly", "much",
"must", "my", "myself", "never", "next", "nobody", "nor", "no-show",
"not", "nothing", "now", "nowhere", "of", "off", "often", "on",
"once", "one", "only", "other", "our", "ours", "ourselves", "out",
"over", "own", "part", "per", "perhaps", "put", "rather",
"runner-up", "same", "seem", "seeming", "see-through",
"self-interest", "self-made", "several", "she", "show", "side",
"since", "sit-in", "so", "some", "someone", "something", "somewhere",
"still", "such", "take", "than", "that", "the", "their", "theirs",
"them", "themselves", "then", "there", "therefore", "these", "they",
"this", "those", "though", "three", "through", "thus", "to",
"together", "too", "toward", "two", "under", "until", "up", "upon",
"us", "very", "we", "well", "well-to-do", "what", "when", "where",
"where's", "whether", "which", "while", "who", "whole", "whom",
"whose", "why", "will", "with", "within", "without", "would-be",
"write-off", "yet", "you", "your", "yours", "yourself"}



I agree with some of those stopwords, but not really any of them that contain the - character. This is perhaps where the issue lies.



What appears to be happening is that DeleteStopwords is deleting part of some words, and what's left over is counted. We can see the outcome:



Counts[TextWords[txt]]["far"]



19




Counts[TextWords[DeleteStopwords[txt]]]["far"]



39




We can see that this behaviour is weird by comparing the following:



Select[TextWords[txt], StringStartsQ["far"]] // Counts // ReverseSort



<|"farinaceous" -> 19, "far" -> 19, "fare" -> 19, "faro" -> 18,
"farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17,
"faraway" -> 16, "farmstead" -> 16, "farrier" -> 15,
"farthermost" -> 14, "far-off" -> 14, "farming" -> 14,
"farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12,
"farce" -> 11, "farsighted" -> 11, "farmland" -> 10,
"farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9,
"farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7,
"farewell" -> 7, "farrow" -> 6, "farmyard" -> 6, "far-out" -> 6|>




Select[TextWords[DeleteStopwords@txt], StringStartsQ["far"]] // Counts // ReverseSort



<|"far" -> 39, "farinaceous" -> 19, "fare" -> 19, "faro" -> 18,
"farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17,
"faraway" -> 16, "farmstead" -> 16, "farrier" -> 15,
"farthermost" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13,
"farcically" -> 13, "farrowing" -> 12, "farce" -> 11,
"farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10,
"farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8,
"farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6,
"farmyard" -> 6|>




Here we can see that DeleteStopwords is replacing "far-out" and "far-off" with "far-", which is segmented to "far" by TextWords, which completely throws off WordCloud's counting mechanism in this case.






share|improve this answer











$endgroup$









  • 1




    $begingroup$
    What you observed about DeleteStopwords is the same issue I touched on here.
    $endgroup$
    – J. M. is slightly pensive
    Mar 8 at 3:29



















4












$begingroup$

As already noted by Carl, you have blamed the wrong function. Had you imported the text file in the proper format, you would have gotten the expected results:



words = Import["https://pastebin.com/raw/Z0hd3huU", "List"];

AllTrue[words, StringQ]
True

Take[words, 10]
{"inconsiderate", "weighting", "unneeded", "issuing", "intemperately", "perverse",
"disgruntled", "ninja", "artificially", "seduce"}


Note that this import format already yielded a list of strings as opposed to a single string.



Commonest[words, 10]
{"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode",
"veronica", "crispness", "ashen"}

TakeLargest[Counts[words], 10]
<|"affirm" -> 29, "equation" -> 28, "ashen" -> 28, "crispness" -> 28, "veronica" -> 28,
"squander" -> 27, "validly" -> 27, "calligrapher" -> 27, "autoimmune" -> 27,
"nematode" -> 27|>

WordCloud[words]


word cloud






share|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "387"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f192806%2fcommonest-function-doesnt-actually-show-commonest-elements%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    10












    $begingroup$

    You should be using TextWords to segment your data into words. Things like StringCount[data, "far"] will also count "fart".



    Commonest[TextWords[txt], 10]



    {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode", "veronica", "crispness", "ashen"}




    WordCloud[TextWords[txt]]


    enter image description here



    You can use Counts to get the counts of each word as well:



    TakeLargest[Counts[TextWords[txt]], 20]



    <|"affirm" -> 29, "equation" -> 28, "veronica" -> 28, "ashen" -> 28,
    "crispness" -> 28, "knacker" -> 27, "validly" -> 27,
    "squander" -> 27, "nematode" -> 27, "autoimmune" -> 27,
    "calligrapher" -> 27, "pus" -> 26, "sledding" -> 26,
    "tablecloth" -> 26, "inclusive" -> 26, "variegated" -> 26,
    "gastrointestinal" -> 26, "undercoat" -> 26, "washout" -> 26,
    "reconnoitering" -> 26|>




    It seems to me that the issue with WordCloud is actually an issue within DeleteStopwords, which WordCloud is using internally when the input is a string.



    enter image description here



    You can prevent WordCloud from using DeleteStopwords by passing PreprocessingRules -> None:



    enter image description here



    It seems to me that DeleteStopwords is deleting many words that perhaps it shouldn't be:



    Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]



    {"a", "about", "above", "across", "add-on", "after", "again", 
    "against", "all", "almost", "alone", "along", "already", "also",
    "although", "always", "among", "an", "and", "another", "any",
    "anyone", "anything", "anywhere", "are", "around", "as", "at",
    "back", "back-to-back", "be", "because", "become", "before",
    "behind", "being", "below", "between", "born-again", "both",
    "built-in", "but", "by", "can-do", "custom-made", "do", "done",
    "down", "during", "each", "either", "enough", "even", "ever",
    "every", "everyone", "everything", "everywhere", "far-off",
    "far-out", "few", "find", "first", "for", "four", "from", "full",
    "further", "get", "give", "go", "have-not", "he", "head-on", "her",
    "here", "hers", "herself", "him", "himself", "his", "how", "however",
    "if", "in", "interest", "into", "it", "its", "itself", "keep",
    "laid-back", "last", "least", "less", "ma'am", "made", "man-made",
    "many", "may", "me", "might", "more", "most", "mostly", "much",
    "must", "my", "myself", "never", "next", "nobody", "nor", "no-show",
    "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    "once", "one", "only", "other", "our", "ours", "ourselves", "out",
    "over", "own", "part", "per", "perhaps", "put", "rather",
    "runner-up", "same", "seem", "seeming", "see-through",
    "self-interest", "self-made", "several", "she", "show", "side",
    "since", "sit-in", "so", "some", "someone", "something", "somewhere",
    "still", "such", "take", "than", "that", "the", "their", "theirs",
    "them", "themselves", "then", "there", "therefore", "these", "they",
    "this", "those", "though", "three", "through", "thus", "to",
    "together", "too", "toward", "two", "under", "until", "up", "upon",
    "us", "very", "we", "well", "well-to-do", "what", "when", "where",
    "where's", "whether", "which", "while", "who", "whole", "whom",
    "whose", "why", "will", "with", "within", "without", "would-be",
    "write-off", "yet", "you", "your", "yours", "yourself"}



    I agree with some of those stopwords, but not really any of them that contain the - character. This is perhaps where the issue lies.



    What appears to be happening is that DeleteStopwords is deleting part of some words, and what's left over is counted. We can see the outcome:



    Counts[TextWords[txt]]["far"]



    19




    Counts[TextWords[DeleteStopwords[txt]]]["far"]



    39




    We can see that this behaviour is weird by comparing the following:



    Select[TextWords[txt], StringStartsQ["far"]] // Counts // ReverseSort



    <|"farinaceous" -> 19, "far" -> 19, "fare" -> 19, "faro" -> 18,
    "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17,
    "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15,
    "farthermost" -> 14, "far-off" -> 14, "farming" -> 14,
    "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12,
    "farce" -> 11, "farsighted" -> 11, "farmland" -> 10,
    "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9,
    "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7,
    "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6, "far-out" -> 6|>




    Select[TextWords[DeleteStopwords@txt], StringStartsQ["far"]] // Counts // ReverseSort



    <|"far" -> 39, "farinaceous" -> 19, "fare" -> 19, "faro" -> 18,
    "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17,
    "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15,
    "farthermost" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13,
    "farcically" -> 13, "farrowing" -> 12, "farce" -> 11,
    "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10,
    "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8,
    "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6,
    "farmyard" -> 6|>




    Here we can see that DeleteStopwords is replacing "far-out" and "far-off" with "far-", which is segmented to "far" by TextWords, which completely throws off WordCloud's counting mechanism in this case.






    share|improve this answer











    $endgroup$









    • 1




      $begingroup$
      What you observed about DeleteStopwords is the same issue I touched on here.
      $endgroup$
      – J. M. is slightly pensive
      Mar 8 at 3:29
















    10












    $begingroup$

    You should be using TextWords to segment your data into words. Things like StringCount[data, "far"] will also count "fart".



    Commonest[TextWords[txt], 10]



    {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode", "veronica", "crispness", "ashen"}




    WordCloud[TextWords[txt]]


    enter image description here



    You can use Counts to get the counts of each word as well:



    TakeLargest[Counts[TextWords[txt]], 20]



    <|"affirm" -> 29, "equation" -> 28, "veronica" -> 28, "ashen" -> 28,
    "crispness" -> 28, "knacker" -> 27, "validly" -> 27,
    "squander" -> 27, "nematode" -> 27, "autoimmune" -> 27,
    "calligrapher" -> 27, "pus" -> 26, "sledding" -> 26,
    "tablecloth" -> 26, "inclusive" -> 26, "variegated" -> 26,
    "gastrointestinal" -> 26, "undercoat" -> 26, "washout" -> 26,
    "reconnoitering" -> 26|>




    It seems to me that the issue with WordCloud is actually an issue within DeleteStopwords, which WordCloud is using internally when the input is a string.



    enter image description here



    You can prevent WordCloud from using DeleteStopwords by passing PreprocessingRules -> None:



    enter image description here



    It seems to me that DeleteStopwords is deleting many words that perhaps it shouldn't be:



    Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]



    {"a", "about", "above", "across", "add-on", "after", "again", 
    "against", "all", "almost", "alone", "along", "already", "also",
    "although", "always", "among", "an", "and", "another", "any",
    "anyone", "anything", "anywhere", "are", "around", "as", "at",
    "back", "back-to-back", "be", "because", "become", "before",
    "behind", "being", "below", "between", "born-again", "both",
    "built-in", "but", "by", "can-do", "custom-made", "do", "done",
    "down", "during", "each", "either", "enough", "even", "ever",
    "every", "everyone", "everything", "everywhere", "far-off",
    "far-out", "few", "find", "first", "for", "four", "from", "full",
    "further", "get", "give", "go", "have-not", "he", "head-on", "her",
    "here", "hers", "herself", "him", "himself", "his", "how", "however",
    "if", "in", "interest", "into", "it", "its", "itself", "keep",
    "laid-back", "last", "least", "less", "ma'am", "made", "man-made",
    "many", "may", "me", "might", "more", "most", "mostly", "much",
    "must", "my", "myself", "never", "next", "nobody", "nor", "no-show",
    "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    "once", "one", "only", "other", "our", "ours", "ourselves", "out",
    "over", "own", "part", "per", "perhaps", "put", "rather",
    "runner-up", "same", "seem", "seeming", "see-through",
    "self-interest", "self-made", "several", "she", "show", "side",
    "since", "sit-in", "so", "some", "someone", "something", "somewhere",
    "still", "such", "take", "than", "that", "the", "their", "theirs",
    "them", "themselves", "then", "there", "therefore", "these", "they",
    "this", "those", "though", "three", "through", "thus", "to",
    "together", "too", "toward", "two", "under", "until", "up", "upon",
    "us", "very", "we", "well", "well-to-do", "what", "when", "where",
    "where's", "whether", "which", "while", "who", "whole", "whom",
    "whose", "why", "will", "with", "within", "without", "would-be",
    "write-off", "yet", "you", "your", "yours", "yourself"}



    I agree with some of those stopwords, but not really any of them that contain the - character. This is perhaps where the issue lies.



    What appears to be happening is that DeleteStopwords is deleting part of some words, and what's left over is counted. We can see the outcome:



    Counts[TextWords[txt]]["far"]



    19




    Counts[TextWords[DeleteStopwords[txt]]]["far"]



    39




    We can see that this behaviour is weird by comparing the following:



    Select[TextWords[txt], StringStartsQ["far"]] // Counts // ReverseSort



    <|"farinaceous" -> 19, "far" -> 19, "fare" -> 19, "faro" -> 18,
    "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17,
    "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15,
    "farthermost" -> 14, "far-off" -> 14, "farming" -> 14,
    "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12,
    "farce" -> 11, "farsighted" -> 11, "farmland" -> 10,
    "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9,
    "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7,
    "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6, "far-out" -> 6|>




    Select[TextWords[DeleteStopwords@txt], StringStartsQ["far"]] // Counts // ReverseSort



    <|"far" -> 39, "farinaceous" -> 19, "fare" -> 19, "faro" -> 18,
    "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17,
    "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15,
    "farthermost" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13,
    "farcically" -> 13, "farrowing" -> 12, "farce" -> 11,
    "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10,
    "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8,
    "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6,
    "farmyard" -> 6|>




    Here we can see that DeleteStopwords is replacing "far-out" and "far-off" with "far-", which is segmented to "far" by TextWords, which completely throws off WordCloud's counting mechanism in this case.






    share|improve this answer











    $endgroup$









    • 1




      $begingroup$
      What you observed about DeleteStopwords is the same issue I touched on here.
      $endgroup$
      – J. M. is slightly pensive
      Mar 8 at 3:29














    10












    10








    10





    $begingroup$

    You should be using TextWords to segment your data into words. Things like StringCount[data, "far"] will also count "fart".



    Commonest[TextWords[txt], 10]



    {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode", "veronica", "crispness", "ashen"}




    WordCloud[TextWords[txt]]


    enter image description here



    You can use Counts to get the counts of each word as well:



    TakeLargest[Counts[TextWords[txt]], 20]



    <|"affirm" -> 29, "equation" -> 28, "veronica" -> 28, "ashen" -> 28,
    "crispness" -> 28, "knacker" -> 27, "validly" -> 27,
    "squander" -> 27, "nematode" -> 27, "autoimmune" -> 27,
    "calligrapher" -> 27, "pus" -> 26, "sledding" -> 26,
    "tablecloth" -> 26, "inclusive" -> 26, "variegated" -> 26,
    "gastrointestinal" -> 26, "undercoat" -> 26, "washout" -> 26,
    "reconnoitering" -> 26|>




    It seems to me that the issue with WordCloud is actually an issue within DeleteStopwords, which WordCloud is using internally when the input is a string.



    enter image description here



    You can prevent WordCloud from using DeleteStopwords by passing PreprocessingRules -> None:



    enter image description here



    It seems to me that DeleteStopwords is deleting many words that perhaps it shouldn't be:



    Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]



    {"a", "about", "above", "across", "add-on", "after", "again", 
    "against", "all", "almost", "alone", "along", "already", "also",
    "although", "always", "among", "an", "and", "another", "any",
    "anyone", "anything", "anywhere", "are", "around", "as", "at",
    "back", "back-to-back", "be", "because", "become", "before",
    "behind", "being", "below", "between", "born-again", "both",
    "built-in", "but", "by", "can-do", "custom-made", "do", "done",
    "down", "during", "each", "either", "enough", "even", "ever",
    "every", "everyone", "everything", "everywhere", "far-off",
    "far-out", "few", "find", "first", "for", "four", "from", "full",
    "further", "get", "give", "go", "have-not", "he", "head-on", "her",
    "here", "hers", "herself", "him", "himself", "his", "how", "however",
    "if", "in", "interest", "into", "it", "its", "itself", "keep",
    "laid-back", "last", "least", "less", "ma'am", "made", "man-made",
    "many", "may", "me", "might", "more", "most", "mostly", "much",
    "must", "my", "myself", "never", "next", "nobody", "nor", "no-show",
    "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    "once", "one", "only", "other", "our", "ours", "ourselves", "out",
    "over", "own", "part", "per", "perhaps", "put", "rather",
    "runner-up", "same", "seem", "seeming", "see-through",
    "self-interest", "self-made", "several", "she", "show", "side",
    "since", "sit-in", "so", "some", "someone", "something", "somewhere",
    "still", "such", "take", "than", "that", "the", "their", "theirs",
    "them", "themselves", "then", "there", "therefore", "these", "they",
    "this", "those", "though", "three", "through", "thus", "to",
    "together", "too", "toward", "two", "under", "until", "up", "upon",
    "us", "very", "we", "well", "well-to-do", "what", "when", "where",
    "where's", "whether", "which", "while", "who", "whole", "whom",
    "whose", "why", "will", "with", "within", "without", "would-be",
    "write-off", "yet", "you", "your", "yours", "yourself"}



    I agree with some of those stopwords, but not really any of them that contain the - character. This is perhaps where the issue lies.



    What appears to be happening is that DeleteStopwords is deleting part of some words, and what's left over is counted. We can see the outcome:



    Counts[TextWords[txt]]["far"]



    19




    Counts[TextWords[DeleteStopwords[txt]]]["far"]



    39




    We can see that this behaviour is weird by comparing the following:



    Select[TextWords[txt], StringStartsQ["far"]] // Counts // ReverseSort



    <|"farinaceous" -> 19, "far" -> 19, "fare" -> 19, "faro" -> 18,
    "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17,
    "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15,
    "farthermost" -> 14, "far-off" -> 14, "farming" -> 14,
    "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12,
    "farce" -> 11, "farsighted" -> 11, "farmland" -> 10,
    "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9,
    "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7,
    "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6, "far-out" -> 6|>




    Select[TextWords[DeleteStopwords@txt], StringStartsQ["far"]] // Counts // ReverseSort



    <|"far" -> 39, "farinaceous" -> 19, "fare" -> 19, "faro" -> 18,
    "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17,
    "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15,
    "farthermost" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13,
    "farcically" -> 13, "farrowing" -> 12, "farce" -> 11,
    "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10,
    "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8,
    "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6,
    "farmyard" -> 6|>




    Here we can see that DeleteStopwords is replacing "far-out" and "far-off" with "far-", which is segmented to "far" by TextWords, which completely throws off WordCloud's counting mechanism in this case.






    share|improve this answer











    $endgroup$



    You should be using TextWords to segment your data into words. Things like StringCount[data, "far"] will also count "fart".



    Commonest[TextWords[txt], 10]



    {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode", "veronica", "crispness", "ashen"}




    WordCloud[TextWords[txt]]


    enter image description here



    You can use Counts to get the counts of each word as well:



    TakeLargest[Counts[TextWords[txt]], 20]



    <|"affirm" -> 29, "equation" -> 28, "veronica" -> 28, "ashen" -> 28,
    "crispness" -> 28, "knacker" -> 27, "validly" -> 27,
    "squander" -> 27, "nematode" -> 27, "autoimmune" -> 27,
    "calligrapher" -> 27, "pus" -> 26, "sledding" -> 26,
    "tablecloth" -> 26, "inclusive" -> 26, "variegated" -> 26,
    "gastrointestinal" -> 26, "undercoat" -> 26, "washout" -> 26,
    "reconnoitering" -> 26|>




    It seems to me that the issue with WordCloud is actually an issue within DeleteStopwords, which WordCloud is using internally when the input is a string.



    enter image description here



    You can prevent WordCloud from using DeleteStopwords by passing PreprocessingRules -> None:



    enter image description here



    It seems to me that DeleteStopwords is deleting many words that perhaps it shouldn't be:



    Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]



    {"a", "about", "above", "across", "add-on", "after", "again", 
    "against", "all", "almost", "alone", "along", "already", "also",
    "although", "always", "among", "an", "and", "another", "any",
    "anyone", "anything", "anywhere", "are", "around", "as", "at",
    "back", "back-to-back", "be", "because", "become", "before",
    "behind", "being", "below", "between", "born-again", "both",
    "built-in", "but", "by", "can-do", "custom-made", "do", "done",
    "down", "during", "each", "either", "enough", "even", "ever",
    "every", "everyone", "everything", "everywhere", "far-off",
    "far-out", "few", "find", "first", "for", "four", "from", "full",
    "further", "get", "give", "go", "have-not", "he", "head-on", "her",
    "here", "hers", "herself", "him", "himself", "his", "how", "however",
    "if", "in", "interest", "into", "it", "its", "itself", "keep",
    "laid-back", "last", "least", "less", "ma'am", "made", "man-made",
    "many", "may", "me", "might", "more", "most", "mostly", "much",
    "must", "my", "myself", "never", "next", "nobody", "nor", "no-show",
    "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    "once", "one", "only", "other", "our", "ours", "ourselves", "out",
    "over", "own", "part", "per", "perhaps", "put", "rather",
    "runner-up", "same", "seem", "seeming", "see-through",
    "self-interest", "self-made", "several", "she", "show", "side",
    "since", "sit-in", "so", "some", "someone", "something", "somewhere",
    "still", "such", "take", "than", "that", "the", "their", "theirs",
    "them", "themselves", "then", "there", "therefore", "these", "they",
    "this", "those", "though", "three", "through", "thus", "to",
    "together", "too", "toward", "two", "under", "until", "up", "upon",
    "us", "very", "we", "well", "well-to-do", "what", "when", "where",
    "where's", "whether", "which", "while", "who", "whole", "whom",
    "whose", "why", "will", "with", "within", "without", "would-be",
    "write-off", "yet", "you", "your", "yours", "yourself"}



    I agree with some of those stopwords, but not really any of them that contain the - character. This is perhaps where the issue lies.



    What appears to be happening is that DeleteStopwords is deleting part of some words, and what's left over is counted. We can see the outcome:



    Counts[TextWords[txt]]["far"]



    19




    Counts[TextWords[DeleteStopwords[txt]]]["far"]



    39




    We can see that this behaviour is weird by comparing the following:



    Select[TextWords[txt], StringStartsQ["far"]] // Counts // ReverseSort



    <|"farinaceous" -> 19, "far" -> 19, "fare" -> 19, "faro" -> 18,
    "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17,
    "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15,
    "farthermost" -> 14, "far-off" -> 14, "farming" -> 14,
    "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12,
    "farce" -> 11, "farsighted" -> 11, "farmland" -> 10,
    "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9,
    "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7,
    "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6, "far-out" -> 6|>




    Select[TextWords[DeleteStopwords@txt], StringStartsQ["far"]] // Counts // ReverseSort



    <|"far" -> 39, "farinaceous" -> 19, "fare" -> 19, "faro" -> 18,
    "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17,
    "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15,
    "farthermost" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13,
    "farcically" -> 13, "farrowing" -> 12, "farce" -> 11,
    "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10,
    "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8,
    "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6,
    "farmyard" -> 6|>




    Here we can see that DeleteStopwords is replacing "far-out" and "far-off" with "far-", which is segmented to "far" by TextWords, which completely throws off WordCloud's counting mechanism in this case.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Mar 11 at 3:43









    Mr.Wizard

    232k294761061




    232k294761061










    answered Mar 7 at 16:38









    Carl LangeCarl Lange

    4,65311139




    4,65311139








    • 1




      $begingroup$
      What you observed about DeleteStopwords is the same issue I touched on here.
      $endgroup$
      – J. M. is slightly pensive
      Mar 8 at 3:29














    • 1




      $begingroup$
      What you observed about DeleteStopwords is the same issue I touched on here.
      $endgroup$
      – J. M. is slightly pensive
      Mar 8 at 3:29








    1




    1




    $begingroup$
    What you observed about DeleteStopwords is the same issue I touched on here.
    $endgroup$
    – J. M. is slightly pensive
    Mar 8 at 3:29




    $begingroup$
    What you observed about DeleteStopwords is the same issue I touched on here.
    $endgroup$
    – J. M. is slightly pensive
    Mar 8 at 3:29











    4












    $begingroup$

    As already noted by Carl, you have blamed the wrong function. Had you imported the text file in the proper format, you would have gotten the expected results:



    words = Import["https://pastebin.com/raw/Z0hd3huU", "List"];

    AllTrue[words, StringQ]
    True

    Take[words, 10]
    {"inconsiderate", "weighting", "unneeded", "issuing", "intemperately", "perverse",
    "disgruntled", "ninja", "artificially", "seduce"}


    Note that this import format already yielded a list of strings as opposed to a single string.



    Commonest[words, 10]
    {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode",
    "veronica", "crispness", "ashen"}

    TakeLargest[Counts[words], 10]
    <|"affirm" -> 29, "equation" -> 28, "ashen" -> 28, "crispness" -> 28, "veronica" -> 28,
    "squander" -> 27, "validly" -> 27, "calligrapher" -> 27, "autoimmune" -> 27,
    "nematode" -> 27|>

    WordCloud[words]


    word cloud






    share|improve this answer









    $endgroup$


















      4












      $begingroup$

      As already noted by Carl, you have blamed the wrong function. Had you imported the text file in the proper format, you would have gotten the expected results:



      words = Import["https://pastebin.com/raw/Z0hd3huU", "List"];

      AllTrue[words, StringQ]
      True

      Take[words, 10]
      {"inconsiderate", "weighting", "unneeded", "issuing", "intemperately", "perverse",
      "disgruntled", "ninja", "artificially", "seduce"}


      Note that this import format already yielded a list of strings as opposed to a single string.



      Commonest[words, 10]
      {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode",
      "veronica", "crispness", "ashen"}

      TakeLargest[Counts[words], 10]
      <|"affirm" -> 29, "equation" -> 28, "ashen" -> 28, "crispness" -> 28, "veronica" -> 28,
      "squander" -> 27, "validly" -> 27, "calligrapher" -> 27, "autoimmune" -> 27,
      "nematode" -> 27|>

      WordCloud[words]


      word cloud






      share|improve this answer









      $endgroup$
















        4












        4








        4





        $begingroup$

        As already noted by Carl, you have blamed the wrong function. Had you imported the text file in the proper format, you would have gotten the expected results:



        words = Import["https://pastebin.com/raw/Z0hd3huU", "List"];

        AllTrue[words, StringQ]
        True

        Take[words, 10]
        {"inconsiderate", "weighting", "unneeded", "issuing", "intemperately", "perverse",
        "disgruntled", "ninja", "artificially", "seduce"}


        Note that this import format already yielded a list of strings as opposed to a single string.



        Commonest[words, 10]
        {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode",
        "veronica", "crispness", "ashen"}

        TakeLargest[Counts[words], 10]
        <|"affirm" -> 29, "equation" -> 28, "ashen" -> 28, "crispness" -> 28, "veronica" -> 28,
        "squander" -> 27, "validly" -> 27, "calligrapher" -> 27, "autoimmune" -> 27,
        "nematode" -> 27|>

        WordCloud[words]


        word cloud






        share|improve this answer









        $endgroup$



        As already noted by Carl, you have blamed the wrong function. Had you imported the text file in the proper format, you would have gotten the expected results:



        words = Import["https://pastebin.com/raw/Z0hd3huU", "List"];

        AllTrue[words, StringQ]
        True

        Take[words, 10]
        {"inconsiderate", "weighting", "unneeded", "issuing", "intemperately", "perverse",
        "disgruntled", "ninja", "artificially", "seduce"}


        Note that this import format already yielded a list of strings as opposed to a single string.



        Commonest[words, 10]
        {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode",
        "veronica", "crispness", "ashen"}

        TakeLargest[Counts[words], 10]
        <|"affirm" -> 29, "equation" -> 28, "ashen" -> 28, "crispness" -> 28, "veronica" -> 28,
        "squander" -> 27, "validly" -> 27, "calligrapher" -> 27, "autoimmune" -> 27,
        "nematode" -> 27|>

        WordCloud[words]


        word cloud







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 8 at 1:59









        J. M. is slightly pensiveJ. M. is slightly pensive

        97.9k10304464




        97.9k10304464






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Mathematica Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f192806%2fcommonest-function-doesnt-actually-show-commonest-elements%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to change which sound is reproduced for terminal bell?

            Can I use Tabulator js library in my java Spring + Thymeleaf project?

            Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents