How to make Spacy's statistical models faster












1















I am using Spacy's pretrained statistical models such as en_core_web_md. I am trying to find similar words between two lists. While the code is working fine. It takes a lot of time to load the statistical model, each time the code is run.



Here is the code I am using.



How to make the models load faster?
Is there a way to save the model to the disk ?



import spacy
nlp = spacy.load('en_core_web_md')
list1 =['mango','apple','tomato','orange','papaya']
list2 =['mango','fig','cherry','apple','dates']
s_words =
for token1 in list1:
list_to_sort =
for token2 in list2:
list_to_sort.append((token1, token2, nlp(str(token1)).similarity(nlp(str(token2)))))

sorted_list = sorted(list_to_sort, key = itemgetter(2), reverse=True)[0][:2]
s_words.append(sorted_list)
similar_words= list(zip(*s_words))[1]









share|improve this question


















  • 1





    Model loading in IO bound. If you want it to go faster load smaller models. You are using web_md, which stands for medium- there is also en_core_web_sm

    – mbatchkarov
    Nov 19 '18 at 19:43











  • @mbatchkarov Thanks for your answer. But I think I do require medium models at the least. I also tried disable =['parser','ner','tagger']. It definitely speeds up the loading but what if In some case, the user does require parser, ner etc. The model loading by default has to be faster in such a case. I believe disabling is just a hack. Wonder is there a better solution. This was the gist of my question above.

    – venkatttaknev
    Nov 19 '18 at 20:15











  • A spacy model is a large and complex data structure. To make deserialisation faster you need to attack either the large part (by finding a smaller model) or by the complex part (by rewriting spacy or training your own models). Outside of these two options, we'd have to rethink the fundamentals of what you are doing. Here are a few questions. Do you have evidence to suggest you absolutely require at least medium model? Can you compromise on accuracy? Can you preload the models once and query them repeatedly (e.g. a web service/ a jupyter notebook cell)? Do you have a fast SSD?

    – mbatchkarov
    Nov 20 '18 at 9:05
















1















I am using Spacy's pretrained statistical models such as en_core_web_md. I am trying to find similar words between two lists. While the code is working fine. It takes a lot of time to load the statistical model, each time the code is run.



Here is the code I am using.



How to make the models load faster?
Is there a way to save the model to the disk ?



import spacy
nlp = spacy.load('en_core_web_md')
list1 =['mango','apple','tomato','orange','papaya']
list2 =['mango','fig','cherry','apple','dates']
s_words =
for token1 in list1:
list_to_sort =
for token2 in list2:
list_to_sort.append((token1, token2, nlp(str(token1)).similarity(nlp(str(token2)))))

sorted_list = sorted(list_to_sort, key = itemgetter(2), reverse=True)[0][:2]
s_words.append(sorted_list)
similar_words= list(zip(*s_words))[1]









share|improve this question


















  • 1





    Model loading in IO bound. If you want it to go faster load smaller models. You are using web_md, which stands for medium- there is also en_core_web_sm

    – mbatchkarov
    Nov 19 '18 at 19:43











  • @mbatchkarov Thanks for your answer. But I think I do require medium models at the least. I also tried disable =['parser','ner','tagger']. It definitely speeds up the loading but what if In some case, the user does require parser, ner etc. The model loading by default has to be faster in such a case. I believe disabling is just a hack. Wonder is there a better solution. This was the gist of my question above.

    – venkatttaknev
    Nov 19 '18 at 20:15











  • A spacy model is a large and complex data structure. To make deserialisation faster you need to attack either the large part (by finding a smaller model) or by the complex part (by rewriting spacy or training your own models). Outside of these two options, we'd have to rethink the fundamentals of what you are doing. Here are a few questions. Do you have evidence to suggest you absolutely require at least medium model? Can you compromise on accuracy? Can you preload the models once and query them repeatedly (e.g. a web service/ a jupyter notebook cell)? Do you have a fast SSD?

    – mbatchkarov
    Nov 20 '18 at 9:05














1












1








1








I am using Spacy's pretrained statistical models such as en_core_web_md. I am trying to find similar words between two lists. While the code is working fine. It takes a lot of time to load the statistical model, each time the code is run.



Here is the code I am using.



How to make the models load faster?
Is there a way to save the model to the disk ?



import spacy
nlp = spacy.load('en_core_web_md')
list1 =['mango','apple','tomato','orange','papaya']
list2 =['mango','fig','cherry','apple','dates']
s_words =
for token1 in list1:
list_to_sort =
for token2 in list2:
list_to_sort.append((token1, token2, nlp(str(token1)).similarity(nlp(str(token2)))))

sorted_list = sorted(list_to_sort, key = itemgetter(2), reverse=True)[0][:2]
s_words.append(sorted_list)
similar_words= list(zip(*s_words))[1]









share|improve this question














I am using Spacy's pretrained statistical models such as en_core_web_md. I am trying to find similar words between two lists. While the code is working fine. It takes a lot of time to load the statistical model, each time the code is run.



Here is the code I am using.



How to make the models load faster?
Is there a way to save the model to the disk ?



import spacy
nlp = spacy.load('en_core_web_md')
list1 =['mango','apple','tomato','orange','papaya']
list2 =['mango','fig','cherry','apple','dates']
s_words =
for token1 in list1:
list_to_sort =
for token2 in list2:
list_to_sort.append((token1, token2, nlp(str(token1)).similarity(nlp(str(token2)))))

sorted_list = sorted(list_to_sort, key = itemgetter(2), reverse=True)[0][:2]
s_words.append(sorted_list)
similar_words= list(zip(*s_words))[1]






python-3.x nlp spacy






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 19 '18 at 12:40









venkatttaknevvenkatttaknev

498




498








  • 1





    Model loading in IO bound. If you want it to go faster load smaller models. You are using web_md, which stands for medium- there is also en_core_web_sm

    – mbatchkarov
    Nov 19 '18 at 19:43











  • @mbatchkarov Thanks for your answer. But I think I do require medium models at the least. I also tried disable =['parser','ner','tagger']. It definitely speeds up the loading but what if In some case, the user does require parser, ner etc. The model loading by default has to be faster in such a case. I believe disabling is just a hack. Wonder is there a better solution. This was the gist of my question above.

    – venkatttaknev
    Nov 19 '18 at 20:15











  • A spacy model is a large and complex data structure. To make deserialisation faster you need to attack either the large part (by finding a smaller model) or by the complex part (by rewriting spacy or training your own models). Outside of these two options, we'd have to rethink the fundamentals of what you are doing. Here are a few questions. Do you have evidence to suggest you absolutely require at least medium model? Can you compromise on accuracy? Can you preload the models once and query them repeatedly (e.g. a web service/ a jupyter notebook cell)? Do you have a fast SSD?

    – mbatchkarov
    Nov 20 '18 at 9:05














  • 1





    Model loading in IO bound. If you want it to go faster load smaller models. You are using web_md, which stands for medium- there is also en_core_web_sm

    – mbatchkarov
    Nov 19 '18 at 19:43











  • @mbatchkarov Thanks for your answer. But I think I do require medium models at the least. I also tried disable =['parser','ner','tagger']. It definitely speeds up the loading but what if In some case, the user does require parser, ner etc. The model loading by default has to be faster in such a case. I believe disabling is just a hack. Wonder is there a better solution. This was the gist of my question above.

    – venkatttaknev
    Nov 19 '18 at 20:15











  • A spacy model is a large and complex data structure. To make deserialisation faster you need to attack either the large part (by finding a smaller model) or by the complex part (by rewriting spacy or training your own models). Outside of these two options, we'd have to rethink the fundamentals of what you are doing. Here are a few questions. Do you have evidence to suggest you absolutely require at least medium model? Can you compromise on accuracy? Can you preload the models once and query them repeatedly (e.g. a web service/ a jupyter notebook cell)? Do you have a fast SSD?

    – mbatchkarov
    Nov 20 '18 at 9:05








1




1





Model loading in IO bound. If you want it to go faster load smaller models. You are using web_md, which stands for medium- there is also en_core_web_sm

– mbatchkarov
Nov 19 '18 at 19:43





Model loading in IO bound. If you want it to go faster load smaller models. You are using web_md, which stands for medium- there is also en_core_web_sm

– mbatchkarov
Nov 19 '18 at 19:43













@mbatchkarov Thanks for your answer. But I think I do require medium models at the least. I also tried disable =['parser','ner','tagger']. It definitely speeds up the loading but what if In some case, the user does require parser, ner etc. The model loading by default has to be faster in such a case. I believe disabling is just a hack. Wonder is there a better solution. This was the gist of my question above.

– venkatttaknev
Nov 19 '18 at 20:15





@mbatchkarov Thanks for your answer. But I think I do require medium models at the least. I also tried disable =['parser','ner','tagger']. It definitely speeds up the loading but what if In some case, the user does require parser, ner etc. The model loading by default has to be faster in such a case. I believe disabling is just a hack. Wonder is there a better solution. This was the gist of my question above.

– venkatttaknev
Nov 19 '18 at 20:15













A spacy model is a large and complex data structure. To make deserialisation faster you need to attack either the large part (by finding a smaller model) or by the complex part (by rewriting spacy or training your own models). Outside of these two options, we'd have to rethink the fundamentals of what you are doing. Here are a few questions. Do you have evidence to suggest you absolutely require at least medium model? Can you compromise on accuracy? Can you preload the models once and query them repeatedly (e.g. a web service/ a jupyter notebook cell)? Do you have a fast SSD?

– mbatchkarov
Nov 20 '18 at 9:05





A spacy model is a large and complex data structure. To make deserialisation faster you need to attack either the large part (by finding a smaller model) or by the complex part (by rewriting spacy or training your own models). Outside of these two options, we'd have to rethink the fundamentals of what you are doing. Here are a few questions. Do you have evidence to suggest you absolutely require at least medium model? Can you compromise on accuracy? Can you preload the models once and query them repeatedly (e.g. a web service/ a jupyter notebook cell)? Do you have a fast SSD?

– mbatchkarov
Nov 20 '18 at 9:05












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374876%2fhow-to-make-spacys-statistical-models-faster%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374876%2fhow-to-make-spacys-statistical-models-faster%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to change which sound is reproduced for terminal bell?

Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents

Can I use Tabulator js library in my java Spring + Thymeleaf project?