Re-processing data for Elasticsearch with a new pipeline












0















I have an ELK-stack server that is being used to analyse Apache web log data. We're loading ALL of the logs, going back several years. The purpose is to look at some application-specific trends over this time period.



The data-processing pipeline is still being tweaked, as this is the first time anyone has looked in detail into this data and some people are still trying to decide how they want the data to be processed.



Some changes were suggested and while they're easy enough to do in the logstash pipeline for new, incoming data, I'm not sure how to apply these changes to the data that's already in elastic. It took several days to load the current data set, and quite a bit more data has been added so re-processing everything through logstash, with the modified pipeline will probably take several days longer.



What's the best way to apply these changes to data that has already been ingested into elastic? In the early stages of testing this set-up, I would just remove the index and rebuild from scratch, but that was done with very limited data sets and with the amount of data in use here, I'm not sure that's feasible. Is there a better way?










share|improve this question



























    0















    I have an ELK-stack server that is being used to analyse Apache web log data. We're loading ALL of the logs, going back several years. The purpose is to look at some application-specific trends over this time period.



    The data-processing pipeline is still being tweaked, as this is the first time anyone has looked in detail into this data and some people are still trying to decide how they want the data to be processed.



    Some changes were suggested and while they're easy enough to do in the logstash pipeline for new, incoming data, I'm not sure how to apply these changes to the data that's already in elastic. It took several days to load the current data set, and quite a bit more data has been added so re-processing everything through logstash, with the modified pipeline will probably take several days longer.



    What's the best way to apply these changes to data that has already been ingested into elastic? In the early stages of testing this set-up, I would just remove the index and rebuild from scratch, but that was done with very limited data sets and with the amount of data in use here, I'm not sure that's feasible. Is there a better way?










    share|improve this question

























      0












      0








      0








      I have an ELK-stack server that is being used to analyse Apache web log data. We're loading ALL of the logs, going back several years. The purpose is to look at some application-specific trends over this time period.



      The data-processing pipeline is still being tweaked, as this is the first time anyone has looked in detail into this data and some people are still trying to decide how they want the data to be processed.



      Some changes were suggested and while they're easy enough to do in the logstash pipeline for new, incoming data, I'm not sure how to apply these changes to the data that's already in elastic. It took several days to load the current data set, and quite a bit more data has been added so re-processing everything through logstash, with the modified pipeline will probably take several days longer.



      What's the best way to apply these changes to data that has already been ingested into elastic? In the early stages of testing this set-up, I would just remove the index and rebuild from scratch, but that was done with very limited data sets and with the amount of data in use here, I'm not sure that's feasible. Is there a better way?










      share|improve this question














      I have an ELK-stack server that is being used to analyse Apache web log data. We're loading ALL of the logs, going back several years. The purpose is to look at some application-specific trends over this time period.



      The data-processing pipeline is still being tweaked, as this is the first time anyone has looked in detail into this data and some people are still trying to decide how they want the data to be processed.



      Some changes were suggested and while they're easy enough to do in the logstash pipeline for new, incoming data, I'm not sure how to apply these changes to the data that's already in elastic. It took several days to load the current data set, and quite a bit more data has been added so re-processing everything through logstash, with the modified pipeline will probably take several days longer.



      What's the best way to apply these changes to data that has already been ingested into elastic? In the early stages of testing this set-up, I would just remove the index and rebuild from scratch, but that was done with very limited data sets and with the amount of data in use here, I'm not sure that's feasible. Is there a better way?







      elasticsearch logstash






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 21 '18 at 21:48









      FrustratedWithFormsDesignerFrustratedWithFormsDesigner

      21k27117175




      21k27117175
























          1 Answer
          1






          active

          oldest

          votes


















          2














          Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)



          Ingest Node






          share|improve this answer
























          • It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 15:44











          • ...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 15:59











          • Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 17:00











          • Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

            – ben5556
            Nov 22 '18 at 18:47











          • To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

            – ben5556
            Nov 22 '18 at 18:49











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420930%2fre-processing-data-for-elasticsearch-with-a-new-pipeline%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)



          Ingest Node






          share|improve this answer
























          • It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 15:44











          • ...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 15:59











          • Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 17:00











          • Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

            – ben5556
            Nov 22 '18 at 18:47











          • To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

            – ben5556
            Nov 22 '18 at 18:49
















          2














          Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)



          Ingest Node






          share|improve this answer
























          • It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 15:44











          • ...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 15:59











          • Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 17:00











          • Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

            – ben5556
            Nov 22 '18 at 18:47











          • To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

            – ben5556
            Nov 22 '18 at 18:49














          2












          2








          2







          Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)



          Ingest Node






          share|improve this answer













          Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)



          Ingest Node







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 22 '18 at 0:31









          ben5556ben5556

          2,0072310




          2,0072310













          • It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 15:44











          • ...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 15:59











          • Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 17:00











          • Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

            – ben5556
            Nov 22 '18 at 18:47











          • To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

            – ben5556
            Nov 22 '18 at 18:49



















          • It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 15:44











          • ...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 15:59











          • Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

            – FrustratedWithFormsDesigner
            Nov 22 '18 at 17:00











          • Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

            – ben5556
            Nov 22 '18 at 18:47











          • To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

            – ben5556
            Nov 22 '18 at 18:49

















          It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

          – FrustratedWithFormsDesigner
          Nov 22 '18 at 15:44





          It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

          – FrustratedWithFormsDesigner
          Nov 22 '18 at 15:44













          ...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

          – FrustratedWithFormsDesigner
          Nov 22 '18 at 15:59





          ...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

          – FrustratedWithFormsDesigner
          Nov 22 '18 at 15:59













          Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

          – FrustratedWithFormsDesigner
          Nov 22 '18 at 17:00





          Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

          – FrustratedWithFormsDesigner
          Nov 22 '18 at 17:00













          Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

          – ben5556
          Nov 22 '18 at 18:47





          Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

          – ben5556
          Nov 22 '18 at 18:47













          To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

          – ben5556
          Nov 22 '18 at 18:49





          To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

          – ben5556
          Nov 22 '18 at 18:49




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420930%2fre-processing-data-for-elasticsearch-with-a-new-pipeline%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

          ComboBox Display Member on multiple fields

          Is it possible to collect Nectar points via Trainline?