Re-processing data for Elasticsearch with a new pipeline

I have an ELK-stack server that is being used to analyse Apache web log data. We're loading ALL of the logs, going back several years. The purpose is to look at some application-specific trends over this time period.

The data-processing pipeline is still being tweaked, as this is the first time anyone has looked in detail into this data and some people are still trying to decide how they want the data to be processed.

Some changes were suggested and while they're easy enough to do in the logstash pipeline for new, incoming data, I'm not sure how to apply these changes to the data that's already in elastic. It took several days to load the current data set, and quite a bit more data has been added so re-processing everything through logstash, with the modified pipeline will probably take several days longer.

What's the best way to apply these changes to data that has already been ingested into elastic? In the early stages of testing this set-up, I would just remove the index and rebuild from scratch, but that was done with very limited data sets and with the amount of data in use here, I'm not sure that's feasible. Is there a better way?

asked Nov 21 '18 at 21:48

FrustratedWithFormsDesigner

21k27117175

add a comment |

asked Nov 21 '18 at 21:48

FrustratedWithFormsDesigner

21k27117175

add a comment |

asked Nov 21 '18 at 21:48

FrustratedWithFormsDesigner

21k27117175

elasticsearch logstash

asked Nov 21 '18 at 21:48

FrustratedWithFormsDesigner

21k27117175

asked Nov 21 '18 at 21:48

FrustratedWithFormsDesigner

21k27117175

asked Nov 21 '18 at 21:48

FrustratedWithFormsDesigner

21k27117175

asked Nov 21 '18 at 21:48

FrustratedWithFormsDesigner

21k27117175

asked Nov 21 '18 at 21:48

FrustratedWithFormsDesigner

21k27117175

add a comment |

1 Answer
1

active

oldest

votes

Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)

Ingest Node

answered Nov 22 '18 at 0:31

ben5556

2,0072310

It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44

...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59

Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00

Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

– ben5556
Nov 22 '18 at 18:47

To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

– ben5556
Nov 22 '18 at 18:49

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420930%2fre-processing-data-for-elasticsearch-with-a-new-pipeline%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)

Ingest Node

answered Nov 22 '18 at 0:31

ben5556

2,0072310

It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44

...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59

Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00

Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

– ben5556
Nov 22 '18 at 18:47

To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

– ben5556
Nov 22 '18 at 18:49

add a comment |

Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)

Ingest Node

answered Nov 22 '18 at 0:31

ben5556

2,0072310

It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44

...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59

Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00

Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

– ben5556
Nov 22 '18 at 18:47

To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

– ben5556
Nov 22 '18 at 18:49

add a comment |

Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)

Ingest Node

answered Nov 22 '18 at 0:31

ben5556

2,0072310

Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)

Ingest Node

answered Nov 22 '18 at 0:31

ben5556

2,0072310

answered Nov 22 '18 at 0:31

ben5556

2,0072310

answered Nov 22 '18 at 0:31

ben5556

2,0072310

answered Nov 22 '18 at 0:31

ben5556

2,0072310

It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44

...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59

Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00

Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

– ben5556
Nov 22 '18 at 18:47

To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

– ben5556
Nov 22 '18 at 18:49

add a comment |

It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44

...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59

Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00

Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

– ben5556
Nov 22 '18 at 18:47

To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

– ben5556
Nov 22 '18 at 18:49

It sounds like this would process all the existing data through the updated pipeline, and populating a new index, and then I guess I'd drop the old one when the operation is finished. I guess this is better than reloading all of the log files from scratch. Would it be much faster? I was hoping for a way to update the index in-place, but I guess that's because I'm used to doing that in relational databases. ;)

– FrustratedWithFormsDesigner
Nov 22 '18 at 15:44

...or should I create a special pipeline that only has the updates, and reindex existing data through that, while new data goes through the regular (and newly updated) logstash pipeline?

– FrustratedWithFormsDesigner
Nov 22 '18 at 15:59

Another issue is that some of the changes to the logstash pipeline involve using the Aggregate filter plugin, and it doesn't look like there is any Ingest Node processor that is equivalent to that.

– FrustratedWithFormsDesigner
Nov 22 '18 at 17:00

Yep it will be much faster. And yes, create a pipeline that only has the updates and use it while reindexing. New data will continue to use your logstash pipeline. Unfortunately, ingest processors work for most use cases but yes they are not as powerful as logstash pipelines to be able to aggregate etc.

– ben5556
Nov 22 '18 at 18:47

To have the same index name with updated data, after reindexing you can take a snapshot of the new index and restore it to your old index name. New index can be deleted after restore.

– ben5556
Nov 22 '18 at 18:49

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky