How should look like a morphline for MapReduceIndexerTool?
I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.
For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.
Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.
I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?
I can't find any example on a morphline adapted for this map reduce job.
Thank you respectfully.
hadoop mapreduce morphline
add a comment |
I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.
For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.
Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.
I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?
I can't find any example on a morphline adapted for this map reduce job.
Thank you respectfully.
hadoop mapreduce morphline
Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
– cricket_007
Mar 5 '18 at 13:44
Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.
– Gyanendra Dwivedi
Mar 5 '18 at 16:22
add a comment |
I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.
For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.
Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.
I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?
I can't find any example on a morphline adapted for this map reduce job.
Thank you respectfully.
hadoop mapreduce morphline
I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.
For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.
Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.
I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?
I can't find any example on a morphline adapted for this map reduce job.
Thank you respectfully.
hadoop mapreduce morphline
hadoop mapreduce morphline
asked Mar 5 '18 at 12:35
Cosmin IonițăCosmin Ioniță
581932
581932
Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
– cricket_007
Mar 5 '18 at 13:44
Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.
– Gyanendra Dwivedi
Mar 5 '18 at 16:22
add a comment |
Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
– cricket_007
Mar 5 '18 at 13:44
Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.
– Gyanendra Dwivedi
Mar 5 '18 at 16:22
Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
– cricket_007
Mar 5 '18 at 13:44
Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
– cricket_007
Mar 5 '18 at 13:44
Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.
– Gyanendra Dwivedi
Mar 5 '18 at 16:22
Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.
– Gyanendra Dwivedi
Mar 5 '18 at 16:22
add a comment |
2 Answers
2
active
oldest
votes
Cloudera has a guide which is having almost similar use case given under morphline
.
In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.
The use case given in the example is what production tools like MapReduceIndexerTool
, Apache Flume Morphline Solr Sink
and Apache Flume MorphlineInterceptor
and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:
This doesn't really answer the question what kind of operations should I perform... unless you are referring to thegrok
command
– cricket_007
Mar 5 '18 at 23:04
@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
– Gyanendra Dwivedi
Mar 6 '18 at 7:05
Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
– Cosmin Ioniță
Mar 6 '18 at 8:48
morphline
in bigdata world is similar toETL
in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which supportplug-your-transformation
in a better way.
– Gyanendra Dwivedi
Mar 6 '18 at 9:02
1
okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
– Cosmin Ioniță
Mar 6 '18 at 10:03
|
show 1 more comment
In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr
to create index.
For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:
SOLR_LOCATOR : {
collection : collection1
zkHost : "127.0.0.1:2181/solr"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readAvroContainer {}
}
{
extractAvroPaths {
flatten : false
paths : {
id : /id
field1_s : /field1
field2_s : /field2
}
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.
This is command I'm using to index files and merge them to running collection:
sudo -u hdfs hadoop --config /etc/hadoop/conf
jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
--morphline-file /local/path/morphlines_file
--output-dir hdfs://localhost/mrit/out
--zk-host localhost:2181/solr
--collection collection1
--go-live
hdfs:/mrit/in/my-avro-file.avro
Solr should be configured to work with HDFS and collection should exist.
All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f49110654%2fhow-should-look-like-a-morphline-for-mapreduceindexertool%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Cloudera has a guide which is having almost similar use case given under morphline
.
In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.
The use case given in the example is what production tools like MapReduceIndexerTool
, Apache Flume Morphline Solr Sink
and Apache Flume MorphlineInterceptor
and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:
This doesn't really answer the question what kind of operations should I perform... unless you are referring to thegrok
command
– cricket_007
Mar 5 '18 at 23:04
@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
– Gyanendra Dwivedi
Mar 6 '18 at 7:05
Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
– Cosmin Ioniță
Mar 6 '18 at 8:48
morphline
in bigdata world is similar toETL
in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which supportplug-your-transformation
in a better way.
– Gyanendra Dwivedi
Mar 6 '18 at 9:02
1
okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
– Cosmin Ioniță
Mar 6 '18 at 10:03
|
show 1 more comment
Cloudera has a guide which is having almost similar use case given under morphline
.
In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.
The use case given in the example is what production tools like MapReduceIndexerTool
, Apache Flume Morphline Solr Sink
and Apache Flume MorphlineInterceptor
and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:
This doesn't really answer the question what kind of operations should I perform... unless you are referring to thegrok
command
– cricket_007
Mar 5 '18 at 23:04
@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
– Gyanendra Dwivedi
Mar 6 '18 at 7:05
Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
– Cosmin Ioniță
Mar 6 '18 at 8:48
morphline
in bigdata world is similar toETL
in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which supportplug-your-transformation
in a better way.
– Gyanendra Dwivedi
Mar 6 '18 at 9:02
1
okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
– Cosmin Ioniță
Mar 6 '18 at 10:03
|
show 1 more comment
Cloudera has a guide which is having almost similar use case given under morphline
.
In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.
The use case given in the example is what production tools like MapReduceIndexerTool
, Apache Flume Morphline Solr Sink
and Apache Flume MorphlineInterceptor
and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:
Cloudera has a guide which is having almost similar use case given under morphline
.
In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.
The use case given in the example is what production tools like MapReduceIndexerTool
, Apache Flume Morphline Solr Sink
and Apache Flume MorphlineInterceptor
and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:
answered Mar 5 '18 at 16:21
Gyanendra DwivediGyanendra Dwivedi
4,11311644
4,11311644
This doesn't really answer the question what kind of operations should I perform... unless you are referring to thegrok
command
– cricket_007
Mar 5 '18 at 23:04
@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
– Gyanendra Dwivedi
Mar 6 '18 at 7:05
Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
– Cosmin Ioniță
Mar 6 '18 at 8:48
morphline
in bigdata world is similar toETL
in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which supportplug-your-transformation
in a better way.
– Gyanendra Dwivedi
Mar 6 '18 at 9:02
1
okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
– Cosmin Ioniță
Mar 6 '18 at 10:03
|
show 1 more comment
This doesn't really answer the question what kind of operations should I perform... unless you are referring to thegrok
command
– cricket_007
Mar 5 '18 at 23:04
@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
– Gyanendra Dwivedi
Mar 6 '18 at 7:05
Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
– Cosmin Ioniță
Mar 6 '18 at 8:48
morphline
in bigdata world is similar toETL
in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which supportplug-your-transformation
in a better way.
– Gyanendra Dwivedi
Mar 6 '18 at 9:02
1
okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
– Cosmin Ioniță
Mar 6 '18 at 10:03
This doesn't really answer the question what kind of operations should I perform... unless you are referring to the
grok
command– cricket_007
Mar 5 '18 at 23:04
This doesn't really answer the question what kind of operations should I perform... unless you are referring to the
grok
command– cricket_007
Mar 5 '18 at 23:04
@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
– Gyanendra Dwivedi
Mar 6 '18 at 7:05
@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
– Gyanendra Dwivedi
Mar 6 '18 at 7:05
Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
– Cosmin Ioniță
Mar 6 '18 at 8:48
Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
– Cosmin Ioniță
Mar 6 '18 at 8:48
morphline
in bigdata world is similar to ETL
in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation
in a better way.– Gyanendra Dwivedi
Mar 6 '18 at 9:02
morphline
in bigdata world is similar to ETL
in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation
in a better way.– Gyanendra Dwivedi
Mar 6 '18 at 9:02
1
1
okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
– Cosmin Ioniță
Mar 6 '18 at 10:03
okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
– Cosmin Ioniță
Mar 6 '18 at 10:03
|
show 1 more comment
In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr
to create index.
For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:
SOLR_LOCATOR : {
collection : collection1
zkHost : "127.0.0.1:2181/solr"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readAvroContainer {}
}
{
extractAvroPaths {
flatten : false
paths : {
id : /id
field1_s : /field1
field2_s : /field2
}
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.
This is command I'm using to index files and merge them to running collection:
sudo -u hdfs hadoop --config /etc/hadoop/conf
jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
--morphline-file /local/path/morphlines_file
--output-dir hdfs://localhost/mrit/out
--zk-host localhost:2181/solr
--collection collection1
--go-live
hdfs:/mrit/in/my-avro-file.avro
Solr should be configured to work with HDFS and collection should exist.
All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.
add a comment |
In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr
to create index.
For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:
SOLR_LOCATOR : {
collection : collection1
zkHost : "127.0.0.1:2181/solr"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readAvroContainer {}
}
{
extractAvroPaths {
flatten : false
paths : {
id : /id
field1_s : /field1
field2_s : /field2
}
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.
This is command I'm using to index files and merge them to running collection:
sudo -u hdfs hadoop --config /etc/hadoop/conf
jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
--morphline-file /local/path/morphlines_file
--output-dir hdfs://localhost/mrit/out
--zk-host localhost:2181/solr
--collection collection1
--go-live
hdfs:/mrit/in/my-avro-file.avro
Solr should be configured to work with HDFS and collection should exist.
All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.
add a comment |
In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr
to create index.
For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:
SOLR_LOCATOR : {
collection : collection1
zkHost : "127.0.0.1:2181/solr"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readAvroContainer {}
}
{
extractAvroPaths {
flatten : false
paths : {
id : /id
field1_s : /field1
field2_s : /field2
}
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.
This is command I'm using to index files and merge them to running collection:
sudo -u hdfs hadoop --config /etc/hadoop/conf
jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
--morphline-file /local/path/morphlines_file
--output-dir hdfs://localhost/mrit/out
--zk-host localhost:2181/solr
--collection collection1
--go-live
hdfs:/mrit/in/my-avro-file.avro
Solr should be configured to work with HDFS and collection should exist.
All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.
In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr
to create index.
For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:
SOLR_LOCATOR : {
collection : collection1
zkHost : "127.0.0.1:2181/solr"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readAvroContainer {}
}
{
extractAvroPaths {
flatten : false
paths : {
id : /id
field1_s : /field1
field2_s : /field2
}
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.
This is command I'm using to index files and merge them to running collection:
sudo -u hdfs hadoop --config /etc/hadoop/conf
jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
--morphline-file /local/path/morphlines_file
--output-dir hdfs://localhost/mrit/out
--zk-host localhost:2181/solr
--collection collection1
--go-live
hdfs:/mrit/in/my-avro-file.avro
Solr should be configured to work with HDFS and collection should exist.
All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.
edited Nov 21 '18 at 21:13
answered Nov 21 '18 at 20:54
arghtypearghtype
3,173113348
3,173113348
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f49110654%2fhow-should-look-like-a-morphline-for-mapreduceindexertool%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
– cricket_007
Mar 5 '18 at 13:44
Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.
– Gyanendra Dwivedi
Mar 5 '18 at 16:22