How should look like a morphline for MapReduceIndexerTool?

I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.

For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.

Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.

I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?

I can't find any example on a morphline adapted for this map reduce job.

Thank you respectfully.

asked Mar 5 '18 at 12:35

Cosmin Ioniță

581932

Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

– cricket_007
Mar 5 '18 at 13:44

Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.

– Gyanendra Dwivedi
Mar 5 '18 at 16:22

add a comment |

I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.

For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.

I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?

I can't find any example on a morphline adapted for this map reduce job.

Thank you respectfully.

asked Mar 5 '18 at 12:35

Cosmin Ioniță

581932

Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

– cricket_007
Mar 5 '18 at 13:44

Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.

– Gyanendra Dwivedi
Mar 5 '18 at 16:22

add a comment |

I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.

For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.

I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?

I can't find any example on a morphline adapted for this map reduce job.

Thank you respectfully.

asked Mar 5 '18 at 12:35

Cosmin Ioniță

581932

I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.

For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.

I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?

I can't find any example on a morphline adapted for this map reduce job.

Thank you respectfully.

hadoop mapreduce morphline

asked Mar 5 '18 at 12:35

Cosmin Ioniță

581932

asked Mar 5 '18 at 12:35

Cosmin Ioniță

581932

asked Mar 5 '18 at 12:35

Cosmin Ioniță

581932

asked Mar 5 '18 at 12:35

Cosmin Ioniță

581932

asked Mar 5 '18 at 12:35

Cosmin Ioniță

581932

Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

– cricket_007
Mar 5 '18 at 13:44

Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.

– Gyanendra Dwivedi
Mar 5 '18 at 16:22

add a comment |

Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

– cricket_007
Mar 5 '18 at 13:44

Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.

– Gyanendra Dwivedi
Mar 5 '18 at 16:22

Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

– cricket_007
Mar 5 '18 at 13:44

Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.

– Gyanendra Dwivedi
Mar 5 '18 at 16:22

add a comment |

2 Answers
2

active

oldest

votes

Cloudera has a guide which is having almost similar use case given under morphline.

enter image description here

In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.

The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:

enter image description here

answered Mar 5 '18 at 16:21

Gyanendra Dwivedi

4,11311644

This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

– cricket_007
Mar 5 '18 at 23:04

@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

– Gyanendra Dwivedi
Mar 6 '18 at 7:05

Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

– Cosmin Ioniță
Mar 6 '18 at 8:48

morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

– Gyanendra Dwivedi
Mar 6 '18 at 9:02

1

okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

– Cosmin Ioniță
Mar 6 '18 at 10:03

|
show 1 more comment

In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.

For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:

SOLR_LOCATOR : {

  collection : collection1

  zkHost : "127.0.0.1:2181/solr"

}

morphlines : [

  {

    id : morphline1

    importCommands : ["org.kitesdk.**"]    

    commands : [

      {

        readAvroContainer {}

      }    

      {

        extractAvroPaths {

          flatten : false

          paths : {

            id : /id

            field1_s : /field1

            field2_s : /field2

          }

        }

      }

      {

        sanitizeUnknownSolrFields {

          solrLocator : ${SOLR_LOCATOR}

        }

      }

      {

        loadSolr {

          solrLocator : ${SOLR_LOCATOR}

        }

      }

    ]

  }

]

When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.

This is command I'm using to index files and merge them to running collection:

sudo -u hdfs hadoop --config /etc/hadoop/conf 

jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool 

--morphline-file /local/path/morphlines_file  

--output-dir hdfs://localhost/mrit/out 

--zk-host localhost:2181/solr 

--collection collection1  

--go-live 

hdfs:/mrit/in/my-avro-file.avro

Solr should be configured to work with HDFS and collection should exist.

All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.

edited Nov 21 '18 at 21:13

answered Nov 21 '18 at 20:54

arghtype

3,173113348

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f49110654%2fhow-should-look-like-a-morphline-for-mapreduceindexertool%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Cloudera has a guide which is having almost similar use case given under morphline.

enter image description here

In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.

enter image description here

answered Mar 5 '18 at 16:21

Gyanendra Dwivedi

4,11311644

This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

– cricket_007
Mar 5 '18 at 23:04

@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

– Gyanendra Dwivedi
Mar 6 '18 at 7:05

Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

– Cosmin Ioniță
Mar 6 '18 at 8:48

morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

– Gyanendra Dwivedi
Mar 6 '18 at 9:02

1

okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

– Cosmin Ioniță
Mar 6 '18 at 10:03

|
show 1 more comment

Cloudera has a guide which is having almost similar use case given under morphline.

enter image description here

In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.

enter image description here

answered Mar 5 '18 at 16:21

Gyanendra Dwivedi

4,11311644

This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

– cricket_007
Mar 5 '18 at 23:04

@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

– Gyanendra Dwivedi
Mar 6 '18 at 7:05

Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

– Cosmin Ioniță
Mar 6 '18 at 8:48

morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

– Gyanendra Dwivedi
Mar 6 '18 at 9:02

1

okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

– Cosmin Ioniță
Mar 6 '18 at 10:03

|
show 1 more comment

Cloudera has a guide which is having almost similar use case given under morphline.

enter image description here

In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.

enter image description here

answered Mar 5 '18 at 16:21

Gyanendra Dwivedi

4,11311644

Cloudera has a guide which is having almost similar use case given under morphline.

enter image description here

In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.

enter image description here

answered Mar 5 '18 at 16:21

Gyanendra Dwivedi

4,11311644

answered Mar 5 '18 at 16:21

Gyanendra Dwivedi

4,11311644

answered Mar 5 '18 at 16:21

Gyanendra Dwivedi

4,11311644

answered Mar 5 '18 at 16:21

Gyanendra Dwivedi

4,11311644

This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

– cricket_007
Mar 5 '18 at 23:04

@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

– Gyanendra Dwivedi
Mar 6 '18 at 7:05

Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

– Cosmin Ioniță
Mar 6 '18 at 8:48

morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

– Gyanendra Dwivedi
Mar 6 '18 at 9:02

1

okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

– Cosmin Ioniță
Mar 6 '18 at 10:03

|
show 1 more comment

This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

– cricket_007
Mar 5 '18 at 23:04

@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

– Gyanendra Dwivedi
Mar 6 '18 at 7:05

Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

– Cosmin Ioniță
Mar 6 '18 at 8:48

morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

– Gyanendra Dwivedi
Mar 6 '18 at 9:02

1

okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

– Cosmin Ioniță
Mar 6 '18 at 10:03

This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

– cricket_007
Mar 5 '18 at 23:04

@cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

– Gyanendra Dwivedi
Mar 6 '18 at 7:05

Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

– Cosmin Ioniță
Mar 6 '18 at 8:48

morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

– Gyanendra Dwivedi
Mar 6 '18 at 9:02

okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

– Cosmin Ioniță
Mar 6 '18 at 10:03

|
show 1 more comment

In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.

For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:

SOLR_LOCATOR : {

  collection : collection1

  zkHost : "127.0.0.1:2181/solr"

}

morphlines : [

  {

    id : morphline1

    importCommands : ["org.kitesdk.**"]    

    commands : [

      {

        readAvroContainer {}

      }    

      {

        extractAvroPaths {

          flatten : false

          paths : {

            id : /id

            field1_s : /field1

            field2_s : /field2

          }

        }

      }

      {

        sanitizeUnknownSolrFields {

          solrLocator : ${SOLR_LOCATOR}

        }

      }

      {

        loadSolr {

          solrLocator : ${SOLR_LOCATOR}

        }

      }

    ]

  }

]

When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.

This is command I'm using to index files and merge them to running collection:

sudo -u hdfs hadoop --config /etc/hadoop/conf 

jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool 

--morphline-file /local/path/morphlines_file  

--output-dir hdfs://localhost/mrit/out 

--zk-host localhost:2181/solr 

--collection collection1  

--go-live 

hdfs:/mrit/in/my-avro-file.avro

Solr should be configured to work with HDFS and collection should exist.

All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.

edited Nov 21 '18 at 21:13

answered Nov 21 '18 at 20:54

arghtype

3,173113348

add a comment |

In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.

For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:

SOLR_LOCATOR : {

  collection : collection1

  zkHost : "127.0.0.1:2181/solr"

}

morphlines : [

  {

    id : morphline1

    importCommands : ["org.kitesdk.**"]    

    commands : [

      {

        readAvroContainer {}

      }    

      {

        extractAvroPaths {

          flatten : false

          paths : {

            id : /id

            field1_s : /field1

            field2_s : /field2

          }

        }

      }

      {

        sanitizeUnknownSolrFields {

          solrLocator : ${SOLR_LOCATOR}

        }

      }

      {

        loadSolr {

          solrLocator : ${SOLR_LOCATOR}

        }

      }

    ]

  }

]

When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.

This is command I'm using to index files and merge them to running collection:

sudo -u hdfs hadoop --config /etc/hadoop/conf 

jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool 

--morphline-file /local/path/morphlines_file  

--output-dir hdfs://localhost/mrit/out 

--zk-host localhost:2181/solr 

--collection collection1  

--go-live 

hdfs:/mrit/in/my-avro-file.avro

Solr should be configured to work with HDFS and collection should exist.

All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.

edited Nov 21 '18 at 21:13

answered Nov 21 '18 at 20:54

arghtype

3,173113348

add a comment |

In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.

For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:

SOLR_LOCATOR : {

  collection : collection1

  zkHost : "127.0.0.1:2181/solr"

}

morphlines : [

  {

    id : morphline1

    importCommands : ["org.kitesdk.**"]    

    commands : [

      {

        readAvroContainer {}

      }    

      {

        extractAvroPaths {

          flatten : false

          paths : {

            id : /id

            field1_s : /field1

            field2_s : /field2

          }

        }

      }

      {

        sanitizeUnknownSolrFields {

          solrLocator : ${SOLR_LOCATOR}

        }

      }

      {

        loadSolr {

          solrLocator : ${SOLR_LOCATOR}

        }

      }

    ]

  }

]

When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.

This is command I'm using to index files and merge them to running collection:

sudo -u hdfs hadoop --config /etc/hadoop/conf 

jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool 

--morphline-file /local/path/morphlines_file  

--output-dir hdfs://localhost/mrit/out 

--zk-host localhost:2181/solr 

--collection collection1  

--go-live 

hdfs:/mrit/in/my-avro-file.avro

Solr should be configured to work with HDFS and collection should exist.

All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.

edited Nov 21 '18 at 21:13

answered Nov 21 '18 at 20:54

arghtype

3,173113348

In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.

For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:

SOLR_LOCATOR : {

  collection : collection1

  zkHost : "127.0.0.1:2181/solr"

}

morphlines : [

  {

    id : morphline1

    importCommands : ["org.kitesdk.**"]    

    commands : [

      {

        readAvroContainer {}

      }    

      {

        extractAvroPaths {

          flatten : false

          paths : {

            id : /id

            field1_s : /field1

            field2_s : /field2

          }

        }

      }

      {

        sanitizeUnknownSolrFields {

          solrLocator : ${SOLR_LOCATOR}

        }

      }

      {

        loadSolr {

          solrLocator : ${SOLR_LOCATOR}

        }

      }

    ]

  }

]

When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.

This is command I'm using to index files and merge them to running collection:

sudo -u hdfs hadoop --config /etc/hadoop/conf 

jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool 

--morphline-file /local/path/morphlines_file  

--output-dir hdfs://localhost/mrit/out 

--zk-host localhost:2181/solr 

--collection collection1  

--go-live 

hdfs:/mrit/in/my-avro-file.avro

Solr should be configured to work with HDFS and collection should exist.

All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.

edited Nov 21 '18 at 21:13

answered Nov 21 '18 at 20:54

arghtype

3,173113348

edited Nov 21 '18 at 21:13

answered Nov 21 '18 at 20:54

arghtype

3,173113348

answered Nov 21 '18 at 20:54

arghtype

3,173113348

answered Nov 21 '18 at 20:54

arghtype

3,173113348

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky