How should look like a morphline for MapReduceIndexerTool?












1















I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.



For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.



Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.



I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?



I can't find any example on a morphline adapted for this map reduce job.



Thank you respectfully.










share|improve this question























  • Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

    – cricket_007
    Mar 5 '18 at 13:44











  • Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.

    – Gyanendra Dwivedi
    Mar 5 '18 at 16:22
















1















I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.



For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.



Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.



I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?



I can't find any example on a morphline adapted for this map reduce job.



Thank you respectfully.










share|improve this question























  • Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

    – cricket_007
    Mar 5 '18 at 13:44











  • Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.

    – Gyanendra Dwivedi
    Mar 5 '18 at 16:22














1












1








1








I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.



For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.



Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.



I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?



I can't find any example on a morphline adapted for this map reduce job.



Thank you respectfully.










share|improve this question














I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.



For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.



Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.



I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?



I can't find any example on a morphline adapted for this map reduce job.



Thank you respectfully.







hadoop mapreduce morphline






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 5 '18 at 12:35









Cosmin IonițăCosmin Ioniță

581932




581932













  • Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

    – cricket_007
    Mar 5 '18 at 13:44











  • Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.

    – Gyanendra Dwivedi
    Mar 5 '18 at 16:22



















  • Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

    – cricket_007
    Mar 5 '18 at 13:44











  • Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.

    – Gyanendra Dwivedi
    Mar 5 '18 at 16:22

















Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

– cricket_007
Mar 5 '18 at 13:44





Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

– cricket_007
Mar 5 '18 at 13:44













Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.

– Gyanendra Dwivedi
Mar 5 '18 at 16:22





Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.

– Gyanendra Dwivedi
Mar 5 '18 at 16:22












2 Answers
2






active

oldest

votes


















1














Cloudera has a guide which is having almost similar use case given under morphline.



enter image description here




In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.




The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:



enter image description here






share|improve this answer
























  • This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

    – cricket_007
    Mar 5 '18 at 23:04













  • @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

    – Gyanendra Dwivedi
    Mar 6 '18 at 7:05











  • Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

    – Cosmin Ioniță
    Mar 6 '18 at 8:48











  • morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

    – Gyanendra Dwivedi
    Mar 6 '18 at 9:02






  • 1





    okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

    – Cosmin Ioniță
    Mar 6 '18 at 10:03



















1














In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.



For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:



SOLR_LOCATOR : {
collection : collection1
zkHost : "127.0.0.1:2181/solr"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readAvroContainer {}
}
{
extractAvroPaths {
flatten : false
paths : {
id : /id
field1_s : /field1
field2_s : /field2
}
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]


When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.



This is command I'm using to index files and merge them to running collection:



sudo -u hdfs hadoop --config /etc/hadoop/conf 
jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
--morphline-file /local/path/morphlines_file
--output-dir hdfs://localhost/mrit/out
--zk-host localhost:2181/solr
--collection collection1
--go-live
hdfs:/mrit/in/my-avro-file.avro


Solr should be configured to work with HDFS and collection should exist.



All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f49110654%2fhow-should-look-like-a-morphline-for-mapreduceindexertool%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Cloudera has a guide which is having almost similar use case given under morphline.



    enter image description here




    In this figure, a Flume Source receives syslog events and sends them
    to a Flume Morphline Sink, which converts each Flume event to a record
    and pipes it into a readLine command. The readLine command extracts
    the log line and pipes it into a grok command. The grok command uses
    regular expression pattern matching to extract some substrings of the
    line. It pipes the resulting structured record into the loadSolr
    command. Finally, the loadSolr command loads the record into Solr,
    typically a SolrCloud. In the process, raw data or semi-structured
    data is transformed into structured data according to application
    modelling requirements.




    The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:



    enter image description here






    share|improve this answer
























    • This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

      – cricket_007
      Mar 5 '18 at 23:04













    • @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

      – Gyanendra Dwivedi
      Mar 6 '18 at 7:05











    • Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

      – Cosmin Ioniță
      Mar 6 '18 at 8:48











    • morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

      – Gyanendra Dwivedi
      Mar 6 '18 at 9:02






    • 1





      okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

      – Cosmin Ioniță
      Mar 6 '18 at 10:03
















    1














    Cloudera has a guide which is having almost similar use case given under morphline.



    enter image description here




    In this figure, a Flume Source receives syslog events and sends them
    to a Flume Morphline Sink, which converts each Flume event to a record
    and pipes it into a readLine command. The readLine command extracts
    the log line and pipes it into a grok command. The grok command uses
    regular expression pattern matching to extract some substrings of the
    line. It pipes the resulting structured record into the loadSolr
    command. Finally, the loadSolr command loads the record into Solr,
    typically a SolrCloud. In the process, raw data or semi-structured
    data is transformed into structured data according to application
    modelling requirements.




    The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:



    enter image description here






    share|improve this answer
























    • This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

      – cricket_007
      Mar 5 '18 at 23:04













    • @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

      – Gyanendra Dwivedi
      Mar 6 '18 at 7:05











    • Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

      – Cosmin Ioniță
      Mar 6 '18 at 8:48











    • morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

      – Gyanendra Dwivedi
      Mar 6 '18 at 9:02






    • 1





      okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

      – Cosmin Ioniță
      Mar 6 '18 at 10:03














    1












    1








    1







    Cloudera has a guide which is having almost similar use case given under morphline.



    enter image description here




    In this figure, a Flume Source receives syslog events and sends them
    to a Flume Morphline Sink, which converts each Flume event to a record
    and pipes it into a readLine command. The readLine command extracts
    the log line and pipes it into a grok command. The grok command uses
    regular expression pattern matching to extract some substrings of the
    line. It pipes the resulting structured record into the loadSolr
    command. Finally, the loadSolr command loads the record into Solr,
    typically a SolrCloud. In the process, raw data or semi-structured
    data is transformed into structured data according to application
    modelling requirements.




    The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:



    enter image description here






    share|improve this answer













    Cloudera has a guide which is having almost similar use case given under morphline.



    enter image description here




    In this figure, a Flume Source receives syslog events and sends them
    to a Flume Morphline Sink, which converts each Flume event to a record
    and pipes it into a readLine command. The readLine command extracts
    the log line and pipes it into a grok command. The grok command uses
    regular expression pattern matching to extract some substrings of the
    line. It pipes the resulting structured record into the loadSolr
    command. Finally, the loadSolr command loads the record into Solr,
    typically a SolrCloud. In the process, raw data or semi-structured
    data is transformed into structured data according to application
    modelling requirements.




    The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:



    enter image description here







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Mar 5 '18 at 16:21









    Gyanendra DwivediGyanendra Dwivedi

    4,11311644




    4,11311644













    • This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

      – cricket_007
      Mar 5 '18 at 23:04













    • @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

      – Gyanendra Dwivedi
      Mar 6 '18 at 7:05











    • Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

      – Cosmin Ioniță
      Mar 6 '18 at 8:48











    • morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

      – Gyanendra Dwivedi
      Mar 6 '18 at 9:02






    • 1





      okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

      – Cosmin Ioniță
      Mar 6 '18 at 10:03



















    • This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

      – cricket_007
      Mar 5 '18 at 23:04













    • @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

      – Gyanendra Dwivedi
      Mar 6 '18 at 7:05











    • Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

      – Cosmin Ioniță
      Mar 6 '18 at 8:48











    • morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

      – Gyanendra Dwivedi
      Mar 6 '18 at 9:02






    • 1





      okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

      – Cosmin Ioniță
      Mar 6 '18 at 10:03

















    This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

    – cricket_007
    Mar 5 '18 at 23:04







    This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command

    – cricket_007
    Mar 5 '18 at 23:04















    @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

    – Gyanendra Dwivedi
    Mar 6 '18 at 7:05





    @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.

    – Gyanendra Dwivedi
    Mar 6 '18 at 7:05













    Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

    – Cosmin Ioniță
    Mar 6 '18 at 8:48





    Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?

    – Cosmin Ioniță
    Mar 6 '18 at 8:48













    morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

    – Gyanendra Dwivedi
    Mar 6 '18 at 9:02





    morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.

    – Gyanendra Dwivedi
    Mar 6 '18 at 9:02




    1




    1





    okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

    – Cosmin Ioniță
    Mar 6 '18 at 10:03





    okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?

    – Cosmin Ioniță
    Mar 6 '18 at 10:03













    1














    In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.



    For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:



    SOLR_LOCATOR : {
    collection : collection1
    zkHost : "127.0.0.1:2181/solr"
    }
    morphlines : [
    {
    id : morphline1
    importCommands : ["org.kitesdk.**"]
    commands : [
    {
    readAvroContainer {}
    }
    {
    extractAvroPaths {
    flatten : false
    paths : {
    id : /id
    field1_s : /field1
    field2_s : /field2
    }
    }
    }
    {
    sanitizeUnknownSolrFields {
    solrLocator : ${SOLR_LOCATOR}
    }
    }
    {
    loadSolr {
    solrLocator : ${SOLR_LOCATOR}
    }
    }
    ]
    }
    ]


    When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.



    This is command I'm using to index files and merge them to running collection:



    sudo -u hdfs hadoop --config /etc/hadoop/conf 
    jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
    --morphline-file /local/path/morphlines_file
    --output-dir hdfs://localhost/mrit/out
    --zk-host localhost:2181/solr
    --collection collection1
    --go-live
    hdfs:/mrit/in/my-avro-file.avro


    Solr should be configured to work with HDFS and collection should exist.



    All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.






    share|improve this answer






























      1














      In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.



      For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:



      SOLR_LOCATOR : {
      collection : collection1
      zkHost : "127.0.0.1:2181/solr"
      }
      morphlines : [
      {
      id : morphline1
      importCommands : ["org.kitesdk.**"]
      commands : [
      {
      readAvroContainer {}
      }
      {
      extractAvroPaths {
      flatten : false
      paths : {
      id : /id
      field1_s : /field1
      field2_s : /field2
      }
      }
      }
      {
      sanitizeUnknownSolrFields {
      solrLocator : ${SOLR_LOCATOR}
      }
      }
      {
      loadSolr {
      solrLocator : ${SOLR_LOCATOR}
      }
      }
      ]
      }
      ]


      When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.



      This is command I'm using to index files and merge them to running collection:



      sudo -u hdfs hadoop --config /etc/hadoop/conf 
      jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
      --morphline-file /local/path/morphlines_file
      --output-dir hdfs://localhost/mrit/out
      --zk-host localhost:2181/solr
      --collection collection1
      --go-live
      hdfs:/mrit/in/my-avro-file.avro


      Solr should be configured to work with HDFS and collection should exist.



      All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.






      share|improve this answer




























        1












        1








        1







        In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.



        For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:



        SOLR_LOCATOR : {
        collection : collection1
        zkHost : "127.0.0.1:2181/solr"
        }
        morphlines : [
        {
        id : morphline1
        importCommands : ["org.kitesdk.**"]
        commands : [
        {
        readAvroContainer {}
        }
        {
        extractAvroPaths {
        flatten : false
        paths : {
        id : /id
        field1_s : /field1
        field2_s : /field2
        }
        }
        }
        {
        sanitizeUnknownSolrFields {
        solrLocator : ${SOLR_LOCATOR}
        }
        }
        {
        loadSolr {
        solrLocator : ${SOLR_LOCATOR}
        }
        }
        ]
        }
        ]


        When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.



        This is command I'm using to index files and merge them to running collection:



        sudo -u hdfs hadoop --config /etc/hadoop/conf 
        jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
        --morphline-file /local/path/morphlines_file
        --output-dir hdfs://localhost/mrit/out
        --zk-host localhost:2181/solr
        --collection collection1
        --go-live
        hdfs:/mrit/in/my-avro-file.avro


        Solr should be configured to work with HDFS and collection should exist.



        All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.






        share|improve this answer















        In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.



        For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:



        SOLR_LOCATOR : {
        collection : collection1
        zkHost : "127.0.0.1:2181/solr"
        }
        morphlines : [
        {
        id : morphline1
        importCommands : ["org.kitesdk.**"]
        commands : [
        {
        readAvroContainer {}
        }
        {
        extractAvroPaths {
        flatten : false
        paths : {
        id : /id
        field1_s : /field1
        field2_s : /field2
        }
        }
        }
        {
        sanitizeUnknownSolrFields {
        solrLocator : ${SOLR_LOCATOR}
        }
        }
        {
        loadSolr {
        solrLocator : ${SOLR_LOCATOR}
        }
        }
        ]
        }
        ]


        When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.



        This is command I'm using to index files and merge them to running collection:



        sudo -u hdfs hadoop --config /etc/hadoop/conf 
        jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
        --morphline-file /local/path/morphlines_file
        --output-dir hdfs://localhost/mrit/out
        --zk-host localhost:2181/solr
        --collection collection1
        --go-live
        hdfs:/mrit/in/my-avro-file.avro


        Solr should be configured to work with HDFS and collection should exist.



        All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 21 '18 at 21:13

























        answered Nov 21 '18 at 20:54









        arghtypearghtype

        3,173113348




        3,173113348






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f49110654%2fhow-should-look-like-a-morphline-for-mapreduceindexertool%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

            ComboBox Display Member on multiple fields

            Is it possible to collect Nectar points via Trainline?