How do i arrange Single cardinality for Vertex properties imported via CSV into AWS Neptune?












0















Neptune documentation says they support "Set" property cardinality only on property data imported via CSV, which means there is no way that a newly arrived property value could overwrite the old property value on the same vertex, on the same property.



For example, if the first CSV imports



~id,~label,age
Marko,person,29


then Marko has a birthday & a second CSV imports



~id,~label,age
Marko,person,30


'Marko' vertex 'age' property will contain both age values, which doesn't seem useful.



AWS says this (collapsing Set to Single cardinality properties (keeping the last arrived value only) needs to be done with post-processing, via Gremlin traversals.



Does this mean that there should be a traversal that continuously scanning Vertexes with multiple (Set) properties and set the property once again with Single cardinality, with the last value possible? IF so, what is the optimal Gremlin query to do do that?



In pseudo-Gremlin i'd imagine something like:



g.V().property(single, properties(*), _.tail())


Is there a guarantee at all that Set-cardinality properties are always listed in order of arrival?



Or am i completely on the wrong track here.



Any help would be appreciated.



Update:
So the best thing i was able to come with up so far is still far from a perfect solution, but it still might be useful for someone in my shoes.



In Plan A if we happen to know the property names and the order of arrival does not matter at all (just want single cardinality on these props), the traversal for all vertexes could be something like:



g.V().has(${propname}).where(property(single, ${propname}, properties(${propname}).value().order().tail() ) )


The plan B is to collect new property values under temporary property names in the same vertex (eg. starting with _), and traverse through vertexes having such temporary property names and set original properties with their tailed values with single cardinality:



g.V().has(${temp_propname}).where(property(single, ${propname}, properties(${temp_propname}).value().order().tail() ) ).properties('temp_propname').drop()


The Plan C, which would be the coolest, but unfortunately does not work, is to keep collecting property values in a dedicated vertex, with epoch timestamps as property names, and property values as their values:



g.V(${vertexid}).out('has_propnames').properties()
==>vp[1542827843->value1]
==>vp[1542827798->value2]
==>vp[1542887080->latestvalue]


and sort the property names (keys), take the last one, and use its value to keep THE main vertex property value up-to-date with the latest value:



g.V().has(${propname}).where(out(${has_these_properties}).count().is(gt(0))).where(property(single, ${propname}, out(${has_these_properties}).properties().value(  out(${has_these_properties}).properties().keys().order().tail()  ) ) )


Looks like the parameter for value() step must be constant, it can't use the outcome of another traversal as parameter, so i could not get this working. Perhaps someone with more Gremlin experience know a workaround for this.










share|improve this question





























    0















    Neptune documentation says they support "Set" property cardinality only on property data imported via CSV, which means there is no way that a newly arrived property value could overwrite the old property value on the same vertex, on the same property.



    For example, if the first CSV imports



    ~id,~label,age
    Marko,person,29


    then Marko has a birthday & a second CSV imports



    ~id,~label,age
    Marko,person,30


    'Marko' vertex 'age' property will contain both age values, which doesn't seem useful.



    AWS says this (collapsing Set to Single cardinality properties (keeping the last arrived value only) needs to be done with post-processing, via Gremlin traversals.



    Does this mean that there should be a traversal that continuously scanning Vertexes with multiple (Set) properties and set the property once again with Single cardinality, with the last value possible? IF so, what is the optimal Gremlin query to do do that?



    In pseudo-Gremlin i'd imagine something like:



    g.V().property(single, properties(*), _.tail())


    Is there a guarantee at all that Set-cardinality properties are always listed in order of arrival?



    Or am i completely on the wrong track here.



    Any help would be appreciated.



    Update:
    So the best thing i was able to come with up so far is still far from a perfect solution, but it still might be useful for someone in my shoes.



    In Plan A if we happen to know the property names and the order of arrival does not matter at all (just want single cardinality on these props), the traversal for all vertexes could be something like:



    g.V().has(${propname}).where(property(single, ${propname}, properties(${propname}).value().order().tail() ) )


    The plan B is to collect new property values under temporary property names in the same vertex (eg. starting with _), and traverse through vertexes having such temporary property names and set original properties with their tailed values with single cardinality:



    g.V().has(${temp_propname}).where(property(single, ${propname}, properties(${temp_propname}).value().order().tail() ) ).properties('temp_propname').drop()


    The Plan C, which would be the coolest, but unfortunately does not work, is to keep collecting property values in a dedicated vertex, with epoch timestamps as property names, and property values as their values:



    g.V(${vertexid}).out('has_propnames').properties()
    ==>vp[1542827843->value1]
    ==>vp[1542827798->value2]
    ==>vp[1542887080->latestvalue]


    and sort the property names (keys), take the last one, and use its value to keep THE main vertex property value up-to-date with the latest value:



    g.V().has(${propname}).where(out(${has_these_properties}).count().is(gt(0))).where(property(single, ${propname}, out(${has_these_properties}).properties().value(  out(${has_these_properties}).properties().keys().order().tail()  ) ) )


    Looks like the parameter for value() step must be constant, it can't use the outcome of another traversal as parameter, so i could not get this working. Perhaps someone with more Gremlin experience know a workaround for this.










    share|improve this question



























      0












      0








      0








      Neptune documentation says they support "Set" property cardinality only on property data imported via CSV, which means there is no way that a newly arrived property value could overwrite the old property value on the same vertex, on the same property.



      For example, if the first CSV imports



      ~id,~label,age
      Marko,person,29


      then Marko has a birthday & a second CSV imports



      ~id,~label,age
      Marko,person,30


      'Marko' vertex 'age' property will contain both age values, which doesn't seem useful.



      AWS says this (collapsing Set to Single cardinality properties (keeping the last arrived value only) needs to be done with post-processing, via Gremlin traversals.



      Does this mean that there should be a traversal that continuously scanning Vertexes with multiple (Set) properties and set the property once again with Single cardinality, with the last value possible? IF so, what is the optimal Gremlin query to do do that?



      In pseudo-Gremlin i'd imagine something like:



      g.V().property(single, properties(*), _.tail())


      Is there a guarantee at all that Set-cardinality properties are always listed in order of arrival?



      Or am i completely on the wrong track here.



      Any help would be appreciated.



      Update:
      So the best thing i was able to come with up so far is still far from a perfect solution, but it still might be useful for someone in my shoes.



      In Plan A if we happen to know the property names and the order of arrival does not matter at all (just want single cardinality on these props), the traversal for all vertexes could be something like:



      g.V().has(${propname}).where(property(single, ${propname}, properties(${propname}).value().order().tail() ) )


      The plan B is to collect new property values under temporary property names in the same vertex (eg. starting with _), and traverse through vertexes having such temporary property names and set original properties with their tailed values with single cardinality:



      g.V().has(${temp_propname}).where(property(single, ${propname}, properties(${temp_propname}).value().order().tail() ) ).properties('temp_propname').drop()


      The Plan C, which would be the coolest, but unfortunately does not work, is to keep collecting property values in a dedicated vertex, with epoch timestamps as property names, and property values as their values:



      g.V(${vertexid}).out('has_propnames').properties()
      ==>vp[1542827843->value1]
      ==>vp[1542827798->value2]
      ==>vp[1542887080->latestvalue]


      and sort the property names (keys), take the last one, and use its value to keep THE main vertex property value up-to-date with the latest value:



      g.V().has(${propname}).where(out(${has_these_properties}).count().is(gt(0))).where(property(single, ${propname}, out(${has_these_properties}).properties().value(  out(${has_these_properties}).properties().keys().order().tail()  ) ) )


      Looks like the parameter for value() step must be constant, it can't use the outcome of another traversal as parameter, so i could not get this working. Perhaps someone with more Gremlin experience know a workaround for this.










      share|improve this question
















      Neptune documentation says they support "Set" property cardinality only on property data imported via CSV, which means there is no way that a newly arrived property value could overwrite the old property value on the same vertex, on the same property.



      For example, if the first CSV imports



      ~id,~label,age
      Marko,person,29


      then Marko has a birthday & a second CSV imports



      ~id,~label,age
      Marko,person,30


      'Marko' vertex 'age' property will contain both age values, which doesn't seem useful.



      AWS says this (collapsing Set to Single cardinality properties (keeping the last arrived value only) needs to be done with post-processing, via Gremlin traversals.



      Does this mean that there should be a traversal that continuously scanning Vertexes with multiple (Set) properties and set the property once again with Single cardinality, with the last value possible? IF so, what is the optimal Gremlin query to do do that?



      In pseudo-Gremlin i'd imagine something like:



      g.V().property(single, properties(*), _.tail())


      Is there a guarantee at all that Set-cardinality properties are always listed in order of arrival?



      Or am i completely on the wrong track here.



      Any help would be appreciated.



      Update:
      So the best thing i was able to come with up so far is still far from a perfect solution, but it still might be useful for someone in my shoes.



      In Plan A if we happen to know the property names and the order of arrival does not matter at all (just want single cardinality on these props), the traversal for all vertexes could be something like:



      g.V().has(${propname}).where(property(single, ${propname}, properties(${propname}).value().order().tail() ) )


      The plan B is to collect new property values under temporary property names in the same vertex (eg. starting with _), and traverse through vertexes having such temporary property names and set original properties with their tailed values with single cardinality:



      g.V().has(${temp_propname}).where(property(single, ${propname}, properties(${temp_propname}).value().order().tail() ) ).properties('temp_propname').drop()


      The Plan C, which would be the coolest, but unfortunately does not work, is to keep collecting property values in a dedicated vertex, with epoch timestamps as property names, and property values as their values:



      g.V(${vertexid}).out('has_propnames').properties()
      ==>vp[1542827843->value1]
      ==>vp[1542827798->value2]
      ==>vp[1542887080->latestvalue]


      and sort the property names (keys), take the last one, and use its value to keep THE main vertex property value up-to-date with the latest value:



      g.V().has(${propname}).where(out(${has_these_properties}).count().is(gt(0))).where(property(single, ${propname}, out(${has_these_properties}).properties().value(  out(${has_these_properties}).properties().keys().order().tail()  ) ) )


      Looks like the parameter for value() step must be constant, it can't use the outcome of another traversal as parameter, so i could not get this working. Perhaps someone with more Gremlin experience know a workaround for this.







      amazon-web-services csv gremlin cardinality amazon-neptune






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 18 '18 at 0:40







      user10796762

















      asked Nov 16 '18 at 15:29









      Balazs David MolnarBalazs David Molnar

      11




      11
























          1 Answer
          1






          active

          oldest

          votes


















          0














          It would probably be more performant to read in the file from which you are bulk loading and set that property using the vertex id, rather than scanning for a vertex with multiple values for that property.



          So your gremlin update query would be as follows.



          g.V(${id})
          .property(single,${key},${value})


          In so far as whether set is a guaranteed order, I do not know. :(






          share|improve this answer
























          • Thank you for your answer! The problem is that vertexes in my setup arrive very fast, CSVs containing over 100.000 vertex arrive in each minute (and get processed in 2-3 seconds, so that works amazingly fast) and that's only the beginning. On the other hand i see gremlin queries complete in 10-1000ms range so i'm afraid if i started to send a property update gremlin query for each vertex by their id's one by one in that volume, i'd probably have massive backlog in no time.

            – Balazs David Molnar
            Nov 20 '18 at 22:34











          • Yes, it might not keep up without some further optimization. You would think that since they allow a distinction between single and array types in the bulk load headers that it would factor into Single vs Set. Maybe in a newer version if enough people request it.

            – Dave Zabriskie
            Nov 21 '18 at 16:17











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53340849%2fhow-do-i-arrange-single-cardinality-for-vertex-properties-imported-via-csv-into%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          It would probably be more performant to read in the file from which you are bulk loading and set that property using the vertex id, rather than scanning for a vertex with multiple values for that property.



          So your gremlin update query would be as follows.



          g.V(${id})
          .property(single,${key},${value})


          In so far as whether set is a guaranteed order, I do not know. :(






          share|improve this answer
























          • Thank you for your answer! The problem is that vertexes in my setup arrive very fast, CSVs containing over 100.000 vertex arrive in each minute (and get processed in 2-3 seconds, so that works amazingly fast) and that's only the beginning. On the other hand i see gremlin queries complete in 10-1000ms range so i'm afraid if i started to send a property update gremlin query for each vertex by their id's one by one in that volume, i'd probably have massive backlog in no time.

            – Balazs David Molnar
            Nov 20 '18 at 22:34











          • Yes, it might not keep up without some further optimization. You would think that since they allow a distinction between single and array types in the bulk load headers that it would factor into Single vs Set. Maybe in a newer version if enough people request it.

            – Dave Zabriskie
            Nov 21 '18 at 16:17
















          0














          It would probably be more performant to read in the file from which you are bulk loading and set that property using the vertex id, rather than scanning for a vertex with multiple values for that property.



          So your gremlin update query would be as follows.



          g.V(${id})
          .property(single,${key},${value})


          In so far as whether set is a guaranteed order, I do not know. :(






          share|improve this answer
























          • Thank you for your answer! The problem is that vertexes in my setup arrive very fast, CSVs containing over 100.000 vertex arrive in each minute (and get processed in 2-3 seconds, so that works amazingly fast) and that's only the beginning. On the other hand i see gremlin queries complete in 10-1000ms range so i'm afraid if i started to send a property update gremlin query for each vertex by their id's one by one in that volume, i'd probably have massive backlog in no time.

            – Balazs David Molnar
            Nov 20 '18 at 22:34











          • Yes, it might not keep up without some further optimization. You would think that since they allow a distinction between single and array types in the bulk load headers that it would factor into Single vs Set. Maybe in a newer version if enough people request it.

            – Dave Zabriskie
            Nov 21 '18 at 16:17














          0












          0








          0







          It would probably be more performant to read in the file from which you are bulk loading and set that property using the vertex id, rather than scanning for a vertex with multiple values for that property.



          So your gremlin update query would be as follows.



          g.V(${id})
          .property(single,${key},${value})


          In so far as whether set is a guaranteed order, I do not know. :(






          share|improve this answer













          It would probably be more performant to read in the file from which you are bulk loading and set that property using the vertex id, rather than scanning for a vertex with multiple values for that property.



          So your gremlin update query would be as follows.



          g.V(${id})
          .property(single,${key},${value})


          In so far as whether set is a guaranteed order, I do not know. :(







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 20 '18 at 18:54









          Dave ZabriskieDave Zabriskie

          1156




          1156













          • Thank you for your answer! The problem is that vertexes in my setup arrive very fast, CSVs containing over 100.000 vertex arrive in each minute (and get processed in 2-3 seconds, so that works amazingly fast) and that's only the beginning. On the other hand i see gremlin queries complete in 10-1000ms range so i'm afraid if i started to send a property update gremlin query for each vertex by their id's one by one in that volume, i'd probably have massive backlog in no time.

            – Balazs David Molnar
            Nov 20 '18 at 22:34











          • Yes, it might not keep up without some further optimization. You would think that since they allow a distinction between single and array types in the bulk load headers that it would factor into Single vs Set. Maybe in a newer version if enough people request it.

            – Dave Zabriskie
            Nov 21 '18 at 16:17



















          • Thank you for your answer! The problem is that vertexes in my setup arrive very fast, CSVs containing over 100.000 vertex arrive in each minute (and get processed in 2-3 seconds, so that works amazingly fast) and that's only the beginning. On the other hand i see gremlin queries complete in 10-1000ms range so i'm afraid if i started to send a property update gremlin query for each vertex by their id's one by one in that volume, i'd probably have massive backlog in no time.

            – Balazs David Molnar
            Nov 20 '18 at 22:34











          • Yes, it might not keep up without some further optimization. You would think that since they allow a distinction between single and array types in the bulk load headers that it would factor into Single vs Set. Maybe in a newer version if enough people request it.

            – Dave Zabriskie
            Nov 21 '18 at 16:17

















          Thank you for your answer! The problem is that vertexes in my setup arrive very fast, CSVs containing over 100.000 vertex arrive in each minute (and get processed in 2-3 seconds, so that works amazingly fast) and that's only the beginning. On the other hand i see gremlin queries complete in 10-1000ms range so i'm afraid if i started to send a property update gremlin query for each vertex by their id's one by one in that volume, i'd probably have massive backlog in no time.

          – Balazs David Molnar
          Nov 20 '18 at 22:34





          Thank you for your answer! The problem is that vertexes in my setup arrive very fast, CSVs containing over 100.000 vertex arrive in each minute (and get processed in 2-3 seconds, so that works amazingly fast) and that's only the beginning. On the other hand i see gremlin queries complete in 10-1000ms range so i'm afraid if i started to send a property update gremlin query for each vertex by their id's one by one in that volume, i'd probably have massive backlog in no time.

          – Balazs David Molnar
          Nov 20 '18 at 22:34













          Yes, it might not keep up without some further optimization. You would think that since they allow a distinction between single and array types in the bulk load headers that it would factor into Single vs Set. Maybe in a newer version if enough people request it.

          – Dave Zabriskie
          Nov 21 '18 at 16:17





          Yes, it might not keep up without some further optimization. You would think that since they allow a distinction between single and array types in the bulk load headers that it would factor into Single vs Set. Maybe in a newer version if enough people request it.

          – Dave Zabriskie
          Nov 21 '18 at 16:17


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53340849%2fhow-do-i-arrange-single-cardinality-for-vertex-properties-imported-via-csv-into%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

          ComboBox Display Member on multiple fields

          Is it possible to collect Nectar points via Trainline?