Spark failing because S3 files are updated. How to eliminate this error?











up vote
2
down vote

favorite












My Spark script is failing because the S3 bucket from which the df is drawn gets updated with new files while the script is running. I don't care about the newly arriving files, but apparently Spark does.



I've tried adding the REFRESH TABLE command per the error msg, but that doesn't work because it is impossible to know at execution time when the new files will arrive, and so no way to know where to put that command. I have tried putting that REFRESH command in 4 different places in the script (in other words, invoking it 4 times at different points in the script) - all with the same failure message



Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.


I create the df with: df = spark.table('data_base.bal_daily_posts')



So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?










share|improve this question


























    up vote
    2
    down vote

    favorite












    My Spark script is failing because the S3 bucket from which the df is drawn gets updated with new files while the script is running. I don't care about the newly arriving files, but apparently Spark does.



    I've tried adding the REFRESH TABLE command per the error msg, but that doesn't work because it is impossible to know at execution time when the new files will arrive, and so no way to know where to put that command. I have tried putting that REFRESH command in 4 different places in the script (in other words, invoking it 4 times at different points in the script) - all with the same failure message



    Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'
    It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.


    I create the df with: df = spark.table('data_base.bal_daily_posts')



    So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?










    share|improve this question
























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      My Spark script is failing because the S3 bucket from which the df is drawn gets updated with new files while the script is running. I don't care about the newly arriving files, but apparently Spark does.



      I've tried adding the REFRESH TABLE command per the error msg, but that doesn't work because it is impossible to know at execution time when the new files will arrive, and so no way to know where to put that command. I have tried putting that REFRESH command in 4 different places in the script (in other words, invoking it 4 times at different points in the script) - all with the same failure message



      Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'
      It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.


      I create the df with: df = spark.table('data_base.bal_daily_posts')



      So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?










      share|improve this question













      My Spark script is failing because the S3 bucket from which the df is drawn gets updated with new files while the script is running. I don't care about the newly arriving files, but apparently Spark does.



      I've tried adding the REFRESH TABLE command per the error msg, but that doesn't work because it is impossible to know at execution time when the new files will arrive, and so no way to know where to put that command. I have tried putting that REFRESH command in 4 different places in the script (in other words, invoking it 4 times at different points in the script) - all with the same failure message



      Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'
      It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.


      I create the df with: df = spark.table('data_base.bal_daily_posts')



      So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?







      apache-spark amazon-s3 apache-spark-sql






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 14 at 21:55









      Thom Rogers

      4811419




      4811419
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          2
          down vote













          Move files you're going to process to a different folder(key) and point spark to work with this folder only






          share|improve this answer




























            up vote
            -1
            down vote













            I'm not sure this will work or not, but give a try:



            Once you read your input files from S3 immediately perform persist operation on that dataframe something like below



            import org.apache.spark.storage.StorageLevel

            val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)
            .persist(StorageLevel.MEMORY_AND_DISK)


            here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3






            share|improve this answer





















            • There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
              – Thom Rogers
              Nov 20 at 17:24










            • by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
              – Prasad Khode
              Nov 21 at 5:12











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309311%2fspark-failing-because-s3-files-are-updated-how-to-eliminate-this-error%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            2
            down vote













            Move files you're going to process to a different folder(key) and point spark to work with this folder only






            share|improve this answer

























              up vote
              2
              down vote













              Move files you're going to process to a different folder(key) and point spark to work with this folder only






              share|improve this answer























                up vote
                2
                down vote










                up vote
                2
                down vote









                Move files you're going to process to a different folder(key) and point spark to work with this folder only






                share|improve this answer












                Move files you're going to process to a different folder(key) and point spark to work with this folder only







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 15 at 6:14









                Roman Kesler

                53638




                53638
























                    up vote
                    -1
                    down vote













                    I'm not sure this will work or not, but give a try:



                    Once you read your input files from S3 immediately perform persist operation on that dataframe something like below



                    import org.apache.spark.storage.StorageLevel

                    val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)
                    .persist(StorageLevel.MEMORY_AND_DISK)


                    here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3






                    share|improve this answer





















                    • There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
                      – Thom Rogers
                      Nov 20 at 17:24










                    • by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
                      – Prasad Khode
                      Nov 21 at 5:12















                    up vote
                    -1
                    down vote













                    I'm not sure this will work or not, but give a try:



                    Once you read your input files from S3 immediately perform persist operation on that dataframe something like below



                    import org.apache.spark.storage.StorageLevel

                    val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)
                    .persist(StorageLevel.MEMORY_AND_DISK)


                    here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3






                    share|improve this answer





















                    • There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
                      – Thom Rogers
                      Nov 20 at 17:24










                    • by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
                      – Prasad Khode
                      Nov 21 at 5:12













                    up vote
                    -1
                    down vote










                    up vote
                    -1
                    down vote









                    I'm not sure this will work or not, but give a try:



                    Once you read your input files from S3 immediately perform persist operation on that dataframe something like below



                    import org.apache.spark.storage.StorageLevel

                    val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)
                    .persist(StorageLevel.MEMORY_AND_DISK)


                    here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3






                    share|improve this answer












                    I'm not sure this will work or not, but give a try:



                    Once you read your input files from S3 immediately perform persist operation on that dataframe something like below



                    import org.apache.spark.storage.StorageLevel

                    val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)
                    .persist(StorageLevel.MEMORY_AND_DISK)


                    here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 15 at 6:05









                    Prasad Khode

                    4,16293043




                    4,16293043












                    • There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
                      – Thom Rogers
                      Nov 20 at 17:24










                    • by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
                      – Prasad Khode
                      Nov 21 at 5:12


















                    • There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
                      – Thom Rogers
                      Nov 20 at 17:24










                    • by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
                      – Prasad Khode
                      Nov 21 at 5:12
















                    There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
                    – Thom Rogers
                    Nov 20 at 17:24




                    There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
                    – Thom Rogers
                    Nov 20 at 17:24












                    by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
                    – Prasad Khode
                    Nov 21 at 5:12




                    by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
                    – Prasad Khode
                    Nov 21 at 5:12


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309311%2fspark-failing-because-s3-files-are-updated-how-to-eliminate-this-error%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to change which sound is reproduced for terminal bell?

                    Can I use Tabulator js library in my java Spring + Thymeleaf project?

                    Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents