Spark failing because S3 files are updated. How to eliminate this error?

up vote
2
down vote

favorite

My Spark script is failing because the S3 bucket from which the df is drawn gets updated with new files while the script is running. I don't care about the newly arriving files, but apparently Spark does.

I've tried adding the REFRESH TABLE command per the error msg, but that doesn't work because it is impossible to know at execution time when the new files will arrive, and so no way to know where to put that command. I have tried putting that REFRESH command in 4 different places in the script (in other words, invoking it 4 times at different points in the script) - all with the same failure message

Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

I create the df with: df = spark.table('data_base.bal_daily_posts')

So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?

asked Nov 14 at 21:55

Thom Rogers

4811419

add a comment |

up vote
2
down vote

favorite

Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

I create the df with: df = spark.table('data_base.bal_daily_posts')

So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?

asked Nov 14 at 21:55

Thom Rogers

4811419

add a comment |

up vote
2
down vote

favorite

Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

I create the df with: df = spark.table('data_base.bal_daily_posts')

So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?

asked Nov 14 at 21:55

Thom Rogers

4811419

Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

I create the df with: df = spark.table('data_base.bal_daily_posts')

So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?

apache-spark amazon-s3 apache-spark-sql

asked Nov 14 at 21:55

Thom Rogers

4811419

asked Nov 14 at 21:55

Thom Rogers

4811419

asked Nov 14 at 21:55

Thom Rogers

4811419

asked Nov 14 at 21:55

Thom Rogers

4811419

asked Nov 14 at 21:55

Thom Rogers

4811419

add a comment |

2 Answers
2

active

oldest

votes

up vote
2
down vote

Move files you're going to process to a different folder(key) and point spark to work with this folder only

answered Nov 15 at 6:14

Roman Kesler

53638

add a comment |

up vote
-1
down vote

I'm not sure this will work or not, but give a try:

Once you read your input files from S3 immediately perform persist operation on that dataframe something like below

import org.apache.spark.storage.StorageLevel



val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)

    .persist(StorageLevel.MEMORY_AND_DISK)

here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3

answered Nov 15 at 6:05

Prasad Khode

4,16293043

There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24

by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309311%2fspark-failing-because-s3-files-are-updated-how-to-eliminate-this-error%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

Move files you're going to process to a different folder(key) and point spark to work with this folder only

answered Nov 15 at 6:14

Roman Kesler

53638

add a comment |

up vote
2
down vote

Move files you're going to process to a different folder(key) and point spark to work with this folder only

answered Nov 15 at 6:14

Roman Kesler

53638

add a comment |

up vote
2
down vote

Move files you're going to process to a different folder(key) and point spark to work with this folder only

answered Nov 15 at 6:14

Roman Kesler

53638

Move files you're going to process to a different folder(key) and point spark to work with this folder only

answered Nov 15 at 6:14

Roman Kesler

53638

answered Nov 15 at 6:14

Roman Kesler

53638

answered Nov 15 at 6:14

Roman Kesler

53638

answered Nov 15 at 6:14

Roman Kesler

53638

add a comment |

up vote
-1
down vote

I'm not sure this will work or not, but give a try:

Once you read your input files from S3 immediately perform persist operation on that dataframe something like below

import org.apache.spark.storage.StorageLevel



val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)

    .persist(StorageLevel.MEMORY_AND_DISK)

here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3

answered Nov 15 at 6:05

Prasad Khode

4,16293043

There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24

by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12

add a comment |

up vote
-1
down vote

I'm not sure this will work or not, but give a try:

Once you read your input files from S3 immediately perform persist operation on that dataframe something like below

import org.apache.spark.storage.StorageLevel



val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)

    .persist(StorageLevel.MEMORY_AND_DISK)

here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3

answered Nov 15 at 6:05

Prasad Khode

4,16293043

There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24

by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12

add a comment |

up vote
-1
down vote

I'm not sure this will work or not, but give a try:

Once you read your input files from S3 immediately perform persist operation on that dataframe something like below

import org.apache.spark.storage.StorageLevel



val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)

    .persist(StorageLevel.MEMORY_AND_DISK)

here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3

answered Nov 15 at 6:05

Prasad Khode

4,16293043

I'm not sure this will work or not, but give a try:

Once you read your input files from S3 immediately perform persist operation on that dataframe something like below

import org.apache.spark.storage.StorageLevel



val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)

    .persist(StorageLevel.MEMORY_AND_DISK)

here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3

answered Nov 15 at 6:05

Prasad Khode

4,16293043

answered Nov 15 at 6:05

Prasad Khode

4,16293043

answered Nov 15 at 6:05

Prasad Khode

4,16293043

answered Nov 15 at 6:05

Prasad Khode

4,16293043

There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24

by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12

add a comment |

There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24

by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12

There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24

by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky