Spark failing because S3 files are updated. How to eliminate this error?
up vote
2
down vote
favorite
My Spark script is failing because the S3 bucket from which the df is drawn gets updated with new files while the script is running. I don't care about the newly arriving files, but apparently Spark does.
I've tried adding the REFRESH TABLE command per the error msg, but that doesn't work because it is impossible to know at execution time when the new files will arrive, and so no way to know where to put that command. I have tried putting that REFRESH command in 4 different places in the script (in other words, invoking it 4 times at different points in the script) - all with the same failure message
Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I create the df with: df = spark.table('data_base.bal_daily_posts')
So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?
apache-spark amazon-s3 apache-spark-sql
add a comment |
up vote
2
down vote
favorite
My Spark script is failing because the S3 bucket from which the df is drawn gets updated with new files while the script is running. I don't care about the newly arriving files, but apparently Spark does.
I've tried adding the REFRESH TABLE command per the error msg, but that doesn't work because it is impossible to know at execution time when the new files will arrive, and so no way to know where to put that command. I have tried putting that REFRESH command in 4 different places in the script (in other words, invoking it 4 times at different points in the script) - all with the same failure message
Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I create the df with: df = spark.table('data_base.bal_daily_posts')
So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?
apache-spark amazon-s3 apache-spark-sql
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
My Spark script is failing because the S3 bucket from which the df is drawn gets updated with new files while the script is running. I don't care about the newly arriving files, but apparently Spark does.
I've tried adding the REFRESH TABLE command per the error msg, but that doesn't work because it is impossible to know at execution time when the new files will arrive, and so no way to know where to put that command. I have tried putting that REFRESH command in 4 different places in the script (in other words, invoking it 4 times at different points in the script) - all with the same failure message
Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I create the df with: df = spark.table('data_base.bal_daily_posts')
So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?
apache-spark amazon-s3 apache-spark-sql
My Spark script is failing because the S3 bucket from which the df is drawn gets updated with new files while the script is running. I don't care about the newly arriving files, but apparently Spark does.
I've tried adding the REFRESH TABLE command per the error msg, but that doesn't work because it is impossible to know at execution time when the new files will arrive, and so no way to know where to put that command. I have tried putting that REFRESH command in 4 different places in the script (in other words, invoking it 4 times at different points in the script) - all with the same failure message
Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I create the df with: df = spark.table('data_base.bal_daily_posts')
So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?
apache-spark amazon-s3 apache-spark-sql
apache-spark amazon-s3 apache-spark-sql
asked Nov 14 at 21:55
Thom Rogers
4811419
4811419
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
2
down vote
Move files you're going to process to a different folder(key) and point spark to work with this folder only
add a comment |
up vote
-1
down vote
I'm not sure this will work or not, but give a try:
Once you read your input files from S3 immediately perform persist
operation on that dataframe something like below
import org.apache.spark.storage.StorageLevel
val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)
.persist(StorageLevel.MEMORY_AND_DISK)
here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3
There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24
by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
Move files you're going to process to a different folder(key) and point spark to work with this folder only
add a comment |
up vote
2
down vote
Move files you're going to process to a different folder(key) and point spark to work with this folder only
add a comment |
up vote
2
down vote
up vote
2
down vote
Move files you're going to process to a different folder(key) and point spark to work with this folder only
Move files you're going to process to a different folder(key) and point spark to work with this folder only
answered Nov 15 at 6:14
Roman Kesler
53638
53638
add a comment |
add a comment |
up vote
-1
down vote
I'm not sure this will work or not, but give a try:
Once you read your input files from S3 immediately perform persist
operation on that dataframe something like below
import org.apache.spark.storage.StorageLevel
val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)
.persist(StorageLevel.MEMORY_AND_DISK)
here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3
There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24
by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12
add a comment |
up vote
-1
down vote
I'm not sure this will work or not, but give a try:
Once you read your input files from S3 immediately perform persist
operation on that dataframe something like below
import org.apache.spark.storage.StorageLevel
val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)
.persist(StorageLevel.MEMORY_AND_DISK)
here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3
There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24
by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12
add a comment |
up vote
-1
down vote
up vote
-1
down vote
I'm not sure this will work or not, but give a try:
Once you read your input files from S3 immediately perform persist
operation on that dataframe something like below
import org.apache.spark.storage.StorageLevel
val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)
.persist(StorageLevel.MEMORY_AND_DISK)
here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3
I'm not sure this will work or not, but give a try:
Once you read your input files from S3 immediately perform persist
operation on that dataframe something like below
import org.apache.spark.storage.StorageLevel
val inputDataFrame = sparkSession.read.json("s3a://bucket_name/file_path/)
.persist(StorageLevel.MEMORY_AND_DISK)
here even if your dataframe gets evicted from memory, as it is available in disk, it will load from disk instead of fetching from S3
answered Nov 15 at 6:05
Prasad Khode
4,16293043
4,16293043
There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24
by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12
add a comment |
There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24
by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12
There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24
There are over 100 billion rows in the table, so moving it and then processing isn't really an option.
– Thom Rogers
Nov 20 at 17:24
by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12
by this approach, you are not moving any data, you are asking spark to persist the dataframe that you have created so that even if it gets evicted from memory, as the dataframe is persisted it will reuse the same instead of reading again from S3
– Prasad Khode
Nov 21 at 5:12
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309311%2fspark-failing-because-s3-files-are-updated-how-to-eliminate-this-error%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown