Usage of 'for loop' in R to split a dataframe into several dataframes
I have a problem with for loop.
I have a dataframe with 120 unique IDs. I want to split the dataframe into 120 different dataframes based on the ID. I split it using the following code:
split_part0 <- split(PART0_DF, PART0_DF$sysid)
Now I want to do something like
for(i in 1:120){
sys[i] <- as.data.frame(split_part0[[i]])}
This way I have the 120 dataframes with unique frame names I can use for further analysis.
Is using 'for loop' in this particular case not possible? If so, what other commands can I use?
Dummy data for PART0_DF
:
Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11
I want the output to be like
>>sys1
Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
>>sys2
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11
r for-loop
add a comment |
I have a problem with for loop.
I have a dataframe with 120 unique IDs. I want to split the dataframe into 120 different dataframes based on the ID. I split it using the following code:
split_part0 <- split(PART0_DF, PART0_DF$sysid)
Now I want to do something like
for(i in 1:120){
sys[i] <- as.data.frame(split_part0[[i]])}
This way I have the 120 dataframes with unique frame names I can use for further analysis.
Is using 'for loop' in this particular case not possible? If so, what other commands can I use?
Dummy data for PART0_DF
:
Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11
I want the output to be like
>>sys1
Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
>>sys2
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11
r for-loop
If you provide a small dummy example ofPART0_DF
it would be easier to understand what it is you need.
– rookie
Nov 21 '18 at 11:47
add a comment |
I have a problem with for loop.
I have a dataframe with 120 unique IDs. I want to split the dataframe into 120 different dataframes based on the ID. I split it using the following code:
split_part0 <- split(PART0_DF, PART0_DF$sysid)
Now I want to do something like
for(i in 1:120){
sys[i] <- as.data.frame(split_part0[[i]])}
This way I have the 120 dataframes with unique frame names I can use for further analysis.
Is using 'for loop' in this particular case not possible? If so, what other commands can I use?
Dummy data for PART0_DF
:
Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11
I want the output to be like
>>sys1
Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
>>sys2
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11
r for-loop
I have a problem with for loop.
I have a dataframe with 120 unique IDs. I want to split the dataframe into 120 different dataframes based on the ID. I split it using the following code:
split_part0 <- split(PART0_DF, PART0_DF$sysid)
Now I want to do something like
for(i in 1:120){
sys[i] <- as.data.frame(split_part0[[i]])}
This way I have the 120 dataframes with unique frame names I can use for further analysis.
Is using 'for loop' in this particular case not possible? If so, what other commands can I use?
Dummy data for PART0_DF
:
Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11
I want the output to be like
>>sys1
Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
>>sys2
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11
r for-loop
r for-loop
edited Nov 21 '18 at 14:42
Shruthi Patil
asked Nov 21 '18 at 11:04
Shruthi PatilShruthi Patil
54
54
If you provide a small dummy example ofPART0_DF
it would be easier to understand what it is you need.
– rookie
Nov 21 '18 at 11:47
add a comment |
If you provide a small dummy example ofPART0_DF
it would be easier to understand what it is you need.
– rookie
Nov 21 '18 at 11:47
If you provide a small dummy example of
PART0_DF
it would be easier to understand what it is you need.– rookie
Nov 21 '18 at 11:47
If you provide a small dummy example of
PART0_DF
it would be easier to understand what it is you need.– rookie
Nov 21 '18 at 11:47
add a comment |
2 Answers
2
active
oldest
votes
An easy way to do this is to create a factor vector by appending the string sys
to the id numbers, and using it to split the data. There is no need to use a for()
loop to produce the desired output, since the result of split()
is a list of data frames when the input to be split is a data frame.
The value of the factor is used to name each element in the list generated by split()
. In the case of the OP, since sysid
is numeric and starts with 1, it's not obvious that the id numbers are being used to name the resulting data frames in the list, as explained in the help for split()
.
Using the data from the OP we'll illustrate how to use the sysid
column to create a factor variable that combines the string sys
with the id values, and split it into a list of data frames that can be accessed by name.
rawData <- "Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11"
data <- read.table(text = rawData,header=TRUE)
sysidName <- paste0("sys",data$sysid)
splitData <- split(data,sysidName)
splitData
...and the output:
> splitData
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
>
At this point one can access individual data frames in the list by using the $
form of the extract operator:
> splitData$sys1
Date sysid power temperature sysidName
1 1.1.2018 1 1000 14 sys1
2 2.1.2018 1 1200 16 sys1
3 3.1.2018 1 800 18 sys1
>
Also, by using the names()
function one can obtain a vector of all the named elements in the list of data frames.
> names(splitData)
[1] "sys1" "sys2"
>
Reiterating the main point from the top of the answer, when split()
is used with a data frame, the resulting list is a list of objects of type data.frame()
. For example:
> str(splitData["sys1"])
List of 1
$ sys1:'data.frame': 3 obs. of 4 variables:
..$ Date : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3
..$ sysid : int [1:3] 1 1 1
..$ power : int [1:3] 1000 1200 800
..$ temperature: int [1:3] 14 16 18
>
If you must use a for()
loop...
Since the OP asked whether the problem could be solved with a for()
loop, the answer is "yes."
# create a vector containing unique values of sysid
ids <- unique(data$sysid)
# initialize output data frame list
dfList <- list()
# loop thru unique values and generate named data frames in list()
for(i in ids){
dfname <- paste0("sys",i)
dfList[[dfname]] <- data[data$sysid == i,]
}
dfList
...and the output:
> for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ }
> dfList
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
Choosing the "best" answer
Between split()
, for()
and the other answer using by()
, how do we choose the best answer?
One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.
We can use the microbenchmark
package to compare the performance of the three different approaches.
split()
performance
library(microbenchmark)
> microbenchmark(splitData <- split(data,sysidName),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507 100
>
for()
performance
> microbenchmark(for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ },unit="us")
Unit: microseconds
expr min lq mean
for (i in ids) { dfname <- paste0("sys", i) dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642
median uq max neval
3099.064 3479.311 8511.609 100
>
by()
performance
> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372 100
>
...and the winner is:
split()
, with an average runtime of 186 microseconds, versus 305 microseconds for by()
and a whopping 3,458 microseconds for the for()
loop approach.
I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.
– Shruthi Patil
Nov 23 '18 at 9:58
add a comment |
Another option is using the function by()
:
df <- data.frame(
Date = c("1.1.2018", "2.1.2018", "3.1.2018", "1.1.2018", "2.1.2018", "3.1.2018"),
sysid = c(1, 1, 1, 2, 2, 2),
power = c(1000, 1200, 800, 1500, 800, 1300)
)
df
Date sysid power
1 1.1.2018 1 1000
2 2.1.2018 1 1200
3 3.1.2018 1 800
4 1.1.2018 2 1500
5 2.1.2018 2 800
6 3.1.2018 2 1300
Now split df
in as many dataframes as you have distinct ('unique') sysid
values using by()
and calling unique
:
df_list <- by(df, df$sysid, function(unique) unique)
df_list
df$sysid: 1
Date sysid power
1 1.1.2018 1 1000
2 2.1.2018 1 1200
3 3.1.2018 1 800
----------------------------------------------------------------------------------------------
df$sysid: 2
Date sysid power
4 1.1.2018 2 1500
5 2.1.2018 2 800
6 3.1.2018 2 1300
I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help
– Shruthi Patil
Nov 23 '18 at 10:00
On Stack Overflow it is customary to click the upward arrow if a given answer is useful.
– Chris Ruehlemann
Nov 23 '18 at 14:49
I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.
– Shruthi Patil
Nov 24 '18 at 16:17
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53410738%2fusage-of-for-loop-in-r-to-split-a-dataframe-into-several-dataframes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
An easy way to do this is to create a factor vector by appending the string sys
to the id numbers, and using it to split the data. There is no need to use a for()
loop to produce the desired output, since the result of split()
is a list of data frames when the input to be split is a data frame.
The value of the factor is used to name each element in the list generated by split()
. In the case of the OP, since sysid
is numeric and starts with 1, it's not obvious that the id numbers are being used to name the resulting data frames in the list, as explained in the help for split()
.
Using the data from the OP we'll illustrate how to use the sysid
column to create a factor variable that combines the string sys
with the id values, and split it into a list of data frames that can be accessed by name.
rawData <- "Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11"
data <- read.table(text = rawData,header=TRUE)
sysidName <- paste0("sys",data$sysid)
splitData <- split(data,sysidName)
splitData
...and the output:
> splitData
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
>
At this point one can access individual data frames in the list by using the $
form of the extract operator:
> splitData$sys1
Date sysid power temperature sysidName
1 1.1.2018 1 1000 14 sys1
2 2.1.2018 1 1200 16 sys1
3 3.1.2018 1 800 18 sys1
>
Also, by using the names()
function one can obtain a vector of all the named elements in the list of data frames.
> names(splitData)
[1] "sys1" "sys2"
>
Reiterating the main point from the top of the answer, when split()
is used with a data frame, the resulting list is a list of objects of type data.frame()
. For example:
> str(splitData["sys1"])
List of 1
$ sys1:'data.frame': 3 obs. of 4 variables:
..$ Date : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3
..$ sysid : int [1:3] 1 1 1
..$ power : int [1:3] 1000 1200 800
..$ temperature: int [1:3] 14 16 18
>
If you must use a for()
loop...
Since the OP asked whether the problem could be solved with a for()
loop, the answer is "yes."
# create a vector containing unique values of sysid
ids <- unique(data$sysid)
# initialize output data frame list
dfList <- list()
# loop thru unique values and generate named data frames in list()
for(i in ids){
dfname <- paste0("sys",i)
dfList[[dfname]] <- data[data$sysid == i,]
}
dfList
...and the output:
> for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ }
> dfList
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
Choosing the "best" answer
Between split()
, for()
and the other answer using by()
, how do we choose the best answer?
One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.
We can use the microbenchmark
package to compare the performance of the three different approaches.
split()
performance
library(microbenchmark)
> microbenchmark(splitData <- split(data,sysidName),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507 100
>
for()
performance
> microbenchmark(for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ },unit="us")
Unit: microseconds
expr min lq mean
for (i in ids) { dfname <- paste0("sys", i) dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642
median uq max neval
3099.064 3479.311 8511.609 100
>
by()
performance
> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372 100
>
...and the winner is:
split()
, with an average runtime of 186 microseconds, versus 305 microseconds for by()
and a whopping 3,458 microseconds for the for()
loop approach.
I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.
– Shruthi Patil
Nov 23 '18 at 9:58
add a comment |
An easy way to do this is to create a factor vector by appending the string sys
to the id numbers, and using it to split the data. There is no need to use a for()
loop to produce the desired output, since the result of split()
is a list of data frames when the input to be split is a data frame.
The value of the factor is used to name each element in the list generated by split()
. In the case of the OP, since sysid
is numeric and starts with 1, it's not obvious that the id numbers are being used to name the resulting data frames in the list, as explained in the help for split()
.
Using the data from the OP we'll illustrate how to use the sysid
column to create a factor variable that combines the string sys
with the id values, and split it into a list of data frames that can be accessed by name.
rawData <- "Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11"
data <- read.table(text = rawData,header=TRUE)
sysidName <- paste0("sys",data$sysid)
splitData <- split(data,sysidName)
splitData
...and the output:
> splitData
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
>
At this point one can access individual data frames in the list by using the $
form of the extract operator:
> splitData$sys1
Date sysid power temperature sysidName
1 1.1.2018 1 1000 14 sys1
2 2.1.2018 1 1200 16 sys1
3 3.1.2018 1 800 18 sys1
>
Also, by using the names()
function one can obtain a vector of all the named elements in the list of data frames.
> names(splitData)
[1] "sys1" "sys2"
>
Reiterating the main point from the top of the answer, when split()
is used with a data frame, the resulting list is a list of objects of type data.frame()
. For example:
> str(splitData["sys1"])
List of 1
$ sys1:'data.frame': 3 obs. of 4 variables:
..$ Date : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3
..$ sysid : int [1:3] 1 1 1
..$ power : int [1:3] 1000 1200 800
..$ temperature: int [1:3] 14 16 18
>
If you must use a for()
loop...
Since the OP asked whether the problem could be solved with a for()
loop, the answer is "yes."
# create a vector containing unique values of sysid
ids <- unique(data$sysid)
# initialize output data frame list
dfList <- list()
# loop thru unique values and generate named data frames in list()
for(i in ids){
dfname <- paste0("sys",i)
dfList[[dfname]] <- data[data$sysid == i,]
}
dfList
...and the output:
> for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ }
> dfList
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
Choosing the "best" answer
Between split()
, for()
and the other answer using by()
, how do we choose the best answer?
One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.
We can use the microbenchmark
package to compare the performance of the three different approaches.
split()
performance
library(microbenchmark)
> microbenchmark(splitData <- split(data,sysidName),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507 100
>
for()
performance
> microbenchmark(for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ },unit="us")
Unit: microseconds
expr min lq mean
for (i in ids) { dfname <- paste0("sys", i) dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642
median uq max neval
3099.064 3479.311 8511.609 100
>
by()
performance
> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372 100
>
...and the winner is:
split()
, with an average runtime of 186 microseconds, versus 305 microseconds for by()
and a whopping 3,458 microseconds for the for()
loop approach.
I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.
– Shruthi Patil
Nov 23 '18 at 9:58
add a comment |
An easy way to do this is to create a factor vector by appending the string sys
to the id numbers, and using it to split the data. There is no need to use a for()
loop to produce the desired output, since the result of split()
is a list of data frames when the input to be split is a data frame.
The value of the factor is used to name each element in the list generated by split()
. In the case of the OP, since sysid
is numeric and starts with 1, it's not obvious that the id numbers are being used to name the resulting data frames in the list, as explained in the help for split()
.
Using the data from the OP we'll illustrate how to use the sysid
column to create a factor variable that combines the string sys
with the id values, and split it into a list of data frames that can be accessed by name.
rawData <- "Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11"
data <- read.table(text = rawData,header=TRUE)
sysidName <- paste0("sys",data$sysid)
splitData <- split(data,sysidName)
splitData
...and the output:
> splitData
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
>
At this point one can access individual data frames in the list by using the $
form of the extract operator:
> splitData$sys1
Date sysid power temperature sysidName
1 1.1.2018 1 1000 14 sys1
2 2.1.2018 1 1200 16 sys1
3 3.1.2018 1 800 18 sys1
>
Also, by using the names()
function one can obtain a vector of all the named elements in the list of data frames.
> names(splitData)
[1] "sys1" "sys2"
>
Reiterating the main point from the top of the answer, when split()
is used with a data frame, the resulting list is a list of objects of type data.frame()
. For example:
> str(splitData["sys1"])
List of 1
$ sys1:'data.frame': 3 obs. of 4 variables:
..$ Date : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3
..$ sysid : int [1:3] 1 1 1
..$ power : int [1:3] 1000 1200 800
..$ temperature: int [1:3] 14 16 18
>
If you must use a for()
loop...
Since the OP asked whether the problem could be solved with a for()
loop, the answer is "yes."
# create a vector containing unique values of sysid
ids <- unique(data$sysid)
# initialize output data frame list
dfList <- list()
# loop thru unique values and generate named data frames in list()
for(i in ids){
dfname <- paste0("sys",i)
dfList[[dfname]] <- data[data$sysid == i,]
}
dfList
...and the output:
> for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ }
> dfList
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
Choosing the "best" answer
Between split()
, for()
and the other answer using by()
, how do we choose the best answer?
One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.
We can use the microbenchmark
package to compare the performance of the three different approaches.
split()
performance
library(microbenchmark)
> microbenchmark(splitData <- split(data,sysidName),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507 100
>
for()
performance
> microbenchmark(for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ },unit="us")
Unit: microseconds
expr min lq mean
for (i in ids) { dfname <- paste0("sys", i) dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642
median uq max neval
3099.064 3479.311 8511.609 100
>
by()
performance
> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372 100
>
...and the winner is:
split()
, with an average runtime of 186 microseconds, versus 305 microseconds for by()
and a whopping 3,458 microseconds for the for()
loop approach.
An easy way to do this is to create a factor vector by appending the string sys
to the id numbers, and using it to split the data. There is no need to use a for()
loop to produce the desired output, since the result of split()
is a list of data frames when the input to be split is a data frame.
The value of the factor is used to name each element in the list generated by split()
. In the case of the OP, since sysid
is numeric and starts with 1, it's not obvious that the id numbers are being used to name the resulting data frames in the list, as explained in the help for split()
.
Using the data from the OP we'll illustrate how to use the sysid
column to create a factor variable that combines the string sys
with the id values, and split it into a list of data frames that can be accessed by name.
rawData <- "Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11"
data <- read.table(text = rawData,header=TRUE)
sysidName <- paste0("sys",data$sysid)
splitData <- split(data,sysidName)
splitData
...and the output:
> splitData
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
>
At this point one can access individual data frames in the list by using the $
form of the extract operator:
> splitData$sys1
Date sysid power temperature sysidName
1 1.1.2018 1 1000 14 sys1
2 2.1.2018 1 1200 16 sys1
3 3.1.2018 1 800 18 sys1
>
Also, by using the names()
function one can obtain a vector of all the named elements in the list of data frames.
> names(splitData)
[1] "sys1" "sys2"
>
Reiterating the main point from the top of the answer, when split()
is used with a data frame, the resulting list is a list of objects of type data.frame()
. For example:
> str(splitData["sys1"])
List of 1
$ sys1:'data.frame': 3 obs. of 4 variables:
..$ Date : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3
..$ sysid : int [1:3] 1 1 1
..$ power : int [1:3] 1000 1200 800
..$ temperature: int [1:3] 14 16 18
>
If you must use a for()
loop...
Since the OP asked whether the problem could be solved with a for()
loop, the answer is "yes."
# create a vector containing unique values of sysid
ids <- unique(data$sysid)
# initialize output data frame list
dfList <- list()
# loop thru unique values and generate named data frames in list()
for(i in ids){
dfname <- paste0("sys",i)
dfList[[dfname]] <- data[data$sysid == i,]
}
dfList
...and the output:
> for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ }
> dfList
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
Choosing the "best" answer
Between split()
, for()
and the other answer using by()
, how do we choose the best answer?
One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.
We can use the microbenchmark
package to compare the performance of the three different approaches.
split()
performance
library(microbenchmark)
> microbenchmark(splitData <- split(data,sysidName),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507 100
>
for()
performance
> microbenchmark(for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ },unit="us")
Unit: microseconds
expr min lq mean
for (i in ids) { dfname <- paste0("sys", i) dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642
median uq max neval
3099.064 3479.311 8511.609 100
>
by()
performance
> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372 100
>
...and the winner is:
split()
, with an average runtime of 186 microseconds, versus 305 microseconds for by()
and a whopping 3,458 microseconds for the for()
loop approach.
edited Nov 22 '18 at 6:39
answered Nov 21 '18 at 14:56
Len GreskiLen Greski
3,2081523
3,2081523
I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.
– Shruthi Patil
Nov 23 '18 at 9:58
add a comment |
I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.
– Shruthi Patil
Nov 23 '18 at 9:58
I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.
– Shruthi Patil
Nov 23 '18 at 9:58
I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.
– Shruthi Patil
Nov 23 '18 at 9:58
add a comment |
Another option is using the function by()
:
df <- data.frame(
Date = c("1.1.2018", "2.1.2018", "3.1.2018", "1.1.2018", "2.1.2018", "3.1.2018"),
sysid = c(1, 1, 1, 2, 2, 2),
power = c(1000, 1200, 800, 1500, 800, 1300)
)
df
Date sysid power
1 1.1.2018 1 1000
2 2.1.2018 1 1200
3 3.1.2018 1 800
4 1.1.2018 2 1500
5 2.1.2018 2 800
6 3.1.2018 2 1300
Now split df
in as many dataframes as you have distinct ('unique') sysid
values using by()
and calling unique
:
df_list <- by(df, df$sysid, function(unique) unique)
df_list
df$sysid: 1
Date sysid power
1 1.1.2018 1 1000
2 2.1.2018 1 1200
3 3.1.2018 1 800
----------------------------------------------------------------------------------------------
df$sysid: 2
Date sysid power
4 1.1.2018 2 1500
5 2.1.2018 2 800
6 3.1.2018 2 1300
I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help
– Shruthi Patil
Nov 23 '18 at 10:00
On Stack Overflow it is customary to click the upward arrow if a given answer is useful.
– Chris Ruehlemann
Nov 23 '18 at 14:49
I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.
– Shruthi Patil
Nov 24 '18 at 16:17
add a comment |
Another option is using the function by()
:
df <- data.frame(
Date = c("1.1.2018", "2.1.2018", "3.1.2018", "1.1.2018", "2.1.2018", "3.1.2018"),
sysid = c(1, 1, 1, 2, 2, 2),
power = c(1000, 1200, 800, 1500, 800, 1300)
)
df
Date sysid power
1 1.1.2018 1 1000
2 2.1.2018 1 1200
3 3.1.2018 1 800
4 1.1.2018 2 1500
5 2.1.2018 2 800
6 3.1.2018 2 1300
Now split df
in as many dataframes as you have distinct ('unique') sysid
values using by()
and calling unique
:
df_list <- by(df, df$sysid, function(unique) unique)
df_list
df$sysid: 1
Date sysid power
1 1.1.2018 1 1000
2 2.1.2018 1 1200
3 3.1.2018 1 800
----------------------------------------------------------------------------------------------
df$sysid: 2
Date sysid power
4 1.1.2018 2 1500
5 2.1.2018 2 800
6 3.1.2018 2 1300
I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help
– Shruthi Patil
Nov 23 '18 at 10:00
On Stack Overflow it is customary to click the upward arrow if a given answer is useful.
– Chris Ruehlemann
Nov 23 '18 at 14:49
I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.
– Shruthi Patil
Nov 24 '18 at 16:17
add a comment |
Another option is using the function by()
:
df <- data.frame(
Date = c("1.1.2018", "2.1.2018", "3.1.2018", "1.1.2018", "2.1.2018", "3.1.2018"),
sysid = c(1, 1, 1, 2, 2, 2),
power = c(1000, 1200, 800, 1500, 800, 1300)
)
df
Date sysid power
1 1.1.2018 1 1000
2 2.1.2018 1 1200
3 3.1.2018 1 800
4 1.1.2018 2 1500
5 2.1.2018 2 800
6 3.1.2018 2 1300
Now split df
in as many dataframes as you have distinct ('unique') sysid
values using by()
and calling unique
:
df_list <- by(df, df$sysid, function(unique) unique)
df_list
df$sysid: 1
Date sysid power
1 1.1.2018 1 1000
2 2.1.2018 1 1200
3 3.1.2018 1 800
----------------------------------------------------------------------------------------------
df$sysid: 2
Date sysid power
4 1.1.2018 2 1500
5 2.1.2018 2 800
6 3.1.2018 2 1300
Another option is using the function by()
:
df <- data.frame(
Date = c("1.1.2018", "2.1.2018", "3.1.2018", "1.1.2018", "2.1.2018", "3.1.2018"),
sysid = c(1, 1, 1, 2, 2, 2),
power = c(1000, 1200, 800, 1500, 800, 1300)
)
df
Date sysid power
1 1.1.2018 1 1000
2 2.1.2018 1 1200
3 3.1.2018 1 800
4 1.1.2018 2 1500
5 2.1.2018 2 800
6 3.1.2018 2 1300
Now split df
in as many dataframes as you have distinct ('unique') sysid
values using by()
and calling unique
:
df_list <- by(df, df$sysid, function(unique) unique)
df_list
df$sysid: 1
Date sysid power
1 1.1.2018 1 1000
2 2.1.2018 1 1200
3 3.1.2018 1 800
----------------------------------------------------------------------------------------------
df$sysid: 2
Date sysid power
4 1.1.2018 2 1500
5 2.1.2018 2 800
6 3.1.2018 2 1300
answered Nov 21 '18 at 15:52
Chris RuehlemannChris Ruehlemann
46929
46929
I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help
– Shruthi Patil
Nov 23 '18 at 10:00
On Stack Overflow it is customary to click the upward arrow if a given answer is useful.
– Chris Ruehlemann
Nov 23 '18 at 14:49
I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.
– Shruthi Patil
Nov 24 '18 at 16:17
add a comment |
I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help
– Shruthi Patil
Nov 23 '18 at 10:00
On Stack Overflow it is customary to click the upward arrow if a given answer is useful.
– Chris Ruehlemann
Nov 23 '18 at 14:49
I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.
– Shruthi Patil
Nov 24 '18 at 16:17
I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help
– Shruthi Patil
Nov 23 '18 at 10:00
I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help
– Shruthi Patil
Nov 23 '18 at 10:00
On Stack Overflow it is customary to click the upward arrow if a given answer is useful.
– Chris Ruehlemann
Nov 23 '18 at 14:49
On Stack Overflow it is customary to click the upward arrow if a given answer is useful.
– Chris Ruehlemann
Nov 23 '18 at 14:49
I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.
– Shruthi Patil
Nov 24 '18 at 16:17
I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.
– Shruthi Patil
Nov 24 '18 at 16:17
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53410738%2fusage-of-for-loop-in-r-to-split-a-dataframe-into-several-dataframes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
If you provide a small dummy example of
PART0_DF
it would be easier to understand what it is you need.– rookie
Nov 21 '18 at 11:47