How to remove extra | (pipe) separator from rows when loading | (pipe)-separated text into R
up vote
2
down vote
favorite
I am reading text from a file in which the text is separated by | (pipes).
The text table looks like this (tweet id|date and time|tweet):
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
I am reading this information using the following code:
nyt <- read.table(file=".../nytimeshealth.txt",
sep="|",
header = F,
quote="",
fill=T,
stringsAsFactors = F,
numerals ="no.loss",
encoding = "UTF-8",
na.strings = "NA")
Now, while most of the rows in the original file have 3 columns, each separated by a '|', a few of the rows have an additional '|' separator. That is to say, they have four columns, because some of the tweets themselves contain a |
symbol.
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
I know that usingfill=T
option in the read.table
function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).
So, the row above becomes
71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
However, now column 3 of row 71 has incomplete information, and columns 2 and 3 of row 72 are empty while column 1 does not contain the tweet ID but a part of the tweet. Is there any way I can avoid this? I would like to remove the extra |
separator wherever it appears, so that I do not lose any information.
Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?
r
add a comment |
up vote
2
down vote
favorite
I am reading text from a file in which the text is separated by | (pipes).
The text table looks like this (tweet id|date and time|tweet):
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
I am reading this information using the following code:
nyt <- read.table(file=".../nytimeshealth.txt",
sep="|",
header = F,
quote="",
fill=T,
stringsAsFactors = F,
numerals ="no.loss",
encoding = "UTF-8",
na.strings = "NA")
Now, while most of the rows in the original file have 3 columns, each separated by a '|', a few of the rows have an additional '|' separator. That is to say, they have four columns, because some of the tweets themselves contain a |
symbol.
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
I know that usingfill=T
option in the read.table
function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).
So, the row above becomes
71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
However, now column 3 of row 71 has incomplete information, and columns 2 and 3 of row 72 are empty while column 1 does not contain the tweet ID but a part of the tweet. Is there any way I can avoid this? I would like to remove the extra |
separator wherever it appears, so that I do not lose any information.
Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?
r
@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.
– Anonymouse
Nov 15 at 2:56
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I am reading text from a file in which the text is separated by | (pipes).
The text table looks like this (tweet id|date and time|tweet):
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
I am reading this information using the following code:
nyt <- read.table(file=".../nytimeshealth.txt",
sep="|",
header = F,
quote="",
fill=T,
stringsAsFactors = F,
numerals ="no.loss",
encoding = "UTF-8",
na.strings = "NA")
Now, while most of the rows in the original file have 3 columns, each separated by a '|', a few of the rows have an additional '|' separator. That is to say, they have four columns, because some of the tweets themselves contain a |
symbol.
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
I know that usingfill=T
option in the read.table
function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).
So, the row above becomes
71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
However, now column 3 of row 71 has incomplete information, and columns 2 and 3 of row 72 are empty while column 1 does not contain the tweet ID but a part of the tweet. Is there any way I can avoid this? I would like to remove the extra |
separator wherever it appears, so that I do not lose any information.
Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?
r
I am reading text from a file in which the text is separated by | (pipes).
The text table looks like this (tweet id|date and time|tweet):
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
I am reading this information using the following code:
nyt <- read.table(file=".../nytimeshealth.txt",
sep="|",
header = F,
quote="",
fill=T,
stringsAsFactors = F,
numerals ="no.loss",
encoding = "UTF-8",
na.strings = "NA")
Now, while most of the rows in the original file have 3 columns, each separated by a '|', a few of the rows have an additional '|' separator. That is to say, they have four columns, because some of the tweets themselves contain a |
symbol.
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
I know that usingfill=T
option in the read.table
function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).
So, the row above becomes
71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
However, now column 3 of row 71 has incomplete information, and columns 2 and 3 of row 72 are empty while column 1 does not contain the tweet ID but a part of the tweet. Is there any way I can avoid this? I would like to remove the extra |
separator wherever it appears, so that I do not lose any information.
Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?
r
r
asked Nov 15 at 1:48
Anonymouse
527
527
@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.
– Anonymouse
Nov 15 at 2:56
add a comment |
@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.
– Anonymouse
Nov 15 at 2:56
@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.
– Anonymouse
Nov 15 at 2:56
@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.
– Anonymouse
Nov 15 at 2:56
add a comment |
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
I created a text file called text.txt
with the 3 lines you provide as example of your data (the 2 easy lines without any |
in the tweet as well as the one which has a |
inside the tweet).
Here is the content of this file:
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
Code
library(tidyverse)
readLines("text.txt", encoding = "UTF-8") %>%
map(., str_split_fixed, "\|", 3) %>%
map_df(., as_tibble)
Result
# A tibble: 3 x 3
V1 V2
<chr> <chr>
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
<chr>
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …
but why not just join any extra columns ex post facto?
– hrbrmstr
Nov 15 at 2:14
Sorry, I am not sure I understand your question
– prosoitos
Nov 15 at 2:15
1
Do you mean, split at every|
, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)
– prosoitos
Nov 15 at 2:21
This also works regardless of how many|
you have in the tweets. So you don't have to worry about some tweets having more complex structures
– prosoitos
Nov 15 at 2:22
@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!
– Anonymouse
Nov 15 at 2:52
|
show 1 more comment
up vote
0
down vote
Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one |
per tweet (you can have tweets with none as long as at least one tweet has one |
). If you don't have any |
in your tweets, or if some tweets have more than one |
, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.
I am still using my text.txt
file:
df <- read.table(file = "text.txt",
sep = "|",
header = F,
quote = "",
fill = T,
stringsAsFactors = F,
numerals = "no.loss",
encoding = "UTF-8",
na.strings = "NA")
df %>%
mutate(V3 = paste0(V3, V4)) %>%
select(- V4)
Result
V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx
appears as585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring
All the text after the hastag#superfoods
is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.
– Anonymouse
Nov 15 at 3:29
That's because#
is a comment in R. But if you use my first answer, you won't have this problem
– prosoitos
Nov 15 at 3:30
Thanks for clarifying that...seems obvious now that you have mentioned it.
– Anonymouse
Nov 15 at 3:31
read.table()
interprets your data. So anything after#
is considered a comment and is thus omitted.readLines()
however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget aboutread.table()
in your case. There are countless things that could go wrong withread.table()
if you have funky characters in your tweets. While thereadLines()
answer is pretty bomb proof
– prosoitos
Nov 15 at 3:31
1
Was just going to write that... I did check?readLines
and read that. Thanks a lot for your time!
– Anonymouse
Nov 15 at 3:45
|
show 7 more comments
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
I created a text file called text.txt
with the 3 lines you provide as example of your data (the 2 easy lines without any |
in the tweet as well as the one which has a |
inside the tweet).
Here is the content of this file:
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
Code
library(tidyverse)
readLines("text.txt", encoding = "UTF-8") %>%
map(., str_split_fixed, "\|", 3) %>%
map_df(., as_tibble)
Result
# A tibble: 3 x 3
V1 V2
<chr> <chr>
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
<chr>
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …
but why not just join any extra columns ex post facto?
– hrbrmstr
Nov 15 at 2:14
Sorry, I am not sure I understand your question
– prosoitos
Nov 15 at 2:15
1
Do you mean, split at every|
, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)
– prosoitos
Nov 15 at 2:21
This also works regardless of how many|
you have in the tweets. So you don't have to worry about some tweets having more complex structures
– prosoitos
Nov 15 at 2:22
@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!
– Anonymouse
Nov 15 at 2:52
|
show 1 more comment
up vote
2
down vote
accepted
I created a text file called text.txt
with the 3 lines you provide as example of your data (the 2 easy lines without any |
in the tweet as well as the one which has a |
inside the tweet).
Here is the content of this file:
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
Code
library(tidyverse)
readLines("text.txt", encoding = "UTF-8") %>%
map(., str_split_fixed, "\|", 3) %>%
map_df(., as_tibble)
Result
# A tibble: 3 x 3
V1 V2
<chr> <chr>
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
<chr>
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …
but why not just join any extra columns ex post facto?
– hrbrmstr
Nov 15 at 2:14
Sorry, I am not sure I understand your question
– prosoitos
Nov 15 at 2:15
1
Do you mean, split at every|
, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)
– prosoitos
Nov 15 at 2:21
This also works regardless of how many|
you have in the tweets. So you don't have to worry about some tweets having more complex structures
– prosoitos
Nov 15 at 2:22
@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!
– Anonymouse
Nov 15 at 2:52
|
show 1 more comment
up vote
2
down vote
accepted
up vote
2
down vote
accepted
I created a text file called text.txt
with the 3 lines you provide as example of your data (the 2 easy lines without any |
in the tweet as well as the one which has a |
inside the tweet).
Here is the content of this file:
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
Code
library(tidyverse)
readLines("text.txt", encoding = "UTF-8") %>%
map(., str_split_fixed, "\|", 3) %>%
map_df(., as_tibble)
Result
# A tibble: 3 x 3
V1 V2
<chr> <chr>
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
<chr>
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …
I created a text file called text.txt
with the 3 lines you provide as example of your data (the 2 easy lines without any |
in the tweet as well as the one which has a |
inside the tweet).
Here is the content of this file:
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
Code
library(tidyverse)
readLines("text.txt", encoding = "UTF-8") %>%
map(., str_split_fixed, "\|", 3) %>%
map_df(., as_tibble)
Result
# A tibble: 3 x 3
V1 V2
<chr> <chr>
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
<chr>
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …
edited Nov 15 at 3:44
answered Nov 15 at 2:12
prosoitos
912219
912219
but why not just join any extra columns ex post facto?
– hrbrmstr
Nov 15 at 2:14
Sorry, I am not sure I understand your question
– prosoitos
Nov 15 at 2:15
1
Do you mean, split at every|
, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)
– prosoitos
Nov 15 at 2:21
This also works regardless of how many|
you have in the tweets. So you don't have to worry about some tweets having more complex structures
– prosoitos
Nov 15 at 2:22
@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!
– Anonymouse
Nov 15 at 2:52
|
show 1 more comment
but why not just join any extra columns ex post facto?
– hrbrmstr
Nov 15 at 2:14
Sorry, I am not sure I understand your question
– prosoitos
Nov 15 at 2:15
1
Do you mean, split at every|
, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)
– prosoitos
Nov 15 at 2:21
This also works regardless of how many|
you have in the tweets. So you don't have to worry about some tweets having more complex structures
– prosoitos
Nov 15 at 2:22
@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!
– Anonymouse
Nov 15 at 2:52
but why not just join any extra columns ex post facto?
– hrbrmstr
Nov 15 at 2:14
but why not just join any extra columns ex post facto?
– hrbrmstr
Nov 15 at 2:14
Sorry, I am not sure I understand your question
– prosoitos
Nov 15 at 2:15
Sorry, I am not sure I understand your question
– prosoitos
Nov 15 at 2:15
1
1
Do you mean, split at every
|
, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)– prosoitos
Nov 15 at 2:21
Do you mean, split at every
|
, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)– prosoitos
Nov 15 at 2:21
This also works regardless of how many
|
you have in the tweets. So you don't have to worry about some tweets having more complex structures– prosoitos
Nov 15 at 2:22
This also works regardless of how many
|
you have in the tweets. So you don't have to worry about some tweets having more complex structures– prosoitos
Nov 15 at 2:22
@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!
– Anonymouse
Nov 15 at 2:52
@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!
– Anonymouse
Nov 15 at 2:52
|
show 1 more comment
up vote
0
down vote
Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one |
per tweet (you can have tweets with none as long as at least one tweet has one |
). If you don't have any |
in your tweets, or if some tweets have more than one |
, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.
I am still using my text.txt
file:
df <- read.table(file = "text.txt",
sep = "|",
header = F,
quote = "",
fill = T,
stringsAsFactors = F,
numerals = "no.loss",
encoding = "UTF-8",
na.strings = "NA")
df %>%
mutate(V3 = paste0(V3, V4)) %>%
select(- V4)
Result
V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx
appears as585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring
All the text after the hastag#superfoods
is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.
– Anonymouse
Nov 15 at 3:29
That's because#
is a comment in R. But if you use my first answer, you won't have this problem
– prosoitos
Nov 15 at 3:30
Thanks for clarifying that...seems obvious now that you have mentioned it.
– Anonymouse
Nov 15 at 3:31
read.table()
interprets your data. So anything after#
is considered a comment and is thus omitted.readLines()
however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget aboutread.table()
in your case. There are countless things that could go wrong withread.table()
if you have funky characters in your tweets. While thereadLines()
answer is pretty bomb proof
– prosoitos
Nov 15 at 3:31
1
Was just going to write that... I did check?readLines
and read that. Thanks a lot for your time!
– Anonymouse
Nov 15 at 3:45
|
show 7 more comments
up vote
0
down vote
Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one |
per tweet (you can have tweets with none as long as at least one tweet has one |
). If you don't have any |
in your tweets, or if some tweets have more than one |
, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.
I am still using my text.txt
file:
df <- read.table(file = "text.txt",
sep = "|",
header = F,
quote = "",
fill = T,
stringsAsFactors = F,
numerals = "no.loss",
encoding = "UTF-8",
na.strings = "NA")
df %>%
mutate(V3 = paste0(V3, V4)) %>%
select(- V4)
Result
V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx
appears as585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring
All the text after the hastag#superfoods
is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.
– Anonymouse
Nov 15 at 3:29
That's because#
is a comment in R. But if you use my first answer, you won't have this problem
– prosoitos
Nov 15 at 3:30
Thanks for clarifying that...seems obvious now that you have mentioned it.
– Anonymouse
Nov 15 at 3:31
read.table()
interprets your data. So anything after#
is considered a comment and is thus omitted.readLines()
however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget aboutread.table()
in your case. There are countless things that could go wrong withread.table()
if you have funky characters in your tweets. While thereadLines()
answer is pretty bomb proof
– prosoitos
Nov 15 at 3:31
1
Was just going to write that... I did check?readLines
and read that. Thanks a lot for your time!
– Anonymouse
Nov 15 at 3:45
|
show 7 more comments
up vote
0
down vote
up vote
0
down vote
Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one |
per tweet (you can have tweets with none as long as at least one tweet has one |
). If you don't have any |
in your tweets, or if some tweets have more than one |
, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.
I am still using my text.txt
file:
df <- read.table(file = "text.txt",
sep = "|",
header = F,
quote = "",
fill = T,
stringsAsFactors = F,
numerals = "no.loss",
encoding = "UTF-8",
na.strings = "NA")
df %>%
mutate(V3 = paste0(V3, V4)) %>%
select(- V4)
Result
V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one |
per tweet (you can have tweets with none as long as at least one tweet has one |
). If you don't have any |
in your tweets, or if some tweets have more than one |
, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.
I am still using my text.txt
file:
df <- read.table(file = "text.txt",
sep = "|",
header = F,
quote = "",
fill = T,
stringsAsFactors = F,
numerals = "no.loss",
encoding = "UTF-8",
na.strings = "NA")
df %>%
mutate(V3 = paste0(V3, V4)) %>%
select(- V4)
Result
V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
edited Nov 15 at 5:43
answered Nov 15 at 3:09
prosoitos
912219
912219
Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx
appears as585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring
All the text after the hastag#superfoods
is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.
– Anonymouse
Nov 15 at 3:29
That's because#
is a comment in R. But if you use my first answer, you won't have this problem
– prosoitos
Nov 15 at 3:30
Thanks for clarifying that...seems obvious now that you have mentioned it.
– Anonymouse
Nov 15 at 3:31
read.table()
interprets your data. So anything after#
is considered a comment and is thus omitted.readLines()
however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget aboutread.table()
in your case. There are countless things that could go wrong withread.table()
if you have funky characters in your tweets. While thereadLines()
answer is pretty bomb proof
– prosoitos
Nov 15 at 3:31
1
Was just going to write that... I did check?readLines
and read that. Thanks a lot for your time!
– Anonymouse
Nov 15 at 3:45
|
show 7 more comments
Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx
appears as585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring
All the text after the hastag#superfoods
is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.
– Anonymouse
Nov 15 at 3:29
That's because#
is a comment in R. But if you use my first answer, you won't have this problem
– prosoitos
Nov 15 at 3:30
Thanks for clarifying that...seems obvious now that you have mentioned it.
– Anonymouse
Nov 15 at 3:31
read.table()
interprets your data. So anything after#
is considered a comment and is thus omitted.readLines()
however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget aboutread.table()
in your case. There are countless things that could go wrong withread.table()
if you have funky characters in your tweets. While thereadLines()
answer is pretty bomb proof
– prosoitos
Nov 15 at 3:31
1
Was just going to write that... I did check?readLines
and read that. Thanks a lot for your time!
– Anonymouse
Nov 15 at 3:45
Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this
585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx
appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring
All the text after the hastag #superfoods
is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.– Anonymouse
Nov 15 at 3:29
Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this
585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx
appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring
All the text after the hastag #superfoods
is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.– Anonymouse
Nov 15 at 3:29
That's because
#
is a comment in R. But if you use my first answer, you won't have this problem– prosoitos
Nov 15 at 3:30
That's because
#
is a comment in R. But if you use my first answer, you won't have this problem– prosoitos
Nov 15 at 3:30
Thanks for clarifying that...seems obvious now that you have mentioned it.
– Anonymouse
Nov 15 at 3:31
Thanks for clarifying that...seems obvious now that you have mentioned it.
– Anonymouse
Nov 15 at 3:31
read.table()
interprets your data. So anything after #
is considered a comment and is thus omitted. readLines()
however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table()
in your case. There are countless things that could go wrong with read.table()
if you have funky characters in your tweets. While the readLines()
answer is pretty bomb proof– prosoitos
Nov 15 at 3:31
read.table()
interprets your data. So anything after #
is considered a comment and is thus omitted. readLines()
however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table()
in your case. There are countless things that could go wrong with read.table()
if you have funky characters in your tweets. While the readLines()
answer is pretty bomb proof– prosoitos
Nov 15 at 3:31
1
1
Was just going to write that... I did check
?readLines
and read that. Thanks a lot for your time!– Anonymouse
Nov 15 at 3:45
Was just going to write that... I did check
?readLines
and read that. Thanks a lot for your time!– Anonymouse
Nov 15 at 3:45
|
show 7 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311301%2fhow-to-remove-extra-pipe-separator-from-rows-when-loading-pipe-separated%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.
– Anonymouse
Nov 15 at 2:56