Scraping data from PDF using R
I'd like to extract data (ski jumpping) from this PDF http://medias4.fis-ski.com/pdf/2019/JP/3088/2019JP3088RL.pdf
I'm interested in every data except bib, club and date of birth
I was trying with pdftools library
pdf_text("raw/data.pdf") %>% strsplit(split = "n")
and I stuck here. The problem is that column points (gate compensation) sometimes is empty and sometimes it's not. I don't know how to handle that.
My desired output is something like that:
Rank|Athlete |Nation|(...)|Jump_1|Round_1|Jump_2|Round_2|Tot_points
1 |KLIMOV Evgeniy|RUS |(...)|127.5 |130 |131.5 |133.4 |263.4
Anyone may help me?
r pdf web-scraping screen-scraping
add a comment |
I'd like to extract data (ski jumpping) from this PDF http://medias4.fis-ski.com/pdf/2019/JP/3088/2019JP3088RL.pdf
I'm interested in every data except bib, club and date of birth
I was trying with pdftools library
pdf_text("raw/data.pdf") %>% strsplit(split = "n")
and I stuck here. The problem is that column points (gate compensation) sometimes is empty and sometimes it's not. I don't know how to handle that.
My desired output is something like that:
Rank|Athlete |Nation|(...)|Jump_1|Round_1|Jump_2|Round_2|Tot_points
1 |KLIMOV Evgeniy|RUS |(...)|127.5 |130 |131.5 |133.4 |263.4
Anyone may help me?
r pdf web-scraping screen-scraping
add a comment |
I'd like to extract data (ski jumpping) from this PDF http://medias4.fis-ski.com/pdf/2019/JP/3088/2019JP3088RL.pdf
I'm interested in every data except bib, club and date of birth
I was trying with pdftools library
pdf_text("raw/data.pdf") %>% strsplit(split = "n")
and I stuck here. The problem is that column points (gate compensation) sometimes is empty and sometimes it's not. I don't know how to handle that.
My desired output is something like that:
Rank|Athlete |Nation|(...)|Jump_1|Round_1|Jump_2|Round_2|Tot_points
1 |KLIMOV Evgeniy|RUS |(...)|127.5 |130 |131.5 |133.4 |263.4
Anyone may help me?
r pdf web-scraping screen-scraping
I'd like to extract data (ski jumpping) from this PDF http://medias4.fis-ski.com/pdf/2019/JP/3088/2019JP3088RL.pdf
I'm interested in every data except bib, club and date of birth
I was trying with pdftools library
pdf_text("raw/data.pdf") %>% strsplit(split = "n")
and I stuck here. The problem is that column points (gate compensation) sometimes is empty and sometimes it's not. I don't know how to handle that.
My desired output is something like that:
Rank|Athlete |Nation|(...)|Jump_1|Round_1|Jump_2|Round_2|Tot_points
1 |KLIMOV Evgeniy|RUS |(...)|127.5 |130 |131.5 |133.4 |263.4
Anyone may help me?
r pdf web-scraping screen-scraping
r pdf web-scraping screen-scraping
asked Nov 19 '18 at 23:14
amikomaamikoma
3529
3529
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Check this out:
library(tidyverse)
text<-pdftools::pdf_text("http://medias4.fis-ski.com/pdf/2019/JP/3088/2019JP3088RL.pdf")
list<-str_remove_all(text,"\X+?TOTAL\s+RANKn") %>%
str_trim() %>%
str_split("n\s{10,}(?=\p{L})") %>%
modify_depth(1,~str_split(.x,"\s{2,}") %>%
map(~.x[1:13] %>%
set_names(paste0("x",1:13)))
)
## Just the first page
df<-bind_rows(!!!list[[1]])
It's not a definitive solution, but it's some progress.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53383989%2fscraping-data-from-pdf-using-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Check this out:
library(tidyverse)
text<-pdftools::pdf_text("http://medias4.fis-ski.com/pdf/2019/JP/3088/2019JP3088RL.pdf")
list<-str_remove_all(text,"\X+?TOTAL\s+RANKn") %>%
str_trim() %>%
str_split("n\s{10,}(?=\p{L})") %>%
modify_depth(1,~str_split(.x,"\s{2,}") %>%
map(~.x[1:13] %>%
set_names(paste0("x",1:13)))
)
## Just the first page
df<-bind_rows(!!!list[[1]])
It's not a definitive solution, but it's some progress.
add a comment |
Check this out:
library(tidyverse)
text<-pdftools::pdf_text("http://medias4.fis-ski.com/pdf/2019/JP/3088/2019JP3088RL.pdf")
list<-str_remove_all(text,"\X+?TOTAL\s+RANKn") %>%
str_trim() %>%
str_split("n\s{10,}(?=\p{L})") %>%
modify_depth(1,~str_split(.x,"\s{2,}") %>%
map(~.x[1:13] %>%
set_names(paste0("x",1:13)))
)
## Just the first page
df<-bind_rows(!!!list[[1]])
It's not a definitive solution, but it's some progress.
add a comment |
Check this out:
library(tidyverse)
text<-pdftools::pdf_text("http://medias4.fis-ski.com/pdf/2019/JP/3088/2019JP3088RL.pdf")
list<-str_remove_all(text,"\X+?TOTAL\s+RANKn") %>%
str_trim() %>%
str_split("n\s{10,}(?=\p{L})") %>%
modify_depth(1,~str_split(.x,"\s{2,}") %>%
map(~.x[1:13] %>%
set_names(paste0("x",1:13)))
)
## Just the first page
df<-bind_rows(!!!list[[1]])
It's not a definitive solution, but it's some progress.
Check this out:
library(tidyverse)
text<-pdftools::pdf_text("http://medias4.fis-ski.com/pdf/2019/JP/3088/2019JP3088RL.pdf")
list<-str_remove_all(text,"\X+?TOTAL\s+RANKn") %>%
str_trim() %>%
str_split("n\s{10,}(?=\p{L})") %>%
modify_depth(1,~str_split(.x,"\s{2,}") %>%
map(~.x[1:13] %>%
set_names(paste0("x",1:13)))
)
## Just the first page
df<-bind_rows(!!!list[[1]])
It's not a definitive solution, but it's some progress.
edited Nov 20 '18 at 0:54
answered Nov 20 '18 at 0:40
JoséJosé
516815
516815
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53383989%2fscraping-data-from-pdf-using-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown