Scrapy - How to scrape a weblink within a weblink using python?
I'm trying to scrape:
https://webmd.com/oral-health/oral-lichen-planus#1 from webmd website in the following web page code:
<li class="global-nav-sign-in global-nav-hide-mobile" data-metrics-module="">
<a href="https://member.webmd.com/signin?appid=1&returl=https://www.webmd.com/oral-health/oral-lichen-planus#1" data-metrics-link="reg-login">Sign In</a>
</li>
I use the following scrapy code to achieve this:
import scrapy
import re
import string
import pandas as pd
class HealthItem(scrapy.Item):
link = scrapy.Field()
def urls_getter():
fname = "/home/phil/fd/webmd/health.csv"
pds = pd.read_csv(fname)
pds_link = pds['link']
pds_link = pds_link.drop_duplicates(keep = "first", inplace = False)
pds_link = pds_link.tolist()
return pds_link
class SymptommdSpider(scrapy.Spider):
name = "symptommd"
allowed_domains = ["webmd.com"]
start_urls = urls_getter()
def parse(self, response):
titles = response.xpath('//li[contains(@class, "global-nav-sign-in")]/a[contains(@href, "https:")]')
for title in titles:
item = HealthItem()
item['link'] = title.xpath('@href').extract()
yield item
However, this code gets only the front portion of the a href. Namely, https://member.webmd.com/signin. How do I get only the second web link?
python html web-scraping scrapy
add a comment |
I'm trying to scrape:
https://webmd.com/oral-health/oral-lichen-planus#1 from webmd website in the following web page code:
<li class="global-nav-sign-in global-nav-hide-mobile" data-metrics-module="">
<a href="https://member.webmd.com/signin?appid=1&returl=https://www.webmd.com/oral-health/oral-lichen-planus#1" data-metrics-link="reg-login">Sign In</a>
</li>
I use the following scrapy code to achieve this:
import scrapy
import re
import string
import pandas as pd
class HealthItem(scrapy.Item):
link = scrapy.Field()
def urls_getter():
fname = "/home/phil/fd/webmd/health.csv"
pds = pd.read_csv(fname)
pds_link = pds['link']
pds_link = pds_link.drop_duplicates(keep = "first", inplace = False)
pds_link = pds_link.tolist()
return pds_link
class SymptommdSpider(scrapy.Spider):
name = "symptommd"
allowed_domains = ["webmd.com"]
start_urls = urls_getter()
def parse(self, response):
titles = response.xpath('//li[contains(@class, "global-nav-sign-in")]/a[contains(@href, "https:")]')
for title in titles:
item = HealthItem()
item['link'] = title.xpath('@href').extract()
yield item
However, this code gets only the front portion of the a href. Namely, https://member.webmd.com/signin. How do I get only the second web link?
python html web-scraping scrapy
This is exactly how@href
appears in page source. What is your desired output?
– Andersson
Nov 21 '18 at 19:34
The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1
– Phillip1982
Nov 21 '18 at 19:36
Looks like this part of URL is added with JavaScript.
– stasdeep
Nov 22 '18 at 7:09
add a comment |
I'm trying to scrape:
https://webmd.com/oral-health/oral-lichen-planus#1 from webmd website in the following web page code:
<li class="global-nav-sign-in global-nav-hide-mobile" data-metrics-module="">
<a href="https://member.webmd.com/signin?appid=1&returl=https://www.webmd.com/oral-health/oral-lichen-planus#1" data-metrics-link="reg-login">Sign In</a>
</li>
I use the following scrapy code to achieve this:
import scrapy
import re
import string
import pandas as pd
class HealthItem(scrapy.Item):
link = scrapy.Field()
def urls_getter():
fname = "/home/phil/fd/webmd/health.csv"
pds = pd.read_csv(fname)
pds_link = pds['link']
pds_link = pds_link.drop_duplicates(keep = "first", inplace = False)
pds_link = pds_link.tolist()
return pds_link
class SymptommdSpider(scrapy.Spider):
name = "symptommd"
allowed_domains = ["webmd.com"]
start_urls = urls_getter()
def parse(self, response):
titles = response.xpath('//li[contains(@class, "global-nav-sign-in")]/a[contains(@href, "https:")]')
for title in titles:
item = HealthItem()
item['link'] = title.xpath('@href').extract()
yield item
However, this code gets only the front portion of the a href. Namely, https://member.webmd.com/signin. How do I get only the second web link?
python html web-scraping scrapy
I'm trying to scrape:
https://webmd.com/oral-health/oral-lichen-planus#1 from webmd website in the following web page code:
<li class="global-nav-sign-in global-nav-hide-mobile" data-metrics-module="">
<a href="https://member.webmd.com/signin?appid=1&returl=https://www.webmd.com/oral-health/oral-lichen-planus#1" data-metrics-link="reg-login">Sign In</a>
</li>
I use the following scrapy code to achieve this:
import scrapy
import re
import string
import pandas as pd
class HealthItem(scrapy.Item):
link = scrapy.Field()
def urls_getter():
fname = "/home/phil/fd/webmd/health.csv"
pds = pd.read_csv(fname)
pds_link = pds['link']
pds_link = pds_link.drop_duplicates(keep = "first", inplace = False)
pds_link = pds_link.tolist()
return pds_link
class SymptommdSpider(scrapy.Spider):
name = "symptommd"
allowed_domains = ["webmd.com"]
start_urls = urls_getter()
def parse(self, response):
titles = response.xpath('//li[contains(@class, "global-nav-sign-in")]/a[contains(@href, "https:")]')
for title in titles:
item = HealthItem()
item['link'] = title.xpath('@href').extract()
yield item
However, this code gets only the front portion of the a href. Namely, https://member.webmd.com/signin. How do I get only the second web link?
python html web-scraping scrapy
python html web-scraping scrapy
asked Nov 21 '18 at 19:23
Phillip1982Phillip1982
3217
3217
This is exactly how@href
appears in page source. What is your desired output?
– Andersson
Nov 21 '18 at 19:34
The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1
– Phillip1982
Nov 21 '18 at 19:36
Looks like this part of URL is added with JavaScript.
– stasdeep
Nov 22 '18 at 7:09
add a comment |
This is exactly how@href
appears in page source. What is your desired output?
– Andersson
Nov 21 '18 at 19:34
The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1
– Phillip1982
Nov 21 '18 at 19:36
Looks like this part of URL is added with JavaScript.
– stasdeep
Nov 22 '18 at 7:09
This is exactly how
@href
appears in page source. What is your desired output?– Andersson
Nov 21 '18 at 19:34
This is exactly how
@href
appears in page source. What is your desired output?– Andersson
Nov 21 '18 at 19:34
The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1
– Phillip1982
Nov 21 '18 at 19:36
The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1
– Phillip1982
Nov 21 '18 at 19:36
Looks like this part of URL is added with JavaScript.
– stasdeep
Nov 22 '18 at 7:09
Looks like this part of URL is added with JavaScript.
– stasdeep
Nov 22 '18 at 7:09
add a comment |
1 Answer
1
active
oldest
votes
As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:
Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.
If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419198%2fscrapy-how-to-scrape-a-weblink-within-a-weblink-using-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:
Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.
If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.
add a comment |
As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:
Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.
If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.
add a comment |
As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:
Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.
If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.
As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:
Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.
If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.
answered Nov 22 '18 at 15:41
GuillaumeGuillaume
1,1581724
1,1581724
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419198%2fscrapy-how-to-scrape-a-weblink-within-a-weblink-using-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
This is exactly how
@href
appears in page source. What is your desired output?– Andersson
Nov 21 '18 at 19:34
The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1
– Phillip1982
Nov 21 '18 at 19:36
Looks like this part of URL is added with JavaScript.
– stasdeep
Nov 22 '18 at 7:09