Scrapy - How to scrape a weblink within a weblink using python?

I'm trying to scrape:

https://webmd.com/oral-health/oral-lichen-planus#1 from webmd website in the following web page code:

<li class="global-nav-sign-in global-nav-hide-mobile" data-metrics-module="">

            <a href="https://member.webmd.com/signin?appid=1&amp;returl=https://www.webmd.com/oral-health/oral-lichen-planus#1" data-metrics-link="reg-login">Sign In</a>

        </li>

I use the following scrapy code to achieve this:

import scrapy

import re

import string

import pandas as pd



class HealthItem(scrapy.Item):

    link = scrapy.Field()





def urls_getter():

    fname = "/home/phil/fd/webmd/health.csv"

    pds = pd.read_csv(fname)

    pds_link = pds['link']

    pds_link = pds_link.drop_duplicates(keep = "first", inplace = False)

    pds_link = pds_link.tolist()

    return pds_link





class SymptommdSpider(scrapy.Spider):

    name = "symptommd"

    allowed_domains = ["webmd.com"]

    start_urls = urls_getter()

    def parse(self, response):

        titles = response.xpath('//li[contains(@class, "global-nav-sign-in")]/a[contains(@href, "https:")]')

        for title in titles:

            item = HealthItem()

            item['link'] =  title.xpath('@href').extract()

            yield item

However, this code gets only the front portion of the a href. Namely, https://member.webmd.com/signin. How do I get only the second web link?

asked Nov 21 '18 at 19:23

Phillip1982

3217

This is exactly how @href appears in page source. What is your desired output?

– Andersson
Nov 21 '18 at 19:34

The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1

– Phillip1982
Nov 21 '18 at 19:36

Looks like this part of URL is added with JavaScript.

– stasdeep
Nov 22 '18 at 7:09

add a comment |

I'm trying to scrape:

https://webmd.com/oral-health/oral-lichen-planus#1 from webmd website in the following web page code:

<li class="global-nav-sign-in global-nav-hide-mobile" data-metrics-module="">

            <a href="https://member.webmd.com/signin?appid=1&amp;returl=https://www.webmd.com/oral-health/oral-lichen-planus#1" data-metrics-link="reg-login">Sign In</a>

        </li>

I use the following scrapy code to achieve this:

import scrapy

import re

import string

import pandas as pd



class HealthItem(scrapy.Item):

    link = scrapy.Field()





def urls_getter():

    fname = "/home/phil/fd/webmd/health.csv"

    pds = pd.read_csv(fname)

    pds_link = pds['link']

    pds_link = pds_link.drop_duplicates(keep = "first", inplace = False)

    pds_link = pds_link.tolist()

    return pds_link





class SymptommdSpider(scrapy.Spider):

    name = "symptommd"

    allowed_domains = ["webmd.com"]

    start_urls = urls_getter()

    def parse(self, response):

        titles = response.xpath('//li[contains(@class, "global-nav-sign-in")]/a[contains(@href, "https:")]')

        for title in titles:

            item = HealthItem()

            item['link'] =  title.xpath('@href').extract()

            yield item

However, this code gets only the front portion of the a href. Namely, https://member.webmd.com/signin. How do I get only the second web link?

asked Nov 21 '18 at 19:23

Phillip1982

3217

This is exactly how @href appears in page source. What is your desired output?

– Andersson
Nov 21 '18 at 19:34

The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1

– Phillip1982
Nov 21 '18 at 19:36

Looks like this part of URL is added with JavaScript.

– stasdeep
Nov 22 '18 at 7:09

add a comment |

I'm trying to scrape:

https://webmd.com/oral-health/oral-lichen-planus#1 from webmd website in the following web page code:

<li class="global-nav-sign-in global-nav-hide-mobile" data-metrics-module="">

            <a href="https://member.webmd.com/signin?appid=1&amp;returl=https://www.webmd.com/oral-health/oral-lichen-planus#1" data-metrics-link="reg-login">Sign In</a>

        </li>

I use the following scrapy code to achieve this:

import scrapy

import re

import string

import pandas as pd



class HealthItem(scrapy.Item):

    link = scrapy.Field()





def urls_getter():

    fname = "/home/phil/fd/webmd/health.csv"

    pds = pd.read_csv(fname)

    pds_link = pds['link']

    pds_link = pds_link.drop_duplicates(keep = "first", inplace = False)

    pds_link = pds_link.tolist()

    return pds_link





class SymptommdSpider(scrapy.Spider):

    name = "symptommd"

    allowed_domains = ["webmd.com"]

    start_urls = urls_getter()

    def parse(self, response):

        titles = response.xpath('//li[contains(@class, "global-nav-sign-in")]/a[contains(@href, "https:")]')

        for title in titles:

            item = HealthItem()

            item['link'] =  title.xpath('@href').extract()

            yield item

However, this code gets only the front portion of the a href. Namely, https://member.webmd.com/signin. How do I get only the second web link?

asked Nov 21 '18 at 19:23

Phillip1982

3217

I'm trying to scrape:

https://webmd.com/oral-health/oral-lichen-planus#1 from webmd website in the following web page code:

<li class="global-nav-sign-in global-nav-hide-mobile" data-metrics-module="">

            <a href="https://member.webmd.com/signin?appid=1&amp;returl=https://www.webmd.com/oral-health/oral-lichen-planus#1" data-metrics-link="reg-login">Sign In</a>

        </li>

I use the following scrapy code to achieve this:

import scrapy

import re

import string

import pandas as pd



class HealthItem(scrapy.Item):

    link = scrapy.Field()





def urls_getter():

    fname = "/home/phil/fd/webmd/health.csv"

    pds = pd.read_csv(fname)

    pds_link = pds['link']

    pds_link = pds_link.drop_duplicates(keep = "first", inplace = False)

    pds_link = pds_link.tolist()

    return pds_link





class SymptommdSpider(scrapy.Spider):

    name = "symptommd"

    allowed_domains = ["webmd.com"]

    start_urls = urls_getter()

    def parse(self, response):

        titles = response.xpath('//li[contains(@class, "global-nav-sign-in")]/a[contains(@href, "https:")]')

        for title in titles:

            item = HealthItem()

            item['link'] =  title.xpath('@href').extract()

            yield item

However, this code gets only the front portion of the a href. Namely, https://member.webmd.com/signin. How do I get only the second web link?

python html web-scraping scrapy

asked Nov 21 '18 at 19:23

Phillip1982

3217

asked Nov 21 '18 at 19:23

Phillip1982

3217

asked Nov 21 '18 at 19:23

Phillip1982

3217

asked Nov 21 '18 at 19:23

Phillip1982

3217

asked Nov 21 '18 at 19:23

Phillip1982

3217

This is exactly how @href appears in page source. What is your desired output?

– Andersson
Nov 21 '18 at 19:34

The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1

– Phillip1982
Nov 21 '18 at 19:36

Looks like this part of URL is added with JavaScript.

– stasdeep
Nov 22 '18 at 7:09

add a comment |

This is exactly how @href appears in page source. What is your desired output?

– Andersson
Nov 21 '18 at 19:34

The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1

– Phillip1982
Nov 21 '18 at 19:36

Looks like this part of URL is added with JavaScript.

– stasdeep
Nov 22 '18 at 7:09

This is exactly how @href appears in page source. What is your desired output?

– Andersson
Nov 21 '18 at 19:34

The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1

– Phillip1982
Nov 21 '18 at 19:36

Looks like this part of URL is added with JavaScript.

– stasdeep
Nov 22 '18 at 7:09

add a comment |

1 Answer
1

active

oldest

votes

As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:

Raw HTML

Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.

If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.

answered Nov 22 '18 at 15:41

Guillaume

1,1581724

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419198%2fscrapy-how-to-scrape-a-weblink-within-a-weblink-using-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:

Raw HTML

Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.

If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.

answered Nov 22 '18 at 15:41

Guillaume

1,1581724

add a comment |

As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:

Raw HTML

Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.

If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.

answered Nov 22 '18 at 15:41

Guillaume

1,1581724

add a comment |

As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:

Raw HTML

Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.

If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.

answered Nov 22 '18 at 15:41

Guillaume

1,1581724

As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:

Raw HTML

Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.

If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.

answered Nov 22 '18 at 15:41

Guillaume

1,1581724

answered Nov 22 '18 at 15:41

Guillaume

1,1581724

answered Nov 22 '18 at 15:41

Guillaume

1,1581724

answered Nov 22 '18 at 15:41

Guillaume

1,1581724

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky