Scrapy - How to scrape a weblink within a weblink using python?












1















I'm trying to scrape:



https://webmd.com/oral-health/oral-lichen-planus#1 from webmd website in the following web page code:



<li class="global-nav-sign-in global-nav-hide-mobile" data-metrics-module="">
<a href="https://member.webmd.com/signin?appid=1&amp;returl=https://www.webmd.com/oral-health/oral-lichen-planus#1" data-metrics-link="reg-login">Sign In</a>
</li>


I use the following scrapy code to achieve this:



import scrapy
import re
import string
import pandas as pd

class HealthItem(scrapy.Item):
link = scrapy.Field()


def urls_getter():
fname = "/home/phil/fd/webmd/health.csv"
pds = pd.read_csv(fname)
pds_link = pds['link']
pds_link = pds_link.drop_duplicates(keep = "first", inplace = False)
pds_link = pds_link.tolist()
return pds_link


class SymptommdSpider(scrapy.Spider):
name = "symptommd"
allowed_domains = ["webmd.com"]
start_urls = urls_getter()
def parse(self, response):
titles = response.xpath('//li[contains(@class, "global-nav-sign-in")]/a[contains(@href, "https:")]')
for title in titles:
item = HealthItem()
item['link'] = title.xpath('@href').extract()
yield item


However, this code gets only the front portion of the a href. Namely, https://member.webmd.com/signin. How do I get only the second web link?










share|improve this question























  • This is exactly how @href appears in page source. What is your desired output?

    – Andersson
    Nov 21 '18 at 19:34











  • The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1

    – Phillip1982
    Nov 21 '18 at 19:36













  • Looks like this part of URL is added with JavaScript.

    – stasdeep
    Nov 22 '18 at 7:09
















1















I'm trying to scrape:



https://webmd.com/oral-health/oral-lichen-planus#1 from webmd website in the following web page code:



<li class="global-nav-sign-in global-nav-hide-mobile" data-metrics-module="">
<a href="https://member.webmd.com/signin?appid=1&amp;returl=https://www.webmd.com/oral-health/oral-lichen-planus#1" data-metrics-link="reg-login">Sign In</a>
</li>


I use the following scrapy code to achieve this:



import scrapy
import re
import string
import pandas as pd

class HealthItem(scrapy.Item):
link = scrapy.Field()


def urls_getter():
fname = "/home/phil/fd/webmd/health.csv"
pds = pd.read_csv(fname)
pds_link = pds['link']
pds_link = pds_link.drop_duplicates(keep = "first", inplace = False)
pds_link = pds_link.tolist()
return pds_link


class SymptommdSpider(scrapy.Spider):
name = "symptommd"
allowed_domains = ["webmd.com"]
start_urls = urls_getter()
def parse(self, response):
titles = response.xpath('//li[contains(@class, "global-nav-sign-in")]/a[contains(@href, "https:")]')
for title in titles:
item = HealthItem()
item['link'] = title.xpath('@href').extract()
yield item


However, this code gets only the front portion of the a href. Namely, https://member.webmd.com/signin. How do I get only the second web link?










share|improve this question























  • This is exactly how @href appears in page source. What is your desired output?

    – Andersson
    Nov 21 '18 at 19:34











  • The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1

    – Phillip1982
    Nov 21 '18 at 19:36













  • Looks like this part of URL is added with JavaScript.

    – stasdeep
    Nov 22 '18 at 7:09














1












1








1








I'm trying to scrape:



https://webmd.com/oral-health/oral-lichen-planus#1 from webmd website in the following web page code:



<li class="global-nav-sign-in global-nav-hide-mobile" data-metrics-module="">
<a href="https://member.webmd.com/signin?appid=1&amp;returl=https://www.webmd.com/oral-health/oral-lichen-planus#1" data-metrics-link="reg-login">Sign In</a>
</li>


I use the following scrapy code to achieve this:



import scrapy
import re
import string
import pandas as pd

class HealthItem(scrapy.Item):
link = scrapy.Field()


def urls_getter():
fname = "/home/phil/fd/webmd/health.csv"
pds = pd.read_csv(fname)
pds_link = pds['link']
pds_link = pds_link.drop_duplicates(keep = "first", inplace = False)
pds_link = pds_link.tolist()
return pds_link


class SymptommdSpider(scrapy.Spider):
name = "symptommd"
allowed_domains = ["webmd.com"]
start_urls = urls_getter()
def parse(self, response):
titles = response.xpath('//li[contains(@class, "global-nav-sign-in")]/a[contains(@href, "https:")]')
for title in titles:
item = HealthItem()
item['link'] = title.xpath('@href').extract()
yield item


However, this code gets only the front portion of the a href. Namely, https://member.webmd.com/signin. How do I get only the second web link?










share|improve this question














I'm trying to scrape:



https://webmd.com/oral-health/oral-lichen-planus#1 from webmd website in the following web page code:



<li class="global-nav-sign-in global-nav-hide-mobile" data-metrics-module="">
<a href="https://member.webmd.com/signin?appid=1&amp;returl=https://www.webmd.com/oral-health/oral-lichen-planus#1" data-metrics-link="reg-login">Sign In</a>
</li>


I use the following scrapy code to achieve this:



import scrapy
import re
import string
import pandas as pd

class HealthItem(scrapy.Item):
link = scrapy.Field()


def urls_getter():
fname = "/home/phil/fd/webmd/health.csv"
pds = pd.read_csv(fname)
pds_link = pds['link']
pds_link = pds_link.drop_duplicates(keep = "first", inplace = False)
pds_link = pds_link.tolist()
return pds_link


class SymptommdSpider(scrapy.Spider):
name = "symptommd"
allowed_domains = ["webmd.com"]
start_urls = urls_getter()
def parse(self, response):
titles = response.xpath('//li[contains(@class, "global-nav-sign-in")]/a[contains(@href, "https:")]')
for title in titles:
item = HealthItem()
item['link'] = title.xpath('@href').extract()
yield item


However, this code gets only the front portion of the a href. Namely, https://member.webmd.com/signin. How do I get only the second web link?







python html web-scraping scrapy






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 21 '18 at 19:23









Phillip1982Phillip1982

3217




3217













  • This is exactly how @href appears in page source. What is your desired output?

    – Andersson
    Nov 21 '18 at 19:34











  • The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1

    – Phillip1982
    Nov 21 '18 at 19:36













  • Looks like this part of URL is added with JavaScript.

    – stasdeep
    Nov 22 '18 at 7:09



















  • This is exactly how @href appears in page source. What is your desired output?

    – Andersson
    Nov 21 '18 at 19:34











  • The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1

    – Phillip1982
    Nov 21 '18 at 19:36













  • Looks like this part of URL is added with JavaScript.

    – stasdeep
    Nov 22 '18 at 7:09

















This is exactly how @href appears in page source. What is your desired output?

– Andersson
Nov 21 '18 at 19:34





This is exactly how @href appears in page source. What is your desired output?

– Andersson
Nov 21 '18 at 19:34













The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1

– Phillip1982
Nov 21 '18 at 19:36







The outcome is member.webmd.com/signin but I need webmd.com/oral-health/oral-lichen-planus#1. Here is the sample url: webmd.com/oral-health/oral-lichen-planus#1

– Phillip1982
Nov 21 '18 at 19:36















Looks like this part of URL is added with JavaScript.

– stasdeep
Nov 22 '18 at 7:09





Looks like this part of URL is added with JavaScript.

– stasdeep
Nov 22 '18 at 7:09












1 Answer
1






active

oldest

votes


















0














As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:



Raw HTML



Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.



If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419198%2fscrapy-how-to-scrape-a-weblink-within-a-weblink-using-python%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:



    Raw HTML



    Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.



    If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.






    share|improve this answer




























      0














      As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:



      Raw HTML



      Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.



      If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.






      share|improve this answer


























        0












        0








        0







        As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:



        Raw HTML



        Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.



        If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.






        share|improve this answer













        As mentioned in the comments, the URL is built with Javascript. If you look at the raw HTML, it looks like this:



        Raw HTML



        Does it really matter anyway? This URL https://member.webmd.com/signin points you to a valid page.



        If this matters, then you need some extra logic to extract the info from the Javascript, or you can hardcode the full URL in your code.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 22 '18 at 15:41









        GuillaumeGuillaume

        1,1581724




        1,1581724
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419198%2fscrapy-how-to-scrape-a-weblink-within-a-weblink-using-python%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to change which sound is reproduced for terminal bell?

            Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents

            Can I use Tabulator js library in my java Spring + Thymeleaf project?