Getting javascript variable value while scraping with python











up vote
0
down vote

favorite












I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.



I am scraping a news site using python with packages such as Beautiful Soup and etc.



I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.



Here is the part of HTML page which I am scraping:(containing only script part)



<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">

var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>


From the above part, I want to get the value of min_news_id in python.
I should also get the value of same variable if updated from line 2.



Here is how I am doing it:



    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)


But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.










share|improve this question


















  • 1




    Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents.
    – Idlehands
    Nov 13 at 14:58












  • @Idlehands Thank you very much for the information. If you have any example reference please add it.
    – Anil
    Nov 13 at 15:00










  • Can you share the URL?
    – QHarr
    Nov 13 at 15:24










  • inshorts.com/en/read/politics
    – Anil
    Nov 13 at 15:26










  • By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for?
    – Kamikaze_goldfish
    Nov 13 at 15:37

















up vote
0
down vote

favorite












I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.



I am scraping a news site using python with packages such as Beautiful Soup and etc.



I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.



Here is the part of HTML page which I am scraping:(containing only script part)



<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">

var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>


From the above part, I want to get the value of min_news_id in python.
I should also get the value of same variable if updated from line 2.



Here is how I am doing it:



    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)


But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.










share|improve this question


















  • 1




    Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents.
    – Idlehands
    Nov 13 at 14:58












  • @Idlehands Thank you very much for the information. If you have any example reference please add it.
    – Anil
    Nov 13 at 15:00










  • Can you share the URL?
    – QHarr
    Nov 13 at 15:24










  • inshorts.com/en/read/politics
    – Anil
    Nov 13 at 15:26










  • By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for?
    – Kamikaze_goldfish
    Nov 13 at 15:37















up vote
0
down vote

favorite









up vote
0
down vote

favorite











I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.



I am scraping a news site using python with packages such as Beautiful Soup and etc.



I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.



Here is the part of HTML page which I am scraping:(containing only script part)



<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">

var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>


From the above part, I want to get the value of min_news_id in python.
I should also get the value of same variable if updated from line 2.



Here is how I am doing it:



    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)


But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.










share|improve this question













I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.



I am scraping a news site using python with packages such as Beautiful Soup and etc.



I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.



Here is the part of HTML page which I am scraping:(containing only script part)



<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">

var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>


From the above part, I want to get the value of min_news_id in python.
I should also get the value of same variable if updated from line 2.



Here is how I am doing it:



    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)


But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.







python web-scraping beautifulsoup python-3.6






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 at 14:55









Anil

4831725




4831725








  • 1




    Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents.
    – Idlehands
    Nov 13 at 14:58












  • @Idlehands Thank you very much for the information. If you have any example reference please add it.
    – Anil
    Nov 13 at 15:00










  • Can you share the URL?
    – QHarr
    Nov 13 at 15:24










  • inshorts.com/en/read/politics
    – Anil
    Nov 13 at 15:26










  • By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for?
    – Kamikaze_goldfish
    Nov 13 at 15:37
















  • 1




    Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents.
    – Idlehands
    Nov 13 at 14:58












  • @Idlehands Thank you very much for the information. If you have any example reference please add it.
    – Anil
    Nov 13 at 15:00










  • Can you share the URL?
    – QHarr
    Nov 13 at 15:24










  • inshorts.com/en/read/politics
    – Anil
    Nov 13 at 15:26










  • By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for?
    – Kamikaze_goldfish
    Nov 13 at 15:37










1




1




Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents.
– Idlehands
Nov 13 at 14:58






Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents.
– Idlehands
Nov 13 at 14:58














@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 at 15:00




@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 at 15:00












Can you share the URL?
– QHarr
Nov 13 at 15:24




Can you share the URL?
– QHarr
Nov 13 at 15:24












inshorts.com/en/read/politics
– Anil
Nov 13 at 15:26




inshorts.com/en/read/politics
– Anil
Nov 13 at 15:26












By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for?
– Kamikaze_goldfish
Nov 13 at 15:37






By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for?
– Kamikaze_goldfish
Nov 13 at 15:37














3 Answers
3






active

oldest

votes

















up vote
1
down vote













you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json



from bs4 import BeautifulSoup
import requests, re

page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'

htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...

# get current min_news_id
min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....

# new min_news_id
min_news_id = ajax_response["min_news_id"]

# remove this to loop all page (thousand?)
break





share|improve this answer























  • That's not hard in selenium: driver.execute_script("return min_news_id")
    – pguardiario
    Nov 14 at 0:41












  • that's return current value, not monitor value on change. but its not hard if using element change.
    – ewwink
    Nov 14 at 8:33












  • Just put it in a loop with a sleep
    – pguardiario
    Nov 14 at 23:57










  • missed thinking about that, but you're right
    – ewwink
    Nov 15 at 0:01






  • 1




    I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
    – pguardiario
    Nov 15 at 7:53


















up vote
0
down vote













html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">

var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''

finder = re.findall(r'min_news_id = .*;', html)
print(finder)

Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']


#2 OR YOU CAN USE



print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

Output:
d7zlgjdu-1


#3 OR YOU CAN USE



finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)

Output:
['d7zlgjdu-1']





share|improve this answer























  • Its not handling the value of the variable, once if it is updated
    – Anil
    Nov 13 at 17:03










  • What do you mean handle the value? What are you trying to accomplish?
    – Kamikaze_goldfish
    Nov 13 at 17:19










  • First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news
    – Anil
    Nov 13 at 17:27












  • So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
    – Kamikaze_goldfish
    Nov 13 at 17:36


















up vote
0
down vote













thank you for the response, Finally I solved using requests package after reading its documentation,



here is my code :



if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")

InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()





share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283742%2fgetting-javascript-variable-value-while-scraping-with-python%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote













    you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json



    from bs4 import BeautifulSoup
    import requests, re

    page_url = 'https://inshorts.com/en/read/politics'
    ajax_url = 'https://inshorts.com/en/ajax/more_news'

    htmlPage = requests.get(page_url).text
    # BeautifulSoup extract article summary
    # page = BeautifulSoup(htmlPage, "html.parser")
    # ...

    # get current min_news_id
    min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

    customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

    while min_news_id:
    # change "politics" if in different category
    reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
    # get Ajax next page
    ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
    # again, do extract article summary
    page = BeautifulSoup(ajax_response["html"], "html.parser")
    # ....
    # ....

    # new min_news_id
    min_news_id = ajax_response["min_news_id"]

    # remove this to loop all page (thousand?)
    break





    share|improve this answer























    • That's not hard in selenium: driver.execute_script("return min_news_id")
      – pguardiario
      Nov 14 at 0:41












    • that's return current value, not monitor value on change. but its not hard if using element change.
      – ewwink
      Nov 14 at 8:33












    • Just put it in a loop with a sleep
      – pguardiario
      Nov 14 at 23:57










    • missed thinking about that, but you're right
      – ewwink
      Nov 15 at 0:01






    • 1




      I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
      – pguardiario
      Nov 15 at 7:53















    up vote
    1
    down vote













    you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json



    from bs4 import BeautifulSoup
    import requests, re

    page_url = 'https://inshorts.com/en/read/politics'
    ajax_url = 'https://inshorts.com/en/ajax/more_news'

    htmlPage = requests.get(page_url).text
    # BeautifulSoup extract article summary
    # page = BeautifulSoup(htmlPage, "html.parser")
    # ...

    # get current min_news_id
    min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

    customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

    while min_news_id:
    # change "politics" if in different category
    reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
    # get Ajax next page
    ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
    # again, do extract article summary
    page = BeautifulSoup(ajax_response["html"], "html.parser")
    # ....
    # ....

    # new min_news_id
    min_news_id = ajax_response["min_news_id"]

    # remove this to loop all page (thousand?)
    break





    share|improve this answer























    • That's not hard in selenium: driver.execute_script("return min_news_id")
      – pguardiario
      Nov 14 at 0:41












    • that's return current value, not monitor value on change. but its not hard if using element change.
      – ewwink
      Nov 14 at 8:33












    • Just put it in a loop with a sleep
      – pguardiario
      Nov 14 at 23:57










    • missed thinking about that, but you're right
      – ewwink
      Nov 15 at 0:01






    • 1




      I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
      – pguardiario
      Nov 15 at 7:53













    up vote
    1
    down vote










    up vote
    1
    down vote









    you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json



    from bs4 import BeautifulSoup
    import requests, re

    page_url = 'https://inshorts.com/en/read/politics'
    ajax_url = 'https://inshorts.com/en/ajax/more_news'

    htmlPage = requests.get(page_url).text
    # BeautifulSoup extract article summary
    # page = BeautifulSoup(htmlPage, "html.parser")
    # ...

    # get current min_news_id
    min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

    customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

    while min_news_id:
    # change "politics" if in different category
    reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
    # get Ajax next page
    ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
    # again, do extract article summary
    page = BeautifulSoup(ajax_response["html"], "html.parser")
    # ....
    # ....

    # new min_news_id
    min_news_id = ajax_response["min_news_id"]

    # remove this to loop all page (thousand?)
    break





    share|improve this answer














    you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json



    from bs4 import BeautifulSoup
    import requests, re

    page_url = 'https://inshorts.com/en/read/politics'
    ajax_url = 'https://inshorts.com/en/ajax/more_news'

    htmlPage = requests.get(page_url).text
    # BeautifulSoup extract article summary
    # page = BeautifulSoup(htmlPage, "html.parser")
    # ...

    # get current min_news_id
    min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

    customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

    while min_news_id:
    # change "politics" if in different category
    reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
    # get Ajax next page
    ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
    # again, do extract article summary
    page = BeautifulSoup(ajax_response["html"], "html.parser")
    # ....
    # ....

    # new min_news_id
    min_news_id = ajax_response["min_news_id"]

    # remove this to loop all page (thousand?)
    break






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 15 at 13:36

























    answered Nov 13 at 18:19









    ewwink

    6,84422233




    6,84422233












    • That's not hard in selenium: driver.execute_script("return min_news_id")
      – pguardiario
      Nov 14 at 0:41












    • that's return current value, not monitor value on change. but its not hard if using element change.
      – ewwink
      Nov 14 at 8:33












    • Just put it in a loop with a sleep
      – pguardiario
      Nov 14 at 23:57










    • missed thinking about that, but you're right
      – ewwink
      Nov 15 at 0:01






    • 1




      I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
      – pguardiario
      Nov 15 at 7:53


















    • That's not hard in selenium: driver.execute_script("return min_news_id")
      – pguardiario
      Nov 14 at 0:41












    • that's return current value, not monitor value on change. but its not hard if using element change.
      – ewwink
      Nov 14 at 8:33












    • Just put it in a loop with a sleep
      – pguardiario
      Nov 14 at 23:57










    • missed thinking about that, but you're right
      – ewwink
      Nov 15 at 0:01






    • 1




      I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
      – pguardiario
      Nov 15 at 7:53
















    That's not hard in selenium: driver.execute_script("return min_news_id")
    – pguardiario
    Nov 14 at 0:41






    That's not hard in selenium: driver.execute_script("return min_news_id")
    – pguardiario
    Nov 14 at 0:41














    that's return current value, not monitor value on change. but its not hard if using element change.
    – ewwink
    Nov 14 at 8:33






    that's return current value, not monitor value on change. but its not hard if using element change.
    – ewwink
    Nov 14 at 8:33














    Just put it in a loop with a sleep
    – pguardiario
    Nov 14 at 23:57




    Just put it in a loop with a sleep
    – pguardiario
    Nov 14 at 23:57












    missed thinking about that, but you're right
    – ewwink
    Nov 15 at 0:01




    missed thinking about that, but you're right
    – ewwink
    Nov 15 at 0:01




    1




    1




    I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
    – pguardiario
    Nov 15 at 7:53




    I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
    – pguardiario
    Nov 15 at 7:53












    up vote
    0
    down vote













    html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

    <script type="text/javascript" src="/dist/scripts/index.js"></script>
    <script type="text/javascript" src="/dist/scripts/read.js"></script>
    <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
    <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
    $("#load-more-btn").hide();
    $("#load-more-gif").show();
    $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
    data = JSON.parse(data);
    min_news_id = data.min_news_id||min_news_id; // line 2
    $(".card-stack").append(data.html);
    })
    .fail(function(){alert("Error : unable to load more news");})
    .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
    </script>'''

    finder = re.findall(r'min_news_id = .*;', html)
    print(finder)

    Output:
    ['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']


    #2 OR YOU CAN USE



    print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

    Output:
    d7zlgjdu-1


    #3 OR YOU CAN USE



    finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
    print(finder)

    Output:
    ['d7zlgjdu-1']





    share|improve this answer























    • Its not handling the value of the variable, once if it is updated
      – Anil
      Nov 13 at 17:03










    • What do you mean handle the value? What are you trying to accomplish?
      – Kamikaze_goldfish
      Nov 13 at 17:19










    • First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news
      – Anil
      Nov 13 at 17:27












    • So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
      – Kamikaze_goldfish
      Nov 13 at 17:36















    up vote
    0
    down vote













    html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

    <script type="text/javascript" src="/dist/scripts/index.js"></script>
    <script type="text/javascript" src="/dist/scripts/read.js"></script>
    <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
    <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
    $("#load-more-btn").hide();
    $("#load-more-gif").show();
    $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
    data = JSON.parse(data);
    min_news_id = data.min_news_id||min_news_id; // line 2
    $(".card-stack").append(data.html);
    })
    .fail(function(){alert("Error : unable to load more news");})
    .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
    </script>'''

    finder = re.findall(r'min_news_id = .*;', html)
    print(finder)

    Output:
    ['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']


    #2 OR YOU CAN USE



    print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

    Output:
    d7zlgjdu-1


    #3 OR YOU CAN USE



    finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
    print(finder)

    Output:
    ['d7zlgjdu-1']





    share|improve this answer























    • Its not handling the value of the variable, once if it is updated
      – Anil
      Nov 13 at 17:03










    • What do you mean handle the value? What are you trying to accomplish?
      – Kamikaze_goldfish
      Nov 13 at 17:19










    • First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news
      – Anil
      Nov 13 at 17:27












    • So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
      – Kamikaze_goldfish
      Nov 13 at 17:36













    up vote
    0
    down vote










    up vote
    0
    down vote









    html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

    <script type="text/javascript" src="/dist/scripts/index.js"></script>
    <script type="text/javascript" src="/dist/scripts/read.js"></script>
    <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
    <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
    $("#load-more-btn").hide();
    $("#load-more-gif").show();
    $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
    data = JSON.parse(data);
    min_news_id = data.min_news_id||min_news_id; // line 2
    $(".card-stack").append(data.html);
    })
    .fail(function(){alert("Error : unable to load more news");})
    .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
    </script>'''

    finder = re.findall(r'min_news_id = .*;', html)
    print(finder)

    Output:
    ['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']


    #2 OR YOU CAN USE



    print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

    Output:
    d7zlgjdu-1


    #3 OR YOU CAN USE



    finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
    print(finder)

    Output:
    ['d7zlgjdu-1']





    share|improve this answer














    html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

    <script type="text/javascript" src="/dist/scripts/index.js"></script>
    <script type="text/javascript" src="/dist/scripts/read.js"></script>
    <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
    <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
    $("#load-more-btn").hide();
    $("#load-more-gif").show();
    $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
    data = JSON.parse(data);
    min_news_id = data.min_news_id||min_news_id; // line 2
    $(".card-stack").append(data.html);
    })
    .fail(function(){alert("Error : unable to load more news");})
    .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
    </script>'''

    finder = re.findall(r'min_news_id = .*;', html)
    print(finder)

    Output:
    ['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']


    #2 OR YOU CAN USE



    print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

    Output:
    d7zlgjdu-1


    #3 OR YOU CAN USE



    finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
    print(finder)

    Output:
    ['d7zlgjdu-1']






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 13 at 15:50

























    answered Nov 13 at 15:39









    Kamikaze_goldfish

    463311




    463311












    • Its not handling the value of the variable, once if it is updated
      – Anil
      Nov 13 at 17:03










    • What do you mean handle the value? What are you trying to accomplish?
      – Kamikaze_goldfish
      Nov 13 at 17:19










    • First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news
      – Anil
      Nov 13 at 17:27












    • So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
      – Kamikaze_goldfish
      Nov 13 at 17:36


















    • Its not handling the value of the variable, once if it is updated
      – Anil
      Nov 13 at 17:03










    • What do you mean handle the value? What are you trying to accomplish?
      – Kamikaze_goldfish
      Nov 13 at 17:19










    • First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news
      – Anil
      Nov 13 at 17:27












    • So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
      – Kamikaze_goldfish
      Nov 13 at 17:36
















    Its not handling the value of the variable, once if it is updated
    – Anil
    Nov 13 at 17:03




    Its not handling the value of the variable, once if it is updated
    – Anil
    Nov 13 at 17:03












    What do you mean handle the value? What are you trying to accomplish?
    – Kamikaze_goldfish
    Nov 13 at 17:19




    What do you mean handle the value? What are you trying to accomplish?
    – Kamikaze_goldfish
    Nov 13 at 17:19












    First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news
    – Anil
    Nov 13 at 17:27






    First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news
    – Anil
    Nov 13 at 17:27














    So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
    – Kamikaze_goldfish
    Nov 13 at 17:36




    So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
    – Kamikaze_goldfish
    Nov 13 at 17:36










    up vote
    0
    down vote













    thank you for the response, Finally I solved using requests package after reading its documentation,



    here is my code :



    if InShortsScraper.firstLoad == True:
    self.pattern = re.compile('var min_news_id = (.+?);')
    else:
    self.pattern = re.compile('min_news_id = (.+?);')
    page = None
    # print("Pattern: " + str(self.pattern))
    if news_offset == None:
    htmlPage = urlopen(url)
    page = bs(htmlPage, "html.parser")
    else:
    self.loadMore['news_offset'] = InShortsScraper.newsOffset
    # print("payload : " + str(self.loadMore))
    try:
    r = myRequest.post(
    url = url,
    data = self.loadMore
    )
    except TypeError:
    print("Error in loading")

    InShortsScraper.newsOffset = r.json()["min_news_id"]
    page = bs(r.json()["html"], "html.parser")
    #print(page)
    if InShortsScraper.newsOffset == None:
    scripts = page.find_all("script")
    for script in scripts:
    for line in script:
    scriptString = str(line)
    if "min_news_id" in scriptString:
    finder = re.findall(self.pattern, scriptString)
    InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()





    share|improve this answer

























      up vote
      0
      down vote













      thank you for the response, Finally I solved using requests package after reading its documentation,



      here is my code :



      if InShortsScraper.firstLoad == True:
      self.pattern = re.compile('var min_news_id = (.+?);')
      else:
      self.pattern = re.compile('min_news_id = (.+?);')
      page = None
      # print("Pattern: " + str(self.pattern))
      if news_offset == None:
      htmlPage = urlopen(url)
      page = bs(htmlPage, "html.parser")
      else:
      self.loadMore['news_offset'] = InShortsScraper.newsOffset
      # print("payload : " + str(self.loadMore))
      try:
      r = myRequest.post(
      url = url,
      data = self.loadMore
      )
      except TypeError:
      print("Error in loading")

      InShortsScraper.newsOffset = r.json()["min_news_id"]
      page = bs(r.json()["html"], "html.parser")
      #print(page)
      if InShortsScraper.newsOffset == None:
      scripts = page.find_all("script")
      for script in scripts:
      for line in script:
      scriptString = str(line)
      if "min_news_id" in scriptString:
      finder = re.findall(self.pattern, scriptString)
      InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()





      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        thank you for the response, Finally I solved using requests package after reading its documentation,



        here is my code :



        if InShortsScraper.firstLoad == True:
        self.pattern = re.compile('var min_news_id = (.+?);')
        else:
        self.pattern = re.compile('min_news_id = (.+?);')
        page = None
        # print("Pattern: " + str(self.pattern))
        if news_offset == None:
        htmlPage = urlopen(url)
        page = bs(htmlPage, "html.parser")
        else:
        self.loadMore['news_offset'] = InShortsScraper.newsOffset
        # print("payload : " + str(self.loadMore))
        try:
        r = myRequest.post(
        url = url,
        data = self.loadMore
        )
        except TypeError:
        print("Error in loading")

        InShortsScraper.newsOffset = r.json()["min_news_id"]
        page = bs(r.json()["html"], "html.parser")
        #print(page)
        if InShortsScraper.newsOffset == None:
        scripts = page.find_all("script")
        for script in scripts:
        for line in script:
        scriptString = str(line)
        if "min_news_id" in scriptString:
        finder = re.findall(self.pattern, scriptString)
        InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()





        share|improve this answer












        thank you for the response, Finally I solved using requests package after reading its documentation,



        here is my code :



        if InShortsScraper.firstLoad == True:
        self.pattern = re.compile('var min_news_id = (.+?);')
        else:
        self.pattern = re.compile('min_news_id = (.+?);')
        page = None
        # print("Pattern: " + str(self.pattern))
        if news_offset == None:
        htmlPage = urlopen(url)
        page = bs(htmlPage, "html.parser")
        else:
        self.loadMore['news_offset'] = InShortsScraper.newsOffset
        # print("payload : " + str(self.loadMore))
        try:
        r = myRequest.post(
        url = url,
        data = self.loadMore
        )
        except TypeError:
        print("Error in loading")

        InShortsScraper.newsOffset = r.json()["min_news_id"]
        page = bs(r.json()["html"], "html.parser")
        #print(page)
        if InShortsScraper.newsOffset == None:
        scripts = page.find_all("script")
        for script in scripts:
        for line in script:
        scriptString = str(line)
        if "min_news_id" in scriptString:
        finder = re.findall(self.pattern, scriptString)
        InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 15 at 13:36









        Anil

        4831725




        4831725






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283742%2fgetting-javascript-variable-value-while-scraping-with-python%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to change which sound is reproduced for terminal bell?

            Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents

            Can I use Tabulator js library in my java Spring + Thymeleaf project?