Getting javascript variable value while scraping with python
up vote
0
down vote
favorite
I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.
I am scraping a news site using python with packages such as Beautiful Soup and etc.
I am facing difficulty while getting the value of java script
variable which is declared in script
tag and also it is getting updated there.
Here is the part of HTML page which I am scraping:(containing only script part)
<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>
From the above part, I want to get the value of min_news_id
in python.
I should also get the value of same variable if updated from line 2.
Here is how I am doing it:
self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)
But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.
python web-scraping beautifulsoup python-3.6
|
show 3 more comments
up vote
0
down vote
favorite
I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.
I am scraping a news site using python with packages such as Beautiful Soup and etc.
I am facing difficulty while getting the value of java script
variable which is declared in script
tag and also it is getting updated there.
Here is the part of HTML page which I am scraping:(containing only script part)
<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>
From the above part, I want to get the value of min_news_id
in python.
I should also get the value of same variable if updated from line 2.
Here is how I am doing it:
self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)
But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.
python web-scraping beautifulsoup python-3.6
1
Some dynamic contents are not rendered when scraping withBeautifulSoup
. What you're seeing in browser vs what your scraper is getting is markedly different. (You can exportpage.content
and compare). You'll need a different module likeselenium
orrequest-html
that can handle dynamic contents.
– Idlehands
Nov 13 at 14:58
@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 at 15:00
Can you share the URL?
– QHarr
Nov 13 at 15:24
inshorts.com/en/read/politics
– Anil
Nov 13 at 15:26
By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example,d7zlgjdu-1
that you're looking for?
– Kamikaze_goldfish
Nov 13 at 15:37
|
show 3 more comments
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.
I am scraping a news site using python with packages such as Beautiful Soup and etc.
I am facing difficulty while getting the value of java script
variable which is declared in script
tag and also it is getting updated there.
Here is the part of HTML page which I am scraping:(containing only script part)
<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>
From the above part, I want to get the value of min_news_id
in python.
I should also get the value of same variable if updated from line 2.
Here is how I am doing it:
self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)
But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.
python web-scraping beautifulsoup python-3.6
I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.
I am scraping a news site using python with packages such as Beautiful Soup and etc.
I am facing difficulty while getting the value of java script
variable which is declared in script
tag and also it is getting updated there.
Here is the part of HTML page which I am scraping:(containing only script part)
<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>
From the above part, I want to get the value of min_news_id
in python.
I should also get the value of same variable if updated from line 2.
Here is how I am doing it:
self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)
But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.
python web-scraping beautifulsoup python-3.6
python web-scraping beautifulsoup python-3.6
asked Nov 13 at 14:55
Anil
4831725
4831725
1
Some dynamic contents are not rendered when scraping withBeautifulSoup
. What you're seeing in browser vs what your scraper is getting is markedly different. (You can exportpage.content
and compare). You'll need a different module likeselenium
orrequest-html
that can handle dynamic contents.
– Idlehands
Nov 13 at 14:58
@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 at 15:00
Can you share the URL?
– QHarr
Nov 13 at 15:24
inshorts.com/en/read/politics
– Anil
Nov 13 at 15:26
By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example,d7zlgjdu-1
that you're looking for?
– Kamikaze_goldfish
Nov 13 at 15:37
|
show 3 more comments
1
Some dynamic contents are not rendered when scraping withBeautifulSoup
. What you're seeing in browser vs what your scraper is getting is markedly different. (You can exportpage.content
and compare). You'll need a different module likeselenium
orrequest-html
that can handle dynamic contents.
– Idlehands
Nov 13 at 14:58
@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 at 15:00
Can you share the URL?
– QHarr
Nov 13 at 15:24
inshorts.com/en/read/politics
– Anil
Nov 13 at 15:26
By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example,d7zlgjdu-1
that you're looking for?
– Kamikaze_goldfish
Nov 13 at 15:37
1
1
Some dynamic contents are not rendered when scraping with
BeautifulSoup
. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content
and compare). You'll need a different module like selenium
or request-html
that can handle dynamic contents.– Idlehands
Nov 13 at 14:58
Some dynamic contents are not rendered when scraping with
BeautifulSoup
. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content
and compare). You'll need a different module like selenium
or request-html
that can handle dynamic contents.– Idlehands
Nov 13 at 14:58
@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 at 15:00
@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 at 15:00
Can you share the URL?
– QHarr
Nov 13 at 15:24
Can you share the URL?
– QHarr
Nov 13 at 15:24
inshorts.com/en/read/politics
– Anil
Nov 13 at 15:26
inshorts.com/en/read/politics
– Anil
Nov 13 at 15:26
By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example,
d7zlgjdu-1
that you're looking for?– Kamikaze_goldfish
Nov 13 at 15:37
By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example,
d7zlgjdu-1
that you're looking for?– Kamikaze_goldfish
Nov 13 at 15:37
|
show 3 more comments
3 Answers
3
active
oldest
votes
up vote
1
down vote
you can't monitor javascript variable change using BeautifulSoup
, here how to get next page news using while
loop, re
and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
That's not hard in selenium:driver.execute_script("return min_news_id")
– pguardiario
Nov 14 at 0:41
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 at 8:33
Just put it in a loop with asleep
– pguardiario
Nov 14 at 23:57
missed thinking about that, but you're right
– ewwink
Nov 15 at 0:01
1
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 at 7:53
add a comment |
up vote
0
down vote
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 at 17:03
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 at 17:19
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :category: politics news_offset: afk0bz0p-1
and the url to make http post request ishttps://inshorts.com/en/ajax/more_news
– Anil
Nov 13 at 17:27
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 at 17:36
add a comment |
up vote
0
down vote
thank you for the response, Finally I solved using requests
package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
you can't monitor javascript variable change using BeautifulSoup
, here how to get next page news using while
loop, re
and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
That's not hard in selenium:driver.execute_script("return min_news_id")
– pguardiario
Nov 14 at 0:41
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 at 8:33
Just put it in a loop with asleep
– pguardiario
Nov 14 at 23:57
missed thinking about that, but you're right
– ewwink
Nov 15 at 0:01
1
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 at 7:53
add a comment |
up vote
1
down vote
you can't monitor javascript variable change using BeautifulSoup
, here how to get next page news using while
loop, re
and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
That's not hard in selenium:driver.execute_script("return min_news_id")
– pguardiario
Nov 14 at 0:41
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 at 8:33
Just put it in a loop with asleep
– pguardiario
Nov 14 at 23:57
missed thinking about that, but you're right
– ewwink
Nov 15 at 0:01
1
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 at 7:53
add a comment |
up vote
1
down vote
up vote
1
down vote
you can't monitor javascript variable change using BeautifulSoup
, here how to get next page news using while
loop, re
and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
you can't monitor javascript variable change using BeautifulSoup
, here how to get next page news using while
loop, re
and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
edited Nov 15 at 13:36
answered Nov 13 at 18:19
ewwink
6,84422233
6,84422233
That's not hard in selenium:driver.execute_script("return min_news_id")
– pguardiario
Nov 14 at 0:41
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 at 8:33
Just put it in a loop with asleep
– pguardiario
Nov 14 at 23:57
missed thinking about that, but you're right
– ewwink
Nov 15 at 0:01
1
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 at 7:53
add a comment |
That's not hard in selenium:driver.execute_script("return min_news_id")
– pguardiario
Nov 14 at 0:41
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 at 8:33
Just put it in a loop with asleep
– pguardiario
Nov 14 at 23:57
missed thinking about that, but you're right
– ewwink
Nov 15 at 0:01
1
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 at 7:53
That's not hard in selenium:
driver.execute_script("return min_news_id")
– pguardiario
Nov 14 at 0:41
That's not hard in selenium:
driver.execute_script("return min_news_id")
– pguardiario
Nov 14 at 0:41
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 at 8:33
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 at 8:33
Just put it in a loop with a
sleep
– pguardiario
Nov 14 at 23:57
Just put it in a loop with a
sleep
– pguardiario
Nov 14 at 23:57
missed thinking about that, but you're right
– ewwink
Nov 15 at 0:01
missed thinking about that, but you're right
– ewwink
Nov 15 at 0:01
1
1
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 at 7:53
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 at 7:53
add a comment |
up vote
0
down vote
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 at 17:03
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 at 17:19
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :category: politics news_offset: afk0bz0p-1
and the url to make http post request ishttps://inshorts.com/en/ajax/more_news
– Anil
Nov 13 at 17:27
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 at 17:36
add a comment |
up vote
0
down vote
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 at 17:03
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 at 17:19
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :category: politics news_offset: afk0bz0p-1
and the url to make http post request ishttps://inshorts.com/en/ajax/more_news
– Anil
Nov 13 at 17:27
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 at 17:36
add a comment |
up vote
0
down vote
up vote
0
down vote
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
edited Nov 13 at 15:50
answered Nov 13 at 15:39
Kamikaze_goldfish
463311
463311
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 at 17:03
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 at 17:19
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :category: politics news_offset: afk0bz0p-1
and the url to make http post request ishttps://inshorts.com/en/ajax/more_news
– Anil
Nov 13 at 17:27
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 at 17:36
add a comment |
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 at 17:03
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 at 17:19
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :category: politics news_offset: afk0bz0p-1
and the url to make http post request ishttps://inshorts.com/en/ajax/more_news
– Anil
Nov 13 at 17:27
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 at 17:36
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 at 17:03
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 at 17:03
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 at 17:19
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 at 17:19
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :
category: politics news_offset: afk0bz0p-1
and the url to make http post request is https://inshorts.com/en/ajax/more_news
– Anil
Nov 13 at 17:27
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :
category: politics news_offset: afk0bz0p-1
and the url to make http post request is https://inshorts.com/en/ajax/more_news
– Anil
Nov 13 at 17:27
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 at 17:36
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 at 17:36
add a comment |
up vote
0
down vote
thank you for the response, Finally I solved using requests
package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()
add a comment |
up vote
0
down vote
thank you for the response, Finally I solved using requests
package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()
add a comment |
up vote
0
down vote
up vote
0
down vote
thank you for the response, Finally I solved using requests
package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()
thank you for the response, Finally I solved using requests
package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()
answered Nov 15 at 13:36
Anil
4831725
4831725
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283742%2fgetting-javascript-variable-value-while-scraping-with-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Some dynamic contents are not rendered when scraping with
BeautifulSoup
. What you're seeing in browser vs what your scraper is getting is markedly different. (You can exportpage.content
and compare). You'll need a different module likeselenium
orrequest-html
that can handle dynamic contents.– Idlehands
Nov 13 at 14:58
@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 at 15:00
Can you share the URL?
– QHarr
Nov 13 at 15:24
inshorts.com/en/read/politics
– Anil
Nov 13 at 15:26
By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example,
d7zlgjdu-1
that you're looking for?– Kamikaze_goldfish
Nov 13 at 15:37