Are spaces around CSS combinators are really optional
I'm a bit confused by using CSS selectors with axis combinators in BeautifulSoup. Below is the simple code to illustrate what I mean:
from bs4 import BeautifulSoup as bs
import requests
response = requests.get('https://stackoverflow.com/questions/tagged/python')
soup = bs(response.text)
print(len(soup.select('#mainbar > div')))
returns 6
children... but
print(len(soup.select('#mainbar>div')))
returns 0
children...
The same with '#mainbar ~ div'
(found 1 sibling) and #mainbar~div'
(found nothing)
From documentation those spaces are optional, but in fact I got different output with BeautifulSoup for the same selectors (as I thought)
So is it bs4
bug or this behavior depends on version of CSS or something else?
python web-scraping beautifulsoup css-selectors
add a comment |
I'm a bit confused by using CSS selectors with axis combinators in BeautifulSoup. Below is the simple code to illustrate what I mean:
from bs4 import BeautifulSoup as bs
import requests
response = requests.get('https://stackoverflow.com/questions/tagged/python')
soup = bs(response.text)
print(len(soup.select('#mainbar > div')))
returns 6
children... but
print(len(soup.select('#mainbar>div')))
returns 0
children...
The same with '#mainbar ~ div'
(found 1 sibling) and #mainbar~div'
(found nothing)
From documentation those spaces are optional, but in fact I got different output with BeautifulSoup for the same selectors (as I thought)
So is it bs4
bug or this behavior depends on version of CSS or something else?
python web-scraping beautifulsoup css-selectors
Why don't you just not do that? If I inherited code like that it would make me unhappy.
– pguardiario
Nov 21 '18 at 1:23
add a comment |
I'm a bit confused by using CSS selectors with axis combinators in BeautifulSoup. Below is the simple code to illustrate what I mean:
from bs4 import BeautifulSoup as bs
import requests
response = requests.get('https://stackoverflow.com/questions/tagged/python')
soup = bs(response.text)
print(len(soup.select('#mainbar > div')))
returns 6
children... but
print(len(soup.select('#mainbar>div')))
returns 0
children...
The same with '#mainbar ~ div'
(found 1 sibling) and #mainbar~div'
(found nothing)
From documentation those spaces are optional, but in fact I got different output with BeautifulSoup for the same selectors (as I thought)
So is it bs4
bug or this behavior depends on version of CSS or something else?
python web-scraping beautifulsoup css-selectors
I'm a bit confused by using CSS selectors with axis combinators in BeautifulSoup. Below is the simple code to illustrate what I mean:
from bs4 import BeautifulSoup as bs
import requests
response = requests.get('https://stackoverflow.com/questions/tagged/python')
soup = bs(response.text)
print(len(soup.select('#mainbar > div')))
returns 6
children... but
print(len(soup.select('#mainbar>div')))
returns 0
children...
The same with '#mainbar ~ div'
(found 1 sibling) and #mainbar~div'
(found nothing)
From documentation those spaces are optional, but in fact I got different output with BeautifulSoup for the same selectors (as I thought)
So is it bs4
bug or this behavior depends on version of CSS or something else?
python web-scraping beautifulsoup css-selectors
python web-scraping beautifulsoup css-selectors
asked Nov 20 '18 at 21:21
JaSONJaSON
5319
5319
Why don't you just not do that? If I inherited code like that it would make me unhappy.
– pguardiario
Nov 21 '18 at 1:23
add a comment |
Why don't you just not do that? If I inherited code like that it would make me unhappy.
– pguardiario
Nov 21 '18 at 1:23
Why don't you just not do that? If I inherited code like that it would make me unhappy.
– pguardiario
Nov 21 '18 at 1:23
Why don't you just not do that? If I inherited code like that it would make me unhappy.
– pguardiario
Nov 21 '18 at 1:23
add a comment |
2 Answers
2
active
oldest
votes
This is confirmed as a bug here: https://bugs.launchpad.net/beautifulsoup/+bug/1717851
The selector, from a CSS perspective is fine with/without.
I will see if I can find further evidence.
The individual reporting the bug states:
The issue, as far as I see, is that since the code is only doing a
shlex.split
, it doesn't treatdiv
,>
, andspan
as separate
entities is a space is left out on either side of>
.
Thanks for the link. However, in bug description user getsValueError
while I'm just got an empty list... Maybe this was some kind of quick fix for not breaking the scripts...
– JaSON
Nov 20 '18 at 21:43
I can’t honestly say though sounds plausible. I am looking to see if I can find anything more up to date.
– QHarr
Nov 20 '18 at 21:43
There is no additional mention in the development log: code.launchpad.net/beautifulsoup
– QHarr
Nov 20 '18 at 21:59
Thank you for help. I just wanted to understand whether bs4 is good for scraping or not.. and as far as I can see - not so good :)
– JaSON
Nov 20 '18 at 22:04
Bs4 is great for scraping in my limited experience. You just need to remember the spaces it would seem. The appropriate spaces make for more legible selectors.
– QHarr
Nov 20 '18 at 22:05
|
show 1 more comment
in case you want to patch it, see bs4/element.py
line 1440 replace
tokens = shlex.split(selector)
with
selector = re.sub(r's*([+>~])s*', r' 1 ', selector)
tokens = shlex.split(selector)
Demo:
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
import re, shlex
def testSelect(selector):
selector = re.sub(r's*([+>~])s*', r' 1 ', selector)
tokens = shlex.split(selector)
print(tokens)
testSelect('#mainbar > div ~ p') # default
testSelect('#mainbar>div~p')
testSelect('#mainbar >div+ p')
testSelect('#mainbar.classA')
testSelect('#mainbar p')
</code>
</div>
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53401724%2fare-spaces-around-css-combinators-are-really-optional%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is confirmed as a bug here: https://bugs.launchpad.net/beautifulsoup/+bug/1717851
The selector, from a CSS perspective is fine with/without.
I will see if I can find further evidence.
The individual reporting the bug states:
The issue, as far as I see, is that since the code is only doing a
shlex.split
, it doesn't treatdiv
,>
, andspan
as separate
entities is a space is left out on either side of>
.
Thanks for the link. However, in bug description user getsValueError
while I'm just got an empty list... Maybe this was some kind of quick fix for not breaking the scripts...
– JaSON
Nov 20 '18 at 21:43
I can’t honestly say though sounds plausible. I am looking to see if I can find anything more up to date.
– QHarr
Nov 20 '18 at 21:43
There is no additional mention in the development log: code.launchpad.net/beautifulsoup
– QHarr
Nov 20 '18 at 21:59
Thank you for help. I just wanted to understand whether bs4 is good for scraping or not.. and as far as I can see - not so good :)
– JaSON
Nov 20 '18 at 22:04
Bs4 is great for scraping in my limited experience. You just need to remember the spaces it would seem. The appropriate spaces make for more legible selectors.
– QHarr
Nov 20 '18 at 22:05
|
show 1 more comment
This is confirmed as a bug here: https://bugs.launchpad.net/beautifulsoup/+bug/1717851
The selector, from a CSS perspective is fine with/without.
I will see if I can find further evidence.
The individual reporting the bug states:
The issue, as far as I see, is that since the code is only doing a
shlex.split
, it doesn't treatdiv
,>
, andspan
as separate
entities is a space is left out on either side of>
.
Thanks for the link. However, in bug description user getsValueError
while I'm just got an empty list... Maybe this was some kind of quick fix for not breaking the scripts...
– JaSON
Nov 20 '18 at 21:43
I can’t honestly say though sounds plausible. I am looking to see if I can find anything more up to date.
– QHarr
Nov 20 '18 at 21:43
There is no additional mention in the development log: code.launchpad.net/beautifulsoup
– QHarr
Nov 20 '18 at 21:59
Thank you for help. I just wanted to understand whether bs4 is good for scraping or not.. and as far as I can see - not so good :)
– JaSON
Nov 20 '18 at 22:04
Bs4 is great for scraping in my limited experience. You just need to remember the spaces it would seem. The appropriate spaces make for more legible selectors.
– QHarr
Nov 20 '18 at 22:05
|
show 1 more comment
This is confirmed as a bug here: https://bugs.launchpad.net/beautifulsoup/+bug/1717851
The selector, from a CSS perspective is fine with/without.
I will see if I can find further evidence.
The individual reporting the bug states:
The issue, as far as I see, is that since the code is only doing a
shlex.split
, it doesn't treatdiv
,>
, andspan
as separate
entities is a space is left out on either side of>
.
This is confirmed as a bug here: https://bugs.launchpad.net/beautifulsoup/+bug/1717851
The selector, from a CSS perspective is fine with/without.
I will see if I can find further evidence.
The individual reporting the bug states:
The issue, as far as I see, is that since the code is only doing a
shlex.split
, it doesn't treatdiv
,>
, andspan
as separate
entities is a space is left out on either side of>
.
answered Nov 20 '18 at 21:36
QHarrQHarr
33.6k82043
33.6k82043
Thanks for the link. However, in bug description user getsValueError
while I'm just got an empty list... Maybe this was some kind of quick fix for not breaking the scripts...
– JaSON
Nov 20 '18 at 21:43
I can’t honestly say though sounds plausible. I am looking to see if I can find anything more up to date.
– QHarr
Nov 20 '18 at 21:43
There is no additional mention in the development log: code.launchpad.net/beautifulsoup
– QHarr
Nov 20 '18 at 21:59
Thank you for help. I just wanted to understand whether bs4 is good for scraping or not.. and as far as I can see - not so good :)
– JaSON
Nov 20 '18 at 22:04
Bs4 is great for scraping in my limited experience. You just need to remember the spaces it would seem. The appropriate spaces make for more legible selectors.
– QHarr
Nov 20 '18 at 22:05
|
show 1 more comment
Thanks for the link. However, in bug description user getsValueError
while I'm just got an empty list... Maybe this was some kind of quick fix for not breaking the scripts...
– JaSON
Nov 20 '18 at 21:43
I can’t honestly say though sounds plausible. I am looking to see if I can find anything more up to date.
– QHarr
Nov 20 '18 at 21:43
There is no additional mention in the development log: code.launchpad.net/beautifulsoup
– QHarr
Nov 20 '18 at 21:59
Thank you for help. I just wanted to understand whether bs4 is good for scraping or not.. and as far as I can see - not so good :)
– JaSON
Nov 20 '18 at 22:04
Bs4 is great for scraping in my limited experience. You just need to remember the spaces it would seem. The appropriate spaces make for more legible selectors.
– QHarr
Nov 20 '18 at 22:05
Thanks for the link. However, in bug description user gets
ValueError
while I'm just got an empty list... Maybe this was some kind of quick fix for not breaking the scripts...– JaSON
Nov 20 '18 at 21:43
Thanks for the link. However, in bug description user gets
ValueError
while I'm just got an empty list... Maybe this was some kind of quick fix for not breaking the scripts...– JaSON
Nov 20 '18 at 21:43
I can’t honestly say though sounds plausible. I am looking to see if I can find anything more up to date.
– QHarr
Nov 20 '18 at 21:43
I can’t honestly say though sounds plausible. I am looking to see if I can find anything more up to date.
– QHarr
Nov 20 '18 at 21:43
There is no additional mention in the development log: code.launchpad.net/beautifulsoup
– QHarr
Nov 20 '18 at 21:59
There is no additional mention in the development log: code.launchpad.net/beautifulsoup
– QHarr
Nov 20 '18 at 21:59
Thank you for help. I just wanted to understand whether bs4 is good for scraping or not.. and as far as I can see - not so good :)
– JaSON
Nov 20 '18 at 22:04
Thank you for help. I just wanted to understand whether bs4 is good for scraping or not.. and as far as I can see - not so good :)
– JaSON
Nov 20 '18 at 22:04
Bs4 is great for scraping in my limited experience. You just need to remember the spaces it would seem. The appropriate spaces make for more legible selectors.
– QHarr
Nov 20 '18 at 22:05
Bs4 is great for scraping in my limited experience. You just need to remember the spaces it would seem. The appropriate spaces make for more legible selectors.
– QHarr
Nov 20 '18 at 22:05
|
show 1 more comment
in case you want to patch it, see bs4/element.py
line 1440 replace
tokens = shlex.split(selector)
with
selector = re.sub(r's*([+>~])s*', r' 1 ', selector)
tokens = shlex.split(selector)
Demo:
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
import re, shlex
def testSelect(selector):
selector = re.sub(r's*([+>~])s*', r' 1 ', selector)
tokens = shlex.split(selector)
print(tokens)
testSelect('#mainbar > div ~ p') # default
testSelect('#mainbar>div~p')
testSelect('#mainbar >div+ p')
testSelect('#mainbar.classA')
testSelect('#mainbar p')
</code>
</div>
add a comment |
in case you want to patch it, see bs4/element.py
line 1440 replace
tokens = shlex.split(selector)
with
selector = re.sub(r's*([+>~])s*', r' 1 ', selector)
tokens = shlex.split(selector)
Demo:
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
import re, shlex
def testSelect(selector):
selector = re.sub(r's*([+>~])s*', r' 1 ', selector)
tokens = shlex.split(selector)
print(tokens)
testSelect('#mainbar > div ~ p') # default
testSelect('#mainbar>div~p')
testSelect('#mainbar >div+ p')
testSelect('#mainbar.classA')
testSelect('#mainbar p')
</code>
</div>
add a comment |
in case you want to patch it, see bs4/element.py
line 1440 replace
tokens = shlex.split(selector)
with
selector = re.sub(r's*([+>~])s*', r' 1 ', selector)
tokens = shlex.split(selector)
Demo:
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
import re, shlex
def testSelect(selector):
selector = re.sub(r's*([+>~])s*', r' 1 ', selector)
tokens = shlex.split(selector)
print(tokens)
testSelect('#mainbar > div ~ p') # default
testSelect('#mainbar>div~p')
testSelect('#mainbar >div+ p')
testSelect('#mainbar.classA')
testSelect('#mainbar p')
</code>
</div>
in case you want to patch it, see bs4/element.py
line 1440 replace
tokens = shlex.split(selector)
with
selector = re.sub(r's*([+>~])s*', r' 1 ', selector)
tokens = shlex.split(selector)
Demo:
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
import re, shlex
def testSelect(selector):
selector = re.sub(r's*([+>~])s*', r' 1 ', selector)
tokens = shlex.split(selector)
print(tokens)
testSelect('#mainbar > div ~ p') # default
testSelect('#mainbar>div~p')
testSelect('#mainbar >div+ p')
testSelect('#mainbar.classA')
testSelect('#mainbar p')
</code>
</div>
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
import re, shlex
def testSelect(selector):
selector = re.sub(r's*([+>~])s*', r' 1 ', selector)
tokens = shlex.split(selector)
print(tokens)
testSelect('#mainbar > div ~ p') # default
testSelect('#mainbar>div~p')
testSelect('#mainbar >div+ p')
testSelect('#mainbar.classA')
testSelect('#mainbar p')
</code>
</div>
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
import re, shlex
def testSelect(selector):
selector = re.sub(r's*([+>~])s*', r' 1 ', selector)
tokens = shlex.split(selector)
print(tokens)
testSelect('#mainbar > div ~ p') # default
testSelect('#mainbar>div~p')
testSelect('#mainbar >div+ p')
testSelect('#mainbar.classA')
testSelect('#mainbar p')
</code>
</div>
edited Nov 22 '18 at 6:25
answered Nov 20 '18 at 22:51
ewwinkewwink
12k22339
12k22339
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53401724%2fare-spaces-around-css-combinators-are-really-optional%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Why don't you just not do that? If I inherited code like that it would make me unhappy.
– pguardiario
Nov 21 '18 at 1:23