Scala Set intersection performance
Using Scala's scala.collection.Set[T]. Given a small set s with only a few elements and another big set b with lots of elements, is there any performance difference between:
s & b // s intersect b
and
b & s // b intersect s.
If yes, which is the fastest?
scala performance
add a comment |
Using Scala's scala.collection.Set[T]. Given a small set s with only a few elements and another big set b with lots of elements, is there any performance difference between:
s & b // s intersect b
and
b & s // b intersect s.
If yes, which is the fastest?
scala performance
2
"If yes, which is the fastest?" -- why not benchmark it and see?
– John Coleman
Nov 16 at 0:14
You can usefilterto take advantage of the size difference to get better performance:smallSet filter bigSet. This will make a new set by adding the items insmallSetare inbigSetwithout you needing to check each element ofbigSet.&may do this under the hood but better to be explicit in my opinion.
– morsecodist
Nov 16 at 6:13
@morsecodist; According to the source code,intersect()actually is thefilter()op.
– jwvh
Nov 16 at 8:38
@jwvh It is probably overriden, as the implementation which is called eventually isscala.collection.immutable.HashSet.HashTrieSet#intersect0and looks quite complex to me. Based on my rough benchmarking its peformance is comparable toa filter band it is much faster thanb filter a.
– Suma
Nov 16 at 11:38
add a comment |
Using Scala's scala.collection.Set[T]. Given a small set s with only a few elements and another big set b with lots of elements, is there any performance difference between:
s & b // s intersect b
and
b & s // b intersect s.
If yes, which is the fastest?
scala performance
Using Scala's scala.collection.Set[T]. Given a small set s with only a few elements and another big set b with lots of elements, is there any performance difference between:
s & b // s intersect b
and
b & s // b intersect s.
If yes, which is the fastest?
scala performance
scala performance
asked Nov 16 at 0:12
pgrandjean
18439
18439
2
"If yes, which is the fastest?" -- why not benchmark it and see?
– John Coleman
Nov 16 at 0:14
You can usefilterto take advantage of the size difference to get better performance:smallSet filter bigSet. This will make a new set by adding the items insmallSetare inbigSetwithout you needing to check each element ofbigSet.&may do this under the hood but better to be explicit in my opinion.
– morsecodist
Nov 16 at 6:13
@morsecodist; According to the source code,intersect()actually is thefilter()op.
– jwvh
Nov 16 at 8:38
@jwvh It is probably overriden, as the implementation which is called eventually isscala.collection.immutable.HashSet.HashTrieSet#intersect0and looks quite complex to me. Based on my rough benchmarking its peformance is comparable toa filter band it is much faster thanb filter a.
– Suma
Nov 16 at 11:38
add a comment |
2
"If yes, which is the fastest?" -- why not benchmark it and see?
– John Coleman
Nov 16 at 0:14
You can usefilterto take advantage of the size difference to get better performance:smallSet filter bigSet. This will make a new set by adding the items insmallSetare inbigSetwithout you needing to check each element ofbigSet.&may do this under the hood but better to be explicit in my opinion.
– morsecodist
Nov 16 at 6:13
@morsecodist; According to the source code,intersect()actually is thefilter()op.
– jwvh
Nov 16 at 8:38
@jwvh It is probably overriden, as the implementation which is called eventually isscala.collection.immutable.HashSet.HashTrieSet#intersect0and looks quite complex to me. Based on my rough benchmarking its peformance is comparable toa filter band it is much faster thanb filter a.
– Suma
Nov 16 at 11:38
2
2
"If yes, which is the fastest?" -- why not benchmark it and see?
– John Coleman
Nov 16 at 0:14
"If yes, which is the fastest?" -- why not benchmark it and see?
– John Coleman
Nov 16 at 0:14
You can use
filter to take advantage of the size difference to get better performance: smallSet filter bigSet. This will make a new set by adding the items in smallSet are in bigSet without you needing to check each element of bigSet. & may do this under the hood but better to be explicit in my opinion.– morsecodist
Nov 16 at 6:13
You can use
filter to take advantage of the size difference to get better performance: smallSet filter bigSet. This will make a new set by adding the items in smallSet are in bigSet without you needing to check each element of bigSet. & may do this under the hood but better to be explicit in my opinion.– morsecodist
Nov 16 at 6:13
@morsecodist; According to the source code,
intersect() actually is the filter() op.– jwvh
Nov 16 at 8:38
@morsecodist; According to the source code,
intersect() actually is the filter() op.– jwvh
Nov 16 at 8:38
@jwvh It is probably overriden, as the implementation which is called eventually is
scala.collection.immutable.HashSet.HashTrieSet#intersect0 and looks quite complex to me. Based on my rough benchmarking its peformance is comparable to a filter b and it is much faster than b filter a.– Suma
Nov 16 at 11:38
@jwvh It is probably overriden, as the implementation which is called eventually is
scala.collection.immutable.HashSet.HashTrieSet#intersect0 and looks quite complex to me. Based on my rough benchmarking its peformance is comparable to a filter b and it is much faster than b filter a.– Suma
Nov 16 at 11:38
add a comment |
3 Answers
3
active
oldest
votes
The generic implementation seen in the GenSetLike using intersect is overriden for HashSet with an implementation which looks quite complex to me (see scala.collection.immutable.HashSet.HashTrieSet#intersect0). Based on my rough benchmark its performance is similar for both a & b and b & a and it is similar to the performance of a filter b, which is an order of magnitude faster than b filter a. My testing code is:
object Sets extends App {
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0)/1e6 + "ms")
result
}
val a = (0 until 10000 by 1).toSet //smaller data
val b = (0 until 1000000 by 2).toSet
time {a & b}
time {b & a}
time {a & b}
time {b & a}
time {a & b}
time {b & a}
println("Filter")
time {a filter b}
time {b filter a}
time {a filter b}
time {b filter a}
time {a filter b}
time {b filter a}
}
Result is:
Elapsed time: 6.990442ms
Elapsed time: 4.25017ms
Elapsed time: 4.1089ms
Elapsed time: 4.480789ms
Elapsed time: 3.71588ms
Elapsed time: 3.160159ms
Filter
Elapsed time: 7.781613ms
Elapsed time: 68.33023ms
Elapsed time: 5.858472ms
Elapsed time: 42.491131ms
Elapsed time: 2.982364ms
Elapsed time: 52.762474ms
add a comment |
The answer is: it's complicated.
The default implementation of an immutable set is scala.collection.immutable.Set. This has special cases for sizes 1 to 4. As soon as you have more than 4 elements, it will use scala.collection.immutable.HashSet.
Very small s (1..4)
So let's say you have a large set b and a small set s, with s containing <4 elements.
Then it will make a large difference:
b & s will filter all elements of b against membership in s and therefore takes b.count * s.count equality comparisons. Since b is large this will be quite slow.
s & b will filter the few elements of s against a membership in b, which is s.length times a hashing and an equality comparison if the hashes match (remember b is a HashSet). Since is is small this should be very fast.
Small s (n>4)
As soon as s is larger than 4 elements, it also will be a HashSet. Intersection for HashSets is written in a symmetric and very efficient way. It will combine the tree structures of s and b and perform equality comparisons when the hashes match. It will use maximum structural sharing. E.g. if b contains all elements of s, the result will be the same instance as s, so no objects will be allocated.
General advice
If you just write generic code where you don't care much about high performance, it is fine to use the default implementations such as scala.collection.Set. However, if you care about performance characteristics it is preferable to use a concrete implementation. E.g. if s and b are declared as scala.collection.immutable.HashSet, you will have consistent high performance independent of order, provided that you have a good hash function.
add a comment |
Let us create two sets as per the condition mentioned
val a = (0 until 10000 by 1).toSet //smaller data
val b = (0 until 1000000 by 2).toSet //Relatively larger data
we can define a time function to check the execution time as below
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
Now we can check the intersection conditon
scala> time {a & b}
Elapsed time: 5895220ns
res2: scala.collection.immutable.Set[Int] = Set(892, 5810, 8062, ..)
scala> time {b & a}
Elapsed time: 6038271ns
res3: scala.collection.immutable.Set[Int] = Set(892, 5810, 8062, ...)
So by this we can conclude that intersection between a smaller and large dataset has performance difference and it is better to have the smaller dataset on the left side for faster execution for Scala set
2
The difference seem to be quite small given the huge size difference between the two sets. As the benchmark does not contain any warm-up and it performed from REPL, I would say it is inconclusive.
– Suma
Nov 16 at 11:20
When I try this 100 Times I also get no conclusive answer.
– pme
Nov 16 at 11:23
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53329633%2fscala-set-intersection-performance%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
The generic implementation seen in the GenSetLike using intersect is overriden for HashSet with an implementation which looks quite complex to me (see scala.collection.immutable.HashSet.HashTrieSet#intersect0). Based on my rough benchmark its performance is similar for both a & b and b & a and it is similar to the performance of a filter b, which is an order of magnitude faster than b filter a. My testing code is:
object Sets extends App {
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0)/1e6 + "ms")
result
}
val a = (0 until 10000 by 1).toSet //smaller data
val b = (0 until 1000000 by 2).toSet
time {a & b}
time {b & a}
time {a & b}
time {b & a}
time {a & b}
time {b & a}
println("Filter")
time {a filter b}
time {b filter a}
time {a filter b}
time {b filter a}
time {a filter b}
time {b filter a}
}
Result is:
Elapsed time: 6.990442ms
Elapsed time: 4.25017ms
Elapsed time: 4.1089ms
Elapsed time: 4.480789ms
Elapsed time: 3.71588ms
Elapsed time: 3.160159ms
Filter
Elapsed time: 7.781613ms
Elapsed time: 68.33023ms
Elapsed time: 5.858472ms
Elapsed time: 42.491131ms
Elapsed time: 2.982364ms
Elapsed time: 52.762474ms
add a comment |
The generic implementation seen in the GenSetLike using intersect is overriden for HashSet with an implementation which looks quite complex to me (see scala.collection.immutable.HashSet.HashTrieSet#intersect0). Based on my rough benchmark its performance is similar for both a & b and b & a and it is similar to the performance of a filter b, which is an order of magnitude faster than b filter a. My testing code is:
object Sets extends App {
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0)/1e6 + "ms")
result
}
val a = (0 until 10000 by 1).toSet //smaller data
val b = (0 until 1000000 by 2).toSet
time {a & b}
time {b & a}
time {a & b}
time {b & a}
time {a & b}
time {b & a}
println("Filter")
time {a filter b}
time {b filter a}
time {a filter b}
time {b filter a}
time {a filter b}
time {b filter a}
}
Result is:
Elapsed time: 6.990442ms
Elapsed time: 4.25017ms
Elapsed time: 4.1089ms
Elapsed time: 4.480789ms
Elapsed time: 3.71588ms
Elapsed time: 3.160159ms
Filter
Elapsed time: 7.781613ms
Elapsed time: 68.33023ms
Elapsed time: 5.858472ms
Elapsed time: 42.491131ms
Elapsed time: 2.982364ms
Elapsed time: 52.762474ms
add a comment |
The generic implementation seen in the GenSetLike using intersect is overriden for HashSet with an implementation which looks quite complex to me (see scala.collection.immutable.HashSet.HashTrieSet#intersect0). Based on my rough benchmark its performance is similar for both a & b and b & a and it is similar to the performance of a filter b, which is an order of magnitude faster than b filter a. My testing code is:
object Sets extends App {
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0)/1e6 + "ms")
result
}
val a = (0 until 10000 by 1).toSet //smaller data
val b = (0 until 1000000 by 2).toSet
time {a & b}
time {b & a}
time {a & b}
time {b & a}
time {a & b}
time {b & a}
println("Filter")
time {a filter b}
time {b filter a}
time {a filter b}
time {b filter a}
time {a filter b}
time {b filter a}
}
Result is:
Elapsed time: 6.990442ms
Elapsed time: 4.25017ms
Elapsed time: 4.1089ms
Elapsed time: 4.480789ms
Elapsed time: 3.71588ms
Elapsed time: 3.160159ms
Filter
Elapsed time: 7.781613ms
Elapsed time: 68.33023ms
Elapsed time: 5.858472ms
Elapsed time: 42.491131ms
Elapsed time: 2.982364ms
Elapsed time: 52.762474ms
The generic implementation seen in the GenSetLike using intersect is overriden for HashSet with an implementation which looks quite complex to me (see scala.collection.immutable.HashSet.HashTrieSet#intersect0). Based on my rough benchmark its performance is similar for both a & b and b & a and it is similar to the performance of a filter b, which is an order of magnitude faster than b filter a. My testing code is:
object Sets extends App {
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0)/1e6 + "ms")
result
}
val a = (0 until 10000 by 1).toSet //smaller data
val b = (0 until 1000000 by 2).toSet
time {a & b}
time {b & a}
time {a & b}
time {b & a}
time {a & b}
time {b & a}
println("Filter")
time {a filter b}
time {b filter a}
time {a filter b}
time {b filter a}
time {a filter b}
time {b filter a}
}
Result is:
Elapsed time: 6.990442ms
Elapsed time: 4.25017ms
Elapsed time: 4.1089ms
Elapsed time: 4.480789ms
Elapsed time: 3.71588ms
Elapsed time: 3.160159ms
Filter
Elapsed time: 7.781613ms
Elapsed time: 68.33023ms
Elapsed time: 5.858472ms
Elapsed time: 42.491131ms
Elapsed time: 2.982364ms
Elapsed time: 52.762474ms
edited Nov 16 at 12:13
answered Nov 16 at 11:49
Suma
21.3k890154
21.3k890154
add a comment |
add a comment |
The answer is: it's complicated.
The default implementation of an immutable set is scala.collection.immutable.Set. This has special cases for sizes 1 to 4. As soon as you have more than 4 elements, it will use scala.collection.immutable.HashSet.
Very small s (1..4)
So let's say you have a large set b and a small set s, with s containing <4 elements.
Then it will make a large difference:
b & s will filter all elements of b against membership in s and therefore takes b.count * s.count equality comparisons. Since b is large this will be quite slow.
s & b will filter the few elements of s against a membership in b, which is s.length times a hashing and an equality comparison if the hashes match (remember b is a HashSet). Since is is small this should be very fast.
Small s (n>4)
As soon as s is larger than 4 elements, it also will be a HashSet. Intersection for HashSets is written in a symmetric and very efficient way. It will combine the tree structures of s and b and perform equality comparisons when the hashes match. It will use maximum structural sharing. E.g. if b contains all elements of s, the result will be the same instance as s, so no objects will be allocated.
General advice
If you just write generic code where you don't care much about high performance, it is fine to use the default implementations such as scala.collection.Set. However, if you care about performance characteristics it is preferable to use a concrete implementation. E.g. if s and b are declared as scala.collection.immutable.HashSet, you will have consistent high performance independent of order, provided that you have a good hash function.
add a comment |
The answer is: it's complicated.
The default implementation of an immutable set is scala.collection.immutable.Set. This has special cases for sizes 1 to 4. As soon as you have more than 4 elements, it will use scala.collection.immutable.HashSet.
Very small s (1..4)
So let's say you have a large set b and a small set s, with s containing <4 elements.
Then it will make a large difference:
b & s will filter all elements of b against membership in s and therefore takes b.count * s.count equality comparisons. Since b is large this will be quite slow.
s & b will filter the few elements of s against a membership in b, which is s.length times a hashing and an equality comparison if the hashes match (remember b is a HashSet). Since is is small this should be very fast.
Small s (n>4)
As soon as s is larger than 4 elements, it also will be a HashSet. Intersection for HashSets is written in a symmetric and very efficient way. It will combine the tree structures of s and b and perform equality comparisons when the hashes match. It will use maximum structural sharing. E.g. if b contains all elements of s, the result will be the same instance as s, so no objects will be allocated.
General advice
If you just write generic code where you don't care much about high performance, it is fine to use the default implementations such as scala.collection.Set. However, if you care about performance characteristics it is preferable to use a concrete implementation. E.g. if s and b are declared as scala.collection.immutable.HashSet, you will have consistent high performance independent of order, provided that you have a good hash function.
add a comment |
The answer is: it's complicated.
The default implementation of an immutable set is scala.collection.immutable.Set. This has special cases for sizes 1 to 4. As soon as you have more than 4 elements, it will use scala.collection.immutable.HashSet.
Very small s (1..4)
So let's say you have a large set b and a small set s, with s containing <4 elements.
Then it will make a large difference:
b & s will filter all elements of b against membership in s and therefore takes b.count * s.count equality comparisons. Since b is large this will be quite slow.
s & b will filter the few elements of s against a membership in b, which is s.length times a hashing and an equality comparison if the hashes match (remember b is a HashSet). Since is is small this should be very fast.
Small s (n>4)
As soon as s is larger than 4 elements, it also will be a HashSet. Intersection for HashSets is written in a symmetric and very efficient way. It will combine the tree structures of s and b and perform equality comparisons when the hashes match. It will use maximum structural sharing. E.g. if b contains all elements of s, the result will be the same instance as s, so no objects will be allocated.
General advice
If you just write generic code where you don't care much about high performance, it is fine to use the default implementations such as scala.collection.Set. However, if you care about performance characteristics it is preferable to use a concrete implementation. E.g. if s and b are declared as scala.collection.immutable.HashSet, you will have consistent high performance independent of order, provided that you have a good hash function.
The answer is: it's complicated.
The default implementation of an immutable set is scala.collection.immutable.Set. This has special cases for sizes 1 to 4. As soon as you have more than 4 elements, it will use scala.collection.immutable.HashSet.
Very small s (1..4)
So let's say you have a large set b and a small set s, with s containing <4 elements.
Then it will make a large difference:
b & s will filter all elements of b against membership in s and therefore takes b.count * s.count equality comparisons. Since b is large this will be quite slow.
s & b will filter the few elements of s against a membership in b, which is s.length times a hashing and an equality comparison if the hashes match (remember b is a HashSet). Since is is small this should be very fast.
Small s (n>4)
As soon as s is larger than 4 elements, it also will be a HashSet. Intersection for HashSets is written in a symmetric and very efficient way. It will combine the tree structures of s and b and perform equality comparisons when the hashes match. It will use maximum structural sharing. E.g. if b contains all elements of s, the result will be the same instance as s, so no objects will be allocated.
General advice
If you just write generic code where you don't care much about high performance, it is fine to use the default implementations such as scala.collection.Set. However, if you care about performance characteristics it is preferable to use a concrete implementation. E.g. if s and b are declared as scala.collection.immutable.HashSet, you will have consistent high performance independent of order, provided that you have a good hash function.
answered Nov 16 at 22:10
Rüdiger Klaehn
10k13149
10k13149
add a comment |
add a comment |
Let us create two sets as per the condition mentioned
val a = (0 until 10000 by 1).toSet //smaller data
val b = (0 until 1000000 by 2).toSet //Relatively larger data
we can define a time function to check the execution time as below
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
Now we can check the intersection conditon
scala> time {a & b}
Elapsed time: 5895220ns
res2: scala.collection.immutable.Set[Int] = Set(892, 5810, 8062, ..)
scala> time {b & a}
Elapsed time: 6038271ns
res3: scala.collection.immutable.Set[Int] = Set(892, 5810, 8062, ...)
So by this we can conclude that intersection between a smaller and large dataset has performance difference and it is better to have the smaller dataset on the left side for faster execution for Scala set
2
The difference seem to be quite small given the huge size difference between the two sets. As the benchmark does not contain any warm-up and it performed from REPL, I would say it is inconclusive.
– Suma
Nov 16 at 11:20
When I try this 100 Times I also get no conclusive answer.
– pme
Nov 16 at 11:23
add a comment |
Let us create two sets as per the condition mentioned
val a = (0 until 10000 by 1).toSet //smaller data
val b = (0 until 1000000 by 2).toSet //Relatively larger data
we can define a time function to check the execution time as below
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
Now we can check the intersection conditon
scala> time {a & b}
Elapsed time: 5895220ns
res2: scala.collection.immutable.Set[Int] = Set(892, 5810, 8062, ..)
scala> time {b & a}
Elapsed time: 6038271ns
res3: scala.collection.immutable.Set[Int] = Set(892, 5810, 8062, ...)
So by this we can conclude that intersection between a smaller and large dataset has performance difference and it is better to have the smaller dataset on the left side for faster execution for Scala set
2
The difference seem to be quite small given the huge size difference between the two sets. As the benchmark does not contain any warm-up and it performed from REPL, I would say it is inconclusive.
– Suma
Nov 16 at 11:20
When I try this 100 Times I also get no conclusive answer.
– pme
Nov 16 at 11:23
add a comment |
Let us create two sets as per the condition mentioned
val a = (0 until 10000 by 1).toSet //smaller data
val b = (0 until 1000000 by 2).toSet //Relatively larger data
we can define a time function to check the execution time as below
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
Now we can check the intersection conditon
scala> time {a & b}
Elapsed time: 5895220ns
res2: scala.collection.immutable.Set[Int] = Set(892, 5810, 8062, ..)
scala> time {b & a}
Elapsed time: 6038271ns
res3: scala.collection.immutable.Set[Int] = Set(892, 5810, 8062, ...)
So by this we can conclude that intersection between a smaller and large dataset has performance difference and it is better to have the smaller dataset on the left side for faster execution for Scala set
Let us create two sets as per the condition mentioned
val a = (0 until 10000 by 1).toSet //smaller data
val b = (0 until 1000000 by 2).toSet //Relatively larger data
we can define a time function to check the execution time as below
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
Now we can check the intersection conditon
scala> time {a & b}
Elapsed time: 5895220ns
res2: scala.collection.immutable.Set[Int] = Set(892, 5810, 8062, ..)
scala> time {b & a}
Elapsed time: 6038271ns
res3: scala.collection.immutable.Set[Int] = Set(892, 5810, 8062, ...)
So by this we can conclude that intersection between a smaller and large dataset has performance difference and it is better to have the smaller dataset on the left side for faster execution for Scala set
answered Nov 16 at 10:34
prasanna kumar
798
798
2
The difference seem to be quite small given the huge size difference between the two sets. As the benchmark does not contain any warm-up and it performed from REPL, I would say it is inconclusive.
– Suma
Nov 16 at 11:20
When I try this 100 Times I also get no conclusive answer.
– pme
Nov 16 at 11:23
add a comment |
2
The difference seem to be quite small given the huge size difference between the two sets. As the benchmark does not contain any warm-up and it performed from REPL, I would say it is inconclusive.
– Suma
Nov 16 at 11:20
When I try this 100 Times I also get no conclusive answer.
– pme
Nov 16 at 11:23
2
2
The difference seem to be quite small given the huge size difference between the two sets. As the benchmark does not contain any warm-up and it performed from REPL, I would say it is inconclusive.
– Suma
Nov 16 at 11:20
The difference seem to be quite small given the huge size difference between the two sets. As the benchmark does not contain any warm-up and it performed from REPL, I would say it is inconclusive.
– Suma
Nov 16 at 11:20
When I try this 100 Times I also get no conclusive answer.
– pme
Nov 16 at 11:23
When I try this 100 Times I also get no conclusive answer.
– pme
Nov 16 at 11:23
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53329633%2fscala-set-intersection-performance%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
"If yes, which is the fastest?" -- why not benchmark it and see?
– John Coleman
Nov 16 at 0:14
You can use
filterto take advantage of the size difference to get better performance:smallSet filter bigSet. This will make a new set by adding the items insmallSetare inbigSetwithout you needing to check each element ofbigSet.&may do this under the hood but better to be explicit in my opinion.– morsecodist
Nov 16 at 6:13
@morsecodist; According to the source code,
intersect()actually is thefilter()op.– jwvh
Nov 16 at 8:38
@jwvh It is probably overriden, as the implementation which is called eventually is
scala.collection.immutable.HashSet.HashTrieSet#intersect0and looks quite complex to me. Based on my rough benchmarking its peformance is comparable toa filter band it is much faster thanb filter a.– Suma
Nov 16 at 11:38