Project

General

Profile

Actions

Feature #16989

open

Sets: need ♥️

Added by marcandre (Marc-Andre Lafortune) 12 months ago. Updated about 2 months ago.

Status:
Assigned
Priority:
Normal
Target version:
-
[ruby-core:98964]

Description

I am opening a series of feature requests on Set, all of them based on this usecase.

The main usecase I have in mind is my recent experience with RuboCop. I noticed a big number of frozen arrays being used only to later call include? on them. This is O(n) instead of O(1).

Trying to convert them to Sets causes major compatibility issues, as well as very frustrating situations and some cases that would make them much less efficient.

Because of these incompatibilities, RuboCop is in the process of using a custom class based on Array with optimized include? and ===. RuboCop runs multiple checks on Ruby code. Those checks are called cops. RuboCop performance is (IMO) pretty bad and some cops currently are in O(n^2) where n is the size of the code being inspected. Even given these extremely inefficient cops, optimizing the 100+ such arrays (most of which are quite small btw) gave a 5% speed boost.

RuboCop PRs for reference: https://github.com/rubocop-hq/rubocop-ast/pull/29
https://github.com/rubocop-hq/rubocop/pull/8133

My experience tells me that there are many other opportunities to use Sets that are missed because Sets are not builtin, not known enough and have no shorthand notation.

In this issue I'd like to concentrate the discussion on the following request: Sets should be core objects, in the same way that Complex were not and are now. Some of the upcoming feature requests would be easier (or only possible) to implement were Sets builtin.


Related issues

Related to Ruby master - Feature #16990: Sets: operators compatibility with ArrayOpenActions
Related to Ruby master - Feature #16991: Sets: add Set#joinOpenActions
Related to Ruby master - Feature #16992: Sets: officially orderedOpenmatz (Yukihiro Matsumoto)Actions
Related to Ruby master - Feature #16993: Sets: from hash keys using Hash#key_setOpenActions
Related to Ruby master - Feature #16994: Sets: shorthand for frozen sets of symbols / stringsFeedbackActions
Related to Ruby master - Feature #16995: Sets: <=> should be specializedOpenActions
Actions #1

Updated by marcandre (Marc-Andre Lafortune) 12 months ago

  • Related to Feature #16990: Sets: operators compatibility with Array added
Actions #2

Updated by marcandre (Marc-Andre Lafortune) 12 months ago

Actions #3

Updated by marcandre (Marc-Andre Lafortune) 12 months ago

Actions #4

Updated by marcandre (Marc-Andre Lafortune) 12 months ago

  • Related to Feature #16993: Sets: from hash keys using Hash#key_set added
Actions #5

Updated by marcandre (Marc-Andre Lafortune) 12 months ago

  • Related to Feature #16994: Sets: shorthand for frozen sets of symbols / strings added
Actions #6

Updated by marcandre (Marc-Andre Lafortune) 12 months ago

Updated by nobu (Nobuyoshi Nakada) 12 months ago

  • Assignee set to knu (Akinori MUSHA)
  • Status changed from Open to Assigned

Updated by mame (Yusuke Endoh) 10 months ago

We discussed this issue at the last-month meeting.

https://github.com/ruby/dev-meeting-log/blob/master/DevelopersMeeting20200720Japan.md

matz: Positive to introduce Set into core. But we need to first introduce a set literal. { x, y, z } is good, but JavaScript uses it as another meanings, so it would bring confusion. I have no idea.
knu: As a set library maintainer, I'll first communicate, and try to solve the easy issues that can be sorted out in the library side.

knu (Akinori MUSHA) ?

Updated by marcandre (Marc-Andre Lafortune) 10 months ago

That's great news.

Was there discussion for a frozen set literal of symbols / strings as I proposed in #16994? For sets that need to be mutable or containing different types of objects, I find the existing Set[...] quite convenient.
Moreover, I would like to avoid having to add a magic comment # frozen_set_literals: true...

Updated by Dan0042 (Daniel DeLorme) 10 months ago

matz: Positive to introduce Set into core. But we need to first introduce a set literal. { x, y, z } is good, but JavaScript uses it as another meanings, so it would bring confusion.

I have to ask, is [lack of] similarity with javascript really a problem? Apparently Python uses { x, y, z } (and set() for an empty set). #5478 has a lot of discussion on Set literals but ultimately nothing beats the simplicity and obviousness of { x, y, z }

Updated by Eregon (Benoit Daloze) 10 months ago

IMHO there is no need for a literal Set notation (which would anyway restrict the type of keys, etc).

Some of the upcoming feature requests would be easier (or only possible) to implement were Sets builtin.

marcandre (Marc-Andre Lafortune) Which feature requests except having a literal notation (#16994) require Set to be builtin? I think only #16993.
I think we could make good progress here by addressing the other feature requests first.

Also moving Set to core makes might make it more complicated to backport to older Rubies.
(Unfortunately Set is only a default gem in 3.0, but it's probably easy enough to require some Set additions with a different require name)

Updated by marcandre (Marc-Andre Lafortune) 10 months ago

Eregon (Benoit Daloze) wrote in #note-11:

IMHO there is no need for a literal Set notation (which would anyway restrict the type of keys, etc).

I agree. Set[...] works very well already for constructing sets at runtime out of any time of elements.

Some of the upcoming feature requests would be easier (or only possible) to implement were Sets builtin.

marcandre (Marc-Andre Lafortune) Which feature requests except having a literal notation (#16994) require Set to be builtin? I think only #16993.

Indeed, the static notation requires a change to the Ruby parser.

#16993 would be easier and faster if builtin, although it is implementable with a transform_values as I showed in the issue.

A bigger issue (and more important feature imo) is interoperability with Array #16990. If not in core, this would require monkeypatching a bunch of Array methods if not in core and would also require looping in Ruby so would not be as efficient.

I think we could make good progress here by addressing the other feature requests first.

Also moving Set to core makes might make it more complicated to backport to older Rubies.

If we want to backport to older Rubies, this could still partly be done with a gem, we'll just need some conditional definitions.

Updated by Dan0042 (Daniel DeLorme) 10 months ago

which would anyway restrict the type of keys

I don't get that part. If you have a literal like { x, y, z } then x/y/z can be any type of object. There's no restriction. But I agree Set[] is short and good enough.

Updated by Eregon (Benoit Daloze) 9 months ago

Also, does moving Set to core mean rewriting it to C?
I think that would be suboptimal, because every implementation would have to maintain its own native implementation then (or reuse the C extension).
It would also hurt readability.
Much of Ruby's performance can be achieved by JIT'ing Ruby code, and Hash is well optimized, so I'm not sure it's a performance gain either.

BTW SortedSet has this rbtree usage, that seems problematic to move in core (core should not depend on RubyGems):
https://github.com/ruby/ruby/blob/eada6350332155972f19bad52bd8621f607520a2/lib/set.rb#L708
Only moving Set but not SortedSet in core would avoid that issue though, but be somewhat inconsistent.

Updated by Eregon (Benoit Daloze) 9 months ago

Dan0042 (Daniel DeLorme) wrote in #note-13:

I don't get that part. If you have a literal like { x, y, z } then x/y/z can be any type of object. There's no restriction. But I agree Set[] is short and good enough.

Right, I was thinking of #16994.
Set[] seems already good enough as a general syntax, and no confusion with JS object literals / potentially future Ruby Hash shorthand literals.

Updated by marcandre (Marc-Andre Lafortune) 9 months ago

Eregon (Benoit Daloze) wrote in #note-14:

Also, does moving Set to core mean rewriting it to C?
I think that would be suboptimal, because every implementation would have to maintain its own native implementation then (or reuse the C extension).

I'm not sure, but Set is quite small, most of the processing uses Hash...

Only moving Set but not SortedSet in core would avoid that issue though, but be somewhat inconsistent.

I should have clarified: while I am convinced that Set is a basic fundamental class, I have yet to use SortedSet or see it in use. I would keep it out of core (in set).

Updated by matz (Yukihiro Matsumoto) 9 months ago

I agree with some of your proposals (#16990, #16991, #16993, #16995). I want knu (Akinori MUSHA) to work on this. If I missed something, he will tell us.

I strongly disagree with #16994. There's no evidence we need frozen sets of strings or symbols that much. Even if we do, I think frozen arrays should come first.

I weakly disagree with #16992. Currently set orders are determined by the internal hash. We may change the implementation in the future to improve performance or memory overhead. Fixing the order could possibly restrict the future implementation choice.

Matz.

Updated by knu (Akinori MUSHA) 9 months ago

OK, I think it's good for us now to diverge from the original philosophy of Set when I first wrote it and pursue the performance and integrity with other parts of Ruby. There are many parts in Set where I avoided optimization in order to retain extensibility (like subclassing and stuff), but I'll unlock the bar.

I'm also planning to remove SortedSet and leave it to an external gem because of the partial dependency on rbtree and the fallback implementation which performs quite poorly.

I'm not absolutely sure about introducing literals in the form of { a, b, c } because I myself is the one who is quite familiar with the shorthand notation introduced in ES6 and would like to have something similar in Ruby. 😆

Updated by marcandre (Marc-Andre Lafortune) 9 months ago

knu (Akinori MUSHA) wrote in #note-18:

OK, I think it's good for us now to diverge from the original philosophy of Set when I first wrote it and pursue the performance and integrity with other parts of Ruby. There are many parts in Set where I avoided optimization in order to retain extensibility (like subclassing and stuff), but I'll unlock the bar.

I'm also planning to remove SortedSet and leave it to an external gem because of the partial dependency on rbtree and the fallback implementation which performs quite poorly.

I'm not absolutely sure about introducing literals in the form of { a, b, c } because I myself is the one who is quite familiar with the shorthand notation introduced in ES6 and would like to have something similar in Ruby. 😆

This sounds great! 👍
Let me know if I can help.

Updated by marcandre (Marc-Andre Lafortune) 9 months ago

matz (Yukihiro Matsumoto) wrote in #note-17:

I agree with some of your proposals (#16990, #16991, #16993, #16995). I want knu (Akinori MUSHA) to work on this. If I missed something, he will tell us.

Thank you for reviewing these :-)

I strongly disagree with #16994. There's no evidence we need frozen sets of strings or symbols that much. Even if we do, I think frozen arrays should come first.

Right, I agree that there is a larger discussion about frozen & static literals to be had.

Would a non-frozen notation for Set of strings / symbols have a chance of being accepted? I would like to avoid Set.new(%w[a list of words]) as can be currently seen in RuboCop or in Rails:

https://github.com/rails/rails/blob/master/actionpack/lib/action_dispatch/http/cache.rb#L128
https://github.com/rails/rails/blob/master/actionpack/lib/action_dispatch/http/headers.rb#L25-L44

I think that many gems have simply not really taken care about Sets. For example, this file in Bundler defines a 40-element array constant, only to call include? on it...

https://github.com/rubygems/rubygems/blob/master/bundler/lib/bundler/settings.rb#L9

I feel that a builtin way to write this might help.

Actions #21

Updated by mame (Yusuke Endoh) 6 months ago

FYI: SortedSet has been extracted out from set.rb since 3.0. So there is no longer the dependency problem.

Updated by akr (Akira Tanaka) 5 months ago

I like extending Hash instead of incorporating Set.
For example, #inspect omit hash values.
(Of course, modifying Hash#inspect is clearly not acceptable, I think some compromise is possible)

Updated by knu (Akinori MUSHA) 5 months ago

akr (Akira Tanaka) wrote in #note-22:

I like extending Hash instead of incorporating Set.
For example, #inspect omit hash values.
(Of course, modifying Hash#inspect is clearly not acceptable, I think some compromise is possible)

How would you implement methods and operators of Set in Hash? Would they only work for hashes that meet special conditions?

Updated by marcandre (Marc-Andre Lafortune) 5 months ago

akr (Akira Tanaka) wrote in #note-22:

I like extending Hash instead of incorporating Set.

I don't understand the impacts

One question thing that having builtin set would change:

# shareable_constant_value: literal
X = Set[1, 2, 3]  # => error, or acceptable (and frozen)

I would like it to be acceptable and frozen.

Updated by akr (Akira Tanaka) 5 months ago

knu (Akinori MUSHA) wrote in #note-23:

akr (Akira Tanaka) wrote in #note-22:

I like extending Hash instead of incorporating Set.
For example, #inspect omit hash values.
(Of course, modifying Hash#inspect is clearly not acceptable, I think some compromise is possible)

How would you implement methods and operators of Set in Hash? Would they only work for hashes that meet special conditions?

  • define a special constant (Hash::DummyValue = Object.new)
  • Hash#inspect doesn't show the value: {1 => Hash::DummyValue}.inspect #=> "{1}"
  • {} can be used as an empty set
  • hash << v would be defined as hash[v] = Hash::DummyValue
  • Hash#each doesn't work well as Set but we can use Hash#each_key.
  • Hash#to_a doesn't work well as Set but we can use Hash#keys.

Updated by marcandre (Marc-Andre Lafortune) 5 months ago

I still don't see the benefit of not simply having Set in core.

Moreover, does this make ary & set possible? efficient?

Updated by duerst (Martin Dürst) 5 months ago

marcandre (Marc-Andre Lafortune) wrote in #note-26:

I still don't see the benefit of not simply having Set in core.

One direction that distinguishes Ruby from some other programming languages is big classes. A typical example is that there's no Stack class, the stack interface is just part of Array. I don't know the history, but I think the original idea was to make the set interface also be part of Array. Your ary & set example also alludes to that idea. Most of that interface (maybe all) is currently available, but it's not very efficient. Using Hash for sets eliminates the efficiency problem, as we know.

Moreover, does this make ary & set possible? efficient?

I haven't checked the details, but it shouldn't be too difficult to implement this, in a reasonably efficient way. If there are any efficiency problems, they would be on the Array side, not on the Hash side.

Another advantage of using Hash may be that we can get to a syntax that is very close to the actual Mathematical set syntax. We would like to write {1, 2, 3}, which may even be possible. At the least, it should be possible to move to the following: {1 => Hash::DummyValue, 2, 3}, i.e. if the first key has a special value, then the remaining values can be left out. We may want to choose a better name for Hash::DummyValue, maybe something like Hash::SetDummy.

Updated by akr (Akira Tanaka) 5 months ago

akr (Akira Tanaka) wrote in #note-25:

  • Hash#each doesn't work well as Set but we can use Hash#each_key.

To be fair, "Hash#each doesn't work well" means Enumerable methods doesn't work well.
enum_for(:each_key) can be used as {1, 2}.enum_for(:each_key).max, though.

(I forgot to mention that Ruby can interpret {1, 2} as { 1 => Hash::SetDummy, 2 => Hash::SetDummy }. I feel the name SetDummy is better than DummyValue. Thanks.)

It may be possible to solve this by Hash#each yields key instead of [key, value] if value is Hash::SetDummy.
I doubt it will be accepted, though.

Updated by knu (Akinori MUSHA) 5 months ago

I don't like the idea of making array & hash possible, as that would go against the trends of making Ruby type-friendly and error-prone.

akr (Akira Tanaka) wrote in #note-25:

  • define a special constant (Hash::DummyValue = Object.new)
  • Hash#inspect doesn't show the value: {1 => Hash::DummyValue}.inspect #=> "{1}"
  • {} can be used as an empty set
  • hash << v would be defined as hash[v] = Hash::DummyValue
  • Hash#each doesn't work well as Set but we can use Hash#each_key.
  • Hash#to_a doesn't work well as Set but we can use Hash#keys.

I'd appreciate the list as implementation details, but if it were like some hashes are used as sets, programs might get less readable.
For example, a program that goes hash = {}; hash << 1; hash[:key] = :value would likely be a mistake, but that couldn't be easily detected without explicit type annotation.

Updated by knu (Akinori MUSHA) 5 months ago

knu (Akinori MUSHA) wrote in #note-29:

For example, a program that goes hash = {}; hash << 1; hash[:key] = :value would likely be a mistake, but that couldn't be easily detected without explicit type annotation.

Well, even with type annotation as long as it is a direct Hash instance.

Updated by marcandre (Marc-Andre Lafortune) 5 months ago

I'm sorry, I am completely confused by this discussion, I can't make sense of it.

From what I read:

set = {:a, :b, :c} # or whatever notation is used
set.class # => Hash ???
set[:a] # => Hash::SetDummy ???
set[:elem] = :something_else # => what does that mean?
set.to_a # => [:a, :b, :c, [:elem, :something_else]] ???
set.transform_values { ... } ???
set + [1, 2, 3] # => ???
set + Set[:d, :e] # => ???

Am I understanding correctly?

duerst (Martin Dürst) wrote in #note-27:

Using Hash for sets eliminates the efficiency problem, as we know.

Are you talking about using Hash internally, or having Hash's class and API?

Another advantage of using Hash may be that we can get to a syntax that is very close to the actual Mathematical set syntax. We would like to write {1, 2, 3}, which may even be possible.

I don't understand how this might be related in any way to using a Hash? What meaning we decide to give to {a, b, c} is completely independent to what {a: :a, b: :b} means, or to what %w{a, b, c} means, or to {|a, b| c} for that matter.

At the least, it should be possible to move to the following: {1 => Hash::DummyValue, 2, 3}, i.e. if the first key has a special value, then the remaining values can be left out. We may want to choose a better name for Hash::DummyValue, maybe something like Hash::SetDummy.

Are you suggesting that instead of writing Set[1, 2, 3] we would literally write {1 => Hash::SetDummy, 2, 3}?

Updated by matz (Yukihiro Matsumoto) 5 months ago

I agree with adding Set by default. But no set literals nor SortedSet at the moment.

Matz.

Updated by Student (Nathan Zook) 4 months ago

marcandre (Marc-Andre Lafortune) wrote in #note-31:

I'm sorry, I am completely confused by this discussion, I can't make sense of it.

From what I read:

set = {:a, :b, :c} # or whatever notation is used
set.class # => Hash ???
set[:a] # => Hash::SetDummy ???
set[:elem] = :something_else # => what does that mean?
set.to_a # => [:a, :b, :c, [:elem, :something_else]] ???
set.transform_values { ... } ???
set + [1, 2, 3] # => ???
set + Set[:d, :e] # => ???

Am I understanding correctly?

duerst (Martin Dürst) wrote in #note-27:

Using Hash for sets eliminates the efficiency problem, as we know.

Are you talking about using Hash internally, or having Hash's class and API?

Another advantage of using Hash may be that we can get to a syntax that is very close to the actual Mathematical set syntax. We would like to write {1, 2, 3}, which may even be possible.

I don't understand how this might be related in any way to using a Hash? What meaning we decide to give to {a, b, c} is completely independent to what {a: :a, b: :b} means, or to what %w{a, b, c} means, or to {|a, b| c} for that matter.

At the least, it should be possible to move to the following: {1 => Hash::DummyValue, 2, 3}, i.e. if the first key has a special value, then the remaining values can be left out. We may want to choose a better name for Hash::DummyValue, maybe something like Hash::SetDummy.

Are you suggesting that instead of writing Set[1, 2, 3] we would literally write {1 => Hash::SetDummy, 2, 3}?

THIS.

Sets are in no way related to Hashes. The fact that they are/can be implemented as hash keys is convenient, but surfacing such an implementations detail violates programming best practice to a crazy amount.

Updated by Eregon (Benoit Daloze) 4 months ago

I don't think mixing Hash and Set is good at all. They have fundamentally different APIs.

Updated by greggzst (Grzegorz Jakubiak) about 2 months ago

Is there a chance this feature lands in 3.1?

Actions

Also available in: Atom PDF