Feature #5478

import Set into core, add syntax

Added by Konstantin Haase over 2 years ago. Updated over 1 year ago.

[ruby-core:40311]
Status:Open
Priority:Normal
Assignee:-
Category:-
Target version:Next Major

Description

=begin
A set is a central data structure. However, a lot of Ruby developers use arrays for situations where it would be more reasonable to use a set. One reason for that is that it is way easier to use Array then Set at the moment, another one is that developers are simply not aware it exists.

I propose moving Set from the stdlib to core and possibly add a syntax or a method on array for creating Set literals.

First class syntax suggestions:

<1, 2, 3>  # might be tricky to parse
#[1, 2, 3] # would collide with comments
$[1, 2, 3]
${1, 2, 3}

Method suggestions:

~[1, 2, 3]
+[1, 2, 3]

Whitespace separated String Sets could look like this:

%w<foo bar blah> # creates an array at the moment 
#w[foo bar blah] # would collide with comments
$w[foo bar blah] # would collide with sending :[] to $w
$w{foo bar blah}

~%w[foo bar blah] # not really shorter than using an array with strings
+%w[foo bar balh] # not really shorter than using an array with strings

Maybe it's ok to not have a whitespace separated syntax, I'm just brainstorming here.

The issue with the method approach is that it would create an Array to send the message to first.

I favor the syntax, possibly without the ability to create a whitespace separated version.

I'd be willing to work on a patch not only for MRI but also for JRuby and Rubinius if you would consider this to be useful.
Although I would need help with the parser.
=end

History

#1 Updated by George Koehler over 2 years ago

=begin
Ruby stdlib already has Set::[], a method that creates a literal set.

irb(main):001:0> require 'set'
=> true
irb(main):002:0> Set[1, 2, 3]
=> #

The proposed syntax might be impossible to parse, because it conflicts with < and > operators. Consider this example:

def c(*args); args.length; end

p c
4

Current Ruby (with no set literal syntax) parses this example like so:

p(c() < 1, 2, (3 >
4))

If became a set literal, then this example would become ambiguous. No parser would know whether this code has a set literal, or calls < and > operators.
=end

#2 Updated by Alexey Muranov over 2 years ago

What about simply { :a, :b, :c } ? It should be obvious to Ruby that this is not a hash.

In fact, how about specifying that internally Set is a form of Hash, and to have a lossless conversion Set->Hash->Set? I do not know the current implementation of Set, but the most natural in my opinion would be the following:
{ :a => nil, :b => nil, :c => nil } # is the set of { :a, :b, :c }.

How about making Ruby interpret { :a, :b, :c => 1 } as an alternative form of { :a => nil, :b => nil, :c => 1}, and view Set as a special Hash, where all values are nil? That is, make ruby understand that { :a, :b, :c => 1 } is a hash, but { :a, :b, :c } is a Set (which inherits from Hash and internally is { :a => nil, :b => nil, :c => nil }).


Update. I've got another idea: maybe the best would be to make Set and Hash inherit from some kind of a BasicHash class? Then Hash would extend BasicHash by, for example #default= and #defaultproc= methods, and Set would simply undefine the #[] and #[]= methods, rename #haskey? to #include?, etc...


Update 2011-11-09. A related proposal: how about having a series of BasicClasses like: BasicArray, BasicHash, BasicRange for functionality which is essential to the classes and is unlikely to be questioned or changed in future versions of Ruby. Then to have: class Array < BasicArray ..., class Hash < BasicHash ..., class Set < BasicHash ..., class DirectedRange < BasicRange ... (for ranges 1..3 and 3..1, see discussion in #5534), class OrderedHash < BasicOrderedHash ... (or class OrderedHash < BasicArray ..., see #5123), etc.

#3 Updated by Konstantin Haase over 2 years ago

I don't think it's impossible to parse. Ruby also manages to parse p /a/m as regexp rather than p / a / m.

Konstantin

On Nov 6, 2011, at 17:05 , George Koehler wrote:

Issue #5478 has been updated by George Koehler.

¾gin
Ruby stdlib already has Set::[], a method that creates a literal set.

irb(main):001:0> require 'set'

#4 Updated by Alexey Muranov over 2 years ago

I forgot to say that i really like this proposal. Please, let us have Set in core with {1, "two", 3} literal syntax, mathematicians will be crazy.

Alexey.

#5 Updated by Anonymous over 2 years ago

+1

Michel Demazure
michel@demazure.com

-----Message d'origine-----
De : Alexey Muranov [mailto:muranov@math.univ-toulouse.fr]
Envoyé : jeudi 1 décembre 2011 09:54
À : ruby-core@ruby-lang.org
Objet : [ruby-trunk - Feature #5478] import Set into core, add syntax

Issue #5478 has been updated by Alexey Muranov.

I forgot to say that i really like this proposal. Please, let us have Set in core with {1, "two", 3} literal syntax, mathematicians will be crazy.

Alexey.


Feature #5478: import Set into core, add syntax
http://redmine.ruby-lang.org/issues/5478

Author: Konstantin Haase
Status: Open
Priority: Normal
Assignee:
Category:
Target version: 3.0

=begin
A set is a central data structure. However, a lot of Ruby developers use arrays for situations where it would be more reasonable to use a set. One reason for that is that it is way easier to use Array then Set at the moment, another one is that developers are simply not aware it exists.

I propose moving Set from the stdlib to core and possibly add a syntax or a method on array for creating Set literals.

First class syntax suggestions:

 <1, 2, 3>  # might be tricky to parse
 #[1, 2, 3] # would collide with comments
 $[1, 2, 3]
 ${1, 2, 3}

Method suggestions:

 ~[1, 2, 3]
 +[1, 2, 3]

Whitespace separated String Sets could look like this:

 %w<foo bar blah> # creates an array at the moment 
 #w[foo bar blah] # would collide with comments
 $w[foo bar blah] # would collide with sending :[] to $w
 $w{foo bar blah}

 ~%w[foo bar blah] # not really shorter than using an array with strings
 +%w[foo bar balh] # not really shorter than using an array with strings

Maybe it's ok to not have a whitespace separated syntax, I'm just brainstorming here.

The issue with the method approach is that it would create an Array to send the message to first.

I favor the syntax, possibly without the ability to create a whitespace separated version.

I'd be willing to work on a patch not only for MRI but also for JRuby and Rubinius if you would consider this to be useful.
Although I would need help with the parser.
=end

--
http://redmine.ruby-lang.org

#6 Updated by Charles Nutter over 2 years ago

I must point out that Ruby 1.8.x allowed {:a, :b, :c, :d} as literal Hash syntax. 1.9 removed that syntax. If it is added back again, it could be confusing...and code which used that syntax and failed fast on 1.9 would now create a Set and fail at some arbitrary time in the future.

All that said, I do like this syntax as a way of representing a literal Set.

#7 Updated by Jos Backus over 2 years ago

How would one represent an empty Set?

'{}' can't be used as it would be ambiguous.

#8 Updated by Joshua Ballanco over 2 years ago

Perhaps in the vein of OCaml, we could use {[]} for set literals?

On Sat, Dec 3, 2011 at 4:53 PM, Jos Backus jos@catnook.com wrote:

Issue #5478 has been updated by Jos Backus.

How would one represent an empty Set?
'{}' can't be used as it would be ambiguous.


Feature #5478: import Set into core, add syntax
http://redmine.ruby-lang.org/issues/5478

Author: Konstantin Haase
Status: Open
Priority: Normal
Assignee:
Category:
Target version: 3.0

=begin
A set is a central data structure. However, a lot of Ruby developers use
arrays for situations where it would be more reasonable to use a set. One
reason for that is that it is way easier to use Array then Set at the
moment, another one is that developers are simply not aware it exists.

I propose moving Set from the stdlib to core and possibly add a syntax or
a method on array for creating Set literals.

First class syntax suggestions:

# might be tricky to parse
#[1, 2, 3] # would collide with comments
$[1, 2, 3]
${1, 2, 3}

Method suggestions:

~[1, 2, 3]
+[1, 2, 3]

Whitespace separated String Sets could look like this:

%w # creates an array at the moment
#w[foo bar blah] # would collide with comments
$w[foo bar blah] # would collide with sending :[] to $w
$w{foo bar blah}

~%w[foo bar blah] # not really shorter than using an array with strings
+%w[foo bar balh] # not really shorter than using an array with strings

Maybe it's ok to not have a whitespace separated syntax, I'm just
brainstorming here.

The issue with the method approach is that it would create an Array to
send the message to first.

I favor the syntax, possibly without the ability to create a
whitespace separated version.

I'd be willing to work on a patch not only for MRI but also for JRuby and
Rubinius if you would consider this to be useful.
Although I would need help with the parser.
=end

http://redmine.ruby-lang.org

#9 Updated by Alexey Muranov over 2 years ago

Jos Backus wrote:

How would one represent an empty Set?

'{}' can't be used as it would be ambiguous.

Good point. Need to think about this. I would be in favor of using {} for the empty set and Hash::new, or Hash[], or {}.to_hash for the empty hash. To keep partial compatibility, Set can convert itself to Hash if a #[] or #[]= method is called on it. Actually, can it? Can an object replace itself with an object of another class in Ruby? Also, {} can be an object of some common ancestor class of Hash and Set, if this can help.

I think it should be possible to use a single class as both Set and Hash, but having set methods and hash methods both defined on the same object somehow seems strange to me.

#10 Updated by Alexey Muranov over 2 years ago

I've let my imagination go free and got another crazy(/stupid) idea: abandon Set, add methods to Hash, and use it as Set.

The main drawback seemed at first to be the name, but the technical term for Ruby's hash is "associative array", so there shall be no abuse of terminology if Hash class is used to represent both sets and associative arrays.

In fact, the #include? instance method already exists in Hash and for "set-like" hashes it works as expected:

{ 1 => nil, 2 => nil, 3 => nil }.include?(2) # => true

To make Hash implement both sets and associative arrays, it can be made to implement first of all sets, while implementing associative arrays can be a byproduct: store associative arrays as sets in which every element is, for example, an instance of Arrow, where

class Arrow
  attr_reader :key
  attr_accessor :value

  def initialize(k, v=nil)
     @key = k
     self.value = v
  end

  def hash
    key.hash
  end
end

This will make the literal syntax consistent:

{} == Hash.new                                                                                 # => true
{1, 2} == { 1 => nil, 2 => nil }                                                         # => *false*
{ 1 => nil, 2 => nil }  == { Arrow.new(1), Arrow.new(2) }                 # => true
{ 1 => "a", 2 => "b" } == { Arrow.new(1, "a"), Arrow.new(2, "b") }   # => true
{ 1, 2 => "b" } == { 1, Arrow.new(2, "b") }                                       # => true
Arrow.new(2, "b") == (2 => "b")                                                          # => true

(Note that this is different from Ruby 1.8 interpretation of {1, 2}.)

This does not make impossible to use the current syntax foo x, y, :option1 => true, :option2 => 0, but there may be an issue: shall foo x, y, (:option1 => true), :option2 => 0 be equivalent to foo x, y, Arrow.new(:option1, true), { :option2 => 0 } or to foo x, y, { :option1 => true, :option2 => 0 }?

#11 Updated by Thomas Sawyer over 2 years ago

Do you think perhaps you've allowed a certain notion to lead you to left field? You're talking about shoehorning Set into Hash just so it can have a literal notation akin to the one Mathematicians use. Sorry, but {} is already taken in Ruby. Mathematicians can get over it or use another language. And honestly, what's wrong with:

Set[1,2,3]

#12 Updated by Alexey Muranov over 2 years ago

Thomas Sawyer wrote:

Do you think perhaps you've allowed a certain notion to lead you to left field? You're talking about shoehorning Set into Hash just so it can have a literal notation akin to the one Mathematicians use. Sorry, but {} is already taken in Ruby. Mathematicians can get over it or use another language. And honestly, what's wrong with:

Set[1,2,3]

But what is wrong with having one class instead of two, if internally they behave mostly the same, and one literal notation instead of two? Anyway, this was just an idea.

#13 Updated by Joshua Ballanco over 2 years ago

On Sun, Dec 4, 2011 at 1:05 PM, Eero Saynatkari ruby-ml@kittensoft.orgwrote:

On 2011-12-04, at 16:15:00, Alexey Muranov wrote:

But what is wrong with having one class instead of two, if internally
they behave mostly the same, and one literal notation instead of two?
Anyway, this was just an idea.

Hash tables aren't Sets, nor the other way around. I'd like to avoid
messing up data structure definitions any more than they already are.

The only compelling reason for bringing Set into core would be a literal
syntax – and I agree, as a data structure it would be a useful addition and
encourage its usage – but I don't really have anything to contribute to the
particular bikeshed of syntax thereof.

Actually, the bulk of Set's functionality is already built on top of Hash.
Personally, since the ability to create Hashes from comma-delimited
key,value lists has been removed in 1.9, I think reintroducing it to create
Set literals is not the worst idea in the world.

Additionally, implementing Set functionality directly on top of Hash would
remove the unfortunate method call overhead that currently exists in the
Set library (note that the implementation of Set#include? is just a call
through to Hash#include?):

Benchmark.bmbm do |x|
?> h = {}
s = Set.new
10000.times { |i| h[i] = nil; s << i }
x.report("Hash#include?") { 1000000.times { h.include? rand(20000) } }
x.report("Set#include?") { 1000000.times { s.include? rand(20000) } }
end
Rehearsal -------------------------------------------------
Hash#include? 0.370000 0.000000 0.370000 ( 0.370516)
Set#include? 0.440000 0.000000 0.440000 ( 0.436806)
---------------------------------------- total: 0.810000sec

                 user     system      total        real

Hash#include? 0.370000 0.000000 0.370000 ( 0.361795)
Set#include? 0.430000 0.000000 0.430000 ( 0.426560)

#14 Updated by Adam Prescott over 2 years ago

On Sun, Dec 4, 2011 at 22:34, Joshua Ballanco jballanc@gmail.com wrote:

Actually, the bulk of Set's functionality is already built on top of Hash.
Personally, since the ability to create Hashes from comma-delimited
key,value lists has been removed in 1.9, I think reintroducing it to create
Set literals is not the worst idea in the world.

Additionally, implementing Set functionality directly on top of Hash would
remove the unfortunate method call overhead that currently exists in the
Set library (note that the implementation of Set#include? is just a call
through to Hash#include?):

This really does seem to just be an argument via implementation details. I
don't think that's the best approach. The point of implementation details
is that they're just that.

#15 Updated by Alexey Muranov over 2 years ago

From the point of view of behavior/interface, both sets and associative arrays are just unordered collections, associative arrays being unordered collections of (key,value) pairs, where each key is unique.

(This is why i thought that it might be possible to combine them in one class Hash as they both need hash -table- function.)

#16 Updated by Alexey Muranov over 2 years ago

If you had enough of my comments on this thread, please tell me stop, i will listen.

Here is my another argument in favor of pure Set as a base class or replacement for Hash: on top of Set (or Hash modified in a way to be able to hold pure sets), one can build not only associative arrays, but also a class (call it Relation) that can naturally hold tables from relational databases. A relation, as well as an associative array, is just a form of a set.

#17 Updated by Magnus Holm over 2 years ago

On Wed, Dec 7, 2011 at 10:01, Alexey Muranov
muranov@math.univ-toulouse.fr wrote:

Issue #5478 has been updated by Alexey Muranov.

If you had enough of my comments on this thread, please tell me stop, i will listen.

Here is my another argument in favor of pure Set as a base class or replacement for Hash: on top of Set (or Hash modified in a way to be able to hold pure sets), one can build not only associative arrays, but also a class (call it Relation) that can naturally hold tables from relational databases.  A relation, as well as an associative array, is just a form of a set.

From a theoretical point of view, a relation has a heading + a set
that holds tuples (arrays):

class Relation
attr_reader :header, :values

 def inittialize(header, values)
   @header = header
   @values = values
 end

end

Relation.new(%w[id name], Set[[1, 'Magnus'], [2, 'Alexy']])

Of course, there are many other ways to represent it (depending on
your requirements). Sometimes you want to use a set of hashes. I don't
quite see how the merging of Set and Hash makes it easier to represent
relations though. It's still a two-dimensional data-structure, so you
either need a Set that contains a Hash or a tuple (Set, Array/Hash).

#18 Updated by Alexey Muranov over 2 years ago

Magnus Holm wrote:

From a theoretical point of view, a relation has a heading + a set
that holds tuples (arrays):

class Relation
attr_reader :header, :values

 def inittialize(header, values)
   @header = header
   @values = values
 end

end

Relation.new(%w[id name], Set[[1, 'Magnus'], [2, 'Alexy']])

Of course, there are many other ways to represent it (depending on
your requirements). Sometimes you want to use a set of hashes. I don't
quite see how the merging of Set and Hash makes it easier to represent
relations though. It's still a two-dimensional data-structure, so you
either need a Set that contains a Hash or a tuple (Set, Array/Hash).

I agree, i was simply pointing out that set is, in my opinion, a more basic data structure than associative array, and both associative arrays and relations can be either built on top of it, or simply represented by sets. In particular, it seems wrong to me that Set uses Hash in implementation, and not the other way around.

As Hash class already exists and is the native class for associative arrays, i proposed a way to transform it into set so that to still be able to use it as an associative array and so that most of the current syntax works. Also this would give a common WYSIWYG :) syntax for both sets and associative arrays: { 1, { 2 => 3 }, 4, {5} }.

I was thinking like this: how to simplify Hash to use it as a base class for both Set and Hash? And i ended up with a data structure that was just a set. Then i thought: how now to build Hash back on top of it? And i decided that there were almost nothing to build, sets can be used as associative arrays, just store (key,value) pairs in them and add methods for fetching and inserting by key. Of course, there need to be a way to set default_value for associative arrays, etc.

I agree with your examples of how to represent Relation, they both use Set as top-level structure. As you pointed out, you can just use a set of associative arrays, in my proposed literal syntax:
{ { "id" => "1", "name" => "Magnus" },
{ "id" => "2", "name" => "Alexey", "country" => "France" } }
This might not be the most efficient way, and you would use instead:
Relation.new(%w[id name country], { [1, "Magnus", nil], [2, "Alexey", "France"] })

Update: As identical symbols do not take up extra space, i think the first implementation of Relation is better:
{ { :id => "1", :name => "Magnus" },
{ :id => "2", :name => "Alexey", :country => "France" } }
It can have a separate attribute for meta-data, like column types.

#19 Updated by Eero Saynatkari over 2 years ago

On 2011-12-09, at 12:05:41, Alexey Muranov wrote:

I agree, i was simply pointing out that set is, in my opinion, a more basic data structure than associative array, and both associative arrays and relations can be either built on top of it, or simply represented by sets. In particular, it seems wrong to me that Set uses Hash in implementation, and not the other way around.

A hash table isn't an associative array (also known as a map). The latter can be implemented using the former, as can a set. Perhaps it'd have been better to abstract Hash more from its implementation, but it wasn't.

I don't see any practical benefit in unifying the two, given the assumption that Hash is to remain the default – and fast – data structure to use. Have you run any numbers to see how much of a hit you're going to take from deconstructing Hash?

--
Magic is indistinguishable from insufficiently advanced technology.

#20 Updated by Alexey Muranov over 1 year ago

=begin
I think that unifying (({Hash})) with (({Set})) could also allow to use the ((splat)) star operator more intelligently. For example, currently the following does not work:

h = { 1 => 2, 3 => 4}
k = { 0 => 0, *h } # => SyntaxError: unexpected tSTAR, expecting '}'

I would like to see here (({k})) = (({{ 0 => 0, 1 => 2, 3 => 4 }})).

Such behavior could be nice for merging several hashes:

def f(h); ...; end

h1 = { :a => 1 }; h2 = { :b => 2 }

f({ *h1, *h2, :name => 'foo' })

Instead of:

h = h1.merge(h2)
h[:name] = 'foo'
f(h)

Update: I also think unifying (({Hash})) with (({Set})), and maybe with other thing, agrees with Ruby philosophy: if everything is an (({Object})), why not to make every finite collection to be a (({Set}))? I know there could be some issues with making (({Array})) a (({Set})) (probably it would first need to become a (({Hash}))), but i do not think there are big issues with making (({Hash})) a (({Set})).

There is also the Square-Rectangle inheritance problem with this approach, i do not really know how to resolve it. However, in Ruby already (({Class})) is a (({Module})), but a class cannot be ((included)) into another class. So Ruby does not look very strict here.
=end

#21 Updated by Nobuyoshi Nakada over 1 year ago

=begin
Seems off-topic.

h1 = {a: 1}; h2 = {b: 2}
def foo(name: 'unknown', *rest) name; end
foo *
h1, **h2, name: "foo" #=> "foo"
=end

#22 Updated by Alexey Muranov over 1 year ago

nobu (Nobuyoshi Nakada) wrote:

=begin
Seems off-topic.

h1 = {a: 1}; h2 = {b: 2}
def foo(name: 'unknown', *rest) name; end
foo *
h1, **h2, name: "foo" #=> "foo"
=end

Ok nobu, thanks for the idea. Then back to the topic. If combining Hash and Set is somehow unfeasible, the question stays about the literal notation for Set, and if using {1, 2, 3}, how to distinguish the empty set from the empty hash.

I can propose {-} for the empty set.

I think it would be even better to use {} for the empty set and {=>} or {:} for the empty hash, but this would be incompatible with the current use.

#23 Updated by Xavier Noria over 1 year ago

I don't know which is the general position of core with respect to non-ASCII in source code, but the natural literal to represent an empty set is ∅.

#24 Updated by Thomas Sawyer over 1 year ago

=begin
@fxn

{/}

Close enough?
=end

#25 Updated by Thomas Sawyer over 1 year ago

I don't see why Hash and Set need to be merged, however. It certainly could be useful for Ruby supported Set out-of-the-box. As it is I think it is not uncommon to just use an Array and pretend as if it is a Set. But Set can be it's own thing.

#26 Updated by Alexey Muranov over 1 year ago

trans (Thomas Sawyer) wrote:

{/}

Close enough?

Thomas, unfortunately here it looks like two set notations overlap (empty set is either {} or ∅). Then maybe (/) ? :D

#27 Updated by Boris Stitnicky over 1 year ago

I am for having Set in the core, with Set[ a, b, c ] and Set.new as basic constructors. I am against mingling it with existing collection classes. I think that the most frequent usec for Set would be order-independent comparison (instad of eg. "array1.sort == array2.sort"). I find { a, b, c } syntax tempting, otherwise I agree on this issue with headius (response #6).

#28 Updated by Nathan Broadbent over 1 year ago

I really like ~[1, 2, 3] as a shortcut for Set.new([1, 2, 3]):

class Array
def ~@
Set.new self
end
end

My other preferences would be:

1) <1, 2, 3>

2) {1, 2, 3}, and {/} for Set.new.

Also available in: Atom PDF