Feature #21005
openUpdate the source location method to include line start/stop and column start/stop details
Description
Why¶
π Hello. After discussing with Kevin Newton and Benoit Daloze in Feature 20999, I'd like to propose adding line start/stop and column start/stop information to the #source_location
method for the following objects:
At the moment, when using #source_location
, you only get the following information:
def demo = "A demonstration."
# From disk.
method(:demo).source_location # ["/Users/bkuhlmann/Engineering/Misc/demo", 15]
# From memory.
method(:demo).source_location # ["(irb)", 3]
Notice, when asking for the source location, we only get the path/location as the first element and the line number as the second element but I'd like to obtain a much richer set of data which includes line start/stop and column start/stop so I can avoid leaning on the RubyVM
for this information. Example:
def demo = "A demonstration."
# From disk.
instructions = RubyVM::InstructionSequence.of method(:demo)
puts [instructions.absolute_path, *instructions.to_a.dig(4, :code_location)]
[
"/Users/bkuhlmann/Engineering/Misc/demo", # Source path.
15, # Line start.
0, # Column start.
15, # Line stop.
29 # Column stop.
]
# From memory.
instructions = RubyVM::InstructionSequence.of method(:demo)
puts instructions.script_lines
[
"def demo = \"A demonstration.\"\n",
""
]
By having access to the path (or lack thereof in case of IRB), line start/stop, and column start/stop, this means we could avoid using the RubyVM to obtain raw source code for any of these objects. This would not only enhance debugging situations but also improve Domain Specific Languages that wish to leverage this information for introducing new features and/or new debugging capabilities to the language.
How¶
Building upon the examples provided above, I'd like to see Binding
, Proc
, Method
, and UnboundMethod
respond to #source_location
as follows:
[
"/Users/bkuhlmann/Engineering/Misc/demo", # Source path.
15, # Line start.
15, # Line stop.
0, # Column start.
29 # Column stop.
]
Notice, for data grouping purposes, I changed the array structure to always start with the path as the first element, followed by line information, and ending with column information. Alternatively, it could might be nice to improve upon the above by answering a hash each time, instead, for a more self-describing data structure. Example:
{
path: "/Users/bkuhlmann/Engineering/Misc/demo",
line_start: 15,
line_stop: 15,
column_start: 0,
column_stop: 29
}
For in-memory, situations like IRB, it would be nice to answer the equivalent of RubyVM::InstructionSequence#script_lines
which would always be an Array
with no line or column information since only the source code is necessary. Example:
[
"def demo = \"A demonstration.\"\n",
""
]
From a pattern matching perspective, this could provide the best of both worlds especially if information is answered as either a Hash
or and Array
. Example:
def demo = "A demonstration."
case method(:demo).source_location
in Hash then puts "Source information obtained from disk."
in Array then puts "Source obtained from memory."
else fail TypeError, "Unrecognized source location type."
end
This above is only a simple example but there's a lot we could do with this information if the above pattern match was enhanced to deal with the extraction and formatting of the actual source code!
Notes¶
This feature request is related to the following discussions in case more context is of help:
Updated by Eregon (Benoit Daloze) 3 days ago
- Related to Feature #20999: Add RubyVM object source support added
Updated by Eregon (Benoit Daloze) 3 days ago
- Related to Feature #6012: Proc#source_location also return the column added
Updated by Eregon (Benoit Daloze) 3 days ago Β· Edited
I think adding end line and start/end columns is a straightforward and compatible extension of source_location
.
I will add that to the dev meeting agenda: https://bugs.ruby-lang.org/issues/20949
Changing to return a Hash (or an Array of 2 Strings) instead would be too incompatible, that would need to be a new method then.
If it's a new method, I think we should return a "code location" object (could be Ruby::CodeLocation
or Ruby::Location
or Ruby::SourceLocation
or so) and have the following methods (inspired from https://bugs.ruby-lang.org/issues/6012#note-19):
- start_line
- start_column
- start_offset
- end_line
- end_column
- end_offset
- code or maybe source_code: gets the source of the Proc/Method/UnboundMethod if available,
nil
otherwise
That last one seems particularly nice because it abstracts how it gets that source code, and allows for multiple implementations.
For instance if the Ruby implementation keeps the source code in memory (or a compressed version of it) it could just use that, if it doesn't it would re-read from disk, etc.
Updated by bkuhlmann (Brooke Kuhlmann) 2 days ago Β· Edited
Yeah, eager to see what the consensus on this becomes from the next dev meeting.
I like the idea of a Ruby::SourceLocation
or even a Ruby::Source
object? The latter would be nice because you could message the instance as source.code
for situations in which the source could be obtained from memory, re-read from disk, or answer nil
as you point out. Super helpful for situations where you're using IRB or Binding
.
Updated by mame (Yusuke Endoh) 2 days ago
I read the use case for #20999.
In short, by the following DSL,
class Demo
include Initable[[:key, :default, proc { Object.new }]]
end
you want to generate the following code string, right?
def initialize(default = Object.new)
@default = default
end
I believe that the last position of Proc is not sufficient for your case, because of a here document. Consider the following proc.
proc { <<END }
heredoc
END
Where do you expect the end position of the Proc to be? The end brace? Or the end of βENDβ delimiter? Either way, your DSL will generate a broken Ruby code string.
What you really need is not the end position information, but an AST node of Method, Proc, etc.
f = proc { <<END }
heredoc
END
# What you really need
node = Prism.node_for(f)
# You can get all the locations including the here document
pp node.block.body.body
# =>
# [@ StringNode (location: (1,7)-(1,12))
# βββ flags: newline
# βββ opening_loc: (1,7)-(1,12) = "<<END"
# βββ content_loc: (2,0)-(3,0) = " heredoc\n"
# βββ closing_loc: (3,0)-(3,3) = "END"
# βββ unescaped: " heredoc\n"]
# You can even get the end position of the proc, if it is really sufficient for you
p [node.location.end_line, node.location.end_column] #=> [1, 14]
Updated by byroot (Jean Boussier) 1 day ago
I think adding end line and start/end columns is a straightforward and compatible extension of source_location.
I fear it may break code that splats the array, e.g.:
some_method(*proc.source_location)
So perhaps it should use an argument, e.g. proc.source_location(true)
?
I like the idea of a Ruby::SourceLocation
Me too, because otherwise it would be an array with many positional numbers, and that would be quite cryptic. Returning an actual object with named methods make it easier to understand what's being used and easier to extend later on.
Updated by bkuhlmann (Brooke Kuhlmann) 1 day ago Β· Edited
Yusuke: I went hunting for the Prism.node_for
method/documentation but couldn't find it. If I understand you correctly, I think you are proposing adding the .node_for
implementation to Prism? If so, I like the detailed information in your example especially when dealing with heredocs and being able to access content location. That would yield even greater flexibility for anyone implementing atop this data. To be more exact, I'm guessing the following could be amended to what Benoit proposed above then?:
- content_line_start
- content_line_end
- content_column_start
- content_column_end
By the way, one slight correction to what you were asking above. This is what my DSL generates:
# DSL
class Demo
include Initable[[:key, :default, proc { Object.new }]]
end
# End result.
class Demo
def initialize(default: Object.new)
@default = default
end
private
attr_reader :default
end
The emphasis is on key
because key
is an optional keyword parameter as defined in the Method#parameters implementation. Essentially, my DSL reconstitutes the raw parameters answered back into a method signature where the third element in the 3-tuple is additional sugar (specific to my DSL) for providing a default value since Method#parameters
only answers an array of tuples.
Jean
I fear it may break code that splats the array
Ugh, yes. What about this instead?
proc.source_location(as: :verbose)
This would give us room to expand the as
keyword with different values to support different object shapes in the future (assuming this design expands and grows in future Ruby versions) without backing us into a corner.
Returning an actual object with named methods make it easier to understand what's being used and easier to extend later on.
Yeah, this would be most welcome. I know I gave an example of using a Hash
above but icing on the cake would be to answer a Data
object instead so you could simply ask for whatever attribute you need.
Updated by byroot (Jean Boussier) 1 day ago
Ugh, yes. What about [keyword argument] instead?
I think it may be a bit much.
But if we decide with returning a new class, instead of just retunring a longer array, I think the better API would be another method entirely.
Perhaps #source
, #source_info
, or something along these lines.
Updated by Earlopain (Earlopain _) 1 day ago
But if we decide with returning a new class, instead of just retunring a longer array, I think the better API would be another method entirely.
Yes, I agree with that. caller
and caller_location
are also different methods. I don't think source
would be a good name, it could just as easily return the source string. source_info
definitly sounds better to me.
Updated by tenderlovemaking (Aaron Patterson) 1 day ago
It seems like a very similar solution to what @mame (Yusuke Endoh) is proposing was merged at one point as Feature #14836. It seems RubyVM::AST
doesn't exist anymore, but it should be possible to do something similar.
For example, this code should work:
f = proc { <<END }
heredoc
END
iseq = RubyVM::InstructionSequence.of(f)
require "prism"
node_id = iseq.to_a[4][:node_id]
ast = Prism.parse(File.binread(__FILE__))
p ast.value.breadth_first_search { |node| node.node_id == node_id }
It does not currently work due to a bug in Prism, but I'm trying to address that in #21014
Updated by tenderlovemaking (Aaron Patterson) 1 day ago
Also just to be clear, I don't think the code I pasted above is nice code, but it does work after the Prism bug is addressed.
In particular this line is pretty heavy:
node_id = iseq.to_a[4][:node_id]
I think we should add a reader method on RubyVM::InstructionSequence to get the node id. It already has some methods for accessing location information, so I think adding the node id could be acceptable. However, I think such an addition is outside the scope of this ticket, so I will open a new ticket.
Updated by mame (Yusuke Endoh) 1 day ago
I think you are proposing adding the .node_for implementation to Prism?
Yes.
To be more exact, I'm guessing the following could be amended to what Benoit proposed above then?
Prism.node_for
is my proposal, but the others are not my proposal but what Prism already provides.
require "prism"
node = Prism.parse("proc { <<END }
heredoc
END")
pp node
$ ruby test.rb
#<Prism::ParseResult:0x00007f17ceafc278
@comments=[],
@data_loc=nil,
@errors=[],
@magic_comments=[],
@source=#<Prism::ASCIISource:0x00007f17cdd4a2d8 @offsets=[0, 15, 25], @source="proc { <<END }\n heredoc\nEND", @start_line=1>,
@value=
@ ProgramNode (location: (1,0)-(1,14))
βββ flags: β
βββ locals: []
βββ statements:
@ StatementsNode (location: (1,0)-(1,14))
βββ flags: β
βββ body: (length: 1)
βββ @ CallNode (location: (1,0)-(1,14))
βββ flags: newline, ignore_visibility
βββ receiver: β
βββ call_operator_loc: β
βββ name: :proc
βββ message_loc: (1,0)-(1,4) = "proc"
βββ opening_loc: β
βββ arguments: β
βββ closing_loc: β
βββ block:
@ BlockNode (location: (1,5)-(1,14))
βββ flags: β
βββ locals: []
βββ parameters: β
βββ body:
β @ StatementsNode (location: (1,7)-(1,12))
β βββ flags: β
β βββ body: (length: 1)
β βββ @ StringNode (location: (1,7)-(1,12))
β βββ flags: newline
β βββ opening_loc: (1,7)-(1,12) = "<<END"
β βββ content_loc: (2,0)-(3,0) = " heredoc\n"
β βββ closing_loc: (3,0)-(3,3) = "END"
β βββ unescaped: " heredoc\n"
βββ opening_loc: (1,5)-(1,6) = "{"
βββ closing_loc: (1,13)-(1,14) = "}",
@warnings=[]>
tenderlovemaking (Aaron Patterson) wrote in #note-10:
It seems like a very similar solution to what @mame (Yusuke Endoh) is proposing was merged at one point as Feature #14836. It seems
RubyVM::AST
doesn't exist anymore, but it should be possible to do something similar.
Yeah, we need to reimplement Prism version of RubyVM::AbstractSyntaxTree.of
. This is what I tried to mean by Prism.node_for
.
tenderlovemaking (Aaron Patterson) wrote in #note-11:
I think we should add a reader method on RubyVM::InstructionSequence to get the node id.
I don't think it is good for a user to handle node_id
explicitly.
Rather, I think Prism should provide a simple method to directly retrieve a node subtree of a Method/Proc object given, like RubyVM::AST.of
.
Updated by tenderlovemaking (Aaron Patterson) about 23 hours ago
mame (Yusuke Endoh) wrote in #note-12:
tenderlovemaking (Aaron Patterson) wrote in #note-11:
I think we should add a reader method on RubyVM::InstructionSequence to get the node id.
I don't think it is good for a user to handle
node_id
explicitly.
Rather, I think Prism should provide a simple method to directly retrieve a node subtree of a Method/Proc object given, likeRubyVM::AST.of
.
I agree. The problem is that RubyVM::AST
is built in to Ruby, so it has access to rb_iseq_t
members. Prism cannot access fields on rb_iseq_t
.
My idea is to apply a patch like this to iseq.c: https://github.com/tenderlove/ruby/commit/9a54230012d8837a981e0ddec88384ab6ef4db89
Then add code to Prism like this:
require "prism"
# Put the following code in Prism
module Prism
def self.ast_for iseq
ast = Prism.parse(File.read(iseq.absolute_path))
node_id = iseq.node_id
ast.value.breadth_first_search { |node| node.node_id == node_id }
end
end
# User code is below
f = proc { <<END }
heredoc
END
iseq = RubyVM::InstructionSequence.of(f)
tree = Prism.ast_for(iseq)
p tree
I implemented the Prism method as a monkey patch just as demonstration. But the idea is the same as RubyVM::AbstractSyntaxTree.of
, where Prism.ast_for
takes an iseq object and returns the AST for that iseq.
Updated by Eregon (Benoit Daloze) about 17 hours ago Β· Edited
It's very important that this new feature does not expect users to use RubyVM::InstructionSequence
or anything under RubyVM
since RubyVM
is CRuby-only.
The feature itself is possible on any Ruby implementation.
So something like Prism.node_for(Proc|Method|UnboundMethod)
is good, and Prism.ast_for(RubyVM::InstructionSequence)
is not.
Internally Prism can of course use RubyVM::InstructionSequence.of(Proc|Method|UnboundMethod).node_id
on CRuby, and something else on other Ruby implementations.
Note that if it's enough to locate a node by its start/end line/column, we might not need node_id
at all, and then just providing start/end line/column to source_location
would be enough to find the right node with Prism.
Are there cases where this would be a problem, i.e. where 2 Prism AST nodes would have the same start/end line/column?
Actually since we are only talking about Proc|Method|UnboundMethod
here it would need to be two nodes which define a proc/lambda/method with the same start/end line/column.
I think that's not possible.
If that holds, then the original proposal to provide start/end line/column is enough, and we can add a convenience method in Prism using those.
That would work on all Ruby implementations, without needing a low-level implementation-specific concept of node_id
:
module Prism
def self.node_for callable
start_line, end_line, start_column, end_column = callable.source_location(true)
ast.value.breadth_first_search { |node|
loc = node.location
loc.start_line == start_line and loc.end_line == end_line and
loc.start_column == start_column and loc.end_column == end_column
}
end
end
Maybe CRuby does not currently preserve the information of end line and start/end column for procs and methods?
For def
it would be trivial to preserve it but I guess for blocks and define_method
it might be trickier.
For such cases source_location
could internally use the node_id
stuff if that's easier or deemed a better trade-off on CRuby.
In summary:
- I think we can build
Prism.node_for(Proc|Method|UnboundMethod)
on(Proc|Method|UnboundMethod)#source_location
with start/end line/column. - Those would all be public APIs working on all Ruby implementations.
- Users don't need to know about low-level implementation-specific (i.e. CRuby-only) concepts like
node_id
.
Updated by Dan0042 (Daniel DeLorme) about 13 hours ago
Eregon (Benoit Daloze) wrote in #note-3:
If it's a new method, I think we should return a "code location" object (could be
Ruby::CodeLocation
orRuby::Location
orRuby::SourceLocation
or so) and have the following methods (inspired from https://bugs.ruby-lang.org/issues/6012#note-19):
- start_line
- end_line
I really like the idea of a #source
method that returns a Ruby::SourceLocation
object. However, when there's a start and end, I believe Ruby should ideally align with its own core conventions and return a Range
. For example, method.source.lines => start...end
. While I understand concerns about the allocation of Range
objects and performance, I feel that: 1) this might be an example of premature micro-optimization, and 2) from an API design perspective, a Range
object feels like the natural default. Separate start
/end
accessors could remain a low-level, performance-focused API if truly necessary.
As an alternative to start
/end
accessors, it would be even better if the Range
-returning method could be optimized via opcode. Referring back to the idea of Range#bounds
in #20080, we could have something like start_line, end_line = method.source.lines.bounds
, which could be optimized via opcode to avoid allocating a Range
object entirely.
It would be great to see Ruby continue to embrace its own language idioms and explore such optimizations for a more elegant API.
Updated by tenderlovemaking (Aaron Patterson) about 10 hours ago
Eregon (Benoit Daloze) wrote in #note-14:
Maybe CRuby does not currently preserve the information of end line and start/end column for procs and methods?
I think we do, but I can investigate. IIRC it's on the InstructionSequence object (so we'd still have to use RubyVM::InstructionSequence on CRuby).
For
def
it would be trivial to preserve it but I guess for blocks anddefine_method
it might be trickier.
For such casessource_location
could internally use thenode_id
stuff if that's easier or deemed a better trade-off on CRuby.
I think start / column info is always there, but not 100% sure.
In summary:
- I think we can build
Prism.node_for(Proc|Method|UnboundMethod)
on(Proc|Method|UnboundMethod)#source_location
with start/end line/column.- Those would all be public APIs working on all Ruby implementations.
- Users don't need to know about low-level implementation-specific (i.e. CRuby-only) concepts like
node_id
.
π
Dan0042 (Daniel DeLorme) wrote in #note-15:
Eregon (Benoit Daloze) wrote in #note-3:
If it's a new method, I think we should return a "code location" object (could be
Ruby::CodeLocation
orRuby::Location
orRuby::SourceLocation
or so) and have the following methods (inspired from https://bugs.ruby-lang.org/issues/6012#note-19):
- start_line
- end_line
I really like the idea of a
#source
method that returns aRuby::SourceLocation
object. However, when there's a start and end, I believe Ruby should ideally align with its own core conventions and return aRange
. For example,method.source.lines => start...end
. While I understand concerns about the allocation ofRange
objects and performance, I feel that: 1) this might be an example of premature micro-optimization, and 2) from an API design perspective, aRange
object feels like the natural default. Separatestart
/end
accessors could remain a low-level, performance-focused API if truly necessary.As an alternative to
start
/end
accessors, it would be even better if theRange
-returning method could be optimized via opcode. Referring back to the idea ofRange#bounds
in #20080, we could have something likestart_line, end_line = method.source.lines.bounds
, which could be optimized via opcode to avoid allocating aRange
object entirely.It would be great to see Ruby continue to embrace its own language idioms and explore such optimizations for a more elegant API.
As @mame (Yusuke Endoh) was pointing out, I don't think a single range for lines makes sense.
Consider the following code:
def foo; <<FOO; end; def bar; <<BAR; end
foo method
FOO
bar method
BAR
What should the lines
method report for the source code for bar
? It cannot be a single Range
because lines 2 and 3 and part of the foo
method. bar
is only defined line lines 1, 4, and 5. If we were to provide a source
method to return the text of the bar
method object, what text would it return?
Since heredocs are allowed to extend beyond the end
of the method / block, I really don't think it makes sense to try to provide a single start / end line. In order to truly provide the source location of the method, we would have to return multiple objects and I think that type of interface would just be too cumbersome to use. I completely agree that we should provide a way to get the AST for a method rather than try to provide line / column information alone.