Feature #19452
open`Thread::Backtrace::Location` should have column information if possible.
Description
I discussed this with @mame (Yusuke Endoh) and it would be pretty useful if we could also get the column information from exception backtrace location, even if it was slow.
A POC:
class Thread::Backtrace::Location
if defined?(RubyVM::AbstractSyntaxTree)
def first_column
RubyVM::AbstractSyntaxTree.of(self, keep_script_lines: true).first_column
end
else
def first_column
raise NotImplementedError
end
end
end
It would be good to have a standard interface, so we follow the same interface as https://bugs.ruby-lang.org/issues/19451 and vice versa where it makes sense. I'll investigate it.
Updated by ioquatix (Samuel Williams) over 1 year ago
- Tracker changed from Bug to Feature
- Backport deleted (
2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN)
Updated by ioquatix (Samuel Williams) over 1 year ago
- Related to Feature #19451: Extract path and line number from SyntaxError? added
Updated by rubyFeedback (robert heiler) over 1 year ago
Perhaps this could be discussed as "one item" in an upcoming
dev meeting, e. g. the extensions ioquatix suggested in the
last some issues raised here. Since I love introspection I
am all in favour of what can be useful for us, when writing
ruby code, and wanting to get more useful information. (Note
that I have not thought through if I may need this, or not;
I am trusting that ioquatix is convinced it may be useful.
I have had very little exposure to Threads actually, and
none at all with Threads::Backtrace, so to me this is a bit
more of a niche topic. Guess I have to update my ruby knowledge
with zverok's new documentation. :D)
Updated by mame (Yusuke Endoh) over 1 year ago
@ioquatix (Samuel Williams) Please, please write a use case in every proposal.
First of all, I think the POC itself is very naive for daily use. Consider the following method call. For Thread::Bactrace::Location of the call to foo, #first_column
will point before the receiver, not before the method name. And #last_column
will point after all arguments.
very_very_long_receiver.foo(many_many_arguments)
^ ^
| |
+--- #first_column #last_columns ---+
Without a use case, it would be impossible to discuss whether and how it is useful. This is a heavier feature than you might think to introduce for the sake of "better than nothing".
ErrorHighlight.spot
is a bit more intelligent: it cuts out the method name part.
very_very_long_receiver.foo(many_many_arguments)
^^^^
|
+--- ErrorHighlight.spot
To achieve this, ErrorHighlight uses RubyVM::AST.of
: it reparses the source code, identify the node in the method call, and draw an underline after its receiver and before its arguments.
Because of using RubyVM::AST.of
, ErrorHighlight works only with CRuby. That is unfortunate. Currently @kddeisz is working on a new Ruby parser project called yarp. As far as I know, its goal is to be a common parser for all major Ruby implementations including CRuby and TruffleRuby. If the goal is accomplished, I will use yarp for ErrorHighlight, which will (hopefully) allow ErrorHighlight to work on Ruby interpreters rather than CRuby.
Even in that case, it is still necessary to map Thread::Backtrace::Location
to the AST node. We may use Thread::Backtrace::Location#first_lineno
, etc. to identify the AST node corresponding to the Thread::Backtrace::Location
. But this should be considered after yarp actually becomes a CRuby parser. (Implementation note: currently, Thread::Backtrace::Location
has the ID of the AST node, which is a number assigned internally to the nodes in the AST in (approximate) pre-order. RubyVM::AST.of
re-parses the source code to recover the full AST, and then identifies the AST node by using the node ID. This is the method suggested by @ko1 (Koichi Sasada). The four values, first_lineno / first_column / last_lineno / last_column, could be another way to identify nodes. But this is slightly less accurate: if there is another node in the exact same code range, it is not uniquely identifiable. I don't think this is a problem for ErrorHighlight, though.)
@Eregon (Benoit Daloze) @kddeisz Any opinions?
Updated by Eregon (Benoit Daloze) over 1 year ago
YARP for sure will be useful for this.
However I think it is somewhat independent of YARP and more whether AST nodes/bytecodes/etc keep track of column information or not.
TruffleRuby already tracks the column information, so it would be trivial to provide that (except that the current parser doesn't keep column info, but YARP does).
Regarding finding out the actually method call name and the .
before, maybe YARP could/should remember those two locations? WDYT @kddeisz?
That would avoid the need to reparse, at the cost of having to store 2 extra uint32 "byte offsets" per call node.
I like the proposed API because it's straightforward.
I think it's also enough to identity which call it is, e.g. for foo(1).bar(2).baz(3)
, i.e. it's always the rightmost call inside first_column...last_columns
.
But that's indeed not as clear or obvious as underlining the operator + method name alone.
Updated by mame (Yusuke Endoh) over 1 year ago
@Eregon (Benoit Daloze) Thank you for your comment.
Eregon (Benoit Daloze) wrote in #note-5:
That would avoid the need to reparse, at the cost of having to store 2 extra uint32 "byte offsets" per call node.
I think there is a confusion of what the word "reparse" means here.
- When CRuby compiles a AST to a bytecode, it copies just node_id and lineno to each instruction but discards the column information. (@ko1 (Koichi Sasada) is not willing to copy column information due to memory consumption concerns.) So
RubyVM::AST.of
needs to "reparse" the whole source code to recover the whole AST. - CRuby's
RubyVM::AST
does not keep the column information of the method name because it converts the name to a symbol. It needs to "reparse" (or "re-tokenize"?) the code to identify where the method name is.
Here I am talking about 1, which will not change even if YARP is introduced. If YARP's AST keeps the method name column information, 2 will be unnecessary, which is good.
I like the proposed API because it's straightforward.
I think it's also enough to identity which call it is, e.g. forfoo(1).bar(2).baz(3)
, i.e. it's always the rightmost call insidefirst_column...last_columns
.
But that's indeed not as clear or obvious as underlining the operator + method name alone.
This is somewhat a matter of taste, but since this is a user interface, I believe it is important to provide pinpoint information at a glance. So I think it is worthwhile to elaborate it, even if it is somewhat tedious.
Updated by mame (Yusuke Endoh) over 1 year ago
mame (Yusuke Endoh) wrote in #note-6:
This is somewhat a matter of taste, but since this is a user interface, I believe it is important to provide pinpoint information at a glance. So I think it is worthwhile to elaborate it, even if it is somewhat tedious.
Sorry, I'm getting off topic. I think it is possible to use four values of first_lineno, etc. as information to connect Thread::Backtrace::Location
and AST node. (@ko1 (Koichi Sasada) is not so keen on it, though.)
However, we would need to clearly agree that this is a use case of Thread::Backtrace::Location#first_column
. If we change the return value arbitrarily, it would not be able to match it up with the AST node. Anyway, I think this should be discussed after YARP is completed and incorporated into CRuby.
Updated by kddnewton (Kevin Newton) over 1 year ago
YARP keeps column information around for this. In the call node there's column information at each of the points below:
very_very_long_receiver.foo(many_many_arguments)
^ ^^ ^^ ^
This may be too much at the end of the day, but for now it's working out nicely. In terms of memory consumption, I would think it would be 1-to-1 if we dropped node_id and replaced it with column
Updated by Eregon (Benoit Daloze) over 1 year ago
mame (Yusuke Endoh) wrote in #note-6:
I think there is a confusion of what the word "reparse" means here.
I meant it as 1.
as well. I think copying/keeping the column information is the most reliable way to have this information.
Indeed it should be encoded efficiently to avoid too much of a memory increase.
I think we should measure, e.g. by how much % does it increase the memory of a Ruby program and a Rails app?
I suspect it's not much higher than keeping the node id.
In fact if we assumed e.g. no call expression is longer than 2^16 bytes long, we could fit 2 column offsets (from the start of the call) in 32-bit. Or even 4 column offsets if we need more, and then have some fallback encoding if longer than 256 (likely rare) or so.
I don't know what CRuby uses currently to keep line information though.
I think it is possible to use four values of first_lineno, etc. as information to connect Thread::Backtrace::Location and AST node.
I think this makes sense, although it's even better if we don't need to reparse to find out column information.
It's probably more efficient to internally have the first_byteoffset and last_byteoffset, that's just two 32-bit integers and we can derive {first,last}_{lineno,column} from it.
It's also how YARP records locations.
Updated by Eregon (Benoit Daloze) over 1 year ago
Anyway, I think this should be discussed after YARP is completed and incorporated into CRuby.
I'm OK with that, I don't need these methods urgently. It will be nice to have public APIs for this kind of functionality vs experimental RubyVM:: stuff.
Updated by ufuk (Ufuk Kayserilioglu) over 1 year ago
If memory increase is a concern, could we not store line/column information as varints (e.g. like how Protocol Buffers encode integers) in the bytecode, so that smaller numbers (which would be way more common, especially for column numbers) take up less memory to begin with? Since access to this data would probably not be on any hot-path, the fact that they'd have to be converted to regular ints when needed should be a very small concern.
Updated by mame (Yusuke Endoh) over 1 year ago
Thank you all!
kddeisz (Kevin Newton) wrote in #note-8:
In terms of memory consumption, I would think it would be 1-to-1 if we dropped node_id and replaced it with column
What concerns me slightly is that it makes it impossible to uniquely identify nodes that have the same column information, such as NODE_SCOPE or its child node. Perhaps it is not a very big deal for ErrorHighlight, though.
Eregon (Benoit Daloze) wrote in #note-9:
I think we should measure, e.g. by how much % does it increase the memory of a Ruby program and a Rails app?
I suspect it's not much higher than keeping the node id.
When adding node_id, I evaluated the memory usage with rails s
. (See #17930.) It increased about 3 % (97 MB -> 100 MB). Not very small to ignore.
ufuk (Ufuk Kayserilioglu) wrote in #note-11:
could we not store line/column information as varints
That makes sense. Although some efforts have already been made in compression, I guess that there would be room for improvement.
Updated by ko1 (Koichi Sasada) over 1 year ago
Let me clear why column information is better than node_id
My understanding:
- Better than
node_id
- B1. We can get one column information (maybe most wider information, for example:
a.b.c.d.func(...)
forfunc()
. - B2. We can get (maybe) corresponding node by column information. not exact same with
node_id
but similar. - B3. Column information is not magical than
node_id
- B4. Can detect file modification on reparsing (not all modification though) because we can not get corresponding node by column information.
- B1. We can get one column information (maybe most wider information, for example:
- Worse than
node_id
- W1. More memory consumption
- W2. Can not get exact corresponding node with column information because some nodes can have same column information.
- W3. Can not get corresponding node on reparsing when the script was modified, even if spacing or commenting (<=> B4)
- W4. (personal concern) if we introduce it, I'm afraid that people ask to hold more and more information than the column information
- Both worse
- S1. Need to reparse a script and analyze the script to get more information like method name, parameters and so on.
Updated by janosch-x (Janosch Müller) over 1 year ago
TracePoint
also lacks column information AFAICT so it might be worthwhile to consider it in search of a consistent solution.