Project

General

Profile

Actions

Bug #21783

open

{Method,UnboundMethod,Proc}#source_location returns columns in bytes and not in characters

Bug #21783: {Method,UnboundMethod,Proc}#source_location returns columns in bytes and not in characters

Added by Eregon (Benoit Daloze) about 17 hours ago. Updated about 1 hour ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 4.0.0dev (2025-12-14T07:11:02Z master 711d14992e) +PRISM [x86_64-linux]
[ruby-core:124206]

Description

The documentation says:

= Proc.source_location

(from ruby core)
------------------------------------------------------------------------
  prc.source_location  -> [String, Integer, Integer, Integer, Integer]

------------------------------------------------------------------------

Returns the location where the Proc was defined. The returned Array
contains:
  (1) the Ruby source filename
  (2) the line number where the definition starts
  (3) the column number where the definition starts
  (4) the line number where the definition ends
  (5) the column number where the definitions ends

This method will return nil if the Proc was not defined in Ruby (i.e.
native).

So it talks about column numbers, so it should be a number of characters and not of bytes.

But currently it's a number of bytes:

$ ruby --parser=prism -ve 'def été; end; p method(:été).source_location'
ruby 4.0.0dev (2025-12-14T07:11:02Z master 711d14992e) +PRISM [x86_64-linux]
["-e", 1, 0, 1, 14]

$ ruby --parser=parse.y -ve 'def été; end; p method(:été).source_location'
ruby 4.0.0dev (2025-12-14T07:11:02Z master 711d14992e) [x86_64-linux]
["-e", 1, 0, 1, 14]

The last number should be 12 because "def été; end".size is 12 characters.

This is a Ruby-level API so I would never expect "byte columns" here, I think it's clear it should be a number of "editor columns" i.e. a number of characters.


Related issues 2 (1 open1 closed)

Related to Ruby - Feature #21005: Update the source location method to include line start/stop and column start/stop detailsOpenActions
Related to Ruby - Feature #6012: Proc#source_location also return the columnClosednobu (Nobuyoshi Nakada)Actions

Updated by Eregon (Benoit Daloze) about 17 hours ago Actions #1

  • Description updated (diff)

Updated by Eregon (Benoit Daloze) about 17 hours ago Actions #2

  • Related to Feature #21005: Update the source location method to include line start/stop and column start/stop details added

Updated by Eregon (Benoit Daloze) about 17 hours ago Actions #3

  • Related to Feature #6012: Proc#source_location also return the column added

Updated by kddnewton (Kevin Newton) about 13 hours ago Actions #4 [ruby-core:124212]

I think this is a documentation issue, as both parsers/compilers operate in terms of bytes. Changing this to characters would likely be a noticeable difference in speed, and quite a bit of code change. (Either both parsers/compilers would have to do this work initially, as that's where the numbers come from, or the source_location function would have to re-parse the source, which is not possible in some cases.) All of that is to say, please do not change this, it will be a ton of work for minimal benefit.

Updated by Eregon (Benoit Daloze) about 9 hours ago Actions #5 [ruby-core:124216]

Updating the docs is one solution, so at least it's consistent between docs and behavior.

I think as a Ruby-facing API it's weird that it operates in terms of bytes (and source_location does not have a byte prefix to indicate that).
I think most programmers when they hear line 4 column 6 they expect the 6th character on the 4th line, not the character starting at the 6th byte (actually hard to find in an editor, most editors don't show "byte columns", in fact it's not even possible to place the cursor at some byte positions, every programmer always think in characters when looking at source code).

For example, one might expect that highlighting with ^ based on the return values from source_location works, but it doesn't:

def underline(callable)
  file, start_line, start_column, end_line, end_column = callable.source_location
  raise unless start_line == end_line
  source = File.readlines(file)[start_line-1]
  puts source
  puts ' '*start_column + '^'*(end_column-start_column)
end

my_proc = proc { ascii-only }
underline my_proc

my_proc = proc { il était une fois un été }
underline my_proc

gives

$ ruby underline.rb
my_proc = proc { ascii-only }
               ^^^^^^^^^^^^^^
my_proc = proc { il était une fois un été }
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Either both parsers/compilers would have to do this work initially, as that's where the numbers come from, or the source_location function would have to re-parse the source, which is not possible in some cases.

This is a good point, I didn't realize that.
I think it would still be worth it to change the parsers/compilers to compute the proper character column for literal lambdas, blocks and methods, and probably wouldn't be very expensive given most source files are ASCII-only and potentially the parsers could even use the knowledge that a given line is ASCII-only so it would still be as fast even if the file contains a few non-ASCII characters.

If columns would e.g. appear in error messages, I think everyone would expect them to be character columns, not byte columns.
For example gcc shows character columns, as one would expect:

int main() {
    /* été */ notexist
}
gcc test.c
test.c: In function ‘main’:
test.c:2:15: error: ‘notexist’ undeclared (first use in this function)
    2 |     /* été */ notexist
      |               ^~~~~~~~

Note it's 2:15 (i.e. character columns), not 2:17 (byte columns).
The highlighting also needs to use character columns of course.

Updated by Eregon (Benoit Daloze) about 8 hours ago Actions #6 [ruby-core:124217]

From https://bugs.ruby-lang.org/issues/6012#note-25 @matz (Yukihiro Matsumoto) said adding column was OK, but not byte offsets.
I'm not sure what were his reasons, but maybe it's that byte offsets are too low-level for source_location?
If so, I would think byte columns are also too low level and it should be character columns instead.
From a user POV character columns seem better and more expected.

OTOH, I understand the reservation from @kddnewton (Kevin Newton) and I share it as a Ruby implementer, it's much simpler to return byte columns.
For example in TruffleRuby we currently save location information by having int32_t start_offset; int32_t length; in every Truffle AST node, i.e. byte offset and byte length.
Returning byte columns from that is easy and only requires the "newline offsets" array, and not the actual source code.
To return character columns, TruffleRuby would need to read from the beginning of the line to the byte offset to find how many characters that is, and keep the source code in memory (currently TruffleRuby does keep it in memory, but it might not in the future).

I have also seen this in the context of adding Prism.node_for and for that usage having byte columns is actually easier than character columns, OTOH it's not hard to convert from character columns to byte columns in that case and I already wrote the logic for that (because I expected source_location would return character columns, even before reading the docs).

It is of course possible to convert from character column to byte column and vice versa, but it requires access to the source code, which is not always available (e.g. eval).

Updated by kddnewton (Kevin Newton) about 7 hours ago 1Actions #7 [ruby-core:124218]

Honestly if we're interpreting column as something visual like you're implying, we're also going to run into issues with grapheme clusters and east asian width and all the other implications for whatever "character" actually means. I think we would also have to return the encoding of the source file inside that array in order for it to make any sense.

Updated by matz (Yukihiro Matsumoto) about 1 hour ago Actions #8 [ruby-core:124221]

I'd like to cancel source_location to have column information in 4.0, due to this concern. In my personal opinion, I am leaning toward byte index, though.

Matz.

Actions

Also available in: PDF Atom