Project

General

Profile

Bug #16675

Regression on Ripper in Ruby 2.7 when parsing new line

Added by Benoit_Tigeot (Benoit Tigeot) 9 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:97373]

Description

Hello

While using migrating RSpec documentation to last Yard. I noticed an issue in code parsing and Ripper. The regression appears on Ruby 2.7 and Head.

require 'pp'
require 'ripper'

SOURCE = "def name\n  # comment\nend"

class RipperParser < Ripper
  attr_accessor :tokens

  SCANNER_EVENTS.each do |event|
    define_method("on_#{event}") do |*args|
      puts "TOKEN:     #{event}"
      (@tokens ||= []) << [event, args]
      super(*args)
    end
  end
end

parser = RipperParser.new(SOURCE, '(stdin)')

puts "PARSING:"
parser.parse

puts "\nTOKENS:"
pp parser.tokens

puts "\nRIPPER SAYS"
pp Ripper.lex(SOURCE)
--- a/2-6-3_ripper_lex.txt
+++ b/2-7-0_ripper_lex.txt
@@ -1,27 +1,27 @@
->> RUBY_VERSION: 2.6.3
+>> RUBY_VERSION: 2.7.0
 PARSING:
 TOKEN:     kw
 TOKEN:     sp
 TOKEN:     ident
-TOKEN:     nl
 TOKEN:     sp
 TOKEN:     comment
+TOKEN:     nl
 TOKEN:     kw

 TOKENS:
 [[:kw, ["def"]],
  [:sp, [" "]],
  [:ident, ["name"]],
- [:nl, ["\n"]],
  [:sp, ["  "]],
  [:comment, ["# comment\n"]],
+ [:nl, ["\n"]],
  [:kw, ["end"]]]

 RIPPER SAYS
-[[[1, 0], :on_kw, "def", EXPR_FNAME],
- [[1, 3], :on_sp, " ", EXPR_FNAME],
- [[1, 4], :on_ident, "name", EXPR_ENDFN],
- [[1, 8], :on_nl, "\n", EXPR_BEG],
- [[2, 0], :on_sp, "  ", EXPR_BEG],
- [[2, 2], :on_comment, "# comment\n", EXPR_BEG],
- [[3, 0], :on_kw, "end", EXPR_END]]
+[[[1, 0], :on_kw, "def", FNAME],
+ [[1, 3], :on_sp, " ", FNAME],
+ [[1, 4], :on_ident, "name", ENDFN],
+ [[1, 8], :on_nl, "\n", BEG],
+ [[2, 0], :on_sp, "  ", ENDFN],
+ [[2, 2], :on_comment, "# comment\n", ENDFN],
+ [[3, 0], :on_kw, "end", END]]

As Loren Segal mentionned

Note that "comment" is detected before "nl" in both the event and the collected tokens, which is different from the results in Ripper.lex

Updated by nobu (Nobuyoshi Nakada) 9 months ago

  • Status changed from Open to Closed

This is because comment lines can be placed between fluent dot.
That means it is undecidable if a newline before a comment line is ignored until comment lines end.

Updated by lsegal (Loren Segal) 9 months ago

  • Status changed from Closed to Open

I'm not sure why this was closed so quickly. There is a real bug here affecting many users who indirectly rely on this core library.

It may be the case that comments can be placed between dots, but that should not affect parsing order. As it stands there are multiple important points:

1) This is either a regression from 2.6.3 or a breaking change in 2.7.0, but in either case, that is a pretty big deal. Users of the ripper library, part of core, get one behavior on one version and another on the next. Given that Ruby 2.x was supposed to be free of breaking changes, I think this deserves more attention.

2) If this is "expected behavior", it is a fundamentally broken API in which an end user cannot actually determine the correct order of tokens while parsing. In other words, this change makes Ripper completely unreliable for parsing comments, which is really odd. Looking purely at the output from the example's PARSING section, it seems clear that the output generated using the API is not representative of the source provided. That to me is a clear reproduction of a bug. Ripper should be able to provide an API that represents source correctly, and this is not happening.

3) Most importantly, I'm not entirely clear on the justification for the change. Scanner events in Ripper represent tokens, not semantic AST nodes, and should therefore represent the token stream as-is. Even if comments can separate an expression, the tokens should not require extra context to tokenize-- this is the whole premise of grammar tokenization. Specifically, you can still have an AST generated in which on_comment is called between AST nodes, since on_comment is not actually used to determine where the newline lives.

4) Finally, comments have always been allowed to separate expressions, but this only broke as of 2.7.0. Ripper even all the way back in Ruby 2.3.3 was able to handle the "fluent dot" scenario just fine, so I'm not sure why this is an issue all of a sudden:

irb(main):006:0> Ripper.sexp("foo. # xxx\nbar")
=> [:program, [[:call, [:vcall, [:@ident, "foo", [1, 0]]], :".", [:@ident, "bar", [2, 0]]]]]
irb(main):007:0> Ripper.lex("foo. # xxx\nbar")
=> [[[1, 0], :on_ident, "foo"], [[1, 3], :on_period, "."], [[1, 4], :on_sp, " "], [[1, 5], :on_comment, "# xxx\n"], [[2, 0], :on_ident, "bar"]]
irb(main):008:0> RUBY_VERSION
=> "2.3.3"

I think this should be revisited as a regression given that the example above shows a clear case of something not working as intended.

If there is no intention to fix this, I'm curious what the correct way is to use the Ripper API to determine comment order alongside AST nodes is that works without behavioral change across all Ripper releases?

Updated by hsbt (Hiroshi SHIBATA) 9 months ago

  • Status changed from Open to Closed

Feel free to comment on this issue. But do not change the status of tracker's issue without the maintainer's decision.

Updated by nobu (Nobuyoshi Nakada) 9 months ago

lsegal (Loren Segal) wrote in #note-2:

4) Finally, comments have always been allowed to separate expressions, but this only broke as of 2.7.0. Ripper even all the way back in Ruby 2.3.3 was able to handle the "fluent dot" scenario just fine, so I'm not sure why this is an issue all of a sudden:

This was a syntax error till 2.6, and valid since 2.7.

foo
  # comment
  .bar

I think this should be revisited as a regression given that the example above shows a clear case of something not working as intended.

Ripper doesn't fire the events in the order of the source, typically around here-documents.

If there is no intention to fix this, I'm curious what the correct way is to use the Ripper API to determine comment order alongside AST nodes is that works without behavioral change across all Ripper releases?

Sort the tokens by location, like Ripper.lex does.

Updated by lsegal (Loren Segal) 9 months ago

Ripper doesn't fire the events in the order of the source, typically around here-documents.

Can you explain issue with here-documents? AFAIK, having used Ripper for over a decade now, this is the first time we've identified Ripper firing lexical events out of order. Given that YARD is probably one of the earliest adopters of the library and has likely parsed a huge chunk of all publicly distributed Ruby code (i.e., an enormous amount of Ruby code), we have a pretty wide range of data that indicates that this is a new problem.

Updated by nobu (Nobuyoshi Nakada) 9 months ago

For example,

<<FOO.chomp
foo
FOO

Here-doc beginning token is just followed by its content and terminator.

Also available in: Atom PDF