Backport #8210

Multibyte character interfering with end-line character within a regex

Added by Tsuyoshi Sawada about 2 years ago. Updated about 2 years ago.

[ruby-core:53944]
Status:Closed
Priority:Normal
Assignee:Usaku NAKAMURA

Description

=begin
With this regex:

regex1 = /\z/

the following strings match as expected:

"hello" =~ regex1 # => 5
"こんにちは" =~ regex1 # => 5

but with these regexes:

regex2 = /#$/?\z/
regex3 = /\n?\z/

they show difference:

"hello" =~ regex2 # => 5
"hello" =~ regex3 # => 5
"こんにちは" =~ regex2 # => nil
"こんにちは" =~ regex3 # => nil

The string encoding is UTF-8, and the OS is Linux (i.e., $/ is "\n"). I expect them to behave the same, and believe this is a bug.
=end

fix-8210-1.diff Magnifier (742 Bytes) Ken Takata, 04/10/2013 12:41 AM

fix-8210-2.diff Magnifier (491 Bytes) Ken Takata, 04/10/2013 12:41 AM

fix-8210-1-update.diff Magnifier (834 Bytes) Ken Takata, 04/13/2013 07:31 PM

Associated revisions

Revision 40276
Added by Yui NARUSE about 2 years ago

  • Merge Onigmo 5.13.4 f22cf2e566712cace60d17f84d63119d7c5764ee. [bug] fix problem with optimization of \z (Issue #16) [Bug #8210]

Revision 40713
Added by Usaku NAKAMURA about 2 years ago

  • regexec.c (onig_search): fix problem with optimization of \z. [Backport #8210] patched by k_tanaka at .

History

#1 Updated by Tsuyoshi Sawada about 2 years ago

=begin
A different regex:

regex4 = /[[:space:]]?\z/

seems to work as expected:

"hello" =~ regex4 # => 5
"こんにちは" =~ regex4 # => 5

=end

#2 Updated by Tsuyoshi Sawada about 2 years ago

=begin
Still a different regex:

regex5 = /\n?$/

seems to work as expected:

"hello" =~ regex5 # => 5
"こんにちは" =~ regex5 # => 5

=end

#3 Updated by Tsuyoshi Sawada about 2 years ago

=begin
The problem seems to happen with combination of a certain token, ?, and \z.

"こんにちは" =~ /a?\z/ # => nil
"こんにちは" =~ / ?\z/ # => nil
"こんにちは" =~ /\t?\z/ # => nil
"こんにちは" =~ /\n?\z/ # => nil
"こんにちは" =~ /\s?\z/ # => nil
"こんにちは" =~ /.?\z/ # => 4
"こんにちは" =~ /\S?\z/ # => 4
"こんにちは" =~ /\W?\z/ # => 5
"こんにちは" =~ /あ?\z/ # => 5
"こんにちは" =~ /\w?\z/ # => 5

=end

#4 Updated by Tsuyoshi Sawada about 2 years ago

Is this bug report wrong? If so, please note so.

#5 Updated by Yui NARUSE about 2 years ago

  • Target version set to 2.1.0
  • Category set to M17N
  • Status changed from Open to Assigned
  • Assignee set to Yui NARUSE

sawa (Tsuyoshi Sawada) wrote:

Is this bug report wrong? If so, please note so.

This looks really bug of oniguruma/onigmo.

#6 Updated by Andrew Cheong about 2 years ago

Contributing notes regarding this bug can be found here: http://stackoverflow.com/a/15885857/925913.

#7 Updated by Franco Rondini about 2 years ago

Just edited the answer and test code available

#8 Updated by Ken Takata about 2 years ago

This problem was caused by optimization of \z.
I wrote two patches to fix this problem.

Maybe fix-8210-1.diff is more efficient than fix-8210-2.diff,
but the former one tries to do backward search when 'start==range'
after 'start' is adjusted. This behavior is a little bit confusing.

#9 Updated by Tsuyoshi Sawada about 2 years ago

Is either of k_takata's bug fix going to be incorporated?

#10 Updated by Yui NARUSE about 2 years ago

k_takata (Ken Takata) wrote:

This problem was caused by optimization of \z.
I wrote two patches to fix this problem.

Maybe fix-8210-1.diff is more efficient than fix-8210-2.diff,
but the former one tries to do backward search when 'start==range'
after 'start' is adjusted. This behavior is a little bit confusing.

k_takata (Ken Takata) wrote:

This problem was caused by optimization of \z.
I wrote two patches to fix this problem.

Maybe fix-8210-1.diff is more efficient than fix-8210-2.diff,
but the former one tries to do backward search when 'start==range'
after 'start' is adjusted. This behavior is a little bit confusing.

I think -1 is suitable because it looks to keep original intention more than -2.

#11 Updated by Ken Takata about 2 years ago

I think -1 is suitable because it looks to keep original intention more than -2.

Thanks for your comment.
I have updated onigmo's tmp/ruby-2.0.x branch.
https://github.com/k-takata/Onigmo/tree/f22cf2e566712cace60d17f84d63119d7c5764ee

I also attach an updated patch so that can be applied to Ruby 1.9.3.

#12 Updated by Yui NARUSE about 2 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r40276.
Tsuyoshi, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • Merge Onigmo 5.13.4 f22cf2e566712cace60d17f84d63119d7c5764ee. [bug] fix problem with optimization of \z (Issue #16) [Bug #8210]

#13 Updated by Yui NARUSE about 2 years ago

  • Tracker changed from Bug to Backport
  • Project changed from Ruby trunk to Backport200
  • Category deleted (M17N)
  • Status changed from Closed to Assigned
  • Assignee changed from Yui NARUSE to Tomoyuki Chikanaga
  • Target version deleted (2.1.0)

#14 Updated by Ken Takata about 2 years ago

I think it's better to backport this patch to Ruby 1.9.3 too.

#15 Updated by Tomoyuki Chikanaga about 2 years ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r40384.
Tsuyoshi, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


merge revision(s) 40276: [Backport #8210]

* Merge Onigmo 5.13.4 f22cf2e566712cace60d17f84d63119d7c5764ee.
  [bug] fix problem with optimization of \z (Issue #16) [Bug #8210]

#16 Updated by Tomoyuki Chikanaga about 2 years ago

  • Project changed from Backport200 to Backport193
  • Status changed from Closed to Assigned
  • Assignee changed from Tomoyuki Chikanaga to Usaku NAKAMURA

Move to Backport93.
But Onigmo is merged after 2.0. I didn't confirm this patch can merge to ruby_1_9_3...

#17 Updated by Usaku NAKAMURA about 2 years ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r40713.
Tsuyoshi, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • regexec.c (onig_search): fix problem with optimization of \z. [Backport #8210] patched by k_tanaka at .

#18 Updated by Ken Takata about 2 years ago

Hi usa,

  • regexec.c (onig_search): fix problem with optimization of \z. [Backport #8210] patched by k_tanaka at .

Thank you for merging my patch.
BTW, my name is not tanaka...

Also available in: Atom PDF