Project

General

Profile

Actions

Bug #16108

closed

gsub gives wrong results with regex backreferencing and triple backslash

Added by VivianUnger (Vivian Unger) over 4 years ago. Updated over 4 years ago.

Status:
Rejected
Assignee:
-
Target version:
-
ruby -v:
ruby 2.6.3p62 (2019-04-16 revision 67580) [x64-mingw32]
[ruby-core:94402]

Description

I have written a script to convert LaTeX indexing files (.idx) to Macrex backup format (.mbk), so that I can import LaTeX-embedded indexes into the Macrex indexing program. A problem arises when I try to convert bolded text. LaTeX indicates bolded text with the tag \textbf{} while Macrex wraps it in backslashes: \.

In my test case, the input string is "\indexentry{\textbf{bold}|hyperpage}{2}", which I need to convert into "\indexentry{\bold|hyperpage}{2}". For this I am using:

record.gsub(/\textbf{([^\}]+)}/, '\\1\')

But instead of the expected output, I get:

\indexentry{\1|hyperpage}{2}

...as if I only had \ rather than \.

I have tried the same Regex in a search-and-replace in Notepad++ and it works as expected. It's only in Ruby that I get this unexpected result.

The kludgey workaround I have found is to leave a space before the \:

record.gsub(/\textbf{([^\}]+)}/, '\ \1\')

...giving the result:

\indexentry{\ bold|hyperpage}{2}

But this won't do. Macrex complains and the extra space has to be edited out. Imagine if you have hundreds of lines with bold text in them!

Updated by VivianUnger (Vivian Unger) over 4 years ago

I have written a script to convert LaTeX indexing files (.idx) to Macrex backup format (.mbk), so that I can import LaTeX-embedded indexes into the Macrex indexing program. A problem arises when I try to convert bold text. LaTeX indicates bold text with the tag \textbf{[bold text]} while Macrex wraps it in backslashes: \[bold text]\.

In my test case, the input string is:

\indexentry{\textbf{bold}|hyperpage}{2}

I need to convert this into:

\indexentry{\bold\|hyperpage}{2}

For this I am using the following code:

record.gsub(/\\textbf\{([^\}]+)\}/, '\\\1\\')

But instead of the expected output, I get:

\indexentry{\1\|hyperpage}{2}

...as if I only had 2 backslashes rather than three.

I have tried using the same Regex in a search-and-replace in Notepad++ and it works as expected. It's only in Ruby that I get this unexpected result.

The kludgey workaround I have found is to leave a space before the two backslashes:

record.gsub(/\\textbf\{([^\}]+)\}/, '\\ \1\\')

...giving the result:

\indexentry{\ bold\|hyperpage}{2}

But this won't do. Macrex complains and the extra space has to be edited out. Imagine if you have hundreds of lines with bold text in them!

Updated by alanwu (Alan Wu) over 4 years ago

The source of your problem seem to be the behavior below:

p ' \1 '.bytes # => [32, 92, 49, 32]
p ' \\ '.bytes # => [32, 92, 32]
p ' \ '.bytes  # => [32, 92, 32]

as you can see, two backslashes in a single quote string literal only gives one backslash in the resulting string.

This is future complicated by gsub interpreting the content of the second argument as a replacement directive. The means interpreting the backslashes for a second time. You want the final replacement to be "one backslash, followed by the first match group, then another backslash", or literally \\1\ ([92, 92, 49, 92]). The replacement directive to express this is \\\1\\ ([92, 92, 92, 49, 92, 92]), as we need to escape the first and last backslash by doubling them. We don't want to double the backslash right before "1", as we are not looking for a literal backslash there.

Now we need to construct a Ruby string literal we can put in the source code that would give us the replacement directive we want, which we could do by doubling all the backslashes:

p '\\\\\\1\\\\'.bytes # => [92, 92, 92, 49, 92, 92]

We could get rid of one of the backslashes in the before "1", the single quote literal '\1' gives [92, 49]:

p '\\\\\1\\\\'.bytes # => [92, 92, 92, 49, 92, 92]

We could also get rid of two backslashes after the 1 as gsub interprets the lone backslash at the end as a literal backslash.

This is too many backslashes for my taste, so I would prefer the block form. It takes the return value of block and substitute that for the mach verbatim. The special $1 variable is set within the gsub block, which we can use to build the replacement we want:

input.gsub(pattern) { ["\\", $1, "\\"].join }

Here is a test program for you:

input = '\indexentry{\textbf{bold}|hyperpage}{2}'
pattern = /\\textbf\{([^\}]+)\}/

test = ->(replacement) {
  puts "result: #{input.gsub(pattern, replacement)}, replacement: #{replacement.bytes}.map(&:chr).join"
}
test.call('\\\1\\')
test.call('\\ \1\\')
test.call('\\\\\\1\\\\')
test.call('\\\\\\1\\')
test.call('\\\\\1\\')

$stdout.write "alternative: "
puts input.gsub(pattern) { ["\\", $1, "\\"].join }

Updated by shyouhei (Shyouhei Urabe) over 4 years ago

  • Status changed from Open to Rejected

This is a designed behaviour. A backslash character is first cooked by the ruby interpreter (to handle \' etc), then cooked again by gsub's own preprocessor (to handle \1 etc). You have to understand exactly what is going on to play with it.

Don't hesitate to resort to the alternative solution shown in @alanwu's comment.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0