Bug #15839
closedmixed encoding heredoc should be a syntax error regardless the order
Description
This heredoc isn't a syntax error,
#encoding: cp932
p <<-STR
\xe9\x9d
\u1234
STR
whereas this is.
#encoding: cp932
"
\xe9\x9d
\u1234
"
Files
Updated by jeremyevans0 (Jeremy Evans) over 4 years ago
Heredocs are parsed line-by-line, and mixed encoding is
already detected if it is on the same line:
#encoding: cp932
p <<-STR
\xe9\x9d\u1234
STR
# UTF-8 mixed within Windows-31J source
# \xe9\x9d\u1234
# syntax error, unexpected end-of-input, expecting tSTRING_CONTENT or tSTRING_DBEG or tSTRING_DVAR or tSTRING_END
In order to handle mixed content on separate lines, we need to
keep track of the temporary encoding of the string, which was
previously done via a local variable in tokadd_string
. The
attached patch adds a second rb_encoding **
argument to
tokadd_string
for keeping track of the temporary encoding,
so that here_document
can store the value between lines.
Updated by nobu (Nobuyoshi Nakada) over 4 years ago
Thank you, but it doesn't work for the reverse order, \u
followed by \x
.
Updated by jeremyevans0 (Jeremy Evans) over 4 years ago
- File mixed-encoding-heredoc-reverse-order-fix.patch mixed-encoding-heredoc-reverse-order-fix.patch added
nobu (Nobuyoshi Nakada) wrote:
Thank you, but it doesn't work for the reverse order,
\u
followed by\x
.
That is because the \x
escape does not do the same type of encoding voodoo that the \u
escape does. Not sure if we want to change that, or if we do, how exactly it would work.
Attached is a patch with a less invasive approach that will still raise the syntax error. It should be applied on top of the previous patch. It checks that the string generated by the heredoc has a valid encoding, after the heredoc has been fully parsed.
Updated by jeremyevans0 (Jeremy Evans) over 4 years ago
After additional analysis, I found that I only needed to add one line to my initial patch to fix it to work with both \u
before \x
and \u
after \x
. With the attached patch (which supersedes the previous patches):
$ ruby -e '#encoding: cp932
p((<<-STR))
\u1234
\xe9\x9d
STR
'
-e:4: UTF-8 mixed within Windows-31J source
\xe9\x9d
$ ruby -e '#encoding: cp932
p((<<-STR))
\xe9\x9d
\u1234
STR
'
-e:4: UTF-8 mixed within Windows-31J source
\u1234
-e:2: syntax error, unexpected end-of-input, expecting literal content or terminator or tSTRING_DBEG or tSTRING_DVAR
Updated by nobu (Nobuyoshi Nakada) over 4 years ago
Would you commit that patch by yourself?
Updated by jeremyevans0 (Jeremy Evans) over 4 years ago
nobu (Nobuyoshi Nakada) wrote:
Would you commit that patch by yourself?
Assuming matz approves a commit bit for me at the next developer meeting, I would be happy to.
Updated by jeremyevans (Jeremy Evans) over 4 years ago
- Status changed from Open to Closed
Applied in changeset git|c05eaa93258ddc01e685b6cc3a0da82998a2af48.
Fix mixed encoding in heredoc
Heredocs are parsed line-by-line, so we need to keep track of the
temporary encoding of the string. Previously, a heredoc would
only detect mixed encoding errors if they were on the same line,
this changes things so they will be caught on different lines.
Fixes [Bug #15839]
Updated by nagachika (Tomoyuki Chikanaga) over 4 years ago
- Backport changed from 2.4: REQUIRED, 2.5: REQUIRED, 2.6: REQUIRED to 2.4: REQUIRED, 2.5: REQUIRED, 2.6: DONE
ruby_2_6 r67724 merged revision(s) 6375c68f8851e1e0fee8a95afba91c4555097127,c05eaa93258ddc01e685b6cc3a0da82998a2af48.
Updated by usa (Usaku NAKAMURA) over 4 years ago
- Backport changed from 2.4: REQUIRED, 2.5: REQUIRED, 2.6: DONE to 2.4: REQUIRED, 2.5: DONE, 2.6: DONE
ruby_2_5 r67763 merged revision(s) 6375c68f8851e1e0fee8a95afba91c4555097127,c05eaa93258ddc01e685b6cc3a0da82998a2af48.