sed/tests/sed-reports-syntax-errors-with-some-multibyte/PURPOSE
2017-09-25 16:43:53 -05:00

46 lines
1.7 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

PURPOSE of /CoreOS/sed/Regression/sed-reports-syntax-errors-with-some-multibyte
Description: Test for sed reports syntax errors with some multibyte
Author: Marek Polacek <mpolacek@redhat.com>
Bug summary: sed reports syntax errors with some multibyte characters
Description:
Description of problem:
Using a multibyte character that ends with 0x5c (backslash) can cause sed to report syntax errors.
Version-Release number of selected component (if applicable): sed-4.1.5-5
How reproducible: Always
Steps to Reproduce:
1. Start with your shell in a UTF-8 locale, eg en-US.UTF-8 (you can probably do this in a different locale, but it definitely works if you start in a UTF-8 locale).
2. Run the follow commands to construct a sed script:
U2010=$(echo -ne '\x20\x10' | iconv -f ucs-2be)
echo "echo '$U2010' | sed 's/$U2010/hyphen/g'" | iconv -t gbk > /tmp/script
3. Run the shell script in a locale that uses the gbk character set:
LC_ALL=zh_CN.gbk sh /tmp/script 2>&1 | iconv -f gbk
Actual results:
The script reports an error:
sed-e 表达式 #1字符 13unterminated `s' command
Expected results:
The single word "hyphen"
Additional info:
The error arises because the character U+2010 (HYPHEN) is encoded as \xa9\x5c in the gbk encoding. Sed sees the "\x5c" as a backslash escaping the following character which, in this case, is the "/" that we hope is going to terminate the pattern; it doesn't and so we get a syntax error.
Of course, this is just one character in one encoding. There are likely to be many others and this is just one example. I have another example for SJIS, (U+8868) but SJIS isn't a good encoding to use for reporting bugs :-).