sed/tests/sed-reports-syntax-errors-with-some-multibyte/PURPOSE

46 lines
1.7 KiB
Plaintext
Raw Normal View History

2017-09-25 21:43:53 +00:00
PURPOSE of /CoreOS/sed/Regression/sed-reports-syntax-errors-with-some-multibyte
Description: Test for sed reports syntax errors with some multibyte
Author: Marek Polacek <mpolacek@redhat.com>
Bug summary: sed reports syntax errors with some multibyte characters
Description:
Description of problem:
Using a multibyte character that ends with 0x5c (backslash) can cause sed to report syntax errors.
Version-Release number of selected component (if applicable): sed-4.1.5-5
How reproducible: Always
Steps to Reproduce:
1. Start with your shell in a UTF-8 locale, eg en-US.UTF-8 (you can probably do this in a different locale, but it definitely works if you start in a UTF-8 locale).
2. Run the follow commands to construct a sed script:
U2010=$(echo -ne '\x20\x10' | iconv -f ucs-2be)
echo "echo '$U2010' | sed 's/$U2010/hyphen/g'" | iconv -t gbk > /tmp/script
3. Run the shell script in a locale that uses the gbk character set:
LC_ALL=zh_CN.gbk sh /tmp/script 2>&1 | iconv -f gbk
Actual results:
The script reports an error:
sed-e 表达式 #1字符 13unterminated `s' command
Expected results:
The single word "hyphen"
Additional info:
The error arises because the character U+2010 (HYPHEN) is encoded as \xa9\x5c in the gbk encoding. Sed sees the "\x5c" as a backslash escaping the following character which, in this case, is the "/" that we hope is going to terminate the pattern; it doesn't and so we get a syntax error.
Of course, this is just one character in one encoding. There are likely to be many others and this is just one example. I have another example for SJIS, (U+8868) but SJIS isn't a good encoding to use for reporting bugs :-).