lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi Adam,

Yes I am aware of the utf8 library,  I was puzzled by the appearance of SYN.

The SYN character is being generated in the call to print, apparently to
represent unprintable characters. Obvious now.

Writing strings to a binary file and opening in hex editor reveals that
string.sub is doing exactly as expected.  The world makes sense again.

local space = string.byte(' ')
local text = utf8.char(0x92e,0x947,0x930,0x93e, space, 0x928, 0x93e, 0x92e, space, 0x932,0x942,0x905, space, 0x939,0x948, 0x964)

function wb(filename, str)
    local fh = io.open(filename, 'wb')
    fh:write(str)
    fh:close()
end

local split = 1
local lh = string.sub(text, 1, split)
local rh = string.sub(text, split+1)

wb('text', text)
wb('lh', lh)
wb('rh', rh)

--text: E0 A4 AE E0 A5 87 E0 A4 B0 E0 A4 BE 20 E0 A4 A8 E0 A4 BE E0 A4 AE 20 E0 A4 B2 E0 A5 82 E0 A4 85 20 E0 A4 B9 E0 A5 88 E0 A5 A4

-- lh: E0
-- rh: A4 AE E0 A5 87 E0 A4 B0 E0 A4 BE 20 E0 A4 A8 E0 A4 BE E0 A4 AE 20 E0 A4 B2 E0 A5 82 E0 A4 85 20 E0 A4 B9 E0 A5 88 E0 A5 A4
 
0x92e          100100101110
0xE0 0xA4   100100101110
0xE0            11100000
0xA4            10100100

Todd

On Fri, Jan 15, 2016 at 9:49 PM, Coda Highland <chighland@gmail.com> wrote:
On Fri, Jan 15, 2016 at 7:44 PM, Todd Wegner <twwegner@gmail.com> wrote:
> I would like to understand why the following code produces SYN characters
> (0x16) in Lua53 on Linux.
> The SYN occur whenever split divides a multi-byte character in half.
> Why does string.sub return SYN rather than respective bytes.
>
>
> Code:
>
> local space = string.byte(' ')
> local text = utf8.char(0x92e,0x947,0x930,0x93e, space, 0x928, 0x93e, 0x92e,
> space, 0x932,0x942,0x905, space, 0x939,0x948, 0x964)
>
> local split = 1
> local lh = string.sub(text, 1, split)
> local rh = string.sub(text, split+1)
>
> print('text', text)
> print('lh', lh)
> print('rh', rh)
>
>
> Output:
>
> text    मेरा नाम लूअ है।
> lh    म SYN SYN
> rh    SYN रा नाम लूअ है।
>
> Thanks

string.sub is not UTF-8 aware. It operates on byte strings, not
Unicode character strings.

Look at the utf8 module for Unicode-aware functionality.

/s/ Adam