Split Join |
|
split("a,b,c", ",") --> {"a", "b", "c"} join({"a", "b", "c"}, ",") --> "a,b,c"
There are various ways to design and implement these functions in Lua, as described below.
With Lua 5.x you can use
table.concat[3] for
joining: table.concat(tbl, delimiter_str).
table.concat({"a", "b", "c"}, ",") --> "a,b,c"
Other interfaces are possible, largely dependent on the choice of split interface since join is often intended to be the inverse operation of split.
A split[1] function
separates a string into a list of substrings,
breaking the original string on occurrences of some separator (character,
character set, or pattern). There are various ways to design a string split function. A summary of the design decisions is listed below.
Should split return a table array, a list, or an iterator?
split("a,b,c", ",") --> {"a", "b", "c"} split("a,b,c", ",") --> "a","b","c" (not scalable: Lua has a limit of a few thousand return values) for x in split("a,b,c", ",") do ..... end
Should the seperator be a string, Lua pattern, LPeg pattern, or regular expression?
split("a +b c", " +") --> {"a ", "b c"} split("a +b c", " +") --> {"a", "+b", "c"} split("a +b c", some_other_object) --> .....
How should empty separators be handled?
split("abc", "") --> {"a", "b", "c"} split("abc", "") --> {"", "a", "b", "c", ""} split("abc", "") --> error split("abc", "%d*") --> what about patterns that can evaluate to empty strings?
split(s,"") is a convenient idiom for splitting a string into characters. In Lua, we can alternately do for c in s:gmatch"." do ..... end.
How should empty values be handled?
split(",,a,b,c,", ",") --> {"a", "b", "c"} split(",,a,b,c,", ",") --> {"", "", "a", "b", "c", ""} split(",", ",") --> {} or {""} or {"", ""} ? split("", ",") --> {} or {""} ?
join({"",""}, ""), join({""}, "") and join({}, "") all result in the same string "". Therefore, the choice of what the inverse operation split("", "") should return is not immediately clear.
Should there be an argument to limit the number of splits?
split("a,b,c", ",", 2) --> {"a", "b,c"}
Should the separator be returned? This is more useful when the separator is a pattern, in which case the separator can vary:
split("a b c", " +") --> {"a", " ", "b", " ", "c"}
string.gmatch [4] is in a way a dual of split, returning the substrings that match a pattern and discarding strings between them rather than the other way around. A function that returns both is sometimes called partition [5].
string.gsub/string.matchBreak a string up at occurrences of a single character. If the number of fields is known:
str:match( ("([^"..sep.."]*)"..sep):rep(nsep) )
If the number of fields is not known
fields = {str:match((str:gsub("[^"..sep.."]*"..sep, "([^"..sep.."]*)"..sep)))}
Some might call the above a hack :) sep will need to be escaped if it is a
pattern metacharacter, and you'd probably be better off precomputing and/or
memoizing the patterns.
string.gsub
fields = {}
str:gsub("([^"..sep.."]*)"..sep, function(c) table.insert(fields, c) end)
Does not work as expected:
str, sep = "1:2:3", ":" fields = {} str:gsub("([^"..sep.."]*)"..sep, function(c) table.insert(fields, c) end) for i,k in ipairs(fields) do print(i,k) end -- output: -- 1 1 -- 2 2
Fix:
function string:split(sep) local sep, fields = sep or ":", {} local pattern = string.format("([^%s]+)", sep) self:gsub(pattern, function(c) fields[#fields+1] = c end) return fields end
Example: split a string into words, or return nil
function justWords(str) local t = {} local function helper(word) table.insert(t, word) return "" end if not str:gsub("%w+", helper):find"%S" then return t end end
This splits a string using the pattern sep. It calls func for each
segment. When func is called, the first argument is the segment and the
remaining arguments are the captures from sep, if any. On the last
segment, func will be called with just one argument. (This could be used
as a flag, or you could use two different functions). sep must not match
the empty string. Enhancements are left as an exercise :)
func((str:gsub("(.-)("..sep..")", func)))
Example: Split a string into lines separated by either DOS or Unix line endings, creating a table out of the results.
function lines(str) local t = {} local function helper(line) table.insert(t, line) return "" end helper((str:gsub("(.-)\r?\n", helper))) return t end
The problem with using gsub as above is that it can't handle the case when the separator pattern doesn't appear at the end of the string. In that case the final "(.-)" never gets to capture the end of the string, because the overall pattern fails to match. To handle that case you have to do something a little more complicated. The split function below behaves more or less like split in perl or python. In particular, single matches at the beginning and end of the string do not create new elements. Multiple matches in a row create empty string elements.
-- Compatibility: Lua-5.1 function split(str, pat) local t = {} -- NOTE: use {n = 0} in Lua-5.0 local fpat = "(.-)" .. pat local last_end = 1 local s, e, cap = str:find(fpat, 1) while s do if s ~= 1 or cap ~= "" then table.insert(t,cap) end last_end = e+1 s, e, cap = str:find(fpat, last_end) end if last_end <= #str then cap = str:sub(last_end) table.insert(t, cap) end return t end
Example: Split a file path string into components.
function split_path(str) return split(str,'[\\/]+') end parts = split_path("/usr/local/bin") --> {'usr','local','bin'}
Test Cases:
split('foo/bar/baz/test','/') --> {'foo','bar','baz','test'} split('/foo/bar/baz/test','/') --> {'foo','bar','baz','test'} split('/foo/bar/baz/test/','/') --> {'foo','bar','baz','test'} split('/foo/bar//baz/test///','/') --> {'foo','bar','','baz','test','',''} split('//foo////bar/baz///test///','/+') --> {'foo','bar','baz','test'} split('foo','/+') --> {'foo'} split('','/+') --> {} split('foo','') -- opps! infinite loop!
After a discussion on this topic in the mailing list, I made my own function... I took, unknowingly, a way similar to the function above, except I use gfind to iterate, and I see the single matches at beginning and end of string as empty fields. As above, multiple successive delimiters create empty string elements.
-- Compatibility: Lua-5.0 function Split(str, delim, maxNb) -- Eliminate bad cases... if string.find(str, delim) == nil then return { str } end if maxNb == nil or maxNb < 1 then maxNb = 0 -- No limit end local result = {} local pat = "(.-)" .. delim .. "()" local nb = 0 local lastPos for part, pos in string.gfind(str, pat) do nb = nb + 1 result[nb] = part lastPos = pos if nb == maxNb then break end end -- Handle the last field if nb ~= maxNb then result[nb + 1] = string.sub(str, lastPos) end return result end
Test Cases:
ShowSplit("abc", '') --> { [1] = "", [2] = "", [3] = "", [4] = "", [5] = "" } -- No infite loop... but garbage in, garbage out... ShowSplit("", ',') --> { [1] = "" } ShowSplit("abc", ',') --> { [1] = "abc" } ShowSplit("a,b,c", ',') --> { [1] = "a", [2] = "b", [3] = "c" } ShowSplit("a,b,c,", ',') --> { [1] = "a", [2] = "b", [3] = "c", [4] = "" } ShowSplit(",a,b,c,", ',') --> { [1] = "", [2] = "a", [3] = "b", [4] = "c", [5] = "" } ShowSplit("x,,,y", ',') --> { [1] = "x", [2] = "", [3] = "", [4] = "y" } ShowSplit(",,,", ',') --> { [1] = "", [2] = "", [3] = "", [4] = "" } ShowSplit("x!yy!zzz!@", '!', 4) --> { [1] = "x", [2] = "yy", [3] = "zzz", [4] = "@" } ShowSplit("x!yy!zzz!@", '!', 3) --> { [1] = "x", [2] = "yy", [3] = "zzz" } ShowSplit("x!yy!zzz!@", '!', 1) --> { [1] = "x" } ShowSplit("a:b:i:p:u:random:garbage", ":", 5) --> { [1] = "a", [2] = "b", [3] = "i", [4] = "p", [5] = "u" } ShowSplit("hr , br ; p ,span, div", '%s*[;,]%s*') --> { [1] = "hr", [2] = "br", [3] = "p", [4] = "span", [5] = "div" }
Many people miss Perl-like split/join functions in Lua. Here are mine:
-- Concat the contents of the parameter list, -- separated by the string delimiter (just like in perl) -- example: strjoin(", ", {"Anna", "Bob", "Charlie", "Dolores"}) function strjoin(delimiter, list) local len = getn(list) if len == 0 then return "" end local string = list[1] for i = 2, len do string = string .. delimiter .. list[i] end return string end -- Split text into a list consisting of the strings in text, -- separated by strings matching delimiter (which may be a pattern). -- example: strsplit(",%s*", "Anna, Bob, Charlie,Dolores") function strsplit(delimiter, text) local list = {} local pos = 1 if strfind("", delimiter, 1) then -- this would result in endless loops error("delimiter matches empty string!") end while 1 do local first, last = strfind(text, delimiter, pos) if first then -- found? tinsert(list, strsub(text, pos, first-1)) pos = last+1 else tinsert(list, strsub(text, pos)) break end end return list end
Here's my own split function, for comparison. It's largely the same as the above; not quite as DRY but (IMO) slightly cleaner. It doesn't use gfind (as suggested below) because I wanted to be able to specify a pattern for the split string, not a pattern for the data sections. If speed is paramount, it might be made faster by caching string.find as a local 'strfind' variable, as the above does.
--Written for 5.0; could be made slightly cleaner with 5.1 --Splits a string based on a separator string or pattern; --returns an array of pieces of the string. --(May optionally supply a table as the third parameter which will be filled with the results.) function string:split( inSplitPattern, outResults ) if not outResults then outResults = { } end local theStart = 1 local theSplitStart, theSplitEnd = string.find( self, inSplitPattern, theStart ) while theSplitStart do table.insert( outResults, string.sub( self, theStart, theSplitStart-1 ) ) theStart = theSplitEnd + 1 theSplitStart, theSplitEnd = string.find( self, inSplitPattern, theStart ) end table.insert( outResults, string.sub( self, theStart ) ) return outResults end
Explode string into table with seperator (moved from TableUtils):
-- explode(seperator, string) function explode(d,p) local t, ll t={} ll=0 if(#p == 1) then return {p} end while true do l=string.find(p,d,ll,true) -- find the next d in the string if l~=nil then -- if "not not" found then.. table.insert(t, string.sub(p,ll,l-1)) -- Save it in our array. ll=l+1 -- save just after where we found it for searching next time. else table.insert(t, string.sub(p,ll)) -- Save what's left in our array. break -- Break at end, as it should be, according to the lua manual. end end return t end
This function uses a metatable's __index function to populate the table of split parts. This function does not try to (correctly) invert the pattern, and so really doesn't work as most string split functions do.
--[[ written for Lua 5.1 split a string by a pattern, take care to create the "inverse" pattern yourself. default pattern splits by white space. ]] string.split = function(str, pattern) pattern = pattern or "[^%s]+" if pattern:len() == 0 then pattern = "[^%s]+" end local parts = {__index = table.insert} setmetatable(parts, parts) str:gsub(pattern, parts) setmetatable(parts, nil) parts.__index = nil return parts end -- example 1 str = "no separators in this string" parts = str:split( "[^,]+" ) print( # parts ) table.foreach(parts, print) --[[ output: 1 1 no separators in this string ]] -- example 2 str = " split, comma, separated , , string " parts = str:split( "[^,%s]+" ) print( # parts ) table.foreach(parts, print) --[[ output: 4 1 split 2 comma 3 separated 4 string ]]
This is the Python behavior:
Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51)
[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
>>> 'x!yy!zzz!@'.split('!')
['x', 'yy', 'zzz', '@']
>>> 'x!yy!zzz!@'.split('!', 3)
['x', 'yy', 'zzz', '@']
>>> 'x!yy!zzz!@'.split('!', 2)
['x', 'yy', 'zzz!@']
>>> 'x!yy!zzz!@'.split('!', 1)
['x', 'yy!zzz!@']
And IMHO this Lua function implements this semantics:
function string:split(sSeparator, nMax, bRegexp) assert(sSeparator ~= '') assert(nMax == nil or nMax >= 1) local aRecord = {} if self:len() > 0 then local bPlain = not bRegexp nMax = nMax or -1 local nField=1 nStart=1 local nFirst,nLast = self:find(sSeparator, nStart, bPlain) while nFirst and nMax ~= 0 do aRecord[nField] = self:sub(nStart, nFirst-1) nField = nField+1 nStart = nLast+1 nFirst,nLast = self:find(sSeparator, nStart, bPlain) nMax = nMax-1 end aRecord[nField] = self:sub(nStart) end return aRecord end
Observe the possibility to use simple strings or regular expressions as delimiters.
Test Cases:
Lua 5.1.4 Copyright (C) 1994-2008 Lua.org, PUC-Rio ... > for k,v in next, string.split('x!yy!zzz!@', '!') do print(v) end x yy zzz @ > for k,v in next, string.split('x!yy!zzz!@', '!', 3) do print(v) end x yy zzz @ > for k,v in next, string.split('x!yy!zzz!@', '!', 2) do print(v) end x yy zzz!@ > for k,v in next, string.split('x!yy!zzz!@', '!', 1) do print(v) end x yy!zzz!@
function gsplit(s,sep)
return coroutine.wrap(function()
if s == '' then return end
local lasti = 1
for v,i in s:gmatch('(.-)'..sep..'()') do
coroutine.yield(v)
lasti = i
end
coroutine.yield(s:sub(lasti))
end)
end
--same idea without coroutines
function gsplit2(s,sep)
local lasti, done, g = 1, false, s:gmatch('(.-)'..sep..'()')
return function()
if done then return end
local v,i = g()
if s == '' then return end
if v == nil then done = true return s:sub(lasti) end
lasti = i
return v
end
end
local function test_split()
local function test(s,sep,expect)
local function testwith(whichsplit)
local t = {}
for v in whichsplit(s,sep) do t[#t+1]=v end
local ss = table.concat(t, ',')
print(s,ss,ss==(expect or s) and 'pass' or 'fail',#t..' slices')
end
testwith(gsplit)
testwith(gsplit2)
end
test('', ',')
test(',', ',')
test('a', ',')
test('a,b', ',')
test('a,b,', ',')
test(',a,b', ',')
test(',a,b,', ',')
test(',a,,b,', ',')
test('a,,b', ',')
test('asd , fgh ,; qwe, rty. ,jkl', '%s*[,.;]%s*', 'asd,fgh,,qwe,rty,,jkl')
test('Spam eggs spam spam and ham', 'spam', 'Spam eggs , , and ham')
end
output of test_split():
pass 0 slices
pass 0 slices
, , pass 2 slices
, , pass 2 slices
a a pass 1 slices
a a pass 1 slices
a,b a,b pass 2 slices
a,b a,b pass 2 slices
a,b, a,b, pass 3 slices
a,b, a,b, pass 3 slices
,a,b ,a,b pass 3 slices
,a,b ,a,b pass 3 slices
,a,b, ,a,b, pass 4 slices
,a,b, ,a,b, pass 4 slices
,a,,b, ,a,,b, pass 5 slices
,a,,b, ,a,,b, pass 5 slices
a,,b a,,b pass 3 slices
a,,b a,,b pass 3 slices
asd , fgh ,; qwe, rty. ,jkl asd,fgh,,qwe,rty,,jkl pass 7 slices
asd , fgh ,; qwe, rty. ,jkl asd,fgh,,qwe,rty,,jkl pass 7 slices
Spam eggs spam spam and ham Spam eggs , , and ham pass 3 slices
Spam eggs spam spam and ham Spam eggs , , and ham pass 3 slices
The gsplit() above returns an iterator, so other API variants can be easily derived from it:
function iunpack(i,s,v1)
local function pass(...)
local v1 = i(s,v1)
if v1 == nil then return ... end
return v1, pass(...)
end
return pass()
end
function split(s,sep)
return iunpack(gsplit(s,sep))
end
function accumulate(t,i,s,v)
for v in i,s,v do
t[#t+1] = v
end
return t
end
function tsplit(s,sep)
return accumulate({}, gsplit(s,sep))
end
I mean no disrespect, of course, but.. does anyone actually have a working split function without glitches like infinite loops, wrong matches, or error cases? Are all those "takes" of any help here? -- CosminApreutesei
Try Rici Lake's split function: LuaList:2006-12/msg00414.html -- Jörg Richter