lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Fri, Oct 3, 2014 at 10:24 PM, Hisham <h@hisham.hm> wrote:
> On 3 October 2014 15:13, Alexander Gladysh <agladysh@gmail.com> wrote:
>> Hi, Geoff, all,
>>
>> Basically, the discussion that day strayed more to the big-data analytics in
>> general than to Lua specifically (which is, I think, more of a good thing).
>>
>> As for Lua, we discussed that Lua (or rather LuaJIT) is a good instrument
>> for ad-hoc data pre-processing.
>>
>> At LogicEditor we (for many reasons) don't use Hadoop for the big-data
>> analysis (we have about 1TB/day of uncompressed data to analyze).
>>
>> For quick data pre-processing and analysis we use simple combination of
>> standard Linux tools (parallel, grep, sort, uniq, cut and some awk). A
>> typical command looks something like this:
>>
>> time pv uid-time-ref-post.gz\
>> | pigz -cdp 4 \
>> | cut -d$'\t' -f 1,3 \
>> | parallel --gnu --progress -P 10 --pipe --block=16M \
>>   $(cat <<"EOF"
>>     luajit ~/url-to-normalized-domain.lua
>> EOF
>>   ) \
>> | LC_ALL=C sort -u -t$'\t' -k2 --parallel 6 -S20% \
>> | luajit ~/simple-reduce-key-counter.lua \
>> | LC_ALL=C sort -t$'\t' -nrk2 --parallel 6 -S20% \
>> | pigz -cp4 domain-uniqs_count-www-merged.gz
>
> Just curious, but: how many cores do you have in the machine that runs this?

Something like 24 cores.

Note that this command is not optimized to the top of performance, I
picked it randomly from the work log.

I forgot to note two things:

1. The Lua scripts are simple for l in io.lines() end stuff.

2. The output gz file is then processed in something like R. (I would
like to stay in Lua even for that — but is there a set of Lua
libraries for Data Mining that can replace R here?)

Best,
Alexander.