[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: [ANN] Lua in Moscow meetup, this Wednesday #lua #lualang #luainmoscow
- From: Alexander Gladysh <agladysh@...>
- Date: Fri, 3 Oct 2014 23:25:35 +0400
On Fri, Oct 3, 2014 at 10:24 PM, Hisham <h@hisham.hm> wrote:
> On 3 October 2014 15:13, Alexander Gladysh <agladysh@gmail.com> wrote:
>> Hi, Geoff, all,
>>
>> Basically, the discussion that day strayed more to the big-data analytics in
>> general than to Lua specifically (which is, I think, more of a good thing).
>>
>> As for Lua, we discussed that Lua (or rather LuaJIT) is a good instrument
>> for ad-hoc data pre-processing.
>>
>> At LogicEditor we (for many reasons) don't use Hadoop for the big-data
>> analysis (we have about 1TB/day of uncompressed data to analyze).
>>
>> For quick data pre-processing and analysis we use simple combination of
>> standard Linux tools (parallel, grep, sort, uniq, cut and some awk). A
>> typical command looks something like this:
>>
>> time pv uid-time-ref-post.gz\
>> | pigz -cdp 4 \
>> | cut -d$'\t' -f 1,3 \
>> | parallel --gnu --progress -P 10 --pipe --block=16M \
>> $(cat <<"EOF"
>> luajit ~/url-to-normalized-domain.lua
>> EOF
>> ) \
>> | LC_ALL=C sort -u -t$'\t' -k2 --parallel 6 -S20% \
>> | luajit ~/simple-reduce-key-counter.lua \
>> | LC_ALL=C sort -t$'\t' -nrk2 --parallel 6 -S20% \
>> | pigz -cp4 domain-uniqs_count-www-merged.gz
>
> Just curious, but: how many cores do you have in the machine that runs this?
Something like 24 cores.
Note that this command is not optimized to the top of performance, I
picked it randomly from the work log.
I forgot to note two things:
1. The Lua scripts are simple for l in io.lines() end stuff.
2. The output gz file is then processed in something like R. (I would
like to stay in Lua even for that — but is there a set of Lua
libraries for Data Mining that can replace R here?)
Best,
Alexander.