lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


2013/3/25 Chris Datfung <chris.datfung@gmail.com>:
> With some help from the list, I wrote a script that is reads in various
> strings from a file containing various payloads and generates SVM data for
> further classification. The idea is that you specify how many grams should
> be used when breaking up the payload, then the script creates a table with
> all possible grams for the parameter specified based on ASCII characters
> 32-126. The script then updates the hash value in the table holding all
> possible grams with the number of times that specific gram appears in the
> payload file. The script works fine when I specify to use two grams, but
> when I move to three grams the script uses up all the CPU on my machine and
> runs out of memory as well. The script is listed below. My question is how
> can I improve the script to use less CPU/memory and still be able to track
> the number of grams a payload has when the number of grams is greater than
> two?

The problem is that the number of ways to split up an integer grows very
fast, even if you restrict the size of the parts, and your `GenerateNGrams`
generates a non-null object for every one of them. That is what eats
the memory.

Most of them will not be encountered in an input file of reasonable
size.  Lua is tailor-made for this situation, with its default value of
nil for something that has never been assigned to.

Your program logic should be:

1. Write a routine "GramKey(...)" that takes a line from the input file
   and converts it to a table key. There is not much to be gained by
   coding to ASCII characters. You may as well encode 1,2,0,4,3,1 by
   "1,2,0,4,3,1" as by "BCAEDB", since Lua internalizes all strings.

2. GramCounter = {}

3. for line in io.lines(PayloadFile) do
      local Gram = GramKey(line)
      GramCounter[Gram] = (GramCounter[Gram] or 0) + 1
   end

That way, a combination that did not occur will not have a physical
table entry at all.

HTH.

Dirk