lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> -----Message d'origine-----
> De : lua-bounces@bazar2.conectiva.com.br [mailto:lua-
> bounces@bazar2.conectiva.com.br] De la part de Roberto Ierusalimschy
> You do not need anything specific to sort an array. The order in the
> array is instrinsic to the keys (1,2,etc.); it has nothing to do with
> the implementation. When sorting, you know that the smaller element
> must
> be at index 1, no matter where/how this index is represented
> internally.

(Sorry, this is going to be a bit long)

Yes, this is true for a vector. Sorting is the action of shuffling the values inside the internal implementation so that whatever iterating algorithm we use produces keys and their associated values in the desired order. It happens that for the vector, the iterating algorithm is simple enough: go over all indexes from beginning to end.

But in Lua we have hashes. A hash hashes its keys to produce indexes. These indexes address a vector of lists where colliding keys are stored. Iterating over the hash entries with pairs() scans this vector of lists in whatever order it chooses. When inserting a new key in the hash, a new index is computed, which has nothing to do with any kind of ordering, and the entry is appended to the corresponding list.

Now comes the sorting problem. How could one coerce Lua to update the internal storage so that the available iteration algorithm produces a user-defined sorted output if the scalar-keys-dedicated vector part was absent from the implementation? One can hardly move some entry from one collision list to another, since this would break table indexing because the computed index from the key wouldn't correspond to the collision list where the entry is located after the sort.

I can see 3 solutions to achieve this:

A - Restrict the problem to scalar keyed hashes, and cause the hash function to hash scalar keys to themselves. Sorting the hash then just moves values around while keeping the hash entries in place internally, and the iterating algorithm will scan all keys according to their natural order. Iterating with ipairs() can then scan the hash entries like pairs() does, and stop at the first non-scalar key. The only problem is memory usage when indexing with large scalar keys...

B - Restrict the problem to scalar keyed hashes, and cause (some of the) entries to be stored in a separate vector part in the implementation, as decided by some vector sparsity threshold. This is nothing more than a special case of the above, with a dedicated storage implementation. Of course, regular indexing has to take this separate storage into account. Sorting only applies to this storage area, as well as iterating with ipairs(), which stops at the first nil value in the vector part.

C - Have yet another internal data structure in the table implementation to manage sorting. Iterating facilities exposed to the language make use of this information when it is available, else they directly iterate over storage implementation. Sorting functions could then be applied to keys *or* values, as opposed to only values in the current implementation (where keys are known since they are scalars). This has the drawback of having to do some bookkeeping on this sorting data on table write accesses, the simplest being that any write access to the table flushes the sorting data :-).

So, I agree that with sorting only made available to scalar-keyed tables, and used on tables without hole, the script cannot decide which underlying solution is implemented between the above 3. But throw holes in the mix, and this feature of having operator # counting to any one of them in an undefined manner when the contents change strongly reeks of B:

> for i=1,10 do t[i]=i end
> t[8] = nil
> t[3] = nil
> print (#t)
7

Which, again, IMO, makes using operator # on tables pretty useless on tables with holes. Therefore in practice I restrict its usage to scalar-keyed tables without holes, and its meaning then becomes "the number of entries in the table". Wherefore I would vote for an implementation of C, and operator # simply returning the number of entries in a table, whatever the keys might be. Note that this doesn't go against the current vector storage implementation for scalar keyed entries, has this certainly has memory advantages when dealing with such tables containing a huge amount of data. But it eliminates the special handling this vector gets from operator # and ipairs(). 5.2 is the perfect time for such a change :D.


Benoit.