|
On Tuesday, September 10, 2002, at 09:29 am, Reuben Thomas wrote:
<OT>To think MS chose XML as native format for Office 11... Unless they always compress it, I expect the files to bloat excessively. Hard disk is cheap these day, but I still have a 6GB HD on my personal computer, and applicationsalready eat a lot of it... </OT> -- See, even mailing list/newsgroups notations are infected :-)
I'd far rather they did that than the binary brain dump they usually choose. Binary files are dreadful for forwards, backwards, and external compatibility.
That's really just a question of how you encode XML. If you store it asASCII, yes it will bloat (but Gnumeric does this, and it compresses well:a spreadsheet I have is ~120Kb of bloated XML, but only 2.5Kb when gzipped).
Compressed XML is an efficient storage format for general data. The repeated tags are exactly what zip like compressors exploit. I'd imagine that compressed XML could be loaded and decompressed a lot faster than uncompressed XML could be loaded.
I recall hearing about research in to processing directly on entropy compressed (zip like) files. With that approach, it may be possible to parse and search directly on a compressed file, giving tight storage and fast processing.
Cheers, Benjohn