lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi all.

In the site http://www2.camara.gov.br/glossario/ there's a lot of definitions that I want to grab to a txt file.

I've instaled luasockets today, and I thougt: "Why not make this on lua?"

The keywords in the HTML sources are like this:
H4 class=sessaoPagina>Abertura de crédito adicional</H4>
And the definitions like this:

<TD>some word here</TD>

Well, I have found that my patterns were on some cases matching more definitions than keywords or more keywords than definitions.

I've made a little test - a lua script that shows the size of keywords and definitions on each page.

The results are( the first column being the letter, the second the number of keywords and the third the number of definitions).

a - 60 - 60
b - 11 - 11
c - 101 - 96
d - 58 - 57
e - 64 - 64
f - 10 - 9
g - 8 - 8
h - 1 - 1
i - 36 - 35
j - 4 - 4
l - 32 - 33
m - 20 - 19
n - 15 - 15
o - 40 - 40
p - 105 - 103
q - 9 - 9
r - 73 - 70
s - 58 - 56
t - 18 - 18
u - 10 - 10
v - 16 - 16
z - 1 - 1

The source is:

____________

http = require("socket.http")
--print(http)
--table.foreach(http, print)

local letters={'a','b','c','d','e','f','g','h','i','j','l','m','n','o','p','q','r','s','t','u','v','z'}

for x, element in letters do
local words={}
local definitions={}
    local page = http.request("http://www2.camara.gov.br/glossario/" .. element ..".html")  
    for w in string.gfind(page, "<H4 class=sessaoPagina>(.[^H4]+)<\/H4>") do
            table.insert(words,w)
    end
  
    for q in string.gfind(page, "<TD>(.[^H4]+)</TD>") do
            table.insert(definitions,q)  
    end

print(element .." - " .. table.getn(words) ..    " - " .. table.getn(definitions))
  
end
__________________________________

Well, if someone can help me to see where is the error, or point me how to make a better regexp, I will be very grateful.

(sorry for the very poor english :P)
[]'s
- Walter