lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Wrong, there are typically quotation marks around attributes, then the link anchor tag is followed by text which can be "MP3", you would then match too much.
The URL can use various extensions (including with variable letter case). Also note that the dot pattern does matches spaces. And the .mp3 links can also use non-ASCII characters (not necessarily URL-encoded, and you cannot safely guess which text encoding is used in the path or querystring of the URL, as it is not necessarily the same as the HTML page encoding itself (including when it is URL-encoded to become ASCII). URLs re designed to be opaque for most things, except that HTTP(S) are designed to be "hierarchic" and make special behavior only of the "/" and or .." relative references; "." alone is supported by target filesystems onwhic h the webserver is installed, but not needed for HTTP(S) which defines its own filesystem space with web semantics, not local-OS semantics on the server. Beside that the path elements in HTTP are opaque binary, do not support any control or whitespaces that have not been URLencoded in %xx hexadecimal form, and do not have any semantics which is brought separately in MIME type headers.
Once again you need to stcik to the URI RFC. Don't reinvent the wheel, there are already tons of URL parsers. And many MP3 on the web are never accessible with an URL ending with ".mp3" in their path or in their query string, and the terminatione ".mp3" may has well return actually NO valid MP3 but plain HTML, or plain text or other file formats (as indicated properly by the MIME type header and the HTTP status code).
So please read the RFC ! https://tools.ietf.org/html/rfc3986
Then look and HTML refefences to see how these URLS are further reencoded by another layer in HTML, applying additional escaping when needed (like character references "&name;" or "&#numericDecimal;" or "&#xnumericHexadecimal;" using Unicode code points independantly of the encoding in the URI itself. Three distinct encoding layers are applied to encapulate the actual resource names, two of them being standard, but one of them being resolved only in the server side.



Le mer. 25 déc. 2019 à 23:22, nobody <nobody+lua-list@afra-berlin.de> a écrit :

> On 25. Dec 2019, at 22:44, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
>
> It matches too many things […]

When dealing with _valid HTML,_ heuristically, any string that starts with 'http://', ends with '.mp3' and doesn't contain spaces is almost certainly exactly a URL pointing at (something that claims to be) an MP3. (The other pattern works, too.) [So a somewhat better pattern than what I initially suggested would be "http://%S+%.mp3" – also excluding line breaks.]

When you're not dealing with random / adversarial strings, that is good enough and you don't have to care about all those intricacies. From what I gathered, the goal is one-off semi-manual extraction of links from HTML generated by some other party, so even potential errors don't really matter… (The human in the loop can notice / fix things.)

-- nobody