realbasic-nug
[Top] [All Lists]

Re: finding links with RegEx?

To: REALbasic Network Users Group <realbasic-nug at lists dot realsoftware dot com>
Subject: Re: finding links with RegEx?
From: Kevin Ballard <kevin at sb dot org>
Date: Wed, 27 Feb 2002 18:45:48 -0500
On 2/27/02 5:35 PM, "Thomas Reed" <thomasareed at earthlink dot net> wrote:

> Does anyone have an example of how to find links in HTML files using
> RegEx?  I'm not all that knowledgeable in regular expressions, and I'm
> curious about whether someone else has already invented this particular
> wheel...
> 
> Note that I need to find any kind of link to an external file -- such as
> <A HREF="link">, <IMG SRC="link">, <BODY BACKGROUND="link">, etc.  And I
> need to be able to quickly and easily isolate the "link" text.

Hrm. Assuming the link tag properties are always HREF, SRC, or BACKGROUND,
and assuming that all tags that use these properties have them set to a
link, then I guess the regex searchstring would be this:

<[^>]*(SRC|HREF|BACKGROUND)=(\x22[^\x22]*\x22|[^\x20>])[^>]*>

(\x22 is quote ("))
(I use \x20 to match space because I don't like spaces in regex strings)

The SubExpressionString to check for would be 2. SubExpressionString 1 would
be the tag property that this link is set to. Note, this is untested, but it
seems like it would match things like: (with parenthesies around the match)

<A HREF="(testing, testing)">
<IMG SRC=(thetest.jpg)>
<BODY BACKGROUND="(whee.jpg)" WHATEVER="blah">

Note: it will also match some strange tags, but if these are in an HTML file
then the file is incorrect HTML (parenthesies around the match):

<HREF=(blah.hrm")> // This matches closing " because there's no opening "
<IMG SRC="(testing.gif)"ONMOUSEOVER="whatever">
<BODYBACKGROUND=(something)>
<A HREF=(testing), testing>
<A HREF="testing> <A HREF="blah"> // No closing " on first tag
// The last one has no match because the "closing" quote, in the second tag,
// isn't succeeded by a >

In most of these cases, it matches the correct link anyway, because it
checks for either a quoted string or a series of characters up until the
first space or >. Note, if the string is quoted and missing the closing
quote, it will include the > in the link and continue up until it sees
another quote. This most likely won't match and will skip the link, because
the next quote found would have to be followed by a >. I chose to do it this
way because both BBEdit and OmniWeb allow > to show up in quoted strings in
tags without ending the tag.

Try out the RegEx and tell me if it works. I'm too busy (read: lazy) to
build a project to try out the RegEx.

-- 
Kevin Ballard
kevin at sb dot org
Email from Korea or China must go to <kevin dot nb at sb dot org>
http://kevin.sb.org/



<Prev in Thread] Current Thread [Next in Thread>