realbasic-nug
[Top] [All Lists]

Re: String Similiarity

To: realbasic-nug at lists dot realsoftware dot com
Subject: Re: String Similiarity
From: "Theodore H.Smith" <delete at elfdata dot com>
Date: Sat, 30 Oct 2004 17:42:53 +0100
Delivered-to: realbasic-nug at lists dot realsoftware dot com
References: <20041030143357 dot 8A20751BE37 at lists dot realsoftware dot com>
Take a look at:
<http://www.seg.rmit.edu.au/research/download.php?manuscript=14>. It's
a tree structure called a burst trie. If I understand how your tree
works, it's very similar (though not exactly the same) to a burst trie
(I could of course be very wrong). Your structure is more suited to an
in-memory tree, whilst burst tries are more suited to on disk trees,
though the paper compares its speed with other in-memory map
structures. BTW this paper is an example of an academic paper written
in an extremely accessible manner.

<http://www.dcs.shef.ac.uk/~sam/stringmetrics.html> I had to use google's cache to read this page, as it went down.

"Employing Trainable String Similarity Metrics for Information Integration". No it wasn't that one. I forgot the title, but most of the code was actually garbled, it looked like they used a very faulty OCR. Not to mention that even the letters that were OCR'd properly, looked awful. PDF is supposed to be beautiful, not look like a badly photocopied text. And it displays very slowly, which is odd. Its like they've done something very wrong to the file format of the PDF. I just don't get why they don't use HTML! They have a website, but don't serve their documents in HTML??? HTML is the most accessible format, more so than PDF. PDF is great for keeping the layout consistent, but its still less accessible than HTML.

"Employing Trainable String Similarity Metrics for Information
Integration" uses a lot of math equations. Thats another qualm I have with academics. Computing is not maths. They try to reduce all programs down to a one line mathematical equation, when a proper normal program using comments, would be much nicer and be more much understandable. What if I don't know what they mean by their funny "E" or other strange things like that? I've seen a lot of source code in my time, and I don't remember needing to use "funny E's" in them.

Math equations are not bad, they have their place, but I think that place is for things including media processing (graphics, sound), and 3d/2d modelling. At least in these cases, there are no better alternatives, and in fact the equations do the job just great. But the way these academics apply these equations to most code, is just not nice.

I'l take a look at the material you sent to me, thanks very much. For my ElfData plugin, I think I may keep all the existing functionality, but delete the lexer class. I can imagine for example, that using a quicksort it would be nice to have some kind of "Lexer" functionality, so that the quicksort can sort by soundex, or case insensitive, etc. The thing is, that as Charles has demonstrated, a quicksort of that sophistication is better done with class interfaces, that way it works on all sorts of data not just strings.

I can't think of any case where this "lexer" functionality would definitely be the most advantageous way.

Are you familiar with the writings of James Clerk Maxwell? I like his writings, I find it easier to understand electromagnetism from his textbooks, despite that they are hundreds of years old, compared to modern books. Why? Because he has that "pioneering yet simple" attitude.

He doesn't cut off awkward pieces, he leaves things raw. Like for example his leaving in his books all sorts of examples like "this is an interesting electrical phenomena, however it is currently unexplained" (refering to "disk lightening"). And thats interspersed with his formula.

Modern academic text books are dry and dead by comparison. It doesn't have that life in them, the enthusiasm for exploring a new subject, that you find in the writings of the pioneers. They just want you to be 100% sure about how to regurgitate some cases, its like trying to treat the human as a "cp" program instead of a "gcc" program. The "cp" program can copy data, but the "gcc" program can understand (some kinds of) data. (In fact I have more in mind a logical processing program than gcc, but I don't know of any that currently exist.)

While we may never/rarely meet a Maxwell, I find the writings of pragmatic coders who are people who have needed to implement professional real-world systems, more accessible than those who are divorced from having to put their tests out into the real world. Where the review of a professor or peers is more important than how the code actually affects the rest of the system. "Writing Solid Code" by that MicroSoft employee is a great example.

About my structure, its not the same as a burst trie, in that mine doesn't "burst". The "burst trie" The structure is composed of nodes and leaves. I think I explained it on www.elfdata.com/dictionary/ however this page is badly out of date, and the content is not good. I intended my code to be a very fast dictionary look up, and while it is fast, it is not the fastest. It is however more flexible than hashing. It maintains binary sort order as an implicit property of its structure also, so the sort order comes "for free".

--
    Theodore H. Smith - Software Developer.
    http://www.elfdata.com

_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives of this list here:
<http://www.realsoftware.com/listarchives/lists.html>

<Prev in Thread] Current Thread [Next in Thread>