• Tutorials

  • Home
  • <

    GPDL Regular Expressions



    DungeonCraft Help Home



    Books have been written on this subject. But it ain't that bad for our purpose here.

    A regular expression is a 'human-readable' (almost and sometimes) program for searching a string. A simple example:

        "cat"

    is a nice regular expression. This particular expression says to find the letters 'c', 'a', and 't' immediately adjacent to each other anywhere in the string.

    Anywhere? Yes, anywhere. The search engine starts at the first character of the string and tests to see if it begins with "cat". If so, it is done and returns 'true'. If not, then the first character of the string is dropped and the process repeated. The obvious result is to test whether the word 'cat' appears in the string.

    But wait! What if the string is "My catalytic converter is broken." Well the answer is that the search will succeed. Maybe you should have used the regular expresion:

       " cat "

    That is a nice regular expression, too. And it says that there must be a space both before and after the letters "cat".

    But wait! What if the string is "Here. Take my cat. Please." Now the search will fail because "cat" is followed by a period instead of a space. Solution? As follows:

       " cat[ .,;]"

    Oh, dear. This is not quite as human-readable. The pair of brackets means that there is a choice of characters. In this case the choice is a space, a period, a comma, or a semicolon.

    Is the problem of finding the word "cat" solved? I don't think so. What about this text:

       Cats and dogs and toads.

    Maybe we should add an 's' as a possible terminating character. As follows:

       " cats?[ .,;]"

    What is that question-mark doing there? A question-mark following a character (or choice of characters) means that the character may or may not be there. The asterisk and plus are very much like the question-mark:



    ?

    Character appears 0 or 1 times

    *

    Character appears 0 or more times

    +

    Character appears 1 or more times



    So does our latest attempt work? No. For the reason that there is no space in front of the word "cat" when it is the first word on the line. We can fix this as follows:



       "[^ ]cats?[ .,;]"

    At least the solution is beginning to look unintelligible. Like an expert did it. But it still should not work! We asked for the word "cat" and what was in the sentence was "Cat". An upper-case 'C'. But in GPDL this works just fine because we convert everything to upper case before doing the actual pattern match.

    And what is that character '^' and how does it help? That 'hat' will match the empty string at the beginning of the line so we can find 'cat' when it is the first word on the line. (A Cat in the Hat?)

    All of this is made more easy by the following 'meta-characters'.

    \<

    Matches empty string at front of word.

    \>

    Matches empty string at end of word

    \b

    Matches empty string at edge of word

    $

    Matches empty string at end of line.



    So a simpler solution to our problem might look like this:

       "\bcats?\b"

    Meaning the edge of a word followed by "cat" followed (perhaps) by an 's' followed by the edge of a word.

    Another interesting problem we can try to solve. Say that we want to know if the player typed a sentence with the word "cat" and the word "dog". Here is what we might try:

       "\b(cats?\b.*\bdogs?|dogs?\b.*\bcats?)\b"

    This probably needs no explanation. But I will try to explain anyway.

    The parentheses simply group the two items between them. The two items are separated by a vertical-bar, which means that either of the two items can be used to make a match. So we will try to match the first '\b', then whatever is in the parentheses, and then the final '\b'.

    The first item within the parentheses will match the word "cat" or "cats" followed by zero or more of anything. followed by the word "dog" or "dogs".

    Do you see that special character, the period? It means any single character. And when followed by an asterisk it means 0 or more of whatever the period can match. We might have used a plus instead of an asterisk. Would it have made any difference?

    The second item similarly will match the two words in the reverse order. The word edges were placed outside to parentheses so I would have to type them fewer times. Do you see why it works that way? We could have written:

       "\bcats?\b.*\bdogs?\b|\bdogs?\b.*\bcats?\b"

    A good book on the subject of egrep will tell you a lot more. A copy of the GNU documentation is at More Regular Expressions.



    See $GREP, $GCASE, and $WIGGLE .