Writing a Regular Expression parser in Haskell: Part 2
The first module in my simple regular expression parse is called RegexToNFA. This module exposes the types that make up a finite state machine and also the functions to convert a regular expression string into a finite state machine.
My structure for a FSM follows closely from the mathematical definition:
1: data FiniteMachine = FiniteMachine{ table :: [Transition],
2: alphabet :: Set Char,
3: start :: Node,
4: final :: Set Node
5:
6:
7: -- NFA node
8: type Node = Integer
9:
10: -- The value for an edge in a NFA
11: type TransitionValue = Maybe Char
12:
13: -- A transition in a NFA is a tuple of
14: -- StartNode , DestinationNode, Value to transition on
15: type Transition = (Node,Node,TransitionValue)
16:
17: -- The value of the edge in the NFA is a Maybe Char
18: -- Where Nothing is the epsilon transition
19: -- therefore lets just rename Nothing to epsilon
I have the value which you transition on as a Maybe Char (which I alias as TransitionValue). This allowed me to define epsilon as Nothing data constructor.
With this structure defined my goal now is to convert a regular expression pattern such as: (a|b)* into a FiniteMachine. In order to do this there is a lot of state that I need to keep track of which naturally leads to the use of the State monad. To do this I set up a structure for what data I want to be kept track of and then create a state monad using that structure:
1: -- The state that gets passed around which we used to build up the NFA
2: data ParseContext = Context
3: {
4: nodeList :: [Node],
5: transitions :: [Transition],
6: operators :: OperatorList,
7: nextNode :: Node,
8: values :: Set Char
9: } deriving (Show, Eq)
10:
11: -- Alias the State data constructor with a more friendly name
12: type RegexParseState a = State ParseContext a
This structure is passed between functions to allow them to see the current state of the parsing and create a new state. I define many functions, each which deal with a piece of the puzzle of converting the input string into a FSM. I am not going to address them all but I will point out some which is note worth:
convertToNFA - This is the top level function, it is exposed externally and lets you convert a regex to a NFA.
processOperator - This function determines when we should execute an operator given its precedence. We assign each operator a precedence which lets us determine when we should execute an operator. For example in the expression a|b*, we want to execute star before we execute union.
Last but not least are the methods which execute the operators. For example, there is one called doConcat, which performs the concatenation of two values in the regular expression. doConcat isn't pretty, since its doing the dirty work of examining the state and create a new state to reflect a partially completed FSM.
1: -- Execute the concat operator
2: doConcat :: RegexParseState ()
3: doConcat = do
4: st <- get
5: let nodes = nodeList st
6: newNodes = (nodes !! 0) : (nodes !! 3) : (drop 4 nodes)
7: newTransitions = transitions st ++ [(nodes !! 2, nodes !! 1, epsilon)]
8: newOperators = tail $ operators st
9: put $ st { nodeList = newNodes,
10: transitions = newTransitions ,
11: operators = newOperators}
With all this in place, lets finally see what this module actually outputs.
> convertToNFA "a|(bc)*"
FiniteMachine {table =
[(4,5,Just 'c'),(2,3,Just 'b'),(0,1,Just'a'),
(3,4,Nothing),(6,2,Nothing),(6,0,Nothing),
(1,7,Nothing),(5,7,Nothing),(8,6,Nothing),
(8,7,Nothing),(7,9,Nothing),(9,8,Nothing)],
alphabet = fromList "abc",
start = 8,
final = fromList [9]}
If you examine the table list in the output, you will see all the transitions for the NFA that accepts "a|(bc)*" and that the start state is node 8 and the accept state is node 9.
I uploaded the RegexToNFA.hs file for your examination. I tried to comment it a good amount and I feel it should be pretty easy to read and understand.
In the next part I will talk about the next modules: NFAtoDFA
Comments
- Anonymous
June 02, 2008
A few weeks ago I read this article about writing a simple regular expression parser. That article does