LECTORA BILINGÜE
6. RESULTADOS DE LA PUESTA EN PRÁCTICA DE LA PROPUESTA
Figure 3 on page 121 shows an overview of important class relationships defined by the C++ interface. Each element of the class hierarchy includes rules, behaviors, and design tips for building hierarchies that is a benefit to a user of good hierarchies.
Token Classes
Each token object passed to the parser must satisify at least the interface defined by class
ANTLRAbstractToken if ANTLR is to compile and report errors for you. Specifically, ANTLR token objects know their token type, line number, and associated input text.
C++ Interface
class ANTLRAbstractToken { public:
virtual ANTLRTokenType getType();
virtual void setType(ANTLRTokenType t); // optional virtual int getLine();
virtual void setLine(int line); // optional virtual ANTLRChar *getText();
virtual void setText(ANTLRChar *); // optional virtual ANTLRAbstractToken *
makeToken(ANTLRTokenType t, ANTLRChar *txt, int ln); };
Most of the time you will want your token objects to be garbage collected to avoid memory leaks. The ANTLRRefCountToken class is provided for this purpose. All subclasses are garbage collected (assuming you use the provide "smart pointer" class ANTLRTokenPtr).
The common case is that you will subclass the ANTLRRefCountToken interface. For your convenience, however, a token object class, ANTLRCommonToken, that will work "out of the box." It does garbage collection and has a fixed text field that stores the text of token found in the input stream.
Why function makeToken() is required at all and why you have to pass the address of an
ANTLRToken into the DLG-based scanner during parser initialization may not be obvious. Why cannot the constructor be used to create a token and so on? The reason lies with the scanner, which must construct the token objects. The DLG support routines are typically in a precompiled object file that is linked, regardless of your token definition. Hence, DLG must be able to create tokens of any type.
Because objects in C++ are not "self-conscious" (i.e., they do not know their own type), DLG has no idea of the appropriate constructor. Constructors cannot be virtual anyway; so, we provide a constructor that is virtual and that acts like a factory. It returns the address of a new token object upon each invocation rather than just initializing an existing object.
Because classes are not first-class objects in C++ (i.e., you cannot pass class names around), we must pass DLG the address of an ANTLRToken token object so DLG has access to the appropriate virtual table and is, thus, able to call the appropriate makeToken(). This weirdness would disappear if all objects knew their type or if class names were first-class objects. Here is the code fragment in DLG that constructs the token objects that are passed to the parser via the ANTLRTokenBuffer:
ANTLRAbstractToken *DLGLexerBase:: getToken()
{
if ( token_to_fill==NULL ) panic("NULL token_to_fill"); ANTLRTokenType tt = nextTokenType();
return tk; }
Token Object Garbage Collection
Token objects are created via ANTLRToken::makeToken(), but how are they deleted? The class ANTLRCommonToken is garbage collected through a "smart pointer" called
ANTLRTokenPtr using reference counting. Any token object not referenced by your grammar’s actions is destroyed by the ANTLRTokenBuffer when it makes room for more token objects. (Calling function ANTLRParser::noGarbageCollection() will turn of this mechanism.) Token objects referenced by your actions are destroyed when local
ANTLRTokenPtr objects are deleted. For example,
a : label:ID ;
would be converted to something like:
void yourclass::a(void) { zzRULE; ANTLRTokenPtr label=NULL; zzmatch(ID); label = (ANTLRTokenPtr)LT(1); consume(); ... }
When the label object is destroyed (it is just a pointer to your input token object obtained from LT(1)), it decrements the reference count on the object created for the ID. If the count goes to zero, the object pointed by label is deleted.
To correctly manage the garbage collection, use ANTLRTokenPtr instead of "ANTLRToken *." Unforunately, the smart pointers can only be pointers to the abstract token class, which causes trouble in your actions. If you subclass ANTLRCommonToken and then attempt to refer to one of your token members via a token pointer in your grammar actions, the C++ compiler will complain that your token object does not have that member. For example, the following results in a compile-time error:
<<
class ANTLRToken : public ANTLRCommonToken { int muck; ... }; >> class Foo { a : t:ID << t->muck = ...; >> ; }
C++ Interface
The t->muck reference will convert the t to “ANTLRAbstractToken *” resulting from
ANTLRTokenPtr::operator->(). Instead, you must do the following:
a : t:ID << mytoken(t)->muck = ...; >> ;
in order to downcast t to be an “ANTLRToken *”. Macro mytoken(aSmartTokenPtr)
gets an “ANTLRToken *” from a smart pointer.
The reference counting interface used by ANTLRTokenPtr is as follows:
class ANTLRRefCountToken : public ANTLRAbstractToken { /* define to satisfy ANTLRTokenBuffer’s need to determine
whether or not a token object can be destroyed. If nref()==0, no one has a reference, and the object may be destroyed. This function defaults to 1, hence, if you use ANTLRParser::garbageCollectTokens() message with a token object not derived from ANTLRCommonRefCountToken, the parser will compile but will not delete objects after they leave the token buffer. */
protected:
unsigned refcnt_; public:
// these 3 functions are called by ANTLRTokenPtr class virtual unsigned nref() { return 1; }
virtual void ref(); virtual void deref(); };
Scanners and Token Streams
The raw stream of tokens coming from a scanner is accessed via an ANTLRTokenStream. The required interface is simply that the token stream must be able to answer the message
getToken():
class ANTLRTokenStream { public:
virtual ANTLRAbstractToken *getToken() = 0;
};
To use your own scanner, subclass ANTLRTokenStream and define getToken() or have
getToken() call the appropriate function in your scanner. For example,
class MyLexer : public ANTLRTokenStream { private:
virtual ANTLRAbstractToken *getToken(); };
DLG scanners are all subclasses of ANTLRTokenStream.
Token Buffer
The parser is "attached" to an ANTLRTokenBuffer by interface functions: getToken()
and bufferedToken(). The object that actually consumes characters and constructs tokens, a subclass of ANTLRTokenStream, is connected to the ANTLRTokenBuffer via interface function ANTLRTokenStream::getToken(). This strategy isolates the infinite lookahead mechanism (used for syntactic predicates) from the parser and provides a "sliding window" into the token stream.
The ANTLRTokenBuffer begins with k token object pointers where k is the size of the lookahead specified on the ANTLR command line. The buffer is circular when the parser is not evaluating a syntactic predicate (that is, when ANTLR is guessing during the parse); when a new token is consumed, the least recently read token pointer is discarded. When the end of the token buffer is reached during a syntactic predicate evaluation, however, the buffer grows so that the token stream can be rewound to the point at which the predicate was initiated. The buffer can only grow, never shrink.
By default, the token buffer deletes token objects when they are no longer needed. A reference count is used to determine how many references exist to each token object. When the count reaches zero, the token object is subject to deletion. If your grammar references a token object in a grammar action, the token buffer will not delete that object. The "smart pointer" to the token object used by your action will delete it.
The token object pointers in the token buffer may be accessed from your actions with
ANTLRParser::LT(i), where i=1..n where n is the number of token objects remaining in the file; LT(1) is a pointer to the next token to be recognized. This function can be used to write sophisticated semantic predicates that look deep into the rest of the input token stream to make complicated decisions. For example, the C++ qualified item construct is difficult to match because there may be an arbitrarily large sequence of scopes before the object can be identified (e.g., A::B::~B()).
The ANTLRParser::LA(i) function returns the token type of the ith lookahead symbol, but is valid only for i=1..k. This function uses a cache of k tokens stored in the parser itself. The token buffer itself is not queried.
C++ Interface
The commonly used ANTLRTokenBuffer functions are:
virtual ANTLRAbstractToken *getToken();
Return the next token from the buffer.
virtual ANTLRAbstractToken *bufferedToken(int i);
Return the token i ahead where i = 1..n with n equal to the number of tokens remaining in the input.
void noGarbageCollectTokens();
Turn off deletion of token objects by buffer.
void garbageCollectTokens();
Turn on deletion of token objects by buffer; this is the default.
virtual void setMinTokens(int k_new);
Specify the minimum number of token objects held by the buffer. The k_new
element must as large as the k specified to the ANTLRTokenBuffer
constructor.
Parsers
ANTLR generates a subclass of ANTLRParser called P for definitions in your grammar file of the form:
class P { ... }
The commonly used functions that you may wish to invoke or override are:
class ANTLRParser { public:
virtual void init();
Note: you must call ANTLRParser::init()if you override init().
ANTLRTokenType LA(int i);
The token type of the ith symbol of lookahead where i=1..k.
ANTLR AbstractToken *LT(int i);
The token object pointer of the ith symbol of lookahead where i=1..n (n is the number of tokens remaining in the input).
void setEofToken(ANTLRTokenType t);
When using non-DLG-based scanners, you must inform the parser what token type should be considered end-of-input. This token type is then used by the errorecovery facilities to scan past bogus tokens without going beyond the end
void garbageCollectTokens();
Any token pointer discarded from the token buffer is deleted if this function is called (assuming the reference count is zero for that token.) This is the default.
void noGarbageCollectTokens();
The token buffer does not delete any tokens.
virtual void syn (ANTLRAbstractToken *tok,ANTLChar*egroup, SetWordType *eset, ANTLRTokenType etok, int k);
You can redefine syn() to change how ANTLR resports error messages; see edecode() below.
virtual void panic(char *msg);
Call this if something really bad happens. The parser will terminate.
virtual void consume();
Get another token of input.
void consumeUntil(SetWordType *st); // for exceptions
This function forces the parser to consume tokens until a token in the token class specified (or end-of-input) is found. That token is not consumed. You may want to call consume() afterwards.
void consumeUntilToken(int t);
Consume tokens until the specified token is found(or end of input). That token is not consumed—you may want to consume() afterwards.
protected:
void edecode(SetWordType *);
Print out in set notation the specified token class. Given a token class called T in your grammar, the set name will be called T_set in an action.
virtual void tracein(char *r);
This function is called upon exit from rule r.
virtual void traceout(char *r);
This function is called upon exit from rule r.
C++ Interface
AST Classes
ANTLR’s AST definitions are subclasses of ASTBase, which is derived from PCCT_AST (so that the SORCERER and ANTLR trees have a common base). The interesting functions are as follows:
class PCCTS_AST {
// minimal SORCERER interface virtual PCCTS_AST *right();
Return next sibling.
virtual PCCTS_AST *down();
Return first child.
virtual void setRight(PCCTS_AST *t);
Set the next sibling.
virtual void setDown(PCCTS_AST *t);
Set the first child.
virtual int type();
What is the node type (used by SORCERER).
virtual void setType(int t);
Set the node type (used by SORCERER)?
virtual PCCTS_AST *shallowCopy();
Return a copy of the node (used for SORCERER in transform mode). When you implement this, you must NULL the child-sibling pointers. You can define a copy constructor and have shallowCopy() call that. If you you want to use dup() with either ANTLR or SORCERER or -transform mode with SORCERER, you must define shallowCopy().
// not needed by ANTLR—support functions; see SORCERER doc virtual PCCTS_AST *deepCopy();
virtual void addChild(PCCTS_AST *t);
virtual void insert_after(PCCTS_AST *a, PCCTS_AST *b); virtual void append(PCCTS_AST *a, PCCTS_AST *b); virtual PCCTS_AST *tail(PCCTS_AST *a);
virtual PCCTS_AST *bottom(PCCTS_AST *a);
virtual PCCTS_AST *cut_between(PCCTS_AST *a, PCCTS_AST *b); virtual void tfree(PCCTS_AST *t);
virtual int nsiblings(PCCTS_AST *t);
virtual PCCTS_AST*sibling_index(PCCTS_AST *t, int i); virtual void panic(char *err);
ASTBase is a subclass of PCCTS_AST and adds the functionality:
class ASTBase : public PCCTS_AST { public:
ASTBase *dup();
Return a duplicate of the tree.
void destroy();
Delete the entire tree.
static ASTBase *tmake(ASTBase *, ...);
Construct a tree from a possibly NULL root (first argument) and a list of children. Followed by a NULL argument.
void preorder();
Preorder traversal of a tree (normally used to print out a tree in LISP form).
virtual void preorder_action();
What to do at each node during the traversal.
virtual void preorder_before_action ();
What to do before descending a down pointer link (i.e., before visiting the children list). Prints a left parenthesis by default.
virtual void preorder_after_action();
What to do upon return from visiting a children list. Prints a right parenthesis by default.
};
To use doubly linked child-sibling trees, subclass ASTDoublyLinkedBase instead:
class ASTDoublyLinkedBase : public ASTBase { public:
void double_link(ASTBase *left,ASTBase *up);
Set the parent (up) and previous child (left) pointers of the whole tree. Initially, left and up arguments to this function must be NULL.
PCCTS_AST *left() { return _left; }
Return the previous child.
PCCTS_AST *up() { return _up; }
Return the parent (works for any sibling in a sibling list).
};
Note, however, that the tree routines from ASTBase do not update the left and up pointers. You must call double_link() to update all the links in the tree.