N
early every modern game requires some sort of text parser. This gem, along with the sample code on the CD, demonstrates a powerful but easy-to-use text pars-ing system designed to handle any type of file format.Text files have a number of advantages when representing data:
oNiHfco . They aj-e easy to reacj and efat using any standard text editor. Binary data usually requires a custom-built tool that must be created, debugged, and maintained.
• They are flexible—the same parser can be used for simple variable assignment or a more complex script.
• They can share constants between code and data (more on this later).
Unfortunately, text data has a few drawbacks as well:
• Unlike most binary formats, text must first be tokenized and interpreted, slowing the loading process.
• Stored text is not space efficient; it wastes disk space and slows file loading.
Because many game parameters only need to be tweaked during development, it may be practical to use a text-based format during development, and then switch to a more optimized binary format for use in the shipping product. This provides the best of both worlds: the ease of use of text files, and the loading speed of binary data. We'll discuss a method for compiling text files into a binary format later in the gem.
The Parsing System
Here's what our parser will support:
• Native support for basic data types: keywords, operators, variables, strings, inte-gers, floats, boots, and GUIDs
• Unlimited user-definable keyword and operator recognition
• Support for both C (block) and C++ (single-line) style comments
• Compiled binary read and write ability
• Debugging support, able to point back to a source file and line number in case of error
• #include file preprocessing support
• #define support for macro substitution
Most of the preceding items are self-explanatory, but #indude files and #define support may seem a bit out of place when discussing a text parser. We'll discuss how these features can greatly simplify scripts, as well as provide an additional mechanism to prevent scripts and code from getting out of sync.
Macros, Headers, and Preprocessing Magic
Preprocessing data files in the same manner as C or C++ code can have some wonder-ful benefits. The concept is perhaps best explained by a simple example. Let's assume that we wish to create a number of unique objects using a script file, which will pro-vide the necessary data to properly initialize each object and create unique handles for use in code. Here's what such a script might look like:
CreateFoo(l) { Data = 10 } CreateFoo(2) { Data = 20 } CreateFoo(3) { Data = 30 } CreateBar(4) { Foo = 1 }
Assuming that the CreateFooQ keyword triggers the creation of a Foo object in code, we now have three Foo objects in memory, each with unique member data, cre-ated by a script. Also, assuming that we're referencing these objects with handles, we can now access these objects in code with the values of 1, 2, and 3 as unique handles.
Note that in our example, the script can also use these numeric handles. The Bar class requires a valid Foo object as a data member, and so we use a reference to the first Foo object created when creating our first Bar object.
It could get easy to lose track of the various handle values after creating several hundred of them. Any time an object is added in the script, the programmer must change the same values in code. There are no safeguards to prevent the programmer from accidentally referencing the wrong script object. This problem has already been solved in C and C++ through the use of header files in which variables and other com-mon elements can be designed for many source files to share. If we think of the text script as simply another source file, the advantages of a C-like preprocessor quickly become apparent. Let's look again at our example using a header file instead of magic numbers.
Header File -// ObjHandles.h
// Define all our object handles tfdefine SmallFoo 1
tfdefine MediumFoo 2
#define LargeFoo 3
#define SmallBar 4
#define FooTypeX 10
#define FooTypeY 20 tfdefine FooTypeZ 30 Script File
-/ -/
// Directs the parser to scan the header file
^include "ObjHandles.h"
CreateFoo(SmallFoo) { Data = FooTypeX } CreateFoo(MediumFoo) { Data = FooTypeY }
CreateFoo(LargeFoo) { Data = FooTypeZ } CreateBar(SmallBar) { Foo = SmallFoo }
In addition to this being much easier to read and understand without the magic numbers, both the text script and source code share the same header file, so it's impos-sible for them to get out of sync.
Because we're already performing a simple preprocessing substitution with
#define, it's just one more step to actually parse and use more complex macros. By rec-ognizing generic argument-based macros, we can now make complex script opera-tions simpler by substituting arguments. Macros are also handy to use for another reason. Because macros are not compiled in code unless they are actually used (like a primitive form of templates), we can create custom script-based macros without breaking C++ compatibility in the header file.
Note diat although we're processing macros and #defines, the parser does not rec-ognize other commands such as #ifdef, #ifndef, and #endif.
The Parsing System Explained
There are five classes in our parsing system: Parser, Token, TokenList, TokenFile, and Macro. The Macro class is a helper class used internally in Parser, so we only need to worry about it in regard to how it's used inside Parser. TokenFile is an optional class used to read and write binary tokens to and from a standard token list. This leaves the heart of the parsing system: Parser, Token, and TokenList. Because Token is the basic building block produced by the parser, let's examine it first.
The Token Class
The basic data type of the parsing system is the Token class. There are eight possible data types represented by the class: keywords, operators, variables, strings, integers, real numbers, Booleans, and GUIDs. Keywords, operators, variables, and strings are all represented by C-strings, and so the only real difference among them is semantic.
Integers, real numbers, and Booleans are represented by signed integers, doubles, and booh. For most purposes, this should be sufficient for data representation. GUIDs, or Globally Unique IDentifiers, are also given native data type status, because it's often handy to have a data type that is guaranteed unique, such as for identifying classes to create from scripts.
The Token class is comprised of a type field and a union of several different data types. A single class represents all basic data types. Data is accessed by first checking what type of token is being dealt with, and then calling the appropriate GetQ func-tion. Asserts ensure that inappropriate data access is not attempted.
Each of the data types has a role to play in the parser, and it's important to under-stand how they work so that script errors are avoided. In general, the type definitions match similar definitions in C++. All keywords and tokens are case sensitive.
Keyword
Keywords are specially defined words that are stored in the parser. Two predefined keywords are include and define. User-defined keywords are used primarily to aid in lexicographical analysis of the tokens after the scanning phase.
Operator
An operator is usually a one- or two-character symbol such as an assignment operator or a comma. Operators are unique in the fact that they act like white space regarding their ability to separate other data types. Because of this, operators always have the highest priority in the scanning routines, meaning that the symbols used in operators cannot be used as part of a keyword or variable name. Thus, using any number or character as part of an operator should be avoided. Operators in this parsing system also have an additional restriction: because of the searching method used, any opera-tor that is larger than a single character must be composed of smaller operaopera-tors. The larger symbol will always take precedence over the smaller symbols when they are not separated by white space or other tokens.
Variable
A variable is any character-based token that was not found in the keyword list.
String
A string must be surrounded by double quotes. This parser supports strings of lengths up to 1024 characters (this buffer constant is adjustable in the parser) and does not support multiple-line strings.
Integers
The parser recognizes both positive and negative numbers and stores them in a signed integer value. It also recognizes hexadecimal numbers by the Ox prefix. No range checking is performed.
Floats
Floating-point numbers are called floats and are represented by a double value. The parser will recognize any number with a decimal point as a float. It will not recognize scientific notation, and no range checking is performed on the floating-point number.
Booleans
Boolean values are represented as a native C++ booltype, and true and false are built-in keywords. As with C++, these values are case sensitive.
QUIDs
By making use of the macro-expansion code, we can support GUIDs without too much extra work. Note that unless the macro is expanded with ProcessMacrosQ, the GUID will remain a series of separate primitive types. This function is described later.
The TokenLlst Class
The TokenList class is publicly derived from a standard STL list of Tokens. It acts exactly like a standard STL list of tokens, and has a couple of additional features. The TokenList class allows viewing of the file and line number that any given token comes from. This is exclusively an aid for debugging, and can be removed with a compile-time flag.
The Parser Class
This is the heart of the parsing functionality. We first create a parser object and call the CreateQ function. Note that all functions return a boot value, using true for success and false for failure. Next, we must reserve any additional operators or keywords beyond the defaults required for the text parsing.
After this comes the actual parsing. The parsing phase is done in three passes, handled by three functions. Splitting the functionality up gives the user more control over the parsing process. Often, for simple parsing jobs, #include file processing and macro substitution are not needed. The first pass reads the files and translates the text directly into a TokenList using the function ProcessSource(). The next function, ProcessHeadersQ, looks for any header files embedded in the source, and then parses and substitutes the contents of those headers into the original source. The third func-tion, ProcessMacrosQ, performs both simple and complex C-style macro substitution.
This can be a very powerful feature, and is especially useful for scripting languages.
Let's see what this whole process looks like. Note that for clarity and brevity's sake, we are not doing any error checking.
/ / W e need a Parser and TokenList object to start TokenList toklist;
Parser parser;
// Create the parser and reserve some more keywords and tokens parser.Create();
parser.ReserveKeyword("special_keyword");
parser.ReserveOperator("[");
parser.ReserveOperator("]");
// Now parse the file, any includes, and process macros
parser.ProcessSource("data\scripts\somescript.txt", &toklist);
parser.ProcessHeaders(&toklist);
parser.ProcessMacros(&toklist);
The TokenFile Class
Because parsing and processing human readable text files can be a bit slow, it may be necessary to use a more efficient file format in the shipping code. The TokenFile class can convert processed token lists into a binary form. This avoids having to parse the text file multiple times, doing #include searches, macro substitutions, and so forth.
Character-based values, such as keywords, operators, and variables, are stored in a lookup table. All numeric values are stored in binary form, providing additional space and efficiency savings. In general, this binary form can be expected to load five to ten times as fast as the text-based form.
Using the TokenFile class is simple as well. The WriteQ function takes a TokenList object as an argument, and creates the binary form using either the output stream or filename that was specified. The class can also store the file in either a case-sensitive or case-insensitive manner. If both the variable "Foo" and "foo" appear in the script, turning the case sensitivity off will merge them together in the binary format, provid-ing further space savprovid-ings. It defaults to off.
Reading the file is performed with the Read() function. Here's how it looks in code:
TokenFile tf;
// Write a file to disk
tf.Write("somefile.pcs", &toklist);
/ / O r read it
tf.Read("somefile.pcs", &toklist);
Wrapping Up
Text file processing at its simplest level is a trivial problem requiring only a few lines of code. For anything more complex than this, however, it's beneficial to have a com-prehensive text-parsing system that can be as flexible and robust as the job demands.