Annotator Documentation KNOWN PROBLEMS =============== OpenC++ -------- ### Lexer + Problem: wide characters are not supported: `wchar_t', L'x', L"foo". Solution: Handled by Annotator. + Problem: `restrict' keyword not recognized. Solution: Handled by Annotator, with similar semantics as `const' or `volatile' (i.e. probably non-conforming). + Problem: `explicit' keyword not recognized. Solution: Handled by Annotator. It can only appear in constructor definitions, and OpenC++ parses it as a return type. It does not enforce that `explicit' isn't used as a name, though. + Problem: Digraphs and Trigraphs are not supported. Solution: NONE. + Problem: iso646 transcriptions of operators are not supported. Solution: use the preprocessor, like C does it. ### Parser + Problem: type-specifiers in non-canonical order are not recognized. Examples include `int typedef a' or `long const int i'. Solution: Annotator handles these, except for mix-in typedefs. + Problem: namespace alias `namespace A = B' not supported. Solution: NONE. + Problem: function-try-blocks not supported a::a() try : mem_init() { } catch(...) { } Solution: NONE. Annotator ---------- + Problem: we don't make assumptions about the underlying machine. Implications: we don't know anything about alignment, sizeof, underlying types of an enumeration, real types of a numeric literal. Grep for "MACHINE" to find such things. + Problem: no expression evaluation. This is a direct consequence of the above. Possible fix: simple expressions (not involving sizeof) can still be done. + Problem: a null-pointer constant is only recognized if it is a literal "0", "0x0", "0ul" or something like that. In particular, `const int NULL = 0' can not be used as a NPC. This is a direct consequence of the above. + Problem: Array dimensions are currently not handled at all. This is a direct consequence of the above. Implications: void foo(int (&a)[10]); void foo(int (&a)[20]); declares one function signature, namely "function taking reference to array of int". Even worse, new int[99] calls the array-new operator, but loses the dimension. + Problem: namespace support is very incomplete. We're missing `using' and Koenig lookup. + Problem: String literals have type "char[]", not "const char[]". This avoids the need for the deprecated "const char[] -> char*" cast rule. Implication: might call a wrong function if something is overloaded by "char*" and "const char*". Misinterpreted source code --------------------------- + Problem: Partially braced array initializers are handled wrong. int i[2][2] = {1,2,3,4}; is treated as int i[][] = {{1,2,3,4}}; Likewise, struct a { int i, j[2], k } x = { 1, 2, 3, 4 }; is handled as ... = { 1, {2, 3, 4}, 0} Solution: add correct braces to source code. You'll get a warning when this happens. + Problem: accessing a static member of an object as "obj.member" is translated into "obj, class::member" which adds a sequencing constraint to the output tree which the input does not have. Solution: convert to "class::" syntax manually. You'll get a warning. + Problem: direct-initialisation is handled wrong when it is syntactically a declaration, because OpenC++ parses it wrong. int i = 0, j(i); bombs out with "no type i defined". A workaround is possible but currently not implemented. Solution: add extra parens: int i = 0, j((i)); + Problem: Some expressions are parsed as C-style casts. Function calls are resolved correctly by the Annotator: int foo(int,int); int i = (foo)(1, 2); Binary operators are *not* resolved correctly in all cases: int j = (i) + 3*2; Since OpenC++ parses this as an expression `((i) +3) * 2' (c-style cast of the value `+3'), we get the wrong bindings. You'll get a warning when this might happen. Solution: add parentheses, as in `int j = ((i)) + 3*2'. + Problem: Language linkage is ignored. According to ISO C++, namespace a { extern "C" void foo(); } extern "C" void foo(); declares the same function twice. We declare two different functions. Likewise, it's not detected when two different function signatures are declared "extern "C"". + Problem: in "if(condition)" etc., expressions which are syntactically equivalent to declarations are parsed as declarations. For example, struct a { int& operator*(a); } the_a; // ... if (the_a * the_a = 9) { ... } will complain that "the_a" is not a type. I think this is too rare to worry about. Solution: add parentheses Suboptimal Rendering --------------------- + Problem: Anonymous namespaces are equivalent to "namespace __unnamed". This way, we don't guarantee uniqueness of names across translation units. Fix: change `Symbol_name::get_unnamed_namespace_name'. + Problem: Blocks are numbered on per-file basis. Two translation units void foo(int) { int i; } and void foo(char*) { long i; } define two entities "foo.1::i" resp. "i?Q3foo?1". + Problem: Template support is rudimentary only. We only support class templates, which are instanciated as a whole when used, and not looked-at at all when they are not used. This might violate some rules regarding name binding, and can cause some code to be rejected (i.e., "std::list" can't be instanciated when the member type has no "<" operator, because "std::list::sort" needs it). Undetected constraint violations --------------------------------- I assume that input is a valid program; the compiler should detect all constraint violations. For robustness, we also detect many violations but by far not all. Here are the major goofs. + A typedef name which refers to a class is 100% equivalent to that class. That is, struct X { }; typedef X Y; struct Y* p; works. 7.1.3p4 says it should not. Similarily (7.1.3p5): typedef struct { typedef int a; } X; X::X::a var; + Protection is completely ignored. Protection doesn't affect the semantics of the program, only its well-formedness. + Class rescanning rules are ignored. typedef int X; struct A { X p; typedef char* X; }; is accepted (and A::p has type int), 3.3.6p1.5 says it's wrong. + Initialisation of a wchar_t[] with a narrow literal, or vice versa, is accepted: char x[] = L"foo"; wchar_t y[] = "bar"; Constructs not handled by Annotator ------------------------------------ + "using". + A class can't be direct base class and virtual base class at the same time. struct A {}; struct B : virtual A {}; struct Foo : A, B {}; This construction makes little sense anyway because the direct base is not accessible. + Default parameters. These are not allowed in SafeC++ anyway. + "try"/"catch" + nameless unions + destructor invocation magic: using a typedef-name to invoke a destructor (12.4p12) or destroying a builtin type (12.4p15). + dynamic_cast, const_cast + operator overloading for most binary operators COMMAND LINE ============= Files with extension ".i" or ".ii" are fed into OpenC++ directly; everything else goes through the preprocessor (cpp). -Dxxx, -Uxxx, -Ixxx these are passed directly to the preprocessor and define/undefine symbols, or specify include paths. -dxxx specify debug flags. Non-cumulative, one "-d" cancels the previous one. a show addresses of nodes in dumps, i.e. [sym, type] Class @address b show function bodies c use color in dumps i show initializers of variables s show "semantics" of translation unit t show symbol table v verbose, show stages of translation V more verbose OUTPUT FORMATS =============== All operators appearing in output are builtin operators with the builtin meaning (ISO C++ clause 4). User-defined operators are translated into function calls. The semantics of a translation unit is the semantics of all its initializers in the correct order. Declarations, definitions of namespaces and classes, and so on, have no semantics. Output falls into two categories: statements and expressions. Only those are output. Output is a (restricted) OpenC++ parse tree. ### Notation Format | [symbol, type] ClassName | a | b This is a node of type `Annotated', where `ClassName' is not a leaf. The encoded symbol is "symbol" which has type "type". The node has children "a" and "b". | [symbol, type] Token This is Annotated where XX is (derived from) Leaf. The Token is the "print name". You get almost-C++ by concatenating the Tokens. The token is informational only. The actual unique symbol name is in the symbol. In color dumps ("-dc"), actual symbol names are red, types are green, and printable tokens are blue. Expressions ------------ Expressions will contain ONLY those elements listed here. In particular: + user-defined operators are resolved as function calls. + "a->b" is turned into "(*a).b", "a->*b" into "(*a).*b". + parentheses are removed. Every expression (sub-)tree is annotated with at least its type. | [SYMBOL, TYPE] name | [SYMBOL, TYPE] PtreeName | a | :: | b The actual form of the tree is irrelevant, it's just the annotation which counts. Not all structured names are preserved. Names may have more types than allowed in C++. For example, the initializer in "int (a::*p)() = &a::fun" is translated into [a::fun, MF_iQ1a] a::fun that is, just a literal of type member-function-of-a. In addition, function signatures may appear as themselves or as pointers. The symbol may be + a Variable_symbol + a Function_signature See also "function call". | [no symbol, TYPE] elem Literal, for example "1", "'x'", "0.9", "true", etc. | [no symbol, TARGET] PtreeFstyleCastExpr | [no symbol, TARGET] TARGET | ( | EXPR | ) A built-in cast. This is used for all sorts of (static) casts which are permitted in C++. In particular, none of these nodes actually encodes a function call. This is also used for derived-to-base conversions. | [no symbol, RESULT_TYPE] PtreeFuncallExpr | FUNCTION | ( | NonLeaf | EXPR1 | , | EXPR2 | ) Function call. This is used for all sorts of function calls. + Normal function: "function" is a Name node [symbol, type] symbol where the "type" is a function type. + Member function: the first parameter is the implicit "this" parameter. The call is a virtual call(!) if the annotation does not have the af_DirectCall flag and the function is virtual. + The "function" may also be of type "pointer to function" or "pointer to member function". In the latter case, the first argument is the object this member pointer is bound to. + Constructors are encoded as `function taking (ctor args) returning object'. The real semantics or a constructor call differ slightly, but Annotator makes no assumptions how it is finally implemented. The list is empty (=null) when no args are passed. | [no symbol, TYPE] PtreeUnaryExpr | OP | EXPR OP can be any unary operator: ~ ! - + ++ -- * &. This one is used for builtin operators only. Special case: pointer to member literal ("&foo::bar"). [no symbol, TYPE] PtreeUnaryExpr & [varsym, VAR_TYPE] PtreeName or LeafName This case can be detected because it has a TYPE kind of "k_Member". The builtin "&" never returns that. | [no symbol, TYPE] PtreePostfixExpr | EXPR | OP This is used to represent the builtin postfix "++" and "--" operators. | [no symbol, SIZE_T] PtreeSizeofExpr | sizeof | ( | [no symbol, TYPE] TYPE | ) Expressions are translated into the "sizeof(TYPE)" form. This is NOT equivalent to what OpenC++ would generate. | [no symbol, RTYPE] PtreeCommaExpr | LHS | , | RHS Note that the RHS of a comma expression never is a comma expression again. This simplifies overload resolution. + "(a, (b, c))" is turned into "((a, b), c)". + "(a, foo)()" is turned into "(a, foo())". | [SYM, TYPE] PtreeDotMemberExpr | EXPR | . | [SYM, TYPE] NAME "name" always is a nonstatic data member. Static data member access is translated into "(expr, class::name)". Function access is translated differently. "a->b" access is translated into "(*a).b". | [no symbol, TYPE] PtreePmExpr | EXPR | .* | MPTR This is only used for normal member pointers, not for member-function pointers. There are no expressions with "->*", these are converted into the canonical form using ".*". | [no symbol, VOID] PtreeThrowExpr | throw | EXPRESSION-OR-NULL The expression is null when we rethrow an exception unchanged. | [no symbol, TYPE] PtreeAssignExpr | LHS | OPER | RHS OPER is "=" or a compound assignment operator ("+=", ...). These are used for built-in assignment and compound assignment operators. They are not used for classes. Note that, for things like "-=", operands are *not* cast into a common type. float f; int i; i = i - f; // this is "i = int(float(i) - f)" i -= f; // this is *not* "i -= int(f)". Try i=9, f=2.5 enums on the RHS are cast into ints, however. | [no symbol, TYPE] PtreeInfixExpr | LHS | OP | RHS This is used for the builtin infix operators: "+", "-", "*", "/", "%", "&", "|", "^", "&&", "||", "<<", ">>", "==", "<=", "<", ">", ">=", "!=". Note that pointer additions are brought into the canonical form with the pointer on the left-hand side. This is expected to simplify post-processor code, but might change evaluation order. | [no symbol, VOID] PtreeDeleteExpr | delete | [ | ] | EXPR | [no symbol, VOID] PtreeDeleteExpr | delete | EXPR Delete expression. So far, no attempt is made to look up the delete function to use. | [function, PTR_TYPE] PtreeNewExpr | new | PLACEMENT-ARGS | NonLeaf | [no symbol, ALLOC_TYPE] TYPE-NAME | PtreeDeclarator | null | INITIALIZER PLACEMENT-ARGS is either null, or a standard function argument list: [( [arg1 , arg2 , arg3] )]. This contains the parameters to pass to the allocation function. ALLOC_TYPE is the type to allocate. The interesting syntax corresponds to OpenC++'s parse tree for a type consisting only of a typedef-name. INITIALIZER is the initializer, in the usual form. It may be missing. Note that for array allocation the array bounds are currently lost. | [no symbol, TYPE] PtreeCondExpr | IF-EXPR | ? | THEN-EXPR | : | ELSE-EXPR IF-EXPR is of type bool. | [no symbol, TYPE] PtreeArrayExpr | ARRAY-EXPR | [ | INDEX-EXPR | ] Note that currently all array expressions are converted into the canonical form with the array on the left side. This is expected to simplify post-processing, but might modify evaluation order. | [no symbol, TYPE] "this" This leaf has type "LeafThis". Code ----- Statements generally have no annotation; only expressions contained in them. You can assume that the scope of every PtreeDeclaration ends at the enclosing PtreeBlock (to call destructors). | PtreeWhileStatement | while | ( | BOOL-EXPR | ) | STATEMENT OpenC++ doesn't recognize the form "while(declaration)", so neither do we. You can assume that the argument to "while" will always be an expression; when it is a declaration, Annotator will modify the source accordingly. | PtreeIfStatement | if | ( | BOOL-EXPR | ) | STATEMENT | else | STATEMENT The same as said for PtreeWhileStatement holds here, too. The last two parts may be missing. | PtreeSwitchStatement | switch | ( | INT-OR-ENUM-EXPR | ) | STATEMENT The same as said for PtreeWhileStatement holds here, too. | PtreeDoStatement | do | STATEMENT | while | ( | BOOL-EXPR | ) | ; | PtreeForStatement | for | ( | NULL-STATEMENT | EXPRESSION-OR-NULL | ; | EXPRESSION-OR-NULL | ) | STATEMENT C++ syntax: "for (; expr1; expr2) stmt" "for (a; b; c) x" is translated into "{ a; for (; b; c) x }". Therefore, post-processors need not deal with the scope of the induction variable. The NULL-STATEMENT is an expression statement (see below) with no expression. | PtreeBlock | { | NonLeaf | STATEMENT | STATEMENT | ... | } Every block establishes a scope (including lifetime control) for variables defined in it. | PtreeExprStatement | ANNOTATED-EXPRESSION | ";" A null statement has a null pointer instead of the ANNOTATED-EXPRESSION. Null statements are never generated in output EXCEPT as... + sub-statement to labels / default / case + sub-statement in `for' Null statements in source code are translated into nothing, or into empty PtreeBlocks. | [TYPE, SYMBOL] PtreeDeclaration | null | TYPE | PtreeDeclarator | NonLeaf | NAME | ; A declaration statement. All information is in the annotation for the PtreeDeclaration; the other stuff is just to get an almost-correct syntax for OpenC++. The actual initializer is in the SYMBOL. | PtreeLabelStatement | NAME | : | NULL-STATEMENT | PtreeDefaultStatement | default | : | NULL-STATEMENT | PtreeCaseStatement | case | EXPR | ":" | NULL-STATEMENT Syntactically, the "foo;" in "label: foo;" is a sub-statement of the label, like it is a sub-statement of the "if" in "if(x) foo;". The Annotator splits this, because unlike in the if case, "foo" doesn't have a scope for its own. Therefore, "label: foo;" is translated into two statements, namely "label:;" and "foo;". That's reflected by the NULL-STATEMENT, which is a PtreeExprStatement with no expression. No validity checks are performed so far. In particular, no symbol name handling is done for labels. | PtreeGotoStatement | goto | LABEL | ; | PtreeBreakStatement | break | ; | PtreeContinueStatement | continue | ; No validity checks so far. | PtreeReturnStatement | return | EXPR | ; | PtreeReturnStatement | return | ; The first form is used in functions returning non-void, the second one in functions returning void. The statement "return expr();" in a void function is translated into "expr(); return;". Initializers ------------- An initializer for a variable of type T is either an expression of type T, or a PtreeBrace containing initializers for all members. This holds, recursively. If the initializer expression is a constructor call (i.e. PtreeFuncallExpr where first child is annotated with a ctor symbol), it's an in-place construction; otherwise, it is copy-initialisation from a matching expression. Sample initializers ( means expression): struct a { int i, j; } the_a = { 99, 98 }; ==> "{ <99>, <98> }" struct b { a a1, a2; }; b the_b = { 1, 2, 3, 4 }; ==> "{ { <1>, <2> }, { <3>, <4> } }" b the_b2 = { the_a, the_a }; ==> "{ , }" struct x {} the_x; struct y { int i; x the_x; char* b; }; y the_y = { 1, "foo" }; ==> "{ <1>, { }, <"foo"> }" y the_y2 = { 1, the_x, "bar" }; ==> "{ <1>, , <"bar"> }" struct z { operator x(); operator char*(); } the_z; y the_y3 = { 1, the_z, "blub" }; ==> "{ <1>, , <"blub"> }" y the_y4 = { 1, the_z }; ==> "{ <1>, , <0> }" b b_copy((the_b)); ==> "" b b_copy = the_b; ==> "" TRANSFORMATIONS ================ These are all the transformations which Annotator performs. All which do not preserve semantics 100% should yield a warning. + Comma expressions: (a, (b, c)) => ((a, b), c) Rationale: simplify overload resolution in case "c" is an overloaded function name Semantics: no change + Comma expressions: (a, foo)(x) => (a, foo(x)) Rationale: simplify function call. When a function F is called, its symbol should appear on the LHS of the PtreeFuncallExpr node to avoid taking an unnecessary function pointer. Semantics: this adds an evaluation ordering constraint which the original code does not contain. You'll get a warning calling a comma expression Workaround: modify code, or remove that optimisation at the cost or making post-processing harder + Call of static method, or access static variable: obj.static_member() => (obj, class::static_member()) Rationale: simplify call (PtreeFuncallExpr) and access (PtreeDotMemberExpr) nodes Semantics: this adds an evaluation ordering constraint which the original code does not contain. You'll get a warning. using a static variable with object calling a static function with object Workaround: modify input code + Recovery from parse error: (function)(a, b, c) [a C cast] => function(a, b, c) [a function call] Problem: parenthesized function name parsed as C cast Semantics: no change + Recovery from parse error: (value) + x [C cast of unary expression] => value + x [binary operator] Problem: parenthesized variable name parsed as C cast Semantics: When the expression is contained inside a binary expression, the operator ranks are handled wrong. "(i)+3*4" turns into "(i+3)*4" instead of "i+(3*4)". You'll get a warning. expression of style `(expr) + expr' might be translated wrong Workaround: add extra parentheses. + Pointer arithmetics: scalar[pointer] => pointer[scalar] Rationale: simplify post-processing Semantics: might change evaluation order. You'll get a warning. swapping sides of array subscript expression Workaround: modify code, or remove that optimisation at the cost or making post-processing harder + Pointer arithmetics: scalar + pointer => pointer + scalar Rationale: simplify post-processing Semantics: might change evaluation order. You'll get a warning. swapping sides of pointer addition Workaround: modify code, or remove that optimisation at the cost or making post-processing harder + Void return: return void_expr(); => void_expr(); return; Rationale: simplify post-processing Semantics: no change + Labels: label: statement; => label:; statement; Rationale: simplify post-processing Semantics: no change + Null statements are removed. Rationale: simplify post-processing Semantics: no change + Sub-statements are turned into blocks: if (cond) foo(); => if (cond) { foo(); } Rationale: simplify post-processing Semantics: no change + For is simplified: for (init; cond; step) { } => { init; for( ; cond; step) { } } Rationale: simplify post-processing Semantics: no change + If is simplified: if (declaration) { } => { declaration; if (var) { } } Rationale: simplify post-processing Semantics: no change + While is simplified: while (declaration) body; => while(true) { if (!bool(declaration)) break; body; } Rationale: simplify post-processing Semantics: no change + Switch is simplified: switch (declaration) body; => { declaration; switch(var) body; } Rationale: simplify post-processing Semantics: no change + static_cast, const_cast, C-style cast, fstyle cast and implicit casts are turned into fstyle casts. Rationale: simplify post-processing Semantics: no change DATA STRUCTURES ================ The symbol table (Symbol_table::get_instance()) contains all symbols defined in the program; the Annotator contains the semantics of the whole translation unit (i.e. a list of PtreeDeclaration nodes as described above under "Statements"). General -------- Each symbol table entry has two slots, "tag" and "untag". The "tag" slot contains the "struct tag" of that name (i.e. this symbol contains a user-defined type), the "untag" member contains the "other" symbol. Possible combinations are: untag tag 0 0 empty p 0 function, variable, ... q q just a struct/enum/union/class p q a struct and a function Every symbol can have one of three states: + undefined (st_Undefined): symbol is known to exist, but has not been declared yet. An example would be "namespace std" or "class std::typeinfo". + declared (st_Declared): we've seen a declaration. + defined (st_Defined): we've seen the definition. Functions ---------- A Function_symbol entry in the symbol table contains possibly many signatures. The signatures themselves don't appear in the symbol table. Function_symbol::signatures - all signatures Function_symbol::declared_scope - parent scope Function_symbol::fun_kind - kind (see Symbol_name::Kind) Function_signature::proto_type - type that appears in the prototype (e.g. "void foo::bar(int)" has a proto-type of "void (int)"). Function_signature::this_type - type of "*this", or invalid type if this is not a member function Function_signature::call_type - type used when calling this function, e.g. a member function "void foo::bar(int)" has call-type "void (foo&, int)" Function_signature::storage_spec - storage class specifier. Valid values are "s_Member", "s_Static" and "s_Extern". Note that static members actually are s_Extern because they are externally visible. Function_signature::function_spec - function specifier set. All combinations permitted. Function_signature::builtin - this one's only used during overload resolution to mark builtin operators. Don't use. Function_signature::generated - true iff this is a generated function, e.g. a default constructor or destructor. These will NOT have a function body. Function_signature::backlink - link to the Function_symbol Function_signature::definition - function body; a list of statements. To check whether a function is defined, use the symbol status; the definition may be null (=an empty list) if the function has an empty body! Function_signature::initializers - list of member initializers, for constructors only. Each of these initializers is a list of the form NonLeaf [SYMBOL, TYPE] NAME = INITIALIZER where the SYMBOL is either a (base) class symbol or a member variable. The INITIALIZER has the usual form, and may be NULL when a POD member is not initialized. The INITIALIZER is *NOT* restricted to the syntactic forms of C++. For example, struct a { int i, j; } struct b [ a the_a; b() {} }; will generate a member initializer for the_a of the form [the_a = [{ [null , null] }]] Function_signature::parameters - list of parameter variables Variables ---------- These symbols represent global, member and local variables, and enum values. Variable_symbol::type - type of variable Variable_symbol::storage_class - storage class. Everything possible but s_None. Variable_symbol::initializer - initializer. May be null if variable is not (explicitly) initialized Variable_symbol::bitsize - unannotated tree for bitsize in a bitfield struct Variable_symbol::has_addr - true iff this is a real variable, false if enumerator. Variable_symbol::in_class - for member variables, the associated class symbol. Classes -------- Class_symbol::k - kind (k_Union or k_ClassOrStruct) Class_symbol::in_scope - parent scope Class_symbol::real_name - real name of the class; that's the name of the constructor (i.e. "X" is in the symbol table as "X__i" resp. "X?T1i", but its real name is "X"). Class_symbol::base_classes - normal base classes Class_symbol::virtual_base_classes - virtual base classes Class_symbol::members - member variables (static and non-static) Class_symbol::member_functions - member functions Class_symbol::pod - true iff this is a POD class Class_symbol::aggregate - true iff this is an aggregate class Rest ----- Namespace_symbol, Typedef_symbol, Enum_symbol, Template_class_symbol: these don't contribute to semantics. Typedef_symbol::type - base type Enum_symbol::values - all enumerators, in order Template_class_symbol::definition - the class definition, a tree containing an elaborated type specifier describing a class Template_class_symbol::members - members defined out-of-line, a list of declarations Template_class_symbol::special - specialisations generated so far Template_class_symbol::defined_in_scope - scope in which template is defined LISP ===== (defun underlined-outline-mode () (interactive) (setq outline-regexp "^.*\n\\([-=]\\)+\\|##+ .*$") (setq outline-level (lambda () (save-excursion (looking-at outline-regexp) (if (looking-at "^##") 3 (let ((WHAT (substring (match-string 1) 0 1))) (if (equal WHAT "=") 1 (if (equal WHAT "-") 2 3)) ) ) )) ) (outline-mode) ) Local Variables: --mode: underlined-outline End: