From 9683cfea6dbbdf8f82e6cd58d52360f958b2322c Mon Sep 17 00:00:00 2001 From: chriseth Date: Tue, 3 Jan 2017 15:19:14 +0100 Subject: Update to new assembly specification. --- docs/assembly.rst | 412 ++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 381 insertions(+), 31 deletions(-) diff --git a/docs/assembly.rst b/docs/assembly.rst index 71fe4027..8ba3f56c 100644 --- a/docs/assembly.rst +++ b/docs/assembly.rst @@ -27,6 +27,12 @@ arising when writing manual assembly by the following features: * assembly-local variables: ``let x := add(2, 3) let y := mload(0x40) x := add(x, y)`` * access to external variables: ``function f(uint x) { assembly { x := sub(x, 1) } }`` * labels: ``let x := 10 repeat: x := sub(x, 1) jumpi(repeat, eq(x, 0))`` +* loops: ``for { let i := 0 } lt(i, x) { i := add(i, 1) } { y := mul(2, y) }`` +* switch statements: ``switch x case 0: { y := mul(x, 2) } default: { y := 0 }`` +* function calls: ``function f(x) -> (y) { switch x case 0: { y := 1 } default: y := mul(x, f(sub(x, 1))) }`` + +.. note:: + Of the above, loops, function calls and switch statements are not yet implemented. We now want to describe the inline assembly language in detail. @@ -91,40 +97,165 @@ you really know what you are doing. Standalone Assembly =================== -Grammar -------- +This assembly language tries to achieve several goals: + +1. Programs written in it should be readable, even if the code is generated by a compiler from Solidity. +2. The translation from assembly to bytecode should contain as few "surprises" as possible. +3. Control flow should be easy to detect to help in formal verification and optimization. + +In order to achieve the first and last goal, assembly provides high-level constructs +like ``for`` loops, ``switch`` statements and function calls. It should be possible +to write assembly programs that do not make use of explicit ``SWAP``, ``DUP``, +``JUMP`` and ``JUMPI`` statements, because the first two obfuscate the data flow +and the last two obfuscate control flow. Furthermore, functional statements of +the form ``mul(add(x, y), 7)`` are preferred over pure opcode statements like +``7 x y add mul`` because in the first form, it is much easier to see which +operand is used for which opcode. + +The second goal is achieved by introducing a desugaring phase that only removes +the higher level constructs in a very regular way and still allows inspecting +the generated low-level assembly code. The only non-local operation performed +by the assembler is name lookup of user-defined identifiers (functions, variables, ...), +which follow very simple and regular scoping rules and cleanup of local variables from the stack. + +Scoping: An identifier that is declared (label, variable, function, assembly) +is only visible in the block where it was declared (including nested blocks +inside the current block). It is not legal to access local variables across +function borders, even if they would be in scope. Shadowing is allowed, but +two identifiers with the same name cannot be declared in the same block. +Local variables cannot be accessed before they were declared, but labels, +functions and assemblies can. Assemblies are special blocks that are used +for e.g. returning runtime code or creating contracts. No identifier from an +outer assembly is visible in a sub-assembly. + +If control flow passes over the end of a block, pop instructions are inserted +that match the number of local variables declared in that block, unless the +``}`` is directly preceded by an opcode that does not have a continuing control +flow path. The stack height is reduced by the number of local variables +regardless of that. This mean that labels in the next block will have the +same height as before the block that just ended. + +If at the end of a block, the stack is not balanced, a warning is issued, +unless the last instruction in the block did not have a continuing control flow path. + +Why do we use higher-level constructs like ``switch``, ``for`` and functions: + +Using ``switch``, ``for`` and functions, it should be possible to write +complex code without using ``jump`` or ``jumpi`` manually. This makes it much +easier to analyze the control flow, which allows for improved formal +verification and optimization. + +Furthermore, if manual jumps are allowed, computing the stack height is rather complicated. +The position of all local variables on the stack needs to be known, otherwise +neither references to local variables nor removing local variables automatically +from the stack at the end of a block will work properly. Because of that, +every label that is preceded by an instruction that ends or diverts control flow +should be annotated with the current stack layout. This annotation is performed +automatically during the desugaring phase. + +Example: + +We will follow an example compilation from Solidity to desugared assembly. +We consider the runtime bytecode of the following Solidity program:: -The assembly lexer follows the one defined by Solidity itself. + contract C { + function f(uint x) returns (uint y) { + y = 1 + for (uint i = 0; i < x; i++) + y = 2 * y; + } + } -Whitespace is used to delimit tokens and it consists of the characters -Space, Tab and Linefeed. Comments as defined below, are interpreted in the -same way as Whitespace. -Furthermore, the following tokens exist: - -TODO: escapes inside strings, decimal literals, hex literals, hex string literals - -``OneLineComment := "//" [^\n]*` -``MultiLineComment := "/*" .*? "*/"`` - -``String := '"' [^"]* '"' | "'" [^']* "'"`` -``Identifier := [_$a-zA-Z][_$a-zA-Z0-9]*`` -``Opcodes := -"add" | "addmod" | "address" | "and" | "balance" | "blockhash" | "byte" | "call" | -"callcode" | "calldatacopy" | "calldataload" | "calldatasize" | "caller" | "callvalue" | -"codecopy" | "codesize" | "coinbase" | "create" | "delegatecall" | "difficulty" | -"div" | "dup1" | "dup2" | "dup3" | "dup4" | "dup5" | "dup6" | "dup7" | "dup8" | "dup9" | -"dup10" | "dup11" | "dup12" | "dup13" | "dup14" | "dup15" | "dup16" | "eq" | "exp" | -"extcodecopy" | "extcodesize" | "gas" | "gaslimit" | "gasprice" | "gt" | "iszero" | -"jump" | "jumpi" | "log0" | "log1" | "log2" | "log3" | "log4" | "lt" | "mload" | "mod" | -"msize" | "mstore" | "mstore8" | "mul" | "mulmod" | "not" | "number" | "or" | "origin" | -"pc" | "pop" | "return" | "sdiv" | "selfdestruct" | "sgt" | "sha3" | "signextend" | -"sload" | "slt" | "smod" | "sstore" | "stop" | "sub" | "swap1" | "swap2" | "swap3" | -"swap4" | "swap5" | "swap6" | "swap7" | "swap8" | "swap9" | "swap10" | "swap11" | -"swap12" | "swap13" | "swap14" | "swap15" | "swap16" | "timestamp" | "xor"`` - -TODO: Define functional instruction, label, assignment, functional assignment, -variable declaration, ... +The following assembly will be generated:: + + { + mstore(0x40, 0x60) // store the "free memory pointer" + // function dispatcher + switch div(calldataload(0), exp(2, 226)) + case 0xb3de648b: { + let (r,) = f(calldataload(4)) + let ret := $allocate(0x20) + mstore(ret, r) + return(ret, 0x20) + } + default: { jump(invalidJumpLabel) } + // memory allocator + function $allocate(size) -> (pos) { + pos := mload(0x40) + mstore(0x40, add(pos, size)) + } + // the contract function + function f(x) -> (y) { + y := 1 + for { let i := 0 } lt(i, x) { i := add(i, 1) } { + y := mul(2, y) + } + } + } +After the desugaring phase it looks as follows:: + + { + mstore(0x40, 0x60) + { + let $0 := div(calldataload(0), exp(2, 226)) + jumpi($case1, eq($0, 0xb3de648b)) + jump($caseDefault) + $case1: + { + // the function call - we put return label and arguments on the stack + $ret1 calldataload(4) jump($fun_f) + $ret1 [r]: // a label with a [...]-annotation resets the stack height + // to "current block + number of local variables". It also + // introduces a variable, r: + // r is at top of stack, $0 is below (from enclosing block) + $ret2 0x20 jump($fun_allocate) + $ret2 [ret]: // stack here: $0, r, ret (top) + mstore(ret, r) + return(ret, 0x20) + // although it is useless, the jump is automatically inserted, + // since the desugaring process does not analyze control-flow + jump($endswitch) + } + $caseDefault: + { + jump(invalidJumpLabel) + jump($endswitch) + } + $endswitch: + } + jump($afterFunction) + $fun_allocate: + { + $start[$retpos, size]: + let pos := 0 + { + pos := mload(0x40) + mstore(0x40, add(pos, size)) + } + swap1 pop swap1 jump + } + $fun_f: + { + start [$retpos, x]: + let y := 0 + { + let i := 0 + $for_begin: + jumpi($for_end, iszero(lt(i, x))) + { + y := mul(2, y) + } + $for_continue: + { i := add(i, 1) } + jump($for_begin) + $for_end: + } // Here, a pop instruction is inserted for i + swap1 pop swap1 jump + } + $afterFunction: + stop + } Syntax ------ @@ -159,6 +290,8 @@ In the following, ``mem[a...b)`` signifies the bytes of memory starting at posit The opcodes ``pushi`` and ``jumpdest`` cannot be used directly. +In the grammar, opcodes are represented as pre-defined identifiers. + +-------------------------+------+-----------------------------------------------------------------+ | stop + `-` | stop execution, identical to return(0,0) | +-------------------------+------+-----------------------------------------------------------------+ @@ -508,3 +641,220 @@ first slot of the array and then only the array elements follow. Statically-sized memory arrays do not have a length field, but it will be added soon to allow better convertibility between statically- and dynamically-sized arrays, so please do not rely on that. + + +Specification +============= + +Assembly happens in four stages: + +1. Parsing +2. Desugaring (removes switch, for and functions) +3. Opcode stream generation +4. Bytecode generation + + +Parsing / Grammar +----------------- + +The tasks of the parser are the following: + +- Turn the byte stream into a token stream, discarding C++-style comments + (a special comment exists for source references, but we will not explain it here). +- Turn the token stream into an AST according to the grammar below +- Register identifiers with the block they are defined in (annotation to the + AST node) and note from which point on, variables can be accessed. + +The assembly lexer follows the one defined by Solidity itself. + +Whitespace is used to delimit tokens and it consists of the characters +Space, Tab and Linefeed. Comments are regular JavaScript/C++ comments and +are interpreted in the same way as Whitespace. + +Grammar:: + + AssemblyBlock = '{' AssemblyItem* '}' + AssemblyItem = + Identifier | + AssemblyBlock | + FunctionalAssemblyExpression | + AssemblyLocalDefinition | + FunctionalAssemblyAssignment | + AssemblyAssignment | + LabelDefinition | + AssemblySwitch | + AssemblyFunctionDefinition | + AssemblyFor | + 'break' | 'continue' | + SubAssembly | 'dataSize' '(' Identifier ')' | + LinkerSymbol | + 'errorLabel' | 'bytecodeSize' | + NumberLiteral | StringLiteral | HexLiteral + Identifier = [a-zA-Z_$] [a-zA-Z_0-9]* + FunctionalAssemblyExpression = Identifier '(' ( AssemblyItem ( ',' AssemblyItem )* )? ')' + AssemblyLocalDefinition = 'let' IdentifierOrList ':=' FunctionalAssemblyExpression + FunctionalAssemblyAssignment = IdentifierOrList ':=' FunctionalAssemblyExpression + IdentifierOrList = Identifier | '(' IdentifierList ')' + IdentifierList = Identifier ( ',' Identifier)* + AssemblyAssignment = '=:' Identifier + LabelDefinition = Identifier ( '[' ( IdentifierList | NumberLiteral ) ']' )? ':' + AssemblySwitch = 'switch' FunctionalAssemblyExpression AssemblyCase* + ( 'default' ':' AssemblyBlock )? + AssemblyCase = 'case' FunctionalAssemblyExpression ':' AssemblyBlock + AssemblyFunctionDefinition = 'function' Identifier '(' IdentifierList? ')' '->' + ( '(' IdentifierList ')' AssemblyBlock + AssemblyFor = 'for' ( AssemblyBlock | FunctionalAssemblyExpression) + FunctionalAssemblyExpression ( AssemblyBlock | FunctionalAssemblyExpression) AssemblyBlock + SubAssembly = 'assembly' Identifier AssemblyBlock + LinkerSymbol = 'linkerSymbol' '(' StringLiteral ')' + NumberLiteral = HexNumber | DecimalNumber + HexLiteral = 'hex' ('"' ([0-9a-fA-F]{2})* '"' | '\'' ([0-9a-fA-F]{2})* '\'') + StringLiteral = '"' ([^"\r\n\\] | '\\' .)* '"' + HexNumber = '0x' [0-9a-fA-F]+ + DecimalNumber = [0-9]+ + + +Desugaring +---------- + +An AST transformation removes for, switch and function constructs. The result +is still parseable by the same parser, but it will not use certain constructs. +If jumpdests are added that are only jumped to and not continued at, information +about the stack content is added, unless no local variables of outer scopes are +accessed or the stack height is the same as for the previous instruction. + +Pseudocode:: + + desugar item: AST -> AST = + match item { + AssemblyFunctionDefinition('function' name '(' arg1, ..., argn ')' '->' ( '(' ret1, ..., retm ')' body) -> + : + { + $_start [$retPC, $argn, ..., arg1]: + let ret1 := 0 ... let retm := 0 + { desugar(body) } + swap and pop items so that only ret1, ... retn, $retPC are left on the stack + jump + } + AssemblyFor('for' { init } condition post body) -> + { + init // cannot be its own block because we want variable scope to extend into the body + // find I such that there are no labels $forI_* + $forI_begin: + jumpi($forI_end, iszero(condition)) + { body } + $forI_continue: + { post } + jump($forI_begin) + $forI_end: + } + 'break' -> + { + // find nearest enclosing scope with label $forI_end + pop all local variables that are defined at the current point + but not at $forI_end + jump($forI_end) + } + 'continue' -> + { + // find nearest enclosing scope with label $forI_continue + pop all local variables that are defined at the current point + but not at $forI_continue + jump($forI_continue) + } + AssemblySwitch(switch condition cases ( default: defaultBlock )? ) -> + { + // find I such that there is no $switchI* label or variable + let $switchI_value := condition + for each of cases match { + case val: -> jumpi($switchI_caseJ, eq($switchI_value, val)) + } + if default block present: -> + { defaultBlock jump($switchI_end) } + for each of cases match { + case val: { body } -> $switchI_caseJ: { body jump($switchI_end) } + } + $switchI_end: + } + FunctionalAssemblyExpression( identifier(arg1, arg2, ..., argn) ) -> + { + if identifier is function with n args and m ret values -> + { + // find I such that $funcallI_* does not exist + $funcallI_return argn ... arg2 arg1 jump() + if the current context is `let (id1, ..., idm) := f(...)` -> + $funcallI_return [id1, ..., idm]: + else -> + $funcallI_return[m - n - 1]: + turn the functional expression that leads to the function call + into a statement stream + } + else -> desugar(children of node) + } + default node -> + desugar(children of node) + } + +Opcode Stream Generation +------------------------ + +During opcode stream generation, we keep track of the current stack height, +so that accessing stack variables by name is possible. + +Pseudocode:: + + codegen item: AST -> opcode_stream = + match item { + AssemblyBlock({ items }) -> + join(codegen(item) for item in items) + if last generated opcode has continuing control flow: + POP for all local variables registered at the block (including variables + introduced by labels) + warn if the stack height at this point is not the same as at the start of the block + Identifier(id) -> + lookup id in the syntactic stack of blocks + match type of id + Local Variable -> + DUPi where i = 1 + stack_height - stack_height_of_identifier(id) + Label -> + // reference to be resolved during bytecode generation + PUSH + SubAssembly -> + PUSH + FunctionalAssemblyExpression(id ( arguments ) ) -> + join(codegen(arg) for arg in arguments.reversed()) + id (which has to be an opcode, might be a function name later) + AssemblyLocalDefinition(let (id1, ..., idn) := expr) -> + register identifiers id1, ..., idn as locals in current block at current stack height + codegen(expr) - assert that expr returns n items to the stack + FunctionalAssemblyAssignment((id1, ..., idn) := expr) -> + lookup id1, ..., idn in the syntactic stack of blocks, assert that they are variables + codegen(expr) + for j = n, ..., i: + SWAPi where i = 1 + stack_height - stack_height_of_identifier(idj) + POP + AssemblyAssignment(=: id) -> + look up id in the syntactic stack of blocks, assert that it is a variable + SWAPi where i = 1 + stack_height - stack_height_of_identifier(id) + POP + LabelDefinition(name [id1, ..., idn] :) -> + JUMPDEST + // register new variables id1, ..., idn and set the stack height to + // stack_height_at_block_start + number_of_local_variables + LabelDefinition(name [number] :) -> + JUMPDEST + // adjust stack height by +number (can be negative) + NumberLiteral(num) -> + PUSH + HexLiteral(lit) -> + PUSH32 + StringLiteral(lit) -> + PUSH32 + SubAssembly(assembly block) -> + append codegen(block) at the end of the code + dataSize() -> + assert that is a subassembly -> + PUSH32> + linkerSymbol() -> + PUSH32 and append position to linker table + } -- cgit