CSC 250: Final Project Post
This blog post is a bit different in that it actually serves as a part of my final project for theory of computation (CSC 250). You can read about my other spring 2021 courses here. The objective is to explain a real-world result or application of theoretical computer science. In addition to publishing a blog post, we were tasked with creating a 5-minute video to be shared with the class. While the intended audience is my fellow computer science majors, I’ve tried to make the material more broadly accessible. Furthermore, even if you don’t fully understand everything, this post still serves my blog’s general purpose of providing insight into my Smith College experience. Now on to the actual content!
Checking Addressing Modes with Context-Free Grammars
When you hear the word theoretical, you may assume there are limited real-world applications. In actuality, concepts in theoretical computer science can be applied to fields ranging from biology to the arts. Given my career interests in computer engineering, robotics, or software engineering, I was curious to explore results relevant to electrical engineering and/or computer architecture. The research I will be presenting here is the work of Hawraa S. Hamza of Babylon University. The full paper, Automatic system Design for finding the Addressing Modes based on Context Free Grammar, can be read here.
To understand addressing modes, we must start with a quick introduction to assembly code. Assembly code is generated by a compiler from modified source code. (The entire compilation process of a C program involves the preprocessor, the compiler, the assembler, and the linker). Assembly is human-readable text and has a one-to-one correspondence with machine code. When we say that assembly is human-readable, it’s worth pointing out that it’s significantly more difficult to understand than the equivalent code in a high-level language. Pictured below are a simple C program and the corresponding assembly code.
Each line of the assembly code is a single instruction to be performed by the central processing unit (CPU) and includes an op-code, an address (or addresses), and an addressing mode (or modes). The op-code is the specific operation to be performed by the CPU. The address is to the operand of the instruction. The mode indicates the way in which the operand of an instruction is specified. In other words, the mode tells the computer how to modify the address field before the instruction is executed.
Examples of one-operand instructions are increment and decrement which in the x86 instruction set architecture (ISA) are incl and decl respectively. The operand is the destination that is specified by the address and the addressing mode. Examples of two operand instructions in the x86 ISA include addl (add), subl (subtract), and movl (copy). While unimportant here, note that the l stands for long which is a large integer datatype. In these two-operand instructions, the first argument is the source (specified by an address and addressing mode) and the second argument is the destination (also specified by an address and an addressing mode).
The main point here is that addressing modes indicate the way in which an address is specified and tell the computer how to calculate an effective address. This is useful because it reduces the number of bits required in the address field. Recall that a bit is simply a 0 or 1 and is the most basic unit of information in computing. Addressing modes are important to assembly programmers, compiler writers, and computer architects.
We will now consider some specific addressing modes. With some familiarity with computer architecture and systems-level computer science, this should make things a bit clearer. The simplest addressing mode, specified with a dollar sign in the x86 ISA, is immediate and indicates that the operand is supplied directly. As an example, $25 means that the operand’s value is simply the number 25. When the percent symbol is used, it specifies that the operand is to be found in the given register. (Registers are simply the small and fast units of storage found in the CPU. They are for short-term storage of program data and instructions currently in use). For example, %eax means the operand is the value found in register eax. Finally, parentheses around the given register indicate the operand’s value is to be found at the memory location specified in a given registrar. In the assembly code that may look like (%eax). If you are still feeling a bit confused by what an addressing mode is, check out this webpage.
Context-free grammars (CFGs) are formally defined by a 4-tuple (V, Σ, R, S) where V is a finite set of variables, Σ finite is a finite set of terminals (disjoint from V), R is a finite set of rules, and S is the start variable. The rules have the form A → u where a is a variable and u is composed of variables and terminals. The language of a given context-free grammar is the set of strings containing only terminals that can be derived from the start symbol using the rules. Such languages, unsurprisingly called context-free languages, are an important topic in theoretical computer science. Their associated machine is the push-down automata. Once again, here is a helpful webpage if you’re feeling a bit lost.
Checking Addressing Modes with Context-Free Grammars:
With all of that background information under our belts, we are finally ready to tie it all together. The system proposed in the paper included two parts: checking the addressing mode with a context-free grammar (focused on here) and implementing the addressing mode. Here are the rules of the context-free grammar outlined in the paper.
Observe that in stage 4 INST is a variable that is used as a placeholder for the various op-codes and that R is used as a placeholder for the various numbered registers. Likewise, in stage 2 each addressing mode type is mapped to the corresponding structure. Note that to check an addressing mode we work backward from a string of terminals up to the addressing mode type. To better understand this process, take a look at the following example.
As you will see the first step is to replace MOV with INST. Next, R1 is replaced by R, the comma is replaced with SC, and the R2 is replaced with R. Moving to stage 3 we map the pattern INST R SC to CONS. Finally, we map CONS [R] with the register indirect addressing mode. (Technically the last step is to say that we successfully mapped to an addressing mode, but what matters is the specific addressing mode). As a proof of concept, the paper shows screenshots of a simple program that performs the addressing mode checking.
In summary, the result presented here was a system for using a context-free grammar to parse the addressing mode from an instruction written in assembly language. The paper succeeded in demonstrating this potential use of CFGs. To fully evaluate this result, I would ideally have access actual proof of concept program and the underlying code. The main area of future research is implementing a more robust error detection system that is capable of determining the type of error and suggest ways to fix it.
Through reading the research paper and preparing my video and this blog post, I gained a better understanding of the way in which various aspects of computer science are applicable to one another. Specifically, in addition to this theoretical computer science course, I was concurrently taking an introductory computer systems course at UMass. In that course, I learned the basics of assembly language and computer architecture among other things. (For more about my specific experience taking a Five College course, click here). As I am generally more interested in computer systems than theory, I really enjoyed exploring a computer systems relevant application of theoretical computer science. In short, while this specific result isn’t the most ground-breaking and exciting, studying it has been meaningful in rounding out my semester.
In addition to the research paper, linked again here, my main informational sources were course notes and materials from CSC 250 with R. Jordan Crouser and COMPSCI 230 with Tim Richards. The final informational source, linked again here, was the GeeksforGeeks article on addressing modes.