This project contains all the logic for processing filtering in EPCC. This document provides an introduction into how this conceptually works.
A problem we often have is that we have a string, and want to classify it as being valid or invalid. For instance here are some things you might care about strings:
- Does a string start with a capital letter.
- Is a string a valid email address.
- Is a string a palindrome (written the same forwards and backwards like redivider, civic, radar, level)
- Is a string a valid Java program.
There are different techniques that can be used to check the above, including writing code directly and using a regular expression. But some things are more complicated, like case #3 and #4.
In the case of the last one, you would likely define a grammar which is a set of rules that say what a valid string looks like. A common technique is to proceed in two phases:
- Lexing - Converting the string into chunks of things called tokens, each token representing a some part of the string (e.g., in a java program you might have a token for each keyword). More details are in the
lexer.js
file - Parsing - Write rules for checking the output of the lexer. We use a recursive decent parser, which means that each grammar rule calls a function that checks for a match and may recursively call other functions.
The grammar that we are attempting to parse in Augmented Backus-Naur Form (https://www.rfc-editor.org/rfc/rfc5234) is the following:
(Some of the way this is structured is to make the code easier to read)
root = term *( ":" term )
term /= binary_term
term /= unary_term
term /= vararg_term
binary_term = binary_operator "(" literal "," literal ")"
unary_term = unary_operator "(" literal ")"
vararg_term = vararg_operator "(" literal "," literal *( "," literal ) ")"
binary_operator = "eq" / "like" / "gt" / "ge " / "lt" / "le" / "text"
unary_operator = "is_null"
vararg_operator = "in"
; these literal represent standard unencoded literals.
literal = 1*(ALPHA / DIGIT / "$" / "-" / "_" / "*" / "." / " " / "{" / "}" / "@" / "|")
; we treat binary, unary & vararg operators as literals in cases where people want to do something like eq(foo, ne).
literal /= binary_operator
literal /= unary_operator
literal /= vararg_operator
; 0x22 is a quotation mark
literal /= 0x22 *string_characters 0x22
; there is probably a better way to write this in ABNF involving hex like 0x01-0x21 0x23-0x5B 0x5D-0x7E
string_character = any_character but not " or \
string_character /= escape_sequence
escape_sequence = "\" 0x22
A more interested reader is encouraged to read: Domain-Specific Languages, by Fowler, the two techniques we use here are: (Regex Table Lexer p.239-244, Recursive Descent Parser p.245-255).