Tokens
DataAxesFormats.Tokens
—
Module
The only exported functions from this module are
escape_value
and
unescape_value
which are useful when embedding values into query strings. The rest of the module is documented to give insight into how a query string is broken into
Token
s.
Ideally
Daf
should have used some established parser generator module for parsing queries, making all this unnecessary. However, As of writing this code, Julia doesn't seem to have such a parser generator solution. Therefore, this module provides a simple
tokenize
function with rudimentary pattern matching which is all we need to parse queries (whose structure is "trivial").
Escaping
DataAxesFormats.Tokens.escape_value
—
Function
escape_value(value::AbstractString)::String
Given some raw
value
(name of an axis, axis entry or property, or a parameter value), which may contain special characters, return an escaped version to be used as a single value
Token
.
We need to consider the following kinds of characters:
-
Safe (
is_value_char) characters includea-z,A-Z,0-9,_,+,-, and., as well as any non-ASCII (that is, Unicode) characters. Any sequence of these characters will be considered a single valueToken. These cover all the common cases (including signed integer and floating point values). -
All other ASCII characters are (at least potentially) special , that is, may be used to describe an operation.
-
Prefixing any character with a
\allows using it inside a valueToken. This is useful if some name or value contains a special character. For example, if you have a cell whose name isACTG:Plate1, and you want to access the name of the batch of this specific cell, you will have to write/ cell = ACTG\:Plate1 : batch.
The
\
character is also used by Julia inside
"..."
string literals, to escape writing non-printable characters. For example,
"\n"
is a single-character string containing a line break, and therefore
"\\"
is used to write a single
\
. Thus the above example would have to be written as
"cell = ACTG\\:Plate1 : batch"
. This isn't nice.
Luckily, Julia also has
raw"..."
string literals that work similarly to Python's
r"..."
strings (in Julia,
r"..."
is a regular expression, not a string). Inside raw string literals, a
\
is a
\
(unless it precedes a
"
). Therefore the above example could also be written as
raw"/ cell = ACTG\:Plate1 : batch
, which is more readable.
Back to
escape_value
- it will prefix any special character with a
\
. It is useful if you want to programmatically inject a value. Often this happens when using
$(...)
to embed values into a query string, e.g., do not write a query
/ $(axis) @ $(property)
as it is unsafe, as any of the embedded variables may contain unsafe characters. You should instead write something like
/ $(escape_value(axis)) @ $(escape_value(property))
.
DataAxesFormats.Tokens.unescape_value
—
Function
unescape_value(escaped::AbstractString)::String
Undo
escape_value
, that is, given an
escaped
value with a
\
characters escaping special characters, drop the
\
to get back the original string value.
DataAxesFormats.Tokens.is_value_char
—
Function
is_value_char(character::Char)::Bool
Return whether a character is safe to use inside a value
Token
(name of an axis, axis entry or property, or a parameter value).
The safe characters are
a
-
z
,
A
-
Z
,
0
-
9
,
_
,
+
,
-
, and
.
, as well as any non-ASCII (that is, Unicode) characters.
DataAxesFormats.Tokens.VALUE_REGEX
—
Constant
VALUE_REGEX = r"^(?:[0-9a-zA-Z_.+-]|[^\x00-\xFF])+"
A sequence of
is_value_char
is considered to be a single value
Token
. This set of characters was chosen to allow expressing numbers, Booleans and simple names. Any other (ASCII, non-space) character may in principle be used as an operator (possibly in a future version of the code). Therefore, use
escape_value
to protect any value you embed into the expression.
Encoding
DataAxesFormats.Tokens.encode_expression
—
Function
encode_expression(expr_string::AbstractString)::String
Given an expression string to parse, encode any non-ASCII (that is, Unicode) character, as well as any character escaped by a
\
, such that the result will only use
is_value_char
characters. Every encoded character is replaced by
_XX
using URI encoding, but replacing the
%
with a
_
so we can deal with unescaped
%
as an operator, so we also need to encode
_
as
_5F
, so we need to encode
\_
as
_5C_5F
. Isn't encoding
fun
?
DataAxesFormats.Tokens.decode_expression
—
Function
decode_expression(encoded_string::AbstractString)::String
Given the results of
encode_expression
, decode it back to its original form.
Tokenization
DataAxesFormats.Tokens.Token
—
Type
struct Token
is_operator::Bool
value::AbstractString
token_index::Int
first_index::Int
last_index::Int
encoded_string::AbstractString
end
A parsed token of an expression.
We distinguish between "value" tokens and "operator" tokens using
is_operator
. A value token holds the name of an axis, axis entry or property, or a parameter value, while an operator token is used to identify a query operation to perform. In both cases, the
value
contains the token string. This goes through both
decode_expression
and
unescape_value
so it can be directly used as-is for value tokens.
We also keep the location (
first_index
..
last_index
) and the (encoded) expression string, to enable generating friendly error messages. There are no line numbers in locations because in
Daf
we squash our queries to a single-line, under the assumption they are "relatively simple". This allows us to simplify the code.
DataAxesFormats.Tokens.tokenize
—
Function
tokenize(string::AbstractString, operators::Regex)::Vector{Token}
Given an expression string, convert it into a vector of
Token
.
We first convert everything that matches the
SPACE_REGEX
into a single space. This squashed the expression into a single line (discarding line breaks and comments), and the squashed expression is used for reporting errors. This is reasonable for dealing with
Daf
queries which are expected to be "relatively simple".
When tokenizing, we discard the spaces. Anything that matches the
operators
is considered to be an operator
Token
. Anything that matches the
VALUE_REGEX
is considered to be a value
Token
. As a special case,
''
is converted to an empty string, which is otherwise impossible to represent (write
\'\'
to prevent this). Anything else is reported as an invalid character.
DataAxesFormats.Tokens.SPACE_REGEX
—
Constant
SPACE_REGEX = r"(?:[\s\n\r]|#[^\n\r]*(?:[\r\n]|$))+"sm
Optional white space can separate
Token
. It is required when there are two consecutive value tokens, but is typically optional around operators. White space includes spaces, tabs, line breaks, and a
# ...
comment suffix of a line.
Index
-
DataAxesFormats.Tokens -
DataAxesFormats.Tokens.SPACE_REGEX -
DataAxesFormats.Tokens.VALUE_REGEX -
DataAxesFormats.Tokens.Token -
DataAxesFormats.Tokens.decode_expression -
DataAxesFormats.Tokens.encode_expression -
DataAxesFormats.Tokens.escape_value -
DataAxesFormats.Tokens.is_value_char -
DataAxesFormats.Tokens.tokenize -
DataAxesFormats.Tokens.unescape_value