Let's Make A Silly JSON-like Parser
Hey there everybody! To give a tl;dr;, we are going to make a silly parser for JSON-like syntax. I call it silly because the main aim is to simply write one for pure fun and not focus on any blazing efficiency. Second of all, it is JSON-like because, the syntax closely resembles JSON but it isn’t exactly one. But we drop the suffix like for the sake of brevity.
Understanding Our Data
Before writing our parser, we must know what our data is and how it looks like. If you know JSON, then there won’t be much discrepancies.
- Our notation supports numbers, strings, lists, JSONs and
null
. - The numbers can be positive or negative. Both integers and floats are
supported but only in base 10 representation. Examples include
123
,2.17
,-3.14
,+20.2
. - Strings must start and end with double quotes, like
"Alice"
,"Hello, World"
. - A list consists of comma separated elements within square brackets
[ ]
. The elements can be heterogeneous, so[1, 2.3, "hello", [0, 10], null]
is valid. - JSON is enclosed inside curly braces
{ }
. It contains key-value pairs of the syntaxkey: value
. Colon:
is used to separate a key from value. Keys must be strings. Value can be any valid type. The key-value pairs are separated by comma. Examples include{"name": "Joe", "age": 25, "numbers": [10, 20, 30]}
. null
is included as it is. Fully lowercase and no quotes.- Space can be used deliberately. For numbers, there can be multiple space
between the sign and value, for example
+ 19.27
(we will see why later.) Here space means any whitespace character including newline, tab etc.
Phew! With the above seven rules, an example data could be as follows.
{
"key": "value",
"number": -3.14,
"list": [10, 20, 30],
"json": {
"key": null
}
}
Jason - The Parser
Meet Jason! It is the name of our JSON parser. We will use
Python to create it. So let’s begin with creating a file called jason.py
and
call it a day! Please don’t name it json.py
as it conflicts with
Python’s standard module json
.
Our parser will consist of tiny functions, each meant to extract some value. After parsing, we get a Python object that matches the data we parsed.
# | JSON Type | Python Type |
---|---|---|
1. | List | list |
2. | JSON | dict |
3. | Number | float or int |
4. | String | str |
5. | null | None |
The Preparation
Our data is a string, that is text. It can come from two places, either a file
or a Python str
. We won’t be reading the data in an entire go, we will read it
character by character. This presents some problems:
- We need a short and standard API to read both string and file. Python strings
are a sequence. It means, we can access individual characters by their
indices.This syntax is short, just two square brackets and a number inside it.
name = "Joe" print(name[1]) # o
- But, when we try to read an index beyond the string’s length, we get an
IndexError
.Wrapping every index access inside aname = "Joe" print(name[100]) # IndexError: string index out of range
try except
block is too verbose. So we need a workaround for this noise.
For the first problem, we will create a reader that takes care of taking an index
and returning a character. For the second problem, we will create a special
class EOF
(end of file), which is returned whenever we try to read beyond the
length.
This is how our EOF
looks like:
class EOF:
pass
The abstract class (or template) for our reader is:
class Data:
def __getitem__(self, i: int):
raise NotImplementedError
Here __getitem__
is a magic
method.
This allows us to access the data at index via data[i]
syntax. Our Data
class is abstract, that’s why it raises NotImplementedError
.
Let’s write the concrete classes now. For a string data:
class StringData(Data):
def __init__(self, txt: str):
self._txt = txt
def __getitem__(self, i: int):
try:
return self._txt[i]
except:
return EOF
We store the data in a private variable _txt
. See how our __getitem__
has
been overwritten with new code. We wrap the index access inside a try except
and return EOF
if we get any error.
For data from file:
from typing import IO
class FileData(Data):
def __init__(self, fp: IO):
self._fp = fp
def __getitem__(self, i: int):
self._fp.seek(i, 0)
char = self._fp.read(1)
if char == "":
return EOF
return char
At the first line, we import IO
from typing
to set type hint for our file
object. Our class takes a file pointer and stores it in _fp
. To get a
character at an index, we first seek to that index via fp.seek
. The first
argument i
is the position and the second argument 0
means the position i
is counted from the beginning of the file.
We have to do seeking because, Python automatically moves our file cursor when
we read the data. So two consecutive read calls to the file gives the subsequent
data stored in it. Also, unlike str
, trying to read a file beyond its length
returns an empty string. To be consistent with our reader API, we check for an
empty string and return EOF
in place of it.
Putting this all together, our jason.py
should look like this:
from typing import IO
class EOF:
pass
class Data:
def __getitem__(self, i: int):
raise NotImplementedError
class StringData(Data):
def __init__(self, txt: str):
self._txt = txt
def __getitem__(self, i: int):
try:
return self._txt[i]
except:
return EOF
class FileData(Data):
def __init__(self, fp: IO):
self._fp = fp
def __getitem__(self, i: int):
self._fp.seek(i, 0)
char = self._fp.read(1)
if char == "":
return EOF
return char
Let’s do a little testing of our API. You need to launch an interpreter and load
the module in it. Assuming the module is in the same location as your terminal,
you can simply do python -i jason.py
. (On some systems it may be python3
or
py
or py3
).
~/Scratch β― python3 -i jason.py
>>> string = "Hello World!"
>>> data = StringData(string)
>>> data[0]
'H'
>>> data[5]
' '
>>> data[7]
'o'
>>> data[2]
'l'
>>> data[100] # beyond length
<class '__main__.EOF'>
>>> data[100] == EOF
True
>>> fp = open("test.json", "w") # prepare a sample
>>> fp.write('"Hello"') # write data in it
7
>>> fp.close() # save the sample
>>> fp = open("test.json") # open it back for reading
>>> data = FileData(fp)
>>> data[0]
'"'
>>> data[1]
'H'
>>> data[2]
'e'
>>> data[10]
<class '__main__.EOF'>
>>> data[10] == EOF
True
Yay! Our reader is ready. Now we are moving to the core, where we extract the values. π
Defining Constants
Our syntax has a few literals like comma, colon, quotes etc. Let us define them as constants.
COMMA = ","
JSON_START = "{"
JSON_END = "}"
JSON_SEP = ":"
LIST_START = "["
LIST_END = "]"
PERIOD = "."
MINUS = "-"
PLUS = "+"
NULL = "null"
STRING_START = STRING_END = '"'
TERMINALS = (EOF, COMMA, LIST_END, JSON_END)
Most of the constants should be self explanatory. TERMINALS
is a tuple with
literals which are used to indicate the end of a value. For example, when
parsing a data like 123,
, we know the number stops at comma, hence it is a terminal.
Our Functions
We will write separate functions to extract different values. Each of these
functions take two arguments - a Data
data
and an integer pos
. It means that the
function has to read data
from position pos
and extract the value from it.
These functions read till they meet a terminal character. Then, they convert the read string to valid data type and return it along with the position they have read so far. This position is used by the next function to know from where the next value should be read.
It could be a little confusing, but once you see it in practice, I hope it will be clear.
Extracting Value
JSON can be parsed recursively. We start with making an entrypoint function that matches the first character and calls the appropriate extraction function.
def extract_value(data: Data, pos: int):
char = data[pos]
if char == EOF:
return "", pos
if char == JSON_START:
val, pos = extract_json(data, pos)
elif char == LIST_START:
val, pos = extract_list(data, pos)
elif NULL.startswith(char):
val, pos = extract_null(data, pos)
elif char.isdigit() or char in (MINUS, PLUS):
val, pos = extract_number(data, pos)
elif char == STRING_START:
val, pos = extract_string(data, pos)
else:
raise ValueError(f"unexpected character: {repr(char)}")
return val, pos
The first if
checks whether the data has any character left. If we get EOF
,
it means either the data is empty or that data has been already read. In that
case, we return with an empty string and an unmodified position.
The subsequent if
s match the character with a known data type and on success,
they call the respective extract function. At last the else
is there as
fallback, which raises a ValueError
, because that character doesn’t match our
syntax guide.
repr
function is used to print the character’s representation. We use repr
instead of directly printing the character to help in case of non-printable
characters.
>>> char = "\t"
>>> print(f"Oops what character is this {char}")
Oops what character is this
>>> print(f"Oops what character is this {repr(char)}")
Oops what character is this '\t'
See how the first print
prints a tab which is quite confusing as it is not visible. However
the second print
clearly prints the escape character.
Skipping Space
As said earlier, we are going to be make space insignificant in our syntax. So any space character, (unless they are within quotes) is gently ignored. We now create a function that does the same.
def skip_space(data: Data, pos: int):
char = data[pos]
while char != EOF and char.isspace():
pos += 1
char = data[pos]
return pos
This function reads the data character by character and stops only when it
encounters EOF
or a non-space character. After the encounter, it returns the
position of the non-space character.
>>> skip_space(StringData(" 1"), 0)
4
>>> skip_space(StringData(" 1 "), 0)
4
>>> skip_space(StringData("\t\n X"), 0)
3
Our function cares only about the leading space.
Extracting Number
The logic for extracting a number is as follows:
- If the number starts with plus, we ignore it. For minus, we add it to our string buffer.
- We then skip the space (continue reading to know why).
- We use
is_float
as flag to know if the number is a float. - Characters are read from the data as long as they are not a terminal.
- If the character is a digit, we add it to our buffer.
- If the character is a period (
.
), we check ifis_float
isTrue
. - If it is
True
, it means we already found a period in the number. It is an error, so we raise an exception. - In case,
is_float
isFalse
, we set it toTrue
and append the period to our buffer. - If the character is a space, we skip all of them. After skipping, we check if the non-space character is a terminal, if not we raise an exception.
- The last
else
raises an exception as the character is neither a digit nor period nor space. - Finally, we convert the buffer to a float or an integer depending upon the
value of
is_float
.
So the function looks like this:
def extract_number(data: Data, pos: int):
if data[pos] in (MINUS, PLUS):
num = "-" if data[pos] == MINUS else ""
pos += 1
pos = skip_space(data, pos)
else:
num = ""
is_float = False
char = data[pos]
while char not in TERMINALS:
if char.isdigit():
num += char
pos += 1
char = data[pos]
elif char == PERIOD:
if is_float:
raise ValueError("extra '.' found in floating point number")
is_float = True
num += PERIOD
pos += 1
char = data[pos]
elif char.isspace():
pos = skip_space(data, pos)
char = data[pos]
if char not in TERMINALS:
raise ValueError("invalid syntax for number")
else:
raise ValueError(f"unexpected character in number: {repr(char)}")
if is_float:
return float(num), pos
else:
return int(num), pos
First of all, we allow space after the signs just for the visual sake. This allows us to create lists like this:
[
+ 10,
- 15,
+ 3,
+ 5,
]
Not really that great, but yea, it exists.
We don’t allow multiple periods to prevent numbers like 12.34.56
, which is not
valid at all. Similarly, we check the first character after skipping space to
terminals. For example consider the data 12.34 13
. What does that even
mean? If you think, it should be parsed as 12.3413
, then what about data like
12.34 "hello"
or 12.34 [10, 20]
? To avoid any of such ambiguity, we don’t
want any non-terminal character to follow a number.
Now let us check our function with some values.
>>> extract_number(StringData('12'), 0)
(12, 2)
>>> extract_number(StringData('10.0'), 0)
(10.0, 4)
>>> extract_number(StringData('-2.71'), 0)
(-2.71, 5)
>>> extract_number(StringData('- 3.14 '), 0)
(-3.14, 10)
>>> extract_number(StringData('10.20.30'), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/arun-mani-j/Scratch/jason.py", line 172, in extract_number
raise ValueError("extra '.' found in floating point number")
ValueError: extra '.' found in floating point number
>>> extract_number(StringData('10.0 20'), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/arun-mani-j/Scratch/jason.py", line 181, in extract_number
raise ValueError("invalid syntax for number")
ValueError: invalid syntax for number
>>> extract_number(StringData('Not a number'), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/arun-mani-j/Scratch/jason.py", line 183, in extract_number
raise ValueError(f"unexpected character in number: {repr(char)}")
ValueError: unexpected character in number: 'N'
Extracting String
Compared to numbers, strings are easy. We start the double quotes and append
every character we get to the buffer till we meet either a double quote or
EOF
.
def extract_string(data: Data, pos: int):
string = ""
pos += 1
char = data[pos]
while char not in (STRING_END, EOF):
string += char
pos += 1
char = data[pos]
if char == EOF:
raise ValueError(f"string terminated without: {STRING_END}")
pos += 1
pos = skip_space(data, pos)
return string, pos
The first pos += 1
skips the double quotes character. From there we read till
we get EOF
or double quotes. Once our while loop completes, we check if the
last character is EOF
. If yes, it means that our string has not been properly
closed with a double quote, so we raise an exception.
Otherwise, we do another pos += 1
to go past the quotes. Then we skip the space
following the string and then return the extracted value.
Let’s test our function to make sure it works well.
>>> extract_string(StringData('"Hello, World!"'), 0)
('Hello, World!', 15)
>>> extract_string(StringData('"Hello, World!" '), 0)
('Hello, World!', 24)
>>> extract_string(StringData('"Hello, World! " '), 0)
('Hello, World! ', 27)
>>> extract_string(StringData('"New \n Line"'), 0)
('New \n Line', 12)
>>> extract_string(StringData('"Not closed'), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/arun-mani-j/Scratch/jason.py", line 202, in extract_string
raise ValueError(f"string terminated without: {STRING_END}")
ValueError: string terminated without: "
Extracting List
Similar to strings, lists are enclosed inside square brackets. But in addition,
they contain commas inside. Since our functions are already made with
terminals in mind, we just call extract_value
whenever we find a character
that is not a comma or space.
def extract_list(data: Data, pos: int):
list = []
pos += 1
char = data[pos]
found = False
while char not in (LIST_END, EOF):
if char == COMMA:
if not found:
raise ValueError("empty comma in array")
found = False
pos += 1
elif char.isspace():
pos = skip_space(data, pos)
else:
val, pos = extract_value(data, pos)
list.append(val)
found = True
char = data[pos]
if char == EOF:
raise ValueError(f"list terminated without {LIST_END}")
pos += 1
pos = skip_space(data, pos)
return list, pos
The flag found
is used to avoid empty commas like [1, ,, 20]
. We set it to
True
when we append some value, so that the function knows that the comma was
indeed used with a value before it. When we meet a comma, we again reset it to
False
.
We call skip_space
when we find a space as spaces can be ignored inside a
list.
Time to check our function with some values.
>>> extract_list(StringData('[1, 2, 3]'), 0)
([1, 2, 3], 9)
>>> extract_list(StringData('[]'), 0)
([], 2)
>>> extract_list(StringData('[ ]'), 0)
([], 4)
>>> extract_list(StringData('["Hello", 12.3, -3, [1, 2 ]]'), 0)
(['Hello', 12.3, -3, [1, 2]], 28)
>>> extract_list(StringData('[1, 2,, 4]'), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/arun-mani-j/Scratch/jason.py", line 115, in extract_list
raise ValueError("empty comma in array")
ValueError: empty comma in array
>>> extract_list(StringData('[1, 2'), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/arun-mani-j/Scratch/jason.py", line 128, in extract_list
raise ValueError(f"list terminated without {LIST_END}")
ValueError: list terminated without ]
Extracting JSON
Luckily, extracting JSON is as same as lists but with an addition of extracting both key and value. A key-value is separated by a colon and there can be any amount of space between them. As we mentioned in our spec, keys can be of only string type. So anything else as a key raises an exception.
def extract_json(data: Data, pos: int):
json = {}
pos += 1
char = data[pos]
found = False
while char not in (JSON_END, EOF):
if char == COMMA:
if not found:
raise ValueError("empty comma in JSON")
found = False
pos += 1
elif char == STRING_START:
key, pos = extract_key(data, pos)
val, pos = extract_value(data, pos)
json[key] = val
found = True
elif char.isspace():
pos = skip_space(data, pos)
else:
raise ValueError("expected key")
char = data[pos]
if char == EOF:
raise ValueError(f"JSON terminated without {JSON_END}")
pos += 1
pos = skip_space(data, pos)
return json, pos
Here we check for double quotes and if found, we extract the key from it. And
then we extract the value. The extract_key
function is given below.
def extract_key(data: Data, pos: int):
key, pos = extract_string(data, pos)
char = data[pos]
if char != JSON_SEP:
raise ValueError(f"expected {JSON_SEP} but found: {repr(char)}")
pos += 1
pos = skip_space(data, pos)
return key, pos
The logic to extract a key is to first extract the string. Then we check if the
string is followed by a colon. If not, we raise an exception. Remember that
extract_string
already skips the spaces following the quotes, so data like
{"hello": "space before me"}
is extracted properly.
Let’s now test these functions.
>>> extract_json(StringData('{"key": 10,}'), 0)
({'key': 10}, 12)
>>> extract_json(StringData('{ "key" : 10 ,}'), 0)
({'key': 10}, 24)
>>> extract_json(StringData('{}'), 0)
({}, 2)
>>> extract_json(StringData('{"key 1": 10, "key 2": { "key 3": 20}}'), 0)
({'key 1': 10, 'key 2': {'key 3': 20}}, 38)
>>> extract_json(StringData('{10: "key not string"}'), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/arun-mani-j/Scratch/jason.py", line 80, in extract_json
raise ValueError("expected key")
ValueError: expected key
>>> extract_json(StringData('{"key" = "not colon"}'), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/arun-mani-j/Scratch/jason.py", line 73, in extract_json
key, pos = extract_key(data, pos)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/arun-mani-j/Scratch/jason.py", line 98, in extract_key
raise ValueError(f"expected {JSON_SEP} but found: {repr(char)}")
ValueError: expected : but found: '='
Extracting null
Finally, let’s make the billion dollar
mistake by
extracting null
. To do so, we read four characters and check they equal null
. We also
ensure that the read character is not a terminal so we can quit in advance. For
example, if our data is nu,2
, then once we find comma, there is no point in
reading the next character (2
).
def extract_null(data: Data, pos: int):
word = ""
char = data[pos]
count = 0
while char not in TERMINALS and count != len(NULL):
word += char
count += 1
pos += 1
char = data[pos]
if count == len(NULL) and word == NULL:
pos = skip_space(data, pos)
return None, pos
raise ValueError(f"unexpected literal: {repr(word)}")
We actually used len(NULL)
instead of four because we don’t want to hardcode
the value as changing NULL
means the length should also be updated.
Now some testing of the function.
>>> extract_null(StringData('null'), 0)
(None, 4)
>>> extract_null(StringData('null '), 0)
(None, 6)
>>> extract_null(StringData('nu,0'), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/arun-mani-j/Scratch/jason.py", line 151, in extract_null
raise ValueError(f"unexpected literal: {repr(word)}")
ValueError: unexpected literal: 'nu'
Are We Done Yet?
This is getting a bit lengthier than it should be⦠Okay, so now our function
extract_value
is ready, we should be able to extract JSON pretty fine⦠right? Well yes but,
In the tests we conducted so far, we manually passed a
StringData
and position. To be a good library, we should provide some shortcuts.extract_value
just extracts a value. Yes, only one value. It means, if your data is"hello" "they do not know me"
, then our function parses only"hello"
.>>> extract_value(StringData('"hello" "they do not know me"'), 0) ('hello', 9)
But the given data is invalid.
Let us introduce three more functions to tackle the above problems.
def extract(data: Data):
val, pos = extract_value(data, 0)
char = data[pos]
if char != EOF:
raise ValueError(f"unexpected character: {repr(char)}")
return val
def load(fp: IO):
return extract(FileData(fp))
def loads(txt: str):
return extract(StringData(txt))
The first function extract
takes a Data
as argument and extracts the value
in it, AND, it checks if the character left after the value is EOF
. If it is
not EOF
, it means some other characters are still there, which is an invalid
syntax. So, we raise an exception. Since, our extract_*
functions already
consume space, the character left must be either EOF
or a non-space
character.
The last two functions load
and loads
are similar to
json.load
and
json.loads
respectively.
load
takes a file pointer and parses the data from it. loads
parses the data
from a Python str
.
Are We Done Yet??!!
Yes yes, we are! Now let us test our parser with a sample JSON data.
{
"hello": 1,
"another": +2,
"extra": -3,
"then": 2.17,
"but": [1, 2, 3],
"also": { "1": null, "2": [] },
"end": "Bye"
}
Let’s say this file is saved as test.json
. Now let us call our parser.
~/Scratch β― python3 -i jason.py
>>> fp = open("test.json")
>>> json = load(fp)
>>> fp.close()
>>> from pprint import pprint
>>> pprint(json)
{'also': {'1': None, '2': []},
'another': 2,
'but': [1, 2, 3],
'end': 'Bye',
'extra': -3,
'hello': 1,
'then': 2.17}
Oh yea, it works π. We used
pprint
to
pretty print the returned dictionary. Let us now check what happens if we pass
more than one value as data.
>>> loads('"parse me" "not me"')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/arun-mani-j/Scratch/jason.py", line 247, in loads
return extract(StringData(txt))
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/arun-mani-j/Scratch/jason.py", line 237, in extract
raise ValueError(f"unexpected character: {repr(char)}")
ValueError: unexpected character: '"'
>>> loads('"parse me" 3.14')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/arun-mani-j/Scratch/jason.py", line 247, in loads
return extract(StringData(txt))
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/arun-mani-j/Scratch/jason.py", line 237, in extract
raise ValueError(f"unexpected character: {repr(char)}")
ValueError: unexpected character: '3'
Yes! It fails successfully π.
Conclusion
We managed to write a parser for our JSON-like syntax perfectly well. Of course, it can be improved a lot. May be the entire logic is flawed and we should rewrite with a different algorithm. However, it is beyond our scope and objective. We wanted a silly parser and now we got it.
The full source code of jason.py
in my snippets.
Thanks for reading! Please feel free to share your thoughts and suggestions. You can contact me via the links in footer.
Good day! π