Arun Mani J

Let's Make A Silly JSON-like Parser

1 March 2023
cover.webp

Hey there everybody! To give a tl;dr;, we are going to make a silly parser for JSON-like syntax. I call it silly because the main aim is to simply write one for pure fun and not focus on any blazing efficiency. Second of all, it is JSON-like because, the syntax closely resembles JSON but it isn’t exactly one. But we drop the suffix like for the sake of brevity.

Understanding Our Data

Before writing our parser, we must know what our data is and how it looks like. If you know JSON, then there won’t be much discrepancies.

  1. Our notation supports numbers, strings, lists, JSONs and null.
  2. The numbers can be positive or negative. Both integers and floats are supported but only in base 10 representation. Examples include 123, 2.17, -3.14, +20.2.
  3. Strings must start and end with double quotes, like "Alice", "Hello, World".
  4. A list consists of comma separated elements within square brackets [ ]. The elements can be heterogeneous, so [1, 2.3, "hello", [0, 10], null] is valid.
  5. JSON is enclosed inside curly braces { }. It contains key-value pairs of the syntax key: value. Colon : is used to separate a key from value. Keys must be strings. Value can be any valid type. The key-value pairs are separated by comma. Examples include {"name": "Joe", "age": 25, "numbers": [10, 20, 30]}.
  6. null is included as it is. Fully lowercase and no quotes.
  7. Space can be used deliberately. For numbers, there can be multiple space between the sign and value, for example + 19.27 (we will see why later.) Here space means any whitespace character including newline, tab etc.

Phew! With the above seven rules, an example data could be as follows.

{
  "key": "value",
  "number": -3.14,
  "list": [10, 20, 30],
  "json": {
    "key": null
  }
}

Jason - The Parser

Meet Jason! It is the name of our JSON parser. We will use Python to create it. So let’s begin with creating a file called jason.py and call it a day! Please don’t name it json.py as it conflicts with Python’s standard module json.

Our parser will consist of tiny functions, each meant to extract some value. After parsing, we get a Python object that matches the data we parsed.

#JSON TypePython Type
1.Listlist
2.JSONdict
3.Numberfloat or int
4.Stringstr
5.nullNone

The Preparation

Our data is a string, that is text. It can come from two places, either a file or a Python str. We won’t be reading the data in an entire go, we will read it character by character. This presents some problems:

  1. We need a short and standard API to read both string and file. Python strings are a sequence. It means, we can access individual characters by their indices.
    name = "Joe"
    print(name[1]) # o
    
    This syntax is short, just two square brackets and a number inside it.
  2. But, when we try to read an index beyond the string’s length, we get an IndexError.
    name = "Joe"
    print(name[100]) # IndexError: string index out of range
    
    Wrapping every index access inside a try except block is too verbose. So we need a workaround for this noise.

For the first problem, we will create a reader that takes care of taking an index and returning a character. For the second problem, we will create a special class EOF (end of file), which is returned whenever we try to read beyond the length.

This is how our EOF looks like:

class EOF:
    pass

The abstract class (or template) for our reader is:

class Data:
    def __getitem__(self, i: int):
        raise NotImplementedError

Here __getitem__ is a magic method. This allows us to access the data at index via data[i] syntax. Our Data class is abstract, that’s why it raises NotImplementedError.

Let’s write the concrete classes now. For a string data:

class StringData(Data):
    def __init__(self, txt: str):
        self._txt = txt

    def __getitem__(self, i: int):
        try:
            return self._txt[i]
        except:
            return EOF

We store the data in a private variable _txt. See how our __getitem__ has been overwritten with new code. We wrap the index access inside a try except and return EOF if we get any error.

For data from file:

from typing import IO


class FileData(Data):
    def __init__(self, fp: IO):
        self._fp = fp

    def __getitem__(self, i: int):
        self._fp.seek(i, 0)
        char = self._fp.read(1)
        if char == "":
            return EOF
        return char

At the first line, we import IO from typing to set type hint for our file object. Our class takes a file pointer and stores it in _fp. To get a character at an index, we first seek to that index via fp.seek. The first argument i is the position and the second argument 0 means the position i is counted from the beginning of the file.

We have to do seeking because, Python automatically moves our file cursor when we read the data. So two consecutive read calls to the file gives the subsequent data stored in it. Also, unlike str, trying to read a file beyond its length returns an empty string. To be consistent with our reader API, we check for an empty string and return EOF in place of it.

Putting this all together, our jason.py should look like this:

from typing import IO


class EOF:
    pass


class Data:
    def __getitem__(self, i: int):
        raise NotImplementedError


class StringData(Data):
    def __init__(self, txt: str):
        self._txt = txt

    def __getitem__(self, i: int):
        try:
            return self._txt[i]
        except:
            return EOF


class FileData(Data):
    def __init__(self, fp: IO):
        self._fp = fp

    def __getitem__(self, i: int):
        self._fp.seek(i, 0)
        char = self._fp.read(1)
        if char == "":
            return EOF
        return char

Let’s do a little testing of our API. You need to launch an interpreter and load the module in it. Assuming the module is in the same location as your terminal, you can simply do python -i jason.py. (On some systems it may be python3 or py or py3).

~/Scratch  ❯ python3 -i jason.py
>>> string = "Hello World!"
>>> data = StringData(string)
>>> data[0]
'H'
>>> data[5]
' '
>>> data[7]
'o'
>>> data[2]
'l'
>>> data[100] # beyond length
<class '__main__.EOF'>
>>> data[100] == EOF
True
>>> fp = open("test.json", "w") # prepare a sample
>>> fp.write('"Hello"') # write data in it
7
>>> fp.close() # save the sample
>>> fp = open("test.json") # open it back for reading
>>> data = FileData(fp)
>>> data[0]
'"'
>>> data[1]
'H'
>>> data[2]
'e'
>>> data[10]
<class '__main__.EOF'>
>>> data[10] == EOF
True

Yay! Our reader is ready. Now we are moving to the core, where we extract the values. πŸ•

Defining Constants

Our syntax has a few literals like comma, colon, quotes etc. Let us define them as constants.

COMMA = ","

JSON_START = "{"
JSON_END = "}"
JSON_SEP = ":"

LIST_START = "["
LIST_END = "]"

PERIOD = "."
MINUS = "-"
PLUS = "+"

NULL = "null"

STRING_START = STRING_END = '"'

TERMINALS = (EOF, COMMA, LIST_END, JSON_END)

Most of the constants should be self explanatory. TERMINALS is a tuple with literals which are used to indicate the end of a value. For example, when parsing a data like 123,, we know the number stops at comma, hence it is a terminal.

Our Functions

We will write separate functions to extract different values. Each of these functions take two arguments - a Data data and an integer pos. It means that the function has to read data from position pos and extract the value from it.

These functions read till they meet a terminal character. Then, they convert the read string to valid data type and return it along with the position they have read so far. This position is used by the next function to know from where the next value should be read.

It could be a little confusing, but once you see it in practice, I hope it will be clear.

Extracting Value

JSON can be parsed recursively. We start with making an entrypoint function that matches the first character and calls the appropriate extraction function.

def extract_value(data: Data, pos: int):
    char = data[pos]

    if char == EOF:
        return "", pos

    if char == JSON_START:
        val, pos = extract_json(data, pos)
    elif char == LIST_START:
        val, pos = extract_list(data, pos)
    elif NULL.startswith(char):
        val, pos = extract_null(data, pos)
    elif char.isdigit() or char in (MINUS, PLUS):
        val, pos = extract_number(data, pos)
    elif char == STRING_START:
        val, pos = extract_string(data, pos)
    else:
        raise ValueError(f"unexpected character: {repr(char)}")

    return val, pos

The first if checks whether the data has any character left. If we get EOF, it means either the data is empty or that data has been already read. In that case, we return with an empty string and an unmodified position.

The subsequent ifs match the character with a known data type and on success, they call the respective extract function. At last the else is there as fallback, which raises a ValueError, because that character doesn’t match our syntax guide.

repr function is used to print the character’s representation. We use repr instead of directly printing the character to help in case of non-printable characters.

>>> char = "\t"
>>> print(f"Oops what character is this {char}")
Oops what character is this
>>> print(f"Oops what character is this {repr(char)}")
Oops what character is this '\t'

See how the first print prints a tab which is quite confusing as it is not visible. However the second print clearly prints the escape character.

Skipping Space

As said earlier, we are going to be make space insignificant in our syntax. So any space character, (unless they are within quotes) is gently ignored. We now create a function that does the same.

def skip_space(data: Data, pos: int):
    char = data[pos]
    while char != EOF and char.isspace():
        pos += 1
        char = data[pos]

    return pos

This function reads the data character by character and stops only when it encounters EOF or a non-space character. After the encounter, it returns the position of the non-space character.

>>> skip_space(StringData("    1"), 0)
4
>>> skip_space(StringData("    1   "), 0)
4
>>> skip_space(StringData("\t\n X"), 0)
3

Our function cares only about the leading space.

Extracting Number

The logic for extracting a number is as follows:

  1. If the number starts with plus, we ignore it. For minus, we add it to our string buffer.
  2. We then skip the space (continue reading to know why).
  3. We use is_float as flag to know if the number is a float.
  4. Characters are read from the data as long as they are not a terminal.
  5. If the character is a digit, we add it to our buffer.
  6. If the character is a period (.), we check if is_float is True.
  7. If it is True, it means we already found a period in the number. It is an error, so we raise an exception.
  8. In case, is_float is False, we set it to True and append the period to our buffer.
  9. If the character is a space, we skip all of them. After skipping, we check if the non-space character is a terminal, if not we raise an exception.
  10. The last else raises an exception as the character is neither a digit nor period nor space.
  11. Finally, we convert the buffer to a float or an integer depending upon the value of is_float.

So the function looks like this:

def extract_number(data: Data, pos: int):
    if data[pos] in (MINUS, PLUS):
        num = "-" if data[pos] == MINUS else ""
        pos += 1
        pos = skip_space(data, pos)
    else:
        num = ""

    is_float = False
    char = data[pos]

    while char not in TERMINALS:
        if char.isdigit():
            num += char
            pos += 1
            char = data[pos]
        elif char == PERIOD:
            if is_float:
                raise ValueError("extra '.' found in floating point number")
            is_float = True
            num += PERIOD
            pos += 1
            char = data[pos]
        elif char.isspace():
            pos = skip_space(data, pos)
            char = data[pos]
            if char not in TERMINALS:
                raise ValueError("invalid syntax for number")
        else:
            raise ValueError(f"unexpected character in number: {repr(char)}")

    if is_float:
        return float(num), pos
    else:
        return int(num), pos

First of all, we allow space after the signs just for the visual sake. This allows us to create lists like this:

[
+ 10,
- 15,
+  3,
+  5,
]

Not really that great, but yea, it exists.

We don’t allow multiple periods to prevent numbers like 12.34.56, which is not valid at all. Similarly, we check the first character after skipping space to terminals. For example consider the data 12.34 13. What does that even mean? If you think, it should be parsed as 12.3413, then what about data like 12.34 "hello" or 12.34 [10, 20]? To avoid any of such ambiguity, we don’t want any non-terminal character to follow a number.

Now let us check our function with some values.

>>> extract_number(StringData('12'), 0)
(12, 2)
>>> extract_number(StringData('10.0'), 0)
(10.0, 4)
>>> extract_number(StringData('-2.71'), 0)
(-2.71, 5)
>>> extract_number(StringData('- 3.14    '), 0)
(-3.14, 10)
>>> extract_number(StringData('10.20.30'), 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/arun-mani-j/Scratch/jason.py", line 172, in extract_number
    raise ValueError("extra '.' found in floating point number")
ValueError: extra '.' found in floating point number
>>> extract_number(StringData('10.0    20'), 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/arun-mani-j/Scratch/jason.py", line 181, in extract_number
    raise ValueError("invalid syntax for number")
ValueError: invalid syntax for number
>>> extract_number(StringData('Not a number'), 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/arun-mani-j/Scratch/jason.py", line 183, in extract_number
    raise ValueError(f"unexpected character in number: {repr(char)}")
ValueError: unexpected character in number: 'N'

Extracting String

Compared to numbers, strings are easy. We start the double quotes and append every character we get to the buffer till we meet either a double quote or EOF.

def extract_string(data: Data, pos: int):
    string = ""
    pos += 1
    char = data[pos]

    while char not in (STRING_END, EOF):
        string += char
        pos += 1
        char = data[pos]

    if char == EOF:
        raise ValueError(f"string terminated without: {STRING_END}")

    pos += 1
    pos = skip_space(data, pos)

    return string, pos

The first pos += 1 skips the double quotes character. From there we read till we get EOF or double quotes. Once our while loop completes, we check if the last character is EOF. If yes, it means that our string has not been properly closed with a double quote, so we raise an exception.

Otherwise, we do another pos += 1 to go past the quotes. Then we skip the space following the string and then return the extracted value.

Let’s test our function to make sure it works well.

>>> extract_string(StringData('"Hello, World!"'), 0)
('Hello, World!', 15)
>>> extract_string(StringData('"Hello, World!"         '), 0)
('Hello, World!', 24)
>>> extract_string(StringData('"Hello, World!   "         '), 0)
('Hello, World!   ', 27)
>>> extract_string(StringData('"New \n Line"'), 0)
('New \n Line', 12)
>>> extract_string(StringData('"Not closed'), 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/arun-mani-j/Scratch/jason.py", line 202, in extract_string
    raise ValueError(f"string terminated without: {STRING_END}")
ValueError: string terminated without: "

Extracting List

Similar to strings, lists are enclosed inside square brackets. But in addition, they contain commas inside. Since our functions are already made with terminals in mind, we just call extract_value whenever we find a character that is not a comma or space.

def extract_list(data: Data, pos: int):
    list = []
    pos += 1
    char = data[pos]
    found = False

    while char not in (LIST_END, EOF):
        if char == COMMA:
            if not found:
                raise ValueError("empty comma in array")
            found = False
            pos += 1
        elif char.isspace():
            pos = skip_space(data, pos)
        else:
            val, pos = extract_value(data, pos)
            list.append(val)
            found = True

        char = data[pos]

    if char == EOF:
        raise ValueError(f"list terminated without {LIST_END}")

    pos += 1
    pos = skip_space(data, pos)

    return list, pos

The flag found is used to avoid empty commas like [1, ,, 20]. We set it to True when we append some value, so that the function knows that the comma was indeed used with a value before it. When we meet a comma, we again reset it to False.

We call skip_space when we find a space as spaces can be ignored inside a list.

Time to check our function with some values.

>>> extract_list(StringData('[1, 2, 3]'), 0)
([1, 2, 3], 9)
>>> extract_list(StringData('[]'), 0)
([], 2)
>>> extract_list(StringData('[  ]'), 0)
([], 4)
>>> extract_list(StringData('["Hello", 12.3, -3, [1, 2 ]]'), 0)
(['Hello', 12.3, -3, [1, 2]], 28)
>>> extract_list(StringData('[1, 2,, 4]'), 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/arun-mani-j/Scratch/jason.py", line 115, in extract_list
    raise ValueError("empty comma in array")
ValueError: empty comma in array
>>> extract_list(StringData('[1, 2'), 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/arun-mani-j/Scratch/jason.py", line 128, in extract_list
    raise ValueError(f"list terminated without {LIST_END}")
ValueError: list terminated without ]

Extracting JSON

Luckily, extracting JSON is as same as lists but with an addition of extracting both key and value. A key-value is separated by a colon and there can be any amount of space between them. As we mentioned in our spec, keys can be of only string type. So anything else as a key raises an exception.

def extract_json(data: Data, pos: int):
    json = {}
    pos += 1
    char = data[pos]
    found = False

    while char not in (JSON_END, EOF):
        if char == COMMA:
            if not found:
                raise ValueError("empty comma in JSON")
            found = False
            pos += 1
        elif char == STRING_START:
            key, pos = extract_key(data, pos)
            val, pos = extract_value(data, pos)
            json[key] = val
            found = True
        elif char.isspace():
            pos = skip_space(data, pos)
        else:
            raise ValueError("expected key")

        char = data[pos]

    if char == EOF:
        raise ValueError(f"JSON terminated without {JSON_END}")

    pos += 1
    pos = skip_space(data, pos)

    return json, pos

Here we check for double quotes and if found, we extract the key from it. And then we extract the value. The extract_key function is given below.

def extract_key(data: Data, pos: int):
    key, pos = extract_string(data, pos)
    char = data[pos]

    if char != JSON_SEP:
        raise ValueError(f"expected {JSON_SEP} but found: {repr(char)}")

    pos += 1
    pos = skip_space(data, pos)

    return key, pos

The logic to extract a key is to first extract the string. Then we check if the string is followed by a colon. If not, we raise an exception. Remember that extract_string already skips the spaces following the quotes, so data like {"hello": "space before me"} is extracted properly.

Let’s now test these functions.

>>> extract_json(StringData('{"key": 10,}'), 0)
({'key': 10}, 12)
>>> extract_json(StringData('{   "key"    :     10 ,}'), 0)
({'key': 10}, 24)
>>> extract_json(StringData('{}'), 0)
({}, 2)
>>> extract_json(StringData('{"key 1": 10, "key 2": { "key 3": 20}}'), 0)
({'key 1': 10, 'key 2': {'key 3': 20}}, 38)
>>> extract_json(StringData('{10: "key not string"}'), 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/arun-mani-j/Scratch/jason.py", line 80, in extract_json
    raise ValueError("expected key")
ValueError: expected key
>>> extract_json(StringData('{"key" = "not colon"}'), 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/arun-mani-j/Scratch/jason.py", line 73, in extract_json
    key, pos = extract_key(data, pos)
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arun-mani-j/Scratch/jason.py", line 98, in extract_key
    raise ValueError(f"expected {JSON_SEP} but found: {repr(char)}")
ValueError: expected : but found: '='

Extracting null

Finally, let’s make the billion dollar mistake by extracting null. To do so, we read four characters and check they equal null. We also ensure that the read character is not a terminal so we can quit in advance. For example, if our data is nu,2, then once we find comma, there is no point in reading the next character (2).

def extract_null(data: Data, pos: int):
    word = ""
    char = data[pos]
    count = 0

    while char not in TERMINALS and count != len(NULL):
        word += char
        count += 1
        pos += 1
        char = data[pos]

    if count == len(NULL) and word == NULL:
        pos = skip_space(data, pos)
        return None, pos

    raise ValueError(f"unexpected literal: {repr(word)}")

We actually used len(NULL) instead of four because we don’t want to hardcode the value as changing NULL means the length should also be updated.

Now some testing of the function.

>>> extract_null(StringData('null'), 0)
(None, 4)
>>> extract_null(StringData('null  '), 0)
(None, 6)
>>> extract_null(StringData('nu,0'), 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/arun-mani-j/Scratch/jason.py", line 151, in extract_null
    raise ValueError(f"unexpected literal: {repr(word)}")
ValueError: unexpected literal: 'nu'

Are We Done Yet?

This is getting a bit lengthier than it should be… Okay, so now our function extract_value is ready, we should be able to extract JSON pretty fine… right? Well yes but,

  1. In the tests we conducted so far, we manually passed a StringData and position. To be a good library, we should provide some shortcuts.

  2. extract_value just extracts a value. Yes, only one value. It means, if your data is "hello" "they do not know me", then our function parses only "hello".

    >>> extract_value(StringData('"hello"  "they do not know me"'), 0)
    ('hello', 9)
    

    But the given data is invalid.

Let us introduce three more functions to tackle the above problems.

def extract(data: Data):
    val, pos = extract_value(data, 0)
    char = data[pos]

    if char != EOF:
        raise ValueError(f"unexpected character: {repr(char)}")

    return val


def load(fp: IO):
    return extract(FileData(fp))


def loads(txt: str):
    return extract(StringData(txt))

The first function extract takes a Data as argument and extracts the value in it, AND, it checks if the character left after the value is EOF. If it is not EOF, it means some other characters are still there, which is an invalid syntax. So, we raise an exception. Since, our extract_* functions already consume space, the character left must be either EOF or a non-space character.

The last two functions load and loads are similar to json.load and json.loads respectively.

load takes a file pointer and parses the data from it. loads parses the data from a Python str.

Are We Done Yet??!!

Yes yes, we are! Now let us test our parser with a sample JSON data.

{
  "hello": 1,
  "another": +2,
  "extra": -3,
  "then": 2.17,
  "but": [1, 2, 3],
  "also": { "1": null, "2": [] },
  "end": "Bye"
}

Let’s say this file is saved as test.json. Now let us call our parser.

~/Scratch  ❯ python3 -i jason.py
>>> fp = open("test.json")
>>> json = load(fp)
>>> fp.close()
>>> from pprint import pprint
>>> pprint(json)
{'also': {'1': None, '2': []},
 'another': 2,
 'but': [1, 2, 3],
 'end': 'Bye',
 'extra': -3,
 'hello': 1,
 'then': 2.17}

Oh yea, it works πŸ™‚. We used pprint to pretty print the returned dictionary. Let us now check what happens if we pass more than one value as data.

>>> loads('"parse me"   "not me"')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/arun-mani-j/Scratch/jason.py", line 247, in loads
    return extract(StringData(txt))
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arun-mani-j/Scratch/jason.py", line 237, in extract
    raise ValueError(f"unexpected character: {repr(char)}")
ValueError: unexpected character: '"'
>>> loads('"parse me"   3.14')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/arun-mani-j/Scratch/jason.py", line 247, in loads
    return extract(StringData(txt))
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arun-mani-j/Scratch/jason.py", line 237, in extract
    raise ValueError(f"unexpected character: {repr(char)}")
ValueError: unexpected character: '3'

Yes! It fails successfully πŸŽ‰.

Conclusion

We managed to write a parser for our JSON-like syntax perfectly well. Of course, it can be improved a lot. May be the entire logic is flawed and we should rewrite with a different algorithm. However, it is beyond our scope and objective. We wanted a silly parser and now we got it.

The full source code of jason.py in my snippets.

Thanks for reading! Please feel free to share your thoughts and suggestions. You can contact me via the links in footer.

Good day! πŸ‘