When a browser starts to render a page, it first transforms the HTML code into a DOM tree. This process includes two main activities:
- HTML tokenization: Transforming input text characters into HTML “tokens”.
- DOM tree building: Transforming HTML tokens from the previous step into a DOM tree.
Because there’re only two main activities, implementing an HTML parser should take you no more than 6 hours agonizing over the HTML5 specification, three weeks implementing half of the parsing algorithm and 2 hours questioning the meaning of existence…every single day. The implementer is then expected to experience several after-effects that may include: confusion, regret for not using Servo’s HTML parser, and Friedrich Nietzsche starts to sound fascinating:
To live is to suffer, to survive is to find some meaning in the suffering.
But since I have foolishly made a promise with the first three words in the name of this series, let’s build an HTML parser from scratch
When the browser receives an HTML document from the server, everything is transfered as raw bytes. Thus, to decode those bytes into readable text characters, the browser will first run the encoding sniffing algorithm to detect the document’s encoding. This includes trying out various methods from BOM sniffing to
BOM or Byte Order Mark, is like a magic number in files. When opening a file in a hex editor like bless, if the file starts with
46, we know that it’s a JPEG file;
46, it’s a PDF file. BOM serves the same purpose but for text streams. Therefore, to determine the encoding of the text stream, the browser will compare the first 3 bytes with the table below:
|Byte order mark||Encoding|
Back in 2012, when emmet is not yet a thing, and developers still typing HTML manually from start to finish, I often find myself missing a crucial tag that I have no idea how it works back then:
<meta charset="utf-8" />
This result in my browser displaying Vietnamese characters as “�” character, which for those who don’t know, is called “replacement character.” This issue was so popular back then that people started to paste replacement character into text inputs intentionally to troll webmasters and make them think that the database has a text encoding issue
Anyway, now you know that if the browser can’t find the BOM, it will try to detect the document encoding via the
meta tag. But you probably won’t have to worry about this since HTML autocomplete is quite powerful these days, and they usually generate that
meta tag by default.
Note: If you are not familiar with tokenization, be sure to read a bit about it here.
After the stream of bytes is decoded into a stream of characters, it’s then fed into an HTML tokenizer. The tokenizer is responsible for transforming input text characters into HTML tokens. There are fives types HTML tokens:
- DOCTYPE: Represent and contain information about the document doctype. Yes, that useless
<!DOCTYPE html>isn’t as useless as you think.
- Tag: Represent both start tag (e.g
<html>) and end tag (e.g
- Comment: Represent a comment in the HTML document.
- Character: Represent a character that is not part of any other tokens.
- EOF: Represent the end of the HTML document.
The HTML tokenizer is a state machine, which first starts at an initial state called the
Data state. From that, the tokenizer will process a character according to the instruction of that state. The tokenization ends when it encounters an
EOF character in the text stream.
The instruction for data state tokenization
But don’t be fooled by the small number of tokens and think that this is easy to implement. What gives me PTSD after implementing the tokenizer is the sheer number of tokenizer states. 80, to be exact.
A small section of the states from moon source code
A complete list of states can be found here.
The way the tree-building stage works is similar to the tokenize stage. It also switches between different states to create the DOM tree. What special about this stage is it have a stack of open elements to keep track of the parent-child relationship, similar to the balance parentheses problem.
Therefore, to build the DOM tree, the tree-building state machine will process the tokens emitted by the tokenizer one by one. If it encounters any script tag, it will pause the current work and let the JS engine does its job. After that, the tree-building process will continue until the EOF token is received.
One thing to notice when implementing an HTML parser is the tree-building stage doesn’t happen after the tokenize stage. As stated in the specification:
When a token is emitted, it must immediately be handled by the tree construction stage. The tree construction stage can affect the state of the tokenization stage, and can insert additional characters into the stream.
Consider this piece of HTML code:
<p>this is a paragraph</p> <script> document.write("<p>this is a new one</p>"); </script> <p>this is another paragraph</p> <!-- very long html below... -->
document.write, the code starting from the end of the
</script> to the rest of the file will be cleared. Thus, if your parser attempts to tokenize the whole file before performing tree construction, it will end up wasting its time tokenizing redundant code.
Therefore, to tackle that problem, the browser has the ability to pause the HTML parsing process to execute the JS script first. Therefore, if the script modifies the page, the browser will resume parsing at a new location instead of where it left off before executing the script.
How CSS block rendering. Source
Here are some extra cool things that I learnt after implementing this HTML parser:
After hours of implementing dry HTML parsing rules, this one really makes me question my own sanity
At first, I thought this tag has something to do with the language Ruby. But turn out, it’s a tag to specify a small piece of on top of another text to show the pronunciation, otherwise known as ruby. For example:
<ruby> 河 內<rp>(</rp><rt>Hà Nội</rt><rp>)</rp> </ruby> <ruby> 東 京<rp>(</rp><rt>Tō kyō</rt><rp>)</rp> </ruby>
That’s all I can share on my journey implementing the HTML parser. It’s not satisfying to read, I know. But even though it’s complicated to implement the parser, summing up how it works turns out to be quite a simple task; hence, the abrupt ending . But I hope that I inspired you to use Servo’s HTML parser instead of implementing it from scratch like me . If you somehow deranged enough to do what I did, I wish you the best of luck.