1. Characters and lines
Any sequence of characters is a valid CommonMark document.
A character is a Unicode code point. Although some code points (for example, combining accents) do not correspond to characters in an intuitive sense, all code points count as characters for purposes of this spec.
This spec does not specify an encoding; it thinks of lines as composed of characters rather than bytes. A conforming parser may be limited to a certain encoding.
A line ending is a newline (
U+000A), a carriage return (
U+000D) not followed by a newline, or a carriage return and a following newline.
A line containing no characters, or a line containing only spaces (
U+0020) or tabs (
U+0009), is called a blank line.
The following definitions of character classes will be used in this spec:
A whitespace character is a space (
U+0020), tab (
U+0009), newline (
U+000A), line tabulation (
U+000B), form feed (
U+000C), or carriage return (
A Unicode whitespace character is any code point in the Unicode
Zs general category, or a tab (
U+0009), carriage return (
U+000D), newline (
U+000A), or form feed (
A space is
An ASCII punctuation character is
Tabs in lines are not expanded to spaces. However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.
Thus, for example, a tab can be used instead of four spaces in an indented code block. (Note, however, that internal tabs are passed through as literal tabs, not expanded to spaces.)
In the following example, a continuation paragraph of a list item is indented with a tab; this has exactly the same effect as indentation with four spaces would:
> that begins a block quote may be followed optionally by a space, which is not considered part of the content. In the following case
> is followed by a tab, which is treated as if it were expanded into three spaces. Since one of these spaces is considered part of the delimiter,
foo is considered to be indented six spaces inside the block quote context, so we get an indented code block starting with two spaces.
3. Insecure characters
For security reasons, the Unicode character
U+0000 must be replaced with the REPLACEMENT CHARACTER (