Markwon/docs/docs/html.md
Dimitry e0563dca43
V2.0.0 (#66)
* Add `html-parser-api` and `html-parser-impl` modules
* Add `HtmlEmptyTagReplacement`
* Implement Appendable and CharSequence in SpannableBuilder
* Renamed library modules to reflect maven artifact names
* Rename `markwon-syntax` to `markwon-syntax-highlight`
* Add HtmlRenderer asbtraction
* Add CssInlineStyleParser
* Fix Theme#listItemColor and OL
* Fix task list block parser to revert parsing state when line is not matching
* Defined test format files
* image-loader add datauri parser
* image-loader add support for inline data uri image references
* Add travis configuration
* Fix image with width greater than canvas scaled
* Fix blockquote span
* Dealing with white spaces at the end of a document
* image-loader add SchemeHandler abstraction
* Add sample-latex-math module
2018-09-17 13:15:58 +03:00

303 lines
9.9 KiB
Markdown

# HTML <Badge text="2.0.0" />
Starting with version `2.0.0` `Markwon` brings the whole HTML parsing/rendering
stack _on-site_. The main reason for this are _special_ definitions of HTML nodes
by <Link name="commonmark-spec" />. More specifically: <Link name="commonmark-spec#inline" displayName="inline" />
and <Link name="commonmark-spec#block" displayName="block" />.
These two are _a bit_ different from _native_ HTML understanding.
Well, they are _completely_ different and share only the same names as
<Link name="html-inlines" displayName="HTML-inline"/> and <Link name="html-blocks" displayName="HTML-block"/>
elements. This leads to situations when for example an `<i>` tag is considered
a block when it's used like this:
```markdown
<i>
Hello from italics tag
</i>
```
:::tip A bit of background
<br>
<GithubIssue id="52" displayName="This issue" /> had brought attention to differences between HTML &amp; commonmark implementations. <br><br>
:::
Let's modify code snippet above _a bit_:
```markdown{3}
<i>
Hello from italics tag
</i>
```
We have just added a `new-line` before closing `</i>` tag. And this
changes everything as now, according to the <Link name="commonmark-dingus" />,
we have 2 HtmlBlocks: one before `new-line` (containing open `<i>` tag and text content)
and one after (containing as little as closing `</i>` tag).
If we modify code snippet _a bit_ again:
```markdown{4}
<i>
Hello from italics tag
</i><b>bold></b>
```
We will have 1 HtmlBlock (from previous snippet) and a bunch of HtmlInlines:
* HtmlInline (`<i>`)
* HtmlInline (`<b>`)
* Text (`bold`)
* HtmlInline (`</b>`)
Those _little_ differences render `Html.fromHtml` (which was used in `1.x.x` versions)
useless. And actually it renders most of the HTML parsers implementations useless,
as most of them do not allow processing of HTML fragments in a raw fashion
without _fixing_ content on-the-fly.
Both `TagSoup` and `Jsoup` HTML parsers (that were considered for this project) are built to deal with
_malicious_ HTML code (*all HTML code*? :no_mouth:). So, when supplied
with a `<i>italic` fragment they will make it `<i>italic</i>`.
And it's a good thing, but consider these fragments for the sake of markdown:
* `<i>italic `
* `<b>bold italic`
* `</b><i>`
We will get:
* `<i>italic </i>`
* `<b>bold italic</b>`
_<sup>*</sup> Or to be precise: `<html><head></head><body><i>italic </i></body></html>` &amp;
`<html><head></head><body><b>bold italic</b></body></html>`_
Which will be rendered in a final document:
|expected|actual|
|---|---|
|<i>italic <b>bold italic</b></i>|<i>italic </i><b>bold italic</b>|
This might seem like a minor problem, but add more tags to a document,
introduce some deeply nested structures, spice openning and closing tags up
by adding markdown markup between them and finally write _malicious_ HTML code :laughing:!
There is no such problem on the _frontend_ for which commonmark specification is mostly
aimed as _frontend_ runs in a web-browser environment. After all _parsed_ markdown
will become HTML tags (most common usage). And web-browser will know how to render final result.
We, on the other hand, do not posess HTML heritage (*thank :robot:!*), but still
want to display some HTML to style resulting markdown a bit. That's why `Markwon`
incorporated own HTML parsing logic. It is based on the <Link name="jsoup" /> project.
And makes usage of the `Tokekiser` class that allows to _tokenise_ input HTML.
All other code that doesn't follow this purpose was removed. It's safe to use
in projects that already have `jsoup` dependency as `Markwon` repackaged **jsoup** source classes
(which could be found <Link name="markwon-jsoup" displayName="here"/>)
## Parser
There are no additional steps to configure HTML parsing. It's enabled by default.
If you wish to _exclude_ it, please follow the [exclude](#exclude-html-parsing) section below.
The key class here is: `MarkwonHtmlParser` that is defined in `markwon-html-parser-api` module.
`markwon-html-parser-api` is a simple module that defines HTML parsing contract and
does not provide implementation.
To change what implementation `Markwon` should use, `SpannableConfiguration` can be used:
```java{2}
SpannableConfiguration.builder(context)
.htmlParser(MarkwonHtmlParser)
.build();
```
`markwon-html-parser-impl` on the other hand provides `MarkwonHtmlParser` implementation.
It's called `MarkwonHtmlParserImpl`. It can be created like this:
```java
final MarkwonHtmlParser htmlParser = MarkwonHtmlParserImpl.create();
// or
final MarkwonHtmlParser htmlParser = MarkwonHtmlParserImpl.create(HtmlEmptyTagReplacement);
```
### Empty tag replacement
In order to append text content for self-closing, void or just _empty_ HTML tags,
`HtmlEmptyTagReplacement` can be used. As we cannot set Span for empty content,
we must represent empty tag with text during parsing stage (if we want it to be represented).
Consider this:
* `<img src="me-sad.JPG">`
* `<br />`
* `<who-am-i></who-am-i>`
By default (`HtmlEmptyTagReplacement.create()`) will handle `img` and `br` tags.
`img` will be replaced with `alt` property if it is present and `\uFFFC` if it is not.
And `br` will insert a new line.
### Non-closed tags
It's possible that your HTML can contain non-closed tags. By default `Markwon` will ignore them,
but if you wish to get a bit closer to a web-browser experience, you can allow this behaviour:
```java{2}
SpannableConfiguration.builder(context)
.htmlAllowNonClosedTags(true)
.build();
```
:::warning Note
If there is (for example) an `<i>` tag at the start of a document and it's not closed
and `Markwon` is configured to **not** ignore non-closed tags (`.htmlAllowNonClosedTags(true)`),
it will make the whole document in italics
:::
### Implementation note
`MarkwonHtmlParserImpl` does not create a unified HTML node. Instead it creates
2 collections: inline tags and block tags. Inline tags are represented as a `List`
of inline tags (<Link name="html-inlines" displayName="reference" />). And
block tags are structured in a tree. This helps to achieve _browser_-like behaviour,
when open inline tag is applied to all content (even if inside blocks) until closing tag.
All tags that are not _inline_ are considered to be _block_ ones.
## Renderer
Unlike `MarkwonHtmlParser` `Markwon` comes with a `MarkwonHtmlRenderer` by default.
Default implementation can be obtain like this:
```java
MarkwonHtmlRenderer.create();
```
Default instance have these tags _handled_:
* emphasis
* `i`
* `em`
* `cite`
* `dfn`
* strong emphasis
* `b`
* `strong`
* `sup` (super script)
* `sub` (sub script)
* underline
* `u`
* `ins`
* strike through
* `del`
* `s`
* `strike`
* `a` (link)
* `ul` (unordered list)
* `ol` (ordered list)
* `img` (image)
* `blockquote` (block quote)
* `h{1-6}` (heading)
If you wish to _extend_ default handling (or override existing),
`#builderWithDefaults` factory method can be used:
```java
MarkwonHtmlRenderer.builderWithDefaults();
```
For a completely _clean_ configurable instance `#builder` method can be used:
```java
MarkwonHtmlRenderer.builder();
```
### Custom tag handler
To configure `MarkwonHtmlRenderer` to handle tags differently or
create a new tag handler - `TagHandler` can be used
```java
public abstract class TagHandler {
public abstract void handle(
@NonNull SpannableConfiguration configuration,
@NonNull SpannableBuilder builder,
@NonNull HtmlTag tag
);
}
```
For the most simple _inline_ tag handler a `SimpleTagHandler` can be used:
```java
public abstract class SimpleTagHandler extends TagHandler {
@Nullable
public abstract Object getSpans(@NonNull SpannableConfiguration configuration, @NonNull HtmlTag tag);
}
```
For example, `EmphasisHandler`:
```java
public class EmphasisHandler extends SimpleTagHandler {
@Nullable
@Override
public Object getSpans(@NonNull SpannableConfiguration configuration, @NonNull HtmlTag tag) {
return configuration.factory().emphasis();
}
}
```
If you wish to handle a _block_ HTML node (for example `<ul><li>First<li>Second</ul>`) refer
to `ListHandler` source code for reference.
:::warning
The most important thing when implementing custom `TagHandler` is to know
what type of `HtmlTag` we are dealing with. There are 2: inline &amp; block.
Inline tag cannot contain children. Block _can_ contain children. And they
_most likely_ should also be visited and _handled_ by registered `TagHandler` (if any)
accordingly. See `TagHandler#visitChildren(configuration, builder, child);`
:::
#### Css inline style parser
When implementing own `TagHandler` you might want to inspect inline CSS styles
of a HTML element. `Markwon` provides an utility parser for that purpose:
```java
final CssInlineStyleParser inlineStyleParser = CssInlineStyleParser.create();
for (CssProperty property: inlineStyleParser.parse("width: 100%; height: 100%;")) {
// [0] = CssProperty({width=100%}),
// [1] = CssProperty({height=100%})
}
```
## Exclude HTML parsing
If you wish to exclude HTML parsing altogether, you can manually
exclude `markwon-html-parser-impl` artifact from your projects compile classpath.
This can be beneficial if you know that markdown input won't contain
HTML and/or you wish to ignore it. Excluding HTML parsing
can speed up `Markwon` parsing and will decrease final size of
`Markwon` dependency by around `100kb`.
<MavenBadge :artifact="'markwon'" />
```groovy
dependencies {
implementation("ru.noties:markwon:${markwonVersion}") {
exclude module: 'markwon-html-parser-impl'
}
}
```
Excluding `markwon-html-parser-impl` this way will result in
`MarkwonHtmlParser#noOp` implementation. No further steps are
required.
:::warning Note
Excluding `markwon-html-parser-impl` won't remove *all* the content between
HTML tags. It will if `commonmark` decides that a specific fragment is a
`HtmlBlock`, but it won't if fragment is considered a `HtmlInline` as `HtmlInline`
does not contain content (just a tag definition).
:::