
* Add `html-parser-api` and `html-parser-impl` modules * Add `HtmlEmptyTagReplacement` * Implement Appendable and CharSequence in SpannableBuilder * Renamed library modules to reflect maven artifact names * Rename `markwon-syntax` to `markwon-syntax-highlight` * Add HtmlRenderer asbtraction * Add CssInlineStyleParser * Fix Theme#listItemColor and OL * Fix task list block parser to revert parsing state when line is not matching * Defined test format files * image-loader add datauri parser * image-loader add support for inline data uri image references * Add travis configuration * Fix image with width greater than canvas scaled * Fix blockquote span * Dealing with white spaces at the end of a document * image-loader add SchemeHandler abstraction * Add sample-latex-math module
303 lines
9.9 KiB
Markdown
303 lines
9.9 KiB
Markdown
# HTML <Badge text="2.0.0" />
|
|
|
|
Starting with version `2.0.0` `Markwon` brings the whole HTML parsing/rendering
|
|
stack _on-site_. The main reason for this are _special_ definitions of HTML nodes
|
|
by <Link name="commonmark-spec" />. More specifically: <Link name="commonmark-spec#inline" displayName="inline" />
|
|
and <Link name="commonmark-spec#block" displayName="block" />.
|
|
These two are _a bit_ different from _native_ HTML understanding.
|
|
Well, they are _completely_ different and share only the same names as
|
|
<Link name="html-inlines" displayName="HTML-inline"/> and <Link name="html-blocks" displayName="HTML-block"/>
|
|
elements. This leads to situations when for example an `<i>` tag is considered
|
|
a block when it's used like this:
|
|
|
|
```markdown
|
|
<i>
|
|
Hello from italics tag
|
|
</i>
|
|
```
|
|
|
|
:::tip A bit of background
|
|
<br>
|
|
<GithubIssue id="52" displayName="This issue" /> had brought attention to differences between HTML & commonmark implementations. <br><br>
|
|
:::
|
|
|
|
Let's modify code snippet above _a bit_:
|
|
|
|
```markdown{3}
|
|
<i>
|
|
Hello from italics tag
|
|
|
|
</i>
|
|
```
|
|
|
|
We have just added a `new-line` before closing `</i>` tag. And this
|
|
changes everything as now, according to the <Link name="commonmark-dingus" />,
|
|
we have 2 HtmlBlocks: one before `new-line` (containing open `<i>` tag and text content)
|
|
and one after (containing as little as closing `</i>` tag).
|
|
|
|
If we modify code snippet _a bit_ again:
|
|
|
|
```markdown{4}
|
|
<i>
|
|
Hello from italics tag
|
|
|
|
</i><b>bold></b>
|
|
```
|
|
|
|
We will have 1 HtmlBlock (from previous snippet) and a bunch of HtmlInlines:
|
|
* HtmlInline (`<i>`)
|
|
* HtmlInline (`<b>`)
|
|
* Text (`bold`)
|
|
* HtmlInline (`</b>`)
|
|
|
|
Those _little_ differences render `Html.fromHtml` (which was used in `1.x.x` versions)
|
|
useless. And actually it renders most of the HTML parsers implementations useless,
|
|
as most of them do not allow processing of HTML fragments in a raw fashion
|
|
without _fixing_ content on-the-fly.
|
|
|
|
Both `TagSoup` and `Jsoup` HTML parsers (that were considered for this project) are built to deal with
|
|
_malicious_ HTML code (*all HTML code*? :no_mouth:). So, when supplied
|
|
with a `<i>italic` fragment they will make it `<i>italic</i>`.
|
|
And it's a good thing, but consider these fragments for the sake of markdown:
|
|
|
|
* `<i>italic `
|
|
* `<b>bold italic`
|
|
* `</b><i>`
|
|
|
|
We will get:
|
|
|
|
* `<i>italic </i>`
|
|
* `<b>bold italic</b>`
|
|
|
|
_<sup>*</sup> Or to be precise: `<html><head></head><body><i>italic </i></body></html>` &
|
|
`<html><head></head><body><b>bold italic</b></body></html>`_
|
|
|
|
Which will be rendered in a final document:
|
|
|
|
|
|
|expected|actual|
|
|
|---|---|
|
|
|<i>italic <b>bold italic</b></i>|<i>italic </i><b>bold italic</b>|
|
|
|
|
This might seem like a minor problem, but add more tags to a document,
|
|
introduce some deeply nested structures, spice openning and closing tags up
|
|
by adding markdown markup between them and finally write _malicious_ HTML code :laughing:!
|
|
|
|
There is no such problem on the _frontend_ for which commonmark specification is mostly
|
|
aimed as _frontend_ runs in a web-browser environment. After all _parsed_ markdown
|
|
will become HTML tags (most common usage). And web-browser will know how to render final result.
|
|
|
|
We, on the other hand, do not posess HTML heritage (*thank :robot:!*), but still
|
|
want to display some HTML to style resulting markdown a bit. That's why `Markwon`
|
|
incorporated own HTML parsing logic. It is based on the <Link name="jsoup" /> project.
|
|
And makes usage of the `Tokekiser` class that allows to _tokenise_ input HTML.
|
|
All other code that doesn't follow this purpose was removed. It's safe to use
|
|
in projects that already have `jsoup` dependency as `Markwon` repackaged **jsoup** source classes
|
|
(which could be found <Link name="markwon-jsoup" displayName="here"/>)
|
|
|
|
## Parser
|
|
|
|
There are no additional steps to configure HTML parsing. It's enabled by default.
|
|
If you wish to _exclude_ it, please follow the [exclude](#exclude-html-parsing) section below.
|
|
|
|
The key class here is: `MarkwonHtmlParser` that is defined in `markwon-html-parser-api` module.
|
|
`markwon-html-parser-api` is a simple module that defines HTML parsing contract and
|
|
does not provide implementation.
|
|
|
|
To change what implementation `Markwon` should use, `SpannableConfiguration` can be used:
|
|
|
|
```java{2}
|
|
SpannableConfiguration.builder(context)
|
|
.htmlParser(MarkwonHtmlParser)
|
|
.build();
|
|
```
|
|
|
|
`markwon-html-parser-impl` on the other hand provides `MarkwonHtmlParser` implementation.
|
|
It's called `MarkwonHtmlParserImpl`. It can be created like this:
|
|
|
|
```java
|
|
final MarkwonHtmlParser htmlParser = MarkwonHtmlParserImpl.create();
|
|
// or
|
|
final MarkwonHtmlParser htmlParser = MarkwonHtmlParserImpl.create(HtmlEmptyTagReplacement);
|
|
```
|
|
|
|
### Empty tag replacement
|
|
|
|
In order to append text content for self-closing, void or just _empty_ HTML tags,
|
|
`HtmlEmptyTagReplacement` can be used. As we cannot set Span for empty content,
|
|
we must represent empty tag with text during parsing stage (if we want it to be represented).
|
|
|
|
Consider this:
|
|
* `<img src="me-sad.JPG">`
|
|
* `<br />`
|
|
* `<who-am-i></who-am-i>`
|
|
|
|
By default (`HtmlEmptyTagReplacement.create()`) will handle `img` and `br` tags.
|
|
`img` will be replaced with `alt` property if it is present and `\uFFFC` if it is not.
|
|
And `br` will insert a new line.
|
|
|
|
### Non-closed tags
|
|
|
|
It's possible that your HTML can contain non-closed tags. By default `Markwon` will ignore them,
|
|
but if you wish to get a bit closer to a web-browser experience, you can allow this behaviour:
|
|
|
|
```java{2}
|
|
SpannableConfiguration.builder(context)
|
|
.htmlAllowNonClosedTags(true)
|
|
.build();
|
|
```
|
|
|
|
:::warning Note
|
|
If there is (for example) an `<i>` tag at the start of a document and it's not closed
|
|
and `Markwon` is configured to **not** ignore non-closed tags (`.htmlAllowNonClosedTags(true)`),
|
|
it will make the whole document in italics
|
|
:::
|
|
|
|
### Implementation note
|
|
|
|
`MarkwonHtmlParserImpl` does not create a unified HTML node. Instead it creates
|
|
2 collections: inline tags and block tags. Inline tags are represented as a `List`
|
|
of inline tags (<Link name="html-inlines" displayName="reference" />). And
|
|
block tags are structured in a tree. This helps to achieve _browser_-like behaviour,
|
|
when open inline tag is applied to all content (even if inside blocks) until closing tag.
|
|
All tags that are not _inline_ are considered to be _block_ ones.
|
|
|
|
## Renderer
|
|
|
|
Unlike `MarkwonHtmlParser` `Markwon` comes with a `MarkwonHtmlRenderer` by default.
|
|
|
|
Default implementation can be obtain like this:
|
|
|
|
```java
|
|
MarkwonHtmlRenderer.create();
|
|
```
|
|
|
|
Default instance have these tags _handled_:
|
|
* emphasis
|
|
* `i`
|
|
* `em`
|
|
* `cite`
|
|
* `dfn`
|
|
* strong emphasis
|
|
* `b`
|
|
* `strong`
|
|
* `sup` (super script)
|
|
* `sub` (sub script)
|
|
* underline
|
|
* `u`
|
|
* `ins`
|
|
* strike through
|
|
* `del`
|
|
* `s`
|
|
* `strike`
|
|
* `a` (link)
|
|
* `ul` (unordered list)
|
|
* `ol` (ordered list)
|
|
* `img` (image)
|
|
* `blockquote` (block quote)
|
|
* `h{1-6}` (heading)
|
|
|
|
If you wish to _extend_ default handling (or override existing),
|
|
`#builderWithDefaults` factory method can be used:
|
|
|
|
```java
|
|
MarkwonHtmlRenderer.builderWithDefaults();
|
|
```
|
|
|
|
For a completely _clean_ configurable instance `#builder` method can be used:
|
|
|
|
```java
|
|
MarkwonHtmlRenderer.builder();
|
|
```
|
|
|
|
### Custom tag handler
|
|
|
|
To configure `MarkwonHtmlRenderer` to handle tags differently or
|
|
create a new tag handler - `TagHandler` can be used
|
|
|
|
```java
|
|
public abstract class TagHandler {
|
|
|
|
public abstract void handle(
|
|
@NonNull SpannableConfiguration configuration,
|
|
@NonNull SpannableBuilder builder,
|
|
@NonNull HtmlTag tag
|
|
);
|
|
}
|
|
```
|
|
|
|
For the most simple _inline_ tag handler a `SimpleTagHandler` can be used:
|
|
|
|
```java
|
|
public abstract class SimpleTagHandler extends TagHandler {
|
|
|
|
@Nullable
|
|
public abstract Object getSpans(@NonNull SpannableConfiguration configuration, @NonNull HtmlTag tag);
|
|
}
|
|
```
|
|
|
|
For example, `EmphasisHandler`:
|
|
|
|
```java
|
|
public class EmphasisHandler extends SimpleTagHandler {
|
|
@Nullable
|
|
@Override
|
|
public Object getSpans(@NonNull SpannableConfiguration configuration, @NonNull HtmlTag tag) {
|
|
return configuration.factory().emphasis();
|
|
}
|
|
}
|
|
```
|
|
|
|
If you wish to handle a _block_ HTML node (for example `<ul><li>First<li>Second</ul>`) refer
|
|
to `ListHandler` source code for reference.
|
|
|
|
:::warning
|
|
The most important thing when implementing custom `TagHandler` is to know
|
|
what type of `HtmlTag` we are dealing with. There are 2: inline & block.
|
|
Inline tag cannot contain children. Block _can_ contain children. And they
|
|
_most likely_ should also be visited and _handled_ by registered `TagHandler` (if any)
|
|
accordingly. See `TagHandler#visitChildren(configuration, builder, child);`
|
|
:::
|
|
|
|
#### Css inline style parser
|
|
|
|
When implementing own `TagHandler` you might want to inspect inline CSS styles
|
|
of a HTML element. `Markwon` provides an utility parser for that purpose:
|
|
|
|
```java
|
|
final CssInlineStyleParser inlineStyleParser = CssInlineStyleParser.create();
|
|
for (CssProperty property: inlineStyleParser.parse("width: 100%; height: 100%;")) {
|
|
// [0] = CssProperty({width=100%}),
|
|
// [1] = CssProperty({height=100%})
|
|
}
|
|
```
|
|
|
|
## Exclude HTML parsing
|
|
|
|
If you wish to exclude HTML parsing altogether, you can manually
|
|
exclude `markwon-html-parser-impl` artifact from your projects compile classpath.
|
|
This can be beneficial if you know that markdown input won't contain
|
|
HTML and/or you wish to ignore it. Excluding HTML parsing
|
|
can speed up `Markwon` parsing and will decrease final size of
|
|
`Markwon` dependency by around `100kb`.
|
|
|
|
<MavenBadge :artifact="'markwon'" />
|
|
|
|
```groovy
|
|
dependencies {
|
|
implementation("ru.noties:markwon:${markwonVersion}") {
|
|
exclude module: 'markwon-html-parser-impl'
|
|
}
|
|
}
|
|
```
|
|
|
|
Excluding `markwon-html-parser-impl` this way will result in
|
|
`MarkwonHtmlParser#noOp` implementation. No further steps are
|
|
required.
|
|
|
|
:::warning Note
|
|
Excluding `markwon-html-parser-impl` won't remove *all* the content between
|
|
HTML tags. It will if `commonmark` decides that a specific fragment is a
|
|
`HtmlBlock`, but it won't if fragment is considered a `HtmlInline` as `HtmlInline`
|
|
does not contain content (just a tag definition).
|
|
::: |