meet unexpected error when parsing the markdown, antlr preview shows different result from antlr-runtime #4846

kurt-steiner · 2025-06-07T14:27:59Z

kurt-steiner
Jun 7, 2025

hey, guys
I am trying make a markdown parser, and this is my lexer and parser

lexer grammar MarkdownLexer;

Letter: [a-zA-Z];
Digit: [0-9];

Space: ' ';
Tab: '\t';
Newline: '\r'? '\n';
WhiteSpace: [\r\n]+;

grammar MarkdownRule;

import MarkdownLexer;
// TODO add url link
inline: plainText | inlineCode | emphasis;

// indent: Space{2,} | Tab;
indent: Space+ | Tab;

plainText: (Letter | Digit)+;
inlineCode: '`' plainText '`';

emphasis: bold | boldItalic | italic | strikeThrough | underline;

// TODO add urllink
boldTag: '**';
boldElement: (plainText | inlineCode | strikeThrough | italic);
bold: boldTag boldElement boldTag # boldSingle
    | boldTag boldElement (indent boldElement)+ boldTag # boldMulti
    ;

// TODO add urllink
italicTag: '*';
italicElement: (bold | inlineCode | strikeThrough | plainText);
italic: italicTag italicElement italicTag # italicSingle
    |   italicTag italicElement (indent italicElement)+ italicTag # italicMutli
    ;

// TODO add urllink
strikeTag: '~~';
strikeThroughElement: (italic | bold | inlineCode | plainText);
strikeThrough: strikeTag strikeThroughElement strikeTag # strikeThroughSingle
    |          strikeTag strikeThroughElement (indent strikeThroughElement)+ strikeTag # strikeThroughMulti
    ;

// TODO adjust here
boldItalicTag: '***';
boldItalicElement: plainText;
boldItalic: boldItalicTag boldItalicElement boldItalicTag # boldItalicSingle
    |       boldItalicTag boldItalicElement (indent boldItalicElement)+ boldItalicTag # boldItalicMulti
    ;

// TODO add urllink
underlineTag: '__';
underlineElement: (plainText | bold | italic | inlineCode);
underline: underlineTag underlineElement underlineTag # underlineSingle
    |      underlineTag underlineElement (indent underlineElement)+ underlineTag # underlineMutli
    ;

when test with antlr preview, for example, test the **hello world**

nothing wrong, right ??

and you can check my visitor here

now I am going to test with code

private fun visitMarkdownElement(charStream: CharStream): MarkdownElement {
    val lexer = MarkdownLexer(charStream)
    val tokens = CommonTokenStream(lexer)
    val parser = MarkdownRuleParser(tokens)

    val tree = parser.inline()
    val visitor = MarkdownVisitor()
    return visitor.visit(tree)
}

@Test
fun testBold() {
    val input = CharStreams.fromString("**hello**")
    val markdown = visitMarkdownElement(input)

    println(markdown.toHTML())
}

the result is

line 1:0 token recognition error at: '*'
line 1:1 token recognition error at: '*'
line 1:3 mismatched input 'e' expecting {Letter, Digit}
<code></code>

test again with

@Test
fun testBold() {
    val input = CharStreams.fromString("__hello__")
    val markdown = visitMarkdownElement(input)

    println(markdown.toHTML())
}

the result is

line 1:0 token recognition error at: '_'
line 1:1 token recognition error at: '_'
line 1:3 mismatched input 'e' expecting {Letter, Digit}
<code></code>

you can check the full log

> Task :checkKotlinGradlePluginConfigurationErrors SKIPPED
> Task :generateGrammarSource UP-TO-DATE
> Task :compileKotlin UP-TO-DATE
> Task :compileJava UP-TO-DATE
> Task :processResources NO-SOURCE
> Task :classes UP-TO-DATE
> Task :jar UP-TO-DATE
> Task :generateTestGrammarSource NO-SOURCE
> Task :compileTestKotlin UP-TO-DATE
> Task :compileTestJava NO-SOURCE
> Task :processTestResources UP-TO-DATE
> Task :testClasses UP-TO-DATE
line 1:0 token recognition error at: '_'
line 1:1 token recognition error at: '_'
line 1:3 mismatched input 'e' expecting {Letter, Digit}
<code></code>
> Task :test
BUILD SUCCESSFUL in 1s
7 actionable tasks: 1 executed, 6 up-to-date
22:26:01: Execution finished ':test --tests "com.aurora.markdown.TestInline.testBold"'.

very strange, all the rules point to InlineCode

Answered by kaby76

Jun 7, 2025

You're writing a grammar that is not following the standards to avoid all sorts of beginner issues. Here's what you should do.

You define a split grammar, but don't declare 'parser grammar' in the parser grammar. You then get two .tokens that are different, assuming you run the Antlr tool on both .g4. It should be parser grammar MarkdownRuleParser;.
You should be using the tokenVocab options, not import. Follow the standard pattern for split grammars. Replace import MarkdownLexer;with options { tokenVocab = MarkdownLexer; }.
If you are using a split grammar, you have to define lexer rules for every string literal used in the parser grammar. Add to the lexer grammar before the first lexer…

View full answer

kurt-steiner · 2025-06-07T14:45:46Z

kurt-steiner
Jun 7, 2025
Author

plugins {
    kotlin("jvm") version "2.1.20"
    antlr
}

group = "com.aurora.markdown"
version = "1.0-SNAPSHOT"

repositories {
    mavenCentral()
}

dependencies {
    // https://mvnrepository.com/artifact/org.antlr/antlr4
    antlr("org.antlr:antlr4:4.13.2")
    // https://mvnrepository.com/artifact/org.antlr/antlr4-runtime
    implementation("org.antlr:antlr4-runtime:4.13.2")
    testImplementation(kotlin("test"))
}

sourceSets {
    main {
        antlr {
            setSrcDirs(listOf("grammar"))
        }
    }
}

tasks.test {
    useJUnitPlatform()
}

tasks.generateGrammarSource {
    outputDirectory = file("src/main/java/com/aurora/markdown/grammar")
    arguments = arguments + listOf("-visitor", "-package", "com.aurora.markdown.grammar")
}

tasks.generateTestGrammarSource {
    outputDirectory = file("src/main/java/com/aurora/markdown/grammar")
    arguments = arguments + listOf("-visitor", "-package", "com.aurora.markdown.grammar")
}

tasks.compileKotlin {
    dependsOn(tasks.generateGrammarSource)
}

tasks.compileTestKotlin {
    dependsOn(tasks.generateGrammarSource)
    dependsOn(tasks.generateTestGrammarSource)
}

tasks.compileJava {
    dependsOn(tasks.generateGrammarSource)
}

tasks.compileTestJava {
    dependsOn(tasks.generateGrammarSource)
    dependsOn(tasks.generateTestGrammarSource)
}

kotlin {
    jvmToolchain(17)
}

modified this build.gradle.kts, but failed with the same error

0 replies

kaby76 · 2025-06-07T15:40:44Z

kaby76
Jun 7, 2025

You're writing a grammar that is not following the standards to avoid all sorts of beginner issues. Here's what you should do.

You define a split grammar, but don't declare 'parser grammar' in the parser grammar. You then get two .tokens that are different, assuming you run the Antlr tool on both .g4. It should be parser grammar MarkdownRuleParser;.
You should be using the tokenVocab options, not import. Follow the standard pattern for split grammars. Replace import MarkdownLexer;with options { tokenVocab = MarkdownLexer; }.
If you are using a split grammar, you have to define lexer rules for every string literal used in the parser grammar. Add to the lexer grammar before the first lexer rule: BQ : '`'; SS : '**'; S : '*'; SQSQ : '~~'; SSS : '***'; UU : '__';
If your intent is to read the entire file and parse it in its entirety, as opposed to having the parser give up on the first occurrence of an error, then you need to use an EOF-terminated start rule: start: inline EOF;.
Eliminate all the code after the parse in your driver. It's unnecessary and distracting from the problem, which is that you are getting the parser error line 1:0 token recognition error at: '*'. Follow the standard way to write a grammar.
It also sounds like you have a build problem. You need a build environment to make sure you build the application and not get a skew between the .g4, the generated files, and executable. Make sure you run remove all .tokens files and generated parser files, then antlr4 MarkdownLexer.g4 first, then antlr4 MarkdownParser.g4, then compile and link.

2 replies

kurt-steiner Jun 7, 2025
Author

eh, for some reason, I don't think some token like * should be parsed immediately,
for example, the blockquote case, and code case, the * could be multiply operator

I am going to try you other suggestion 🙂

kurt-steiner Jun 7, 2025
Author

ok, fix it, with your method, finally
but how should I fix this situation ?

for example, the blockquote case, and code case, the * could be multiply operator

jimidle · 2025-06-10T16:08:45Z

jimidle
Jun 10, 2025

MArkdown is not a good candidate for an ANTLR grammar as it is very ambiguous within the language. It wasn't meant to be parsed like this. It definitely isn't a candidate for your first grammar - as you are seeing with the ambiguity. But don't think about the operators and context - think about the lexemes; '' is always a '' - in markdown, it's meaning depends on where you see it.

Unless you have a specific reason to write another markdown parser, then I would pick some other language to start with.

1 reply

kurt-steiner Jun 10, 2025
Author

how about org-mode or any other markup language?
this markdown parser is for learning, it will not be complex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

meet unexpected error when parsing the markdown, antlr preview shows different result from antlr-runtime #4846

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

meet unexpected error when parsing the markdown, antlr preview shows different result from antlr-runtime #4846

Uh oh!

Uh oh!

kurt-steiner Jun 7, 2025

Replies: 3 comments · 3 replies

Uh oh!

kurt-steiner Jun 7, 2025 Author

Uh oh!

Uh oh!

kaby76 Jun 7, 2025

Uh oh!

kurt-steiner Jun 7, 2025 Author

Uh oh!

kurt-steiner Jun 7, 2025 Author

Uh oh!

jimidle Jun 10, 2025

Uh oh!

kurt-steiner Jun 10, 2025 Author

kurt-steiner
Jun 7, 2025

Replies: 3 comments 3 replies

kurt-steiner
Jun 7, 2025
Author

kaby76
Jun 7, 2025

kurt-steiner Jun 7, 2025
Author

kurt-steiner Jun 7, 2025
Author

jimidle
Jun 10, 2025

kurt-steiner Jun 10, 2025
Author