Convert markdown and its elements (tables, lists, code, etc.) into structured, easily processable data formats like lists and hierarchical dictionaries (or JSON), with support for parsing back to markdown.
- Detect, extract and convert markdown building blocks into Python data structures
- Provide two formats for parsed markdown:
- List format: Each building block as separate dictionary in a list
- Dictionary format: Nested structure using headers as keys
- Convert parsed markdown to JSON
- Parse markdown data back to markdown formatted string
- Add options which data gets parsed back to markdown
- Extract specific building blocks (e.g., only tables or lists)
- Support for task lists (checkboxes)
- Enhanced code block handling with language detection
- Comprehensive blockquote support with nesting
- Consistent handling of definition lists
- Provide comprehensive documentation
- Add more test coverage --> 215 test cases
- Publish on PyPI
- Add line numbers (
start_line
andend_line
) to parsed markdown elements - Align with edge cases of Common Markdown Specification
pip install markdown-to-data
from markdown_to_data import Markdown
markdown = """
---
title: Example text
author: John Doe
---
# Main Header
- [ ] Pending task
- [x] Completed subtask
- [x] Completed task
## Table Example
| Column 1 | Column 2 |
|----------|----------|
| Cell 1 | Cell 2 |
´´´python
def hello():
print("Hello World!")
´´´
"""
md = Markdown(markdown)
# Get parsed markdown as list
print(md.md_list)
# Each building block is a separate dictionary in the list
# Get parsed markdown as nested dictionary
print(md.md_dict)
# Headers are used as keys for nesting content
# Get information about markdown elements
print(md.md_elements)
[
{
'metadata': {'title': 'Example text', 'author': 'John Doe'},
'start_line': 2,
'end_line': 5
},
{
'header': {'level': 1, 'content': 'Main Header'},
'start_line': 7,
'end_line': 7
},
{
'list': {
'type': 'ul',
'items': [
{
'content': 'Pending task',
'items': [
{
'content': 'Completed subtask',
'items': [],
'task': 'checked'
}
],
'task': 'unchecked'
},
{'content': 'Completed task', 'items': [], 'task': 'checked'}
]
},
'start_line': 9,
'end_line': 11
},
{
'header': {'level': 2, 'content': 'Table Example'},
'start_line': 13,
'end_line': 13
},
{
'table': {'Column 1': ['Cell 1'], 'Column 2': ['Cell 2']},
'start_line': 14,
'end_line': 16
},
{
'code': {
'language': 'python',
'content': 'def hello():\n print("Hello World!")'
},
'start_line': 18,
'end_line': 21
}
]
{
'metadata': {'title': 'Example text', 'author': 'John Doe'},
'Main Header': {
'list_1': {
'type': 'ul',
'items': [
{
'content': 'Pending task',
'items': [
{
'content': 'Completed subtask',
'items': [],
'task': 'checked'
}
],
'task': 'unchecked'
},
{'content': 'Completed task', 'items': [], 'task': 'checked'}
]
},
'Table Example': {
'table_1': {'Column 1': ['Cell 1'], 'Column 2': ['Cell 2']},
'code_1': {
'language': 'python',
'content': 'def hello():\n print("Hello World!")'
}
}
}
}
{
'metadata': {
'count': 1,
'positions': [0],
'variants': ['2_fields'],
'summary': {}
},
'header': {
'count': 2,
'positions': [1, 3],
'variants': ['h1', 'h2'],
'summary': {'levels': {1: 1, 2: 1}}
},
'list': {
'count': 1,
'positions': [2],
'variants': ['task', 'ul'],
'summary': {'task_stats': {'checked': 2, 'unchecked': 1, 'total_tasks': 3}}
},
'table': {
'count': 1,
'positions': [4],
'variants': ['2_columns'],
'summary': {'column_counts': [2], 'total_cells': 2}
},
'paragraph': {
'count': 4,
'positions': [5, 6, 7, 8],
'variants': [],
'summary': {}
}
}
The enhanced md_elements
property now provides:
- Extended variant tracking: Headers show level variants (h1, h2, etc.), tables show column counts, lists identify task lists
- Summary statistics: Detailed analytics for each element type including task list statistics, language distribution for code blocks, header level distribution, table cell counts, and blockquote nesting depth
- Better performance: Fixed O(n²) performance issue with efficient indexing
- Consistent output: Variants are sorted lists instead of sets for predictable results
The Markdown
class provides a method to parse markdown data back to markdown-formatted strings.
The to_md
method comes with options to customize the output:
from markdown_to_data import Markdown
markdown = """
---
title: Example
---
# Main Header
- [x] Task 1
- [ ] Subtask
- [ ] Task 2
## Code Example
´´´python
print("Hello")
´´´
"""
md = Markdown(markdown)
Example 1: Include specific elements
print(md.to_md(
include=['header', 'list'], # Include all headers and lists
spacer=1 # One empty line between elements
))
Output:
# Main Header
- [x] Task 1
- [ ] Subtask
- [ ] Task 2
Example 2: Include by position and exclude specific types
print(md.to_md(
include=[0, 1, 2], # Include first three elements
exclude=['code'], # But exclude any code blocks
spacer=2 # Two empty lines between elements
))
Output:
---
title: Example
---
# Main Header
- [x] Task 1
- [ ] Subtask
- [ ] Task 2
The to_md_parser
function can be used directly to convert markdown data structures to markdown text:
from markdown_to_data import to_md_parser
data = [
{
'metadata': {
'title': 'Document'
}
},
{
'header': {
'level': 1,
'content': 'Title'
}
},
{
'list': {
'type': 'ul',
'items': [
{
'content': 'Task 1',
'items': [],
'task': 'checked'
}
]
}
}
]
print(to_md_parser(data=data, spacer=1))
Output:
---
title: Document
---
# Title
- [x] Task 1
metadata = '''
---
title: Document
author: John Doe
tags: markdown, documentation
---
'''
md = Markdown(metadata)
print(md.md_list)
Output:
[
{
'metadata': {
'title': 'Document',
'author': 'John Doe',
'tags': ['markdown', 'documentation']
},
'start_line': 2,
'end_line': 6
}
]
headers = '''
# Main Title
## Section
### Subsection
'''
md = Markdown(headers)
print(md.md_list)
Output:
[
{
'header': {'level': 1, 'content': 'Main Title'},
'start_line': 2,
'end_line': 2
},
{
'header': {
'level': 2,
'content': 'Section'
},
'start_line': 3,
'end_line': 3
},
{
'header': {'level': 3, 'content': 'Subsection'},
'start_line': 4,
'end_line': 4
}
]
lists = '''
- Regular item
- Nested item
- [x] Completed task
- [ ] Pending subtask
1. Ordered item
1. Nested ordered
'''
md = Markdown(lists)
print(md.md_list)
Output:
[
{
'list': {
'type': 'ul',
'items': [
{
'content': 'Regular item',
'items': [
{'content': 'Nested item', 'items': [], 'task': None}
],
'task': None
},
{
'content': 'Completed task',
'items': [
{
'content': 'Pending subtask',
'items': [],
'task': 'unchecked'
}
],
'task': 'checked'
}
]
},
'start_line': 2,
'end_line': 5
},
{
'list': {
'type': 'ol',
'items': [
{
'content': 'Ordered item',
'items': [
{'content': 'Nested ordered', 'items': [], 'task': None}
],
'task': None
}
]
},
'start_line': 6,
'end_line': 7
}
]
tables = '''
| Header 1 | Header 2 |
|----------|----------|
| Value 1 | Value 2 |
| Value 3 | Value 4 |
'''
md = Markdown(tables)
print(md.md_list)
Output:
[
{
'table': {
'Header 1': ['Value 1', 'Value 3'],
'Header 2': ['Value 2', 'Value 4']
},
'start_line': 2,
'end_line': 5
}
]
code = '''
´´´python
def example():
return "Hello"
´´´
´´´javascript
console.log("Hello");
´´´
'''
md = Markdown(code)
print(md.md_list)
Output:
[
{
'code': {
'language': 'python',
'content': 'def example():\n return "Hello"'
},
'start_line': 2,
'end_line': 5
},
{
'code': {'language': 'javascript', 'content': 'console.log("Hello");'},
'start_line': 7,
'end_line': 9
}
]
blockquotes = '''
> Simple quote
> Multiple lines
> Nested quote
>> Inner quote
> Back to outer
'''
md = Markdown(blockquotes)
print(md.md_list)
Output:
[
{
'blockquote': [
{'content': 'Simple quote', 'items': []},
{'content': 'Multiple lines', 'items': []}
],
'start_line': 2,
'end_line': 3
},
{
'blockquote': [
{
'content': 'Nested quote',
'items': [{'content': 'Inner quote', 'items': []}]
},
{'content': 'Back to outer', 'items': []}
],
'start_line': 5,
'end_line': 7
}
]
def_lists = '''
Term
: Definition 1
: Definition 2
'''
md = Markdown(def_lists)
print(md.md_list)
Output:
[
{
'def_list': {'term': 'Term', 'list': ['Definition 1', 'Definition 2']},
'start_line': 2,
'end_line': 4
}
]
- Some extended markdown flavors might not be supported
- Inline formatting (bold, italic, links) is currently not parsed
- Table alignment specifications are not preserved
Contributions are welcome! Please feel free to submit a Pull Request or open an issue.