[Feature] Override Table Conversion To HTML For Merged Cells & Complex Tables
Problem
The current DOCX-to-Markdown conversion process has a significant limitation when it comes to handling complex tables and merged cells. The native Markdown table syntax (| --- |) lacks support for these features, resulting in broken or oversimplified output. This can lead to a loss of critical formatting, including:
- Merged cells (rowspan/colspan)
- Complex tables (nested structures, multi-level headers)
- Styling (borders, alignment)
Solution
To address this issue, we have implemented a non-invasive override to output tables as HTML instead of Markdown. This approach preserves the structure and merged cells of the original table, ensuring that the converted Markdown accurately reflects the original DOCX document.
Key Changes
- CustomMarkdownify Class: This class extends the
_CustomMarkdownify
class and overrides theconvert_table()
,convert_td()
,convert_tr()
, andconvert_th()
methods to return raw HTML elements. It also wraps tables in<html><body>
to ensure valid HTML5 output. - CustomHtmlConverter & CustomDocxConverter: These classes propagate the modified table handling while maintaining other conversions (e.g., text, headings).
- CustomMarkitdown Class: This class swaps the default
DocxConverter
withCustomDocxConverter
at runtime.
HTML Result Table Example
The following image illustrates the HTML result table example:
Code
from typing import BinaryIO, Any
from bs4 import BeautifulSoup
from markitdown._markitdown import ConverterRegistration, PRIORITY_SPECIFIC_FILE_FORMAT
from markitdown.converters import DocxConverter, HtmlConverter
from markitdown.converters._markdownify import _CustomMarkdownify
from markitdown import MarkItDown, DocumentConverterResult, StreamInfo
from common.log import logger
class CustomMarkdownify(_CustomMarkdownify):
def convert_table(self, el, text, parent_tags):
headers = [f"h{i}" for i in range(1, 8)]
for h in headers:
for h_element in el.find_all(h):
h_element.unwrap()
return f"<html><body>{el}</body></html>"
def convert_td(self, el, text, parent_tags):
return str(el)
def convert_tr(self, el, text, parent_tags):
return str(el)
def convert_th(self, el, text, parent_tags):
return str(el)
class CustomHtmlConverter(HtmlConverter):
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
# Parse the stream
encoding = "utf-8" if stream_info.charset is None else stream_info.charset
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
# Remove javascript and style blocks
for script in soup(["script", "style"]):
script.extract()
# Print only the main content
body_elm = soup.find("body")
webpage_text = ""
if body_elm:
webpage_text = CustomMarkdownify(**kwargs).convert_soup(body_elm)
else:
webpage_text = CustomMarkdownify(**kwargs).convert_soup(soup)
assert isinstance(webpage_text, str)
# remove leading and trailing \n
webpage_text = webpage_text.strip()
return DocumentConverterResult(
markdown=webpage_text,
title=None if soup.title is None else soup.title.string,
)
class CustomDocxConverter(DocxConverter):
def __init__(self):
super().__init__()
self._html_converter = CustomHtmlConverter()
class CustomMarkitdown(MarkItDown):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.replace_converter()
def replace_converter(self):
for ix, convert in enumerate(self._converters):
if isinstance(convert.converter, DocxConverter):
self._converters[ix] = ConverterRegistration(converter=CustomDocxConverter(),
priority=PRIORITY_SPECIFIC_FILE_FORMAT)
logger.info(f"replace markitdown docx converter to custom converter: {CustomDocxConverter}")
break
if __name__ == '__main__':
markdown = CustomMarkitdown()
md = markdown.convert('test.docx')
with open("result.md", "w", encoding="utf-8") as f:
f.write(md.markdown)
Benefits
The proposed solution offers several benefits:
- Perfect fidelity for merged/complex tables
- No upstream breaks (override-based, doesn’t modify core logic)
- Works with renderers supporting HTML (GitHub, Typora, etc.)
Request
Consider merging this as an opt-in feature (e.g., via table_format="html"
flag) or as the default behavior for complex tables.
Why This Matters
Q: What is the problem with the current DOCX-to-Markdown conversion process?
A: The current process loses critical formatting for merged cells (rowspan/colspan), complex tables (nested structures, multi-level headers), and styling (borders, alignment) due to the limitations of Markdown's native table syntax.
Q: How does the proposed solution address this issue?
A: The solution implements a non-invasive override to output tables as HTML instead of Markdown, preserving the structure and merged cells of the original table.
Q: What are the key changes made to the conversion process?
A: The key changes include:
- Creating a
CustomMarkdownify
class that extends the_CustomMarkdownify
class and overrides theconvert_table()
,convert_td()
,convert_tr()
, andconvert_th()
methods to return raw HTML elements. - Propagating the modified table handling through
CustomHtmlConverter
andCustomDocxConverter
classes. - Swapping the default
DocxConverter
withCustomDocxConverter
at runtime through theCustomMarkitdown
class.
Q: What are the benefits of this solution?
A: The proposed solution offers several benefits, including:
- Perfect fidelity for merged/complex tables
- No upstream breaks (override-based, doesn’t modify core logic)
- Works with renderers supporting HTML (GitHub, Typora, etc.)
Q: How can this solution be implemented?
A: The solution can be implemented by merging the CustomMarkdownify
class, CustomHtmlConverter
class, CustomDocxConverter
class, and CustomMarkitdown
class into the existing codebase.
Q: What are the potential use cases for this solution?
A: The proposed solution can be used in various scenarios, such as:
- Converting complex tables from DOCX to Markdown for use in Markdown viewers.
- Preserving merged cells and styling in Markdown documents.
- Enhancing the accuracy of Markdown conversions for users who require precise table representation.
Q: Are there any potential drawbacks or limitations to this solution?
A: While the proposed solution addresses the issue of complex tables and merged cells, it may introduce additional complexity to the conversion process. Additionally, the solution relies on the CustomMarkdownify
class, which may require modifications to the existing codebase.
Q: How can this solution be further improved or optimized?
A: To further improve or optimize the solution, consider the following:
- Refine the
CustomMarkdownify
class to handle edge cases and improve performance. - Explore alternative solutions, such as using a dedicated table conversion library.
- Conduct thorough testing and validation to ensure the solution works as expected in various scenarios.