[Feature] Override Table Conversion To HTML For Merged Cells & Complex Tables

by ADMIN 78 views

Problem


The current DOCX-to-Markdown conversion process has a significant limitation when it comes to handling complex tables and merged cells. The native Markdown table syntax (| --- |) lacks support for these features, resulting in broken or oversimplified output. This can lead to a loss of critical formatting, including:

  • Merged cells (rowspan/colspan)
  • Complex tables (nested structures, multi-level headers)
  • Styling (borders, alignment)

Solution


To address this issue, we have implemented a non-invasive override to output tables as HTML instead of Markdown. This approach preserves the structure and merged cells of the original table, ensuring that the converted Markdown accurately reflects the original DOCX document.

Key Changes

  1. CustomMarkdownify Class: This class extends the _CustomMarkdownify class and overrides the convert_table(), convert_td(), convert_tr(), and convert_th() methods to return raw HTML elements. It also wraps tables in <html><body> to ensure valid HTML5 output.
  2. CustomHtmlConverter & CustomDocxConverter: These classes propagate the modified table handling while maintaining other conversions (e.g., text, headings).
  3. CustomMarkitdown Class: This class swaps the default DocxConverter with CustomDocxConverter at runtime.

HTML Result Table Example


The following image illustrates the HTML result table example:

Image

Code


from typing import BinaryIO, Any

from bs4 import BeautifulSoup
from markitdown._markitdown import ConverterRegistration, PRIORITY_SPECIFIC_FILE_FORMAT
from markitdown.converters import DocxConverter, HtmlConverter
from markitdown.converters._markdownify import _CustomMarkdownify
from markitdown import MarkItDown, DocumentConverterResult, StreamInfo

from common.log import logger


class CustomMarkdownify(_CustomMarkdownify):
    def convert_table(self, el, text, parent_tags):
        headers = [f"h{i}" for i in range(1, 8)]
        for h in headers:
            for h_element in el.find_all(h):
                h_element.unwrap()
        return f"<html><body>{el}</body></html>"

    def convert_td(self, el, text, parent_tags):
        return str(el)

    def convert_tr(self, el, text, parent_tags):
        return str(el)

    def convert_th(self, el, text, parent_tags):
        return str(el)


class CustomHtmlConverter(HtmlConverter):
    def convert(
            self,
            file_stream: BinaryIO,
            stream_info: StreamInfo,
            **kwargs: Any,  # Options to pass to the converter
    ) -> DocumentConverterResult:
        # Parse the stream
        encoding = "utf-8" if stream_info.charset is None else stream_info.charset
        soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)

        # Remove javascript and style blocks
        for script in soup(["script", "style"]):
 script.extract()

        # Print only the main content
        body_elm = soup.find("body")
        webpage_text = ""
        if body_elm:
            webpage_text = CustomMarkdownify(**kwargs).convert_soup(body_elm)
        else:
            webpage_text = CustomMarkdownify(**kwargs).convert_soup(soup)

        assert isinstance(webpage_text, str)

        # remove leading and trailing \n
        webpage_text = webpage_text.strip()

        return DocumentConverterResult(
            markdown=webpage_text,
            title=None if soup.title is None else soup.title.string,
        )


class CustomDocxConverter(DocxConverter):
    def __init__(self):
        super().__init__()
        self._html_converter = CustomHtmlConverter()


class CustomMarkitdown(MarkItDown):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.replace_converter()

    def replace_converter(self):
        for ix, convert in enumerate(self._converters):
            if isinstance(convert.converter, DocxConverter):
                self._converters[ix] = ConverterRegistration(converter=CustomDocxConverter(),
                                                             priority=PRIORITY_SPECIFIC_FILE_FORMAT)
                logger.info(f"replace markitdown docx converter to custom converter: {CustomDocxConverter}")
                break


if __name__ == '__main__':
    markdown = CustomMarkitdown()
    md = markdown.convert('test.docx')
    with open("result.md", "w", encoding="utf-8") as f:
        f.write(md.markdown)

Benefits


The proposed solution offers several benefits:

  • Perfect fidelity for merged/complex tables
  • No upstream breaks (override-based, doesn’t modify core logic)
  • Works with renderers supporting HTML (GitHub, Typora, etc.)

Request


Consider merging this as an opt-in feature (e.g., via table_format="html" flag) or as the default behavior for complex tables.

Why This Matters


Q: What is the problem with the current DOCX-to-Markdown conversion process?


A: The current process loses critical formatting for merged cells (rowspan/colspan), complex tables (nested structures, multi-level headers), and styling (borders, alignment) due to the limitations of Markdown's native table syntax.

Q: How does the proposed solution address this issue?


A: The solution implements a non-invasive override to output tables as HTML instead of Markdown, preserving the structure and merged cells of the original table.

Q: What are the key changes made to the conversion process?


A: The key changes include:

  • Creating a CustomMarkdownify class that extends the _CustomMarkdownify class and overrides the convert_table(), convert_td(), convert_tr(), and convert_th() methods to return raw HTML elements.
  • Propagating the modified table handling through CustomHtmlConverter and CustomDocxConverter classes.
  • Swapping the default DocxConverter with CustomDocxConverter at runtime through the CustomMarkitdown class.

Q: What are the benefits of this solution?


A: The proposed solution offers several benefits, including:

  • Perfect fidelity for merged/complex tables
  • No upstream breaks (override-based, doesn’t modify core logic)
  • Works with renderers supporting HTML (GitHub, Typora, etc.)

Q: How can this solution be implemented?


A: The solution can be implemented by merging the CustomMarkdownify class, CustomHtmlConverter class, CustomDocxConverter class, and CustomMarkitdown class into the existing codebase.

Q: What are the potential use cases for this solution?


A: The proposed solution can be used in various scenarios, such as:

  • Converting complex tables from DOCX to Markdown for use in Markdown viewers.
  • Preserving merged cells and styling in Markdown documents.
  • Enhancing the accuracy of Markdown conversions for users who require precise table representation.

Q: Are there any potential drawbacks or limitations to this solution?


A: While the proposed solution addresses the issue of complex tables and merged cells, it may introduce additional complexity to the conversion process. Additionally, the solution relies on the CustomMarkdownify class, which may require modifications to the existing codebase.

Q: How can this solution be further improved or optimized?


A: To further improve or optimize the solution, consider the following:

  • Refine the CustomMarkdownify class to handle edge cases and improve performance.
  • Explore alternative solutions, such as using a dedicated table conversion library.
  • Conduct thorough testing and validation to ensure the solution works as expected in various scenarios.