Convert docx document to html

I have a document docx already saved in bytes[] and I need to pass it to html to be able to display it on a page.

I'm using Visual Studio with .NET to develop it in C#.

Currently it already works from pdf which is easy to transform to html but it is not the case of docx nor any Microsoft product since I can not use the Native library interop since it is not guarantee that the server has it installed.

The result end is:

strFinalDoc = strFinalDoc.Replace("<body>", "<body>" + documentInfoHtml + "<BR /><BR />");

Where documentInfoHtml is the result of transforming the bytes[] to html and strFinalDoc is simply the content that replaces the body of a page.

I have found some solution but practically all use interop or paid libraries.

Do you know any way to do it with free software or open projects?

I also have to do the same process for files xls and xlsx.

The current answer is very good but it only covers one file doc and not the docx

It is also important to keep existing CSS styles as much as possible so answers that simply extract the content to generate it myself in HTML is not enough in the sense that I would lose the entire format.

 12
Author: Miquel Coll, 2016-06-30

3 answers

Using Apache POI is relatively easy to do. So we can use NPOI to do the Transform in C#.

Using as a basis this answer of Convert word to HTML with Apache POI

Java version

HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new
FileInputStream("D:\\temp\\seo\\1.doc"));

WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
        DocumentBuilderFactory.newInstance().newDocumentBuilder()
                .newDocument());
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);

TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();

String result = new String(out.toByteArray());
System.out.println(result);

We transform this to C #

HWPFDocumentCore wordDocument = WordToHtmlUtils.LoadDoc(@"D:\Hola.doc"); 

WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
    new XmlDocument());

wordToHtmlConverter.ProcessDocument(wordDocument);

XmlDocument htmlDocument = wordToHtmlConverter.Document;

htmlDocument.Save(@"D:\Hola.html");

I recommend that do not download NPOI by nuget (current version 2.2.1) and use version 2.1.3.1 but from the official page since two more files are needed which do not come in NuGet NPOI.ScratchPad.HSSF.dll and NPOI.ScratchPad.HWPF.dll both compiled with NET Framework 2.x Y needs the other libraries to be version 2.X too. These 2 files can be downloaded from npoi Github


By testing it seems that the npoi version has a bug in the end result of the HTML, since to simulate the format it creates the style with the first letter of the tag type and an incremental number

<!-- ejemplo POI java-->
span.s1{color:red;}
...
<span class="s1">Hola</span>

But for some reason the NET version does not generates them well

<!-- ejemplo NPOI C#-->
span.s1{color:red;}
...
<span>Hola</span>

Maybe it has to do with Transformer but I don't know what the equivalence will be in C #


By doing a manual count, you may no longer need to make the output look good

    ....
    XmlNode node = htmlDocument.FirstChild.LastChild; //encontramos el body
    EditNode(node); //metodo de ediciĆ³n recursiva
    htmlDocument.Save(@"D:\tmp18\Hola.html");
}

Dictionary<string, int> cuenta; //para llevar la cuenta de cada elemento

private void EditNode(XmlNode node) {
    try
    {
        XmlElement xe = (XmlElement)node;     

        xe.SetAttribute("class", cuenta[xe.LocalName].ToString()); //localName seria span o p por ejemplo
        cuenta[xe.LocalName] += 1;
    }
    catch (Exception) { return; }

    if (node.HasChildNodes) {
        foreach (XmlNode x in node.ChildNodes) {                
            EditNode(x);
        }
    }

}
 10
Author: jasilva, 2017-05-23 12:39:20

Because a Word document is composed of XML, then because not starting from this point you only convert your XML to HTML. Look on the page of MSDN they show you the structure that a word document has in xml, here I leave you the structure:

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
  <CoreProperties xmlns="http://schemas.microsoft.com/package/2005/06/md/core-properties"> 
   <Title>Word Document Sample</Title> 
   <Subject>Microsoft Office Word 2007</Subject> 
   <Creator>2007 Microsoft Office System User</Creator> 
   <Keywords/> 
   <Description>2007 Microsoft Office system .docx file</Description> 
   <LastModifiedBy>2007 Microsoft Office System User</LastModifiedBy> 
   <Revision>2</Revision> 
   <DateCreated>2005-05-05T20:01:00Z</DateCreated> 
   <DateModified>2005-05-05T20:02:00Z</DateModified> 
  </CoreProperties>

And similarly in the MSDN they also give you an example of the use of the XmlDocument Class, here is an example of it:

 using System;
 using System.IO;
 using System.Xml;

 public class Sample
 {
   public static void Main()
   {
     //Create the XmlDocument.
     XmlDocument doc = new XmlDocument();
     doc.LoadXml("<?xml version='1.0' ?>" +
            "<book genre='novel' ISBN='1-861001-57-5'>" +
            "<title>Pride And Prejudice</title>" +
            "</book>");

     //Display the document element.
     Console.WriteLine(doc.DocumentElement.OuterXml);
  }
 }

Now, to access the nodes you can do it like this:

  public XmlNode GetBook(string uniqueAttribute, XmlDocument doc)
  {
      XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc.NameTable);
      nsmgr.AddNamespace("bk", "http://www.contoso.com/books");
      string xPathString = "//bk:books/bk:book[@ISBN='" + uniqueAttribute +      "']";
      XmlNode xmlNode = doc.DocumentElement.SelectSingleNode(xPathString, nsmgr);
     return xmlNode;
  }

So there it's where you already concatenate all your code HTML to it. The codes I saw in the MSDN XmlDocument Class

 4
Author: Alfonso Carrasco, 2016-06-30 18:29:32

Conversion

As you have already realized the docx is nothing more than a zipped xml, and therefore easily convertible to HTML.

Shipping to customer

To send the information to the client (to make sure it doesn't see only a txt) you have to remember to send the Headers first:

Content-Type:text/html; charset=utf8
Content-Length: 12345

In charset you have to put what corresponds and in content length too (in bytes, not in chars, remember that a utf8 char can measure more than one byte). The length it serves to let the browser know how many bytes it expects and can put the progress bar when the document is long.

Compact

Once it works uncompressed you could evaluate using a middleware or module to send the compacted information (gzip for example)

 1
Author: Emilio Platzer, 2016-07-09 13:59:58