Convert docx document to html
I have a document docx
already saved in bytes[]
and I need to pass it to html to be able to display it on a page.
I'm using Visual Studio with .NET to develop it in C#.
Currently it already works from pdf which is easy to transform to html but it is not the case of docx
nor any Microsoft product since I can not use the Native library interop
since it is not guarantee that the server has it installed.
The result end is:
strFinalDoc = strFinalDoc.Replace("<body>", "<body>" + documentInfoHtml + "<BR /><BR />");
Where documentInfoHtml
is the result of transforming the bytes[]
to html and strFinalDoc
is simply the content that replaces the body
of a page.
I have found some solution but practically all use interop
or paid libraries.
Do you know any way to do it with free software or open projects?
I also have to do the same process for files xls
and xlsx
.
The current answer is very good but it only covers one file doc
and not the docx
It is also important to keep existing CSS styles as much as possible so answers that simply extract the content to generate it myself in HTML
is not enough in the sense that I would lose the entire format.
3 answers
Using Apache POI is relatively easy to do. So we can use NPOI to do the Transform in C#.
Using as a basis this answer of Convert word to HTML with Apache POI
Java version
HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream("D:\\temp\\seo\\1.doc")); WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter( DocumentBuilderFactory.newInstance().newDocumentBuilder() .newDocument()); wordToHtmlConverter.processDocument(wordDocument); Document htmlDocument = wordToHtmlConverter.getDocument(); ByteArrayOutputStream out = new ByteArrayOutputStream(); DOMSource domSource = new DOMSource(htmlDocument); StreamResult streamResult = new StreamResult(out); TransformerFactory tf = TransformerFactory.newInstance(); Transformer serializer = tf.newTransformer(); serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); serializer.setOutputProperty(OutputKeys.INDENT, "yes"); serializer.setOutputProperty(OutputKeys.METHOD, "html"); serializer.transform(domSource, streamResult); out.close(); String result = new String(out.toByteArray()); System.out.println(result);
We transform this to C #
HWPFDocumentCore wordDocument = WordToHtmlUtils.LoadDoc(@"D:\Hola.doc");
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
new XmlDocument());
wordToHtmlConverter.ProcessDocument(wordDocument);
XmlDocument htmlDocument = wordToHtmlConverter.Document;
htmlDocument.Save(@"D:\Hola.html");
I recommend that do not download NPOI by nuget (current version 2.2.1) and use version 2.1.3.1 but from the official page since two more files are needed which do not come in NuGet NPOI.ScratchPad.HSSF.dll
and NPOI.ScratchPad.HWPF.dll
both compiled with NET Framework 2.x Y needs the other libraries to be version 2.X too. These 2 files can be downloaded from npoi Github
By testing it seems that the npoi version has a bug in the end result of the HTML, since to simulate the format it creates the style with the first letter of the tag type and an incremental number
<!-- ejemplo POI java-->
span.s1{color:red;}
...
<span class="s1">Hola</span>
But for some reason the NET version does not generates them well
<!-- ejemplo NPOI C#-->
span.s1{color:red;}
...
<span>Hola</span>
Maybe it has to do with Transformer
but I don't know what the equivalence will be in C #
By doing a manual count, you may no longer need to make the output look good
....
XmlNode node = htmlDocument.FirstChild.LastChild; //encontramos el body
EditNode(node); //metodo de ediciĆ³n recursiva
htmlDocument.Save(@"D:\tmp18\Hola.html");
}
Dictionary<string, int> cuenta; //para llevar la cuenta de cada elemento
private void EditNode(XmlNode node) {
try
{
XmlElement xe = (XmlElement)node;
xe.SetAttribute("class", cuenta[xe.LocalName].ToString()); //localName seria span o p por ejemplo
cuenta[xe.LocalName] += 1;
}
catch (Exception) { return; }
if (node.HasChildNodes) {
foreach (XmlNode x in node.ChildNodes) {
EditNode(x);
}
}
}
Because a Word document is composed of XML, then because not starting from this point you only convert your XML
to HTML
.
Look on the page of MSDN they show you the structure that a word document has in xml, here I leave you the structure:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CoreProperties xmlns="http://schemas.microsoft.com/package/2005/06/md/core-properties">
<Title>Word Document Sample</Title>
<Subject>Microsoft Office Word 2007</Subject>
<Creator>2007 Microsoft Office System User</Creator>
<Keywords/>
<Description>2007 Microsoft Office system .docx file</Description>
<LastModifiedBy>2007 Microsoft Office System User</LastModifiedBy>
<Revision>2</Revision>
<DateCreated>2005-05-05T20:01:00Z</DateCreated>
<DateModified>2005-05-05T20:02:00Z</DateModified>
</CoreProperties>
And similarly in the MSDN they also give you an example of the use of the XmlDocument Class
, here is an example of it:
using System;
using System.IO;
using System.Xml;
public class Sample
{
public static void Main()
{
//Create the XmlDocument.
XmlDocument doc = new XmlDocument();
doc.LoadXml("<?xml version='1.0' ?>" +
"<book genre='novel' ISBN='1-861001-57-5'>" +
"<title>Pride And Prejudice</title>" +
"</book>");
//Display the document element.
Console.WriteLine(doc.DocumentElement.OuterXml);
}
}
Now, to access the nodes you can do it like this:
public XmlNode GetBook(string uniqueAttribute, XmlDocument doc)
{
XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc.NameTable);
nsmgr.AddNamespace("bk", "http://www.contoso.com/books");
string xPathString = "//bk:books/bk:book[@ISBN='" + uniqueAttribute + "']";
XmlNode xmlNode = doc.DocumentElement.SelectSingleNode(xPathString, nsmgr);
return xmlNode;
}
So there it's where you already concatenate all your code HTML
to it. The codes I saw in the MSDN XmlDocument Class
Conversion
As you have already realized the docx is nothing more than a zipped xml, and therefore easily convertible to HTML.
Shipping to customer
To send the information to the client (to make sure it doesn't see only a txt) you have to remember to send the Headers first:
Content-Type:text/html; charset=utf8
Content-Length: 12345
In charset you have to put what corresponds and in content length too (in bytes, not in chars, remember that a utf8 char can measure more than one byte). The length it serves to let the browser know how many bytes it expects and can put the progress bar when the document is long.
Compact
Once it works uncompressed you could evaluate using a middleware or module to send the compacted information (gzip for example)