Runar Ovesen Hjerpbakk

Software Philosopher

Converting a HTML table to Markdown using dotnet script

Working with software integrations can be both interesting, fun and mind-boggling frustrating. Today I needed to parse a table in a website (first mistake), clean the text using regular expressions (second mistake?) and generate a corresponding table in Markdown using dotnet script (definitely not a mistake).

Parse a website using dotnet script

Parsing a website using dotnet script is as easy as parsing HTML can be, using the excellent HtmlAgilityPack library.

#! "netcoreapp2.1"
#r "nuget: HtmlAgilityPack, 1.8.4"

using HtmlAgilityPack;

var doc = new HtmlDocument();
doc.Load("website.html");

var table = doc.DocumentNode
	.SelectNodes("//table")
	.Single(table => table.Id == "wanted_id");
var rows = table("tr")
		.Skip(1) // Skip headers in my case
		.Select(row => new MyRow(row.SelectNodes("th|td").ToArray()));

I’ve downloaded the website ahead of time, and doc.Load with a file as input just parses the content. Then I go searching for nodes of the type table and choose the one with the IdI’m interested in. Next, I go through every row, except the row containing headers, and create a custom type MyRow with the content of each row.

I’d never thought parsing HTML could be so easy!

Clean text using regex

In my particular case, the content of the table cells was an amalgamation of newlines (\n) and hundreds of spaces. Thank you so very much Sharepoint. These were wisely ignored by the browser, but I needed to strip them all except for one space between words.

Regex.Replace("Text to clean", @"\s+", " ").Trim();

As explained by regex101.com, \s+ matches any whitespace character (equal to [\r\n\t\f\v]) one and unlimited times, as many times as possible, giving back as needed. Thus, the Replace will leave one whitespace between words, and a leading and trailing whitespace if these exist. Trim will the take care of those.

Markdown tables

I never remember the HTML table syntax, but the Markdown equivalent is simple to remember:

| First Header  | Second Header |
| ------------- | ------------- |
| Content Cell  | Content Cell  |
| Content Cell  | Content Cell  |

They are a little harder to read when tables get large, but nevertheless an improvement on Sharepoint tables!

Sharepoint HTML is only machine readable