Converting a HTML table to Markdown using dotnet script
Working with software integrations can be both interesting, fun and mind-boggling frustrating. Today I needed to parse a table in a website (first mistake), clean the text using regular expressions (second mistake?) and generate a corresponding table in Markdown using dotnet script (definitely not a mistake).
Parse a website using dotnet script
Parsing a website using dotnet script is as easy as parsing HTML can be, using the excellent HtmlAgilityPack library.
#! "netcoreapp2.1" #r "nuget: HtmlAgilityPack, 1.8.4" using HtmlAgilityPack; var doc = new HtmlDocument(); doc.Load("website.html"); var table = doc.DocumentNode .SelectNodes("//table") .Single(table => table.Id == "wanted_id"); var rows = table("tr") .Skip(1) // Skip headers in my case .Select(row => new MyRow(row.SelectNodes("th|td").ToArray()));
I’ve downloaded the website ahead of time, and
doc.Load with a file as input just parses the content. Then I go searching for nodes of the type
table and choose the one with the
IdI’m interested in. Next, I go through every row, except the row containing headers, and create a custom type
MyRow with the content of each row.
I’d never thought parsing HTML could be so easy!
Clean text using regex
In my particular case, the content of the table cells was an amalgamation of newlines (\n) and hundreds of spaces. Thank you so very much Sharepoint. These were wisely ignored by the browser, but I needed to strip them all except for one space between words.
Regex.Replace("Text to clean", @"\s+", " ").Trim();
As explained by regex101.com, \s+ matches
any whitespace character (equal to [\r\n\t\f\v]) one and unlimited times, as many times as possible, giving back as needed. Thus, the Replace will leave one whitespace between words, and a leading and trailing whitespace if these exist.
Trim will the take care of those.
I never remember the HTML table syntax, but the Markdown equivalent is simple to remember:
| First Header | Second Header | | ------------- | ------------- | | Content Cell | Content Cell | | Content Cell | Content Cell |
They are a little harder to read when tables get large, but nevertheless an improvement on Sharepoint tables!