Cleaning up markdown generated from pandoc with Python

I’ve been playing around with Pelican lately, using it to build my new Nimble Autonomy, LLC site (more on that soon).

So far, I like Pelican as a static site generator. It seems to strike a reasonable balance between generality and power. I previously used Hugo to build the Unit Circle Rekkids site. I found it reasonably decent, but not life-changing. That site’s content doesn’t change that often, so once it was built, I have only had to make an occasional tweak. This new site will be changing a bit more often.

To get some content on the new site, I wanted to republish some posts from this blog. Using the instructions for WordPress Export and Pelican-import, I was able to generate some markdown from my WordPress posts, but it was a bit underwhelming.

There was a lot of this kind of gunk in the markdown:

<!-- wp:paragraph -->`{=html}

While I am an experienced video-conferencer and a reasonably experienced presenter, presenting to a remote audience is still something I am learning how to do. Having just given a talk this morning, I did want to share some things that are working well for me at the moment.

`<!-- /wp:paragraph -->`{=html}

`<!-- wp:heading -->`{=html}

The Tools
---------

`<!-- /wp:heading -->`{=html}

`<!-- wp:paragraph -->`{=html}

Only a few of my images were even referenced. I quickly realized that if I was going to try to move more than a handful of articles over, I was going to be spending a lot of time hand-editing the generated markdown.

This was an obvious problem that automation could fix. As I was using a python-based static-site generator, I decided to use python to do my cleanup. I’m sharing the code below as it may help others who are trying to solve the same problem. At some point, I might try to create a pull-request for it with Pelican, but right now I am just trying to move forward on other things.

It isn’t the best or cleanest python I’ve written, this was done quickly with a lot of iteration to catch all the corner cases. It could also be more pythonic. It is also very opinionated in the Markdown that it creates.

At some point, I may clean it up, but really I’m supplying it here because I have to believe that other people have hit the same problem and I want to save those folks some time.

Feel free to fork and improve!