Migrating a blog (yes, this one!) from Wordpress to Hugo
February 13, 2015
I had been meaning to move this blog from Wordpress to a static site generator for a while.
Why? Mostly, I like having control and access to my blog posts in a standard raw form; Wordpress (and most other CMSes) store posts in a database, but a static site generator lets you use a bunch of Markdown files.
Also, I wanted an easier way to write posts offline. With a static site generator, I can just save everything locally and publish it all together, very easily, once I’m ready. I can also very reliably see a local copy of exactly what my site will look like, before I publish.
Hugo caught my eye. It’s written in Go, which I’ve been hearing a lot about, and it’s new, and therefore really, really simple. Installation is incredibly easy. It’s probably one of the easiest things I’ve ever installed on a computer, and it’s certainly the easiest-to-configure local environment I’ve ever run into, even for someone like me who doesn’t spend all day in the terminal. Lastly, it’s written and the project is led by my friend Steve Francia.
Since the migration was of moderate difficulty, this post documents the problems I ran into, and how I solved them. You can also check out my blog repo, which shows some of the steps I took.
I don’t cover installation here, since the instructions are accurate and worked well for me.
Getting data out of Wordpress
The first step was to export all of my posts. This thread gives some ideas, specifically the Wordpress to Hugo exporter. I couldn’t get this to work.
The Wordpress plugin did indeed timeout, and the command line version kept generating an empty .zip. (I may not have followed the instructions correctly.)
Fortunately, there is another tool, called ExitWP, that worked well. Really well. Thank you, Thomas Frössman.
I think this is because it takes advantage of the native Wordpress export, and is a script that then runs locally on those files, rather than depending on the server configuration.
This got me to a starting point, an archive of exported files with usable front matter indicating the date, slug, tags, and other metadata.
Cleaning up ExitWP output
However, the output from ExitWP had a few problems that prevented it from being publishable:
1) Images were still linked to their Wordpress URLs, e.g. http://www.justindunham.net/wp-content/uploads... Obviously these would all break once I removed Wordpress from my server, so I had to rewrite all these URLs. Not a huge deal; I used find/replace in Sublime Text to fix these. A more sophisticated user could use the shell.
2) Wordpress uses shortcodes. Helpful when you are publishing a post, extremely unhelpful when you are trying to move your writing to a standardized format. Here is an example:
[caption id=“attachment_841” align=“alignleft” width=“300” caption=“5: Coq au vin!“][![(/images/uploads/2010/07/IMG_24621-300x225.jpg)](/?attachment_id=841)[/caption]
Every single one of these had to be turned into standard HTML. This was by far the most time-consuming part of the migration. I took an approach that had 2 parts. First, I ran all of these snippets (hundreds of them) through a Markdown to HTML converter. This caused them all to have one of two consistent structures, depending on whether or not Wordpress put the caption itself within the [ caption ] tag.
That then allowed me to write Javascript that takes these shortcodes, which are still present in the Markdown, and converts them into actual proper HTML elements. It is a hack to be sure. Again, a more sophisticated user could probably fix this with sed
.
3) The Markdown that ExitWP generates can sometimes be a little messy or incorrect. For example, the underscore (_) will sometimes have a space inserted before or after it, which causes problems for emphasized text. Or lots of extra line breaks. I fixed these where I saw them.
Theming
Once I had all that cleaned up, it was time to start on the theme. Aaaah. I love theming in Hugo! The /indexes
directory is a little confusing, but otherwise, it is really nice.
I picked the theme I liked most, Simple A, and ripped out huge pieces, rearranged things, and changed styling until I got something nice and neat. I consolidated all of the various /chrome
files into three partials, header.html
, footer.html
, and disqus.html
(since you want disqus on some pages but not others).
I also reduced the amount of metadata that’s displayed, and the number of posts on the front page. Hugo makes these changes very, very easy.
Archives and Pagination
Hugo is very young, and one feature it lacks is pagination, although a recent pull request seems to have added this (though I didn’t feel comfortable trying that until it is actually part of Hugo officially.)
That means that you can’t have, for example, a front page that gives links to the previous ten posts, and then that page gives links to the next ten posts, and so on. You can have archives by month, but only by actually setting this up as a taxonomy item, which is not an ideal solution.
Nobody clicks through to the monthly archives on my site, except for me, so that wasn’t an issue. However, I did want to make sure all the posts were linked somewhere, so that (a) an interested visitor could easily see them, and mostly (b) so Google could find them.
I followed the suggestion of someone on the forums, and solved this problem by creating a taxonomy item called “Archives”, and setting all posts to have “blog”.
archives:
- blog
I then added an archive.html
file to the indexes
directory. Lastly, I added a link to the template, on all pages, to /archives/blog
. This actually worked really well, and while it’s a simpler solution than many might want, a big part of my reason for migrating off Wordpress is simplicity, so this was fine.
Redirects
After all this, I had a local copy of my site that was working pretty well. Mostly good-looking posts, properly placed and captioned images, all Javascript in place, templates, archives, etc.
However, one other thing Hugo doesn’t do is automatically generate URLs in the form http://justindunham.net/2015/02/post
. Instead, all posts have URLs that look like http://justindunham.net/post
.
Update: This is wrong. Actually, Hugo can do this, I just didn’t understand how. More detail at http://gohugo.io/extras/permalinks/. I’m leaving this seciton in anyway, as it is an alternative solution to this problem if you don’t feel like changing the default permalink format.
The ExitWP script did preserve all the slugs in the post metadata, which was incredibly useful, but there’s no way to get Hugo to parse the date information into the URLs, at least not automatically. I suppose I could have re-parented all the Markdown files into the correct directories, but this again makes the system much more complex than I wanted.
“Well”, I thought to myself, “I’ll just change the URLs of the posts in Wordpress to the format that Hugo will use, wait a little while for Google to reindex, and it will be fine!”
Unfortunately, Wordpress isn’t smart enough to put in redirects from your old posts when their URLs change. So you end up breaking all Google search results, and all internal links, anyway. There are a few plugins out there that claim to fix this, but none of them seemed to work for me (or at least, I wasn’t confident enough that any would, to take the risk of trying them.)
The solution was in .htaccess. I needed something that would basically remove things like 2015/02/
from the request URL, and forward to the URL without that extra piece. So I put this in, and it seems to be working well:
Options -Indexes
RewriteEngine on
RewriteBase /
RewriteRule ^[0-9]{4}/[0-9]{2}/(.*)$ /$1 [L,R=302,QSA]
If you’re not deeply familiar with .htaccess, basically what this says is, “for a string that has 4 digits, then a slash, then 2 digits, then a bunch of other stuff, just keep the other stuff.”
I use a 302 (temporary) redirect for now, but eventually I’ll make this a 301.
You can also add this to forward Wordpress /tag/ URLs to /tags/:
RewriteRule ^tag/(.*)$ tags/$1
Deploying
There are a few ways to deploy your Hugo site.
A theme of this post is that, while I’m pretty comfortable dealing with .htaccess files and Git and other stuff like that, I’m not very sophisticated. So the idea of writing a Git post-commit hook, though I’m sure I could figure it out, was not exciting. And it wouldn’t have been quick. Ditto for figuring out how to deploy to GitHub Pages.
I also wanted something that would work across a variety of host servers, since configuring the host was something I wanted to avoid. But you can always count on rsync! Here’s what I ended up with:
rsync -avzP –delete –exclude ‘.htaccess’ /local/blog/directory user@host:remote/directory/
a
syncs files recursively and preserves file metadata, v
gives detailed output, z
compresses, P
enables progress montioring and partial file transfers. –delete
deletes remote files that don’t exist locally, but I have to exclude .htaccess
since I need it but it doesn’t exist locally.
One issue is that permissions on new local images aren’t set to be world-readable when they’re transferred. This needs to be fixed since it prevents images from showing up. I think there’s an option for rsync that will do this.
Another issue is that, depending on how your site is configured, all of your canonical URLs may end up containing localhost
, since you’re not regenerating the site on your remote server. I couldn’t get Hugo to consistently pay attention to baseUrl
while building pages, so I modified my header.html
as follows:
{{if .IsPage}}
<link rel="canonical" href="http://justindunham.net{{ .RelPermalink }}">
{{end}}
The {{if .isPage}}
is necessary to prevent Hugo throwing a lot of errors when this header shows up on nodes, which are distinct from pages, and which do not have a .RelPermalink
attribute.
Update: Hugo wasn’t paying attention to baseUrl
because I was running it exclusively in live-rebuild (watch) mode. Running it so that it does one-time generation fixes this problem. I submitted a PR to update the docs.
Conclusion
Hugo’s working really well so far.
The only major quirk remaining is that the Markdown engine seems to be very aggressive, for example, rendering (a), (b), (d) and (e), but © (the copyright symbol) and rendering 2015-02 with a slash as 2015⁄02.
Some new features would definitely be useful, but it’s really enjoyable to have a very clean, simple static site generator that parses highly-accessible Markdown files instead of database records, and to have complete and very simple control over the way my blog looks and works.