Making static copies of Wordpress sites
As I've mentioned in other posts on this blog, I was previously hosting a number of blogs running on self-hosted Wordpress
I wanted to archive the sites so that I could refer to them in future - I wanted to retain the look of the sites, so just exporting the posts to .md
files and adding them to this site with 11ty was not an option.
I looked at various Wordpress plug-ins which promised to export a whole site to static HTML files, but I didn't have much luck, you might get on better so they are worth trying:
- WP2Static - This wouldn't install with the version of PHP which I was running. I didn't want to update in case it broke any of the other blogs I was running and cause more problems.
- Really Static - This was last updated 5 years ago and had a lot of negative reviews
- Simply Static - This installed and seemed to run ok, but I couldn't find all of the posts in the output.
In the end, I decided to just spider the site with HTTrack - I ended up using the Windows version.
The settings I chose were just the defaults and the following:
Action: Download Website(s)
Options > Spider > Spider: no robots.txt rules
After completing the download, I browsed the local copy of the site and everything looked fine until I right-clicked on an image and chose 'View Image' - the image that was displayed was the one hosted on my server, not the local copy.
On investigation it seems that HTTrack does not handle any srcset
attributes inside <img>
tags.
srcset is often used to allow alternative sizes of images to be displayed in responsive Wordpress themes. This didn't really affect my local copy, but a quick fix would be to strip all the srcset
attributes from the HTML files. Happily 'Mitch B' had written a Python script which does exactly this. The script can be found on the HTTrack Forum
I've included it here in case I need it again:
#Given a PATH go through all the html files and delete all of the srcset subtags for img tags.
#files will be overwritten
#Script from https://forum.httrack.com/readmsg/36162/33879/index.html
#Written by Mitch B
from bs4 import BeautifulSoup
import os
from glob import glob
import codecs
PATH = "C:\folder_containing_static_site" # No trailing \ or it will break!
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0],'*.html'))]
for filename in result:
print filename
file = codecs.open(filename,'r','utf-8')
data = file.read()
soup = BeautifulSoup(data, 'html.parser')
for p in soup.find_all('img'):
if 'srcset' in p.attrs:
del p.attrs['srcset']
file.close()
file1 = codecs.open(filename,'w','utf-8')
file1.write(soup.prettify())
file1.close()
Note: The script overwrites the files, so run it on a copy in case something goes wrong.
I saved the script as removesetsrc.py
and ran it with python removesetsrc.py
If the script complains about the bs4
module, you will need to install BeautifulSoup by doing pip install BeautifulSoup
After running the script, my local sites are working as expected and are ready to be hosted if I decide to.