Jan 272016
Mirroring Web Pages

Related to the previous article step 1 of a resilient WordPress setup is to mirror the web page somehow. A 2nd WordPress with file level synchronization of the WordPress directory and MySQL multi-master replication sounds great…not.

WordPress keeps files in its directory tree (YAPB pictures as examples, and plugins) and while rsync could handle this, it gets messy quickly. Multi-master MySQL is possible. Overkill for my purpose though.

The easier and more universal way would be to simply grab the web page and keep a static content available. While it’s missing the ability to log in and edit/write articles, that’s fine as most readers will simply read.

The initial idea was to use wget, however that failed a bit: at first it did not (by default) copy JavaScript pages in <script> tags, and it did not download CSS files either, so the pure text contents plus some formatting was copied, but not enough to make it a “mirror”. Another idea was to use PhantomJS to copy and render a picture of the web page. But for a blog that would be a quite long picture, so that idea was thrown out quickly. Going back to the wget method I found httrack, and while not perfect on the first try, httrack made a much better copy out-of-the-box, and with a bit of tuning I could mirror my blog page quite well. While there are differences visible, they are minimal. So httrack it was.

Naturally this became a Docker container. It’s hosted on Docker Hub under hkubota/webmirror.

The Dockerfile is simple:

FROM debian:8 
MAINTAINER Harald Kubota <hkubota@gmx.net> 
RUN apt-get update ; apt-get -y install lighttpd wget curl openssh-client ; apt-get clean 
# httrack in usr/local/ 
COPY usr/ /usr/ 
RUN ldconfig -v 
# The script to run 
COPY mirror.sh /root/ 
# The lighttpd configuration 
COPY lighttpd.conf /root/ 
ENTRYPOINT ["dumb-init", "/root/mirror.sh"] 
# It's a web server, so expose port 80 

It’s using mainly httrack which I compiled from sources, and lighttpd as a web server since I need to export the web pages via a web server again. wget, curl and openssh-client are more for completeness as I was testing with httrack and wget and ssh’ing out.

I tested it on several other web pages (www.heise.de, www.theregister.co.uk and some other ones) and it works quite well. Note that defaults are 2 recursive levels, which allow for anything to be clickable on the first page. Also after 5min the copying stops as I had endless loops happening sometimes. If your network bandwidth is very fast or slow, you might have to adjust this.

To run the hkubota/webmirror Docker image, do:

docker run -e web_source="http://www.heise.de/" -e recursive=2 -e refresh=24
 max_time=300 -e other_flags="-v" -p 80:80 -d hkubota/webmirror

If you want to watch what happens (mainly to see httrack output), replace the “-d” by “-it” or watch the logs via “docker logs CONTAINER”.

Next is the actual load balancer…