Pelican conversion notes

    The basic process I used for converting my WordPress data was to:

    1. Export the XML from WordPress
    2. Run the import pelican-import --markup markdown --wpfile -o content/ --dir-page petersmithjournal.wordpress.2014-12-09.xml
    3. Use pandoc to get rid of remaining textile markup.
    4. Fix-up Unicode and other matters.
    5. Change the filenames so they begin with a date.
    6. Adjust categories. python

    # Remove textfile markup
    for file in _md
     echo Processing $file
     pandoc -f textile -t markdown -o $file.tmp $file
     mv $file.tmp  $file

    # Fix up unicode and other problems in markdown files
    for file in _md
    echo Processing $file
    vim -E -s $file <<-EOF
    :%substitute/\\\# /#/g
    sed -i '' 's/\\$//' $file 
    sed -i '' 's/\\\\//g' $file 


    #!/bin/bash -e
    for i in
        date=$(head -10 $i | grep "^Date:" | awk '{print $2}')
        year=$(echo $date | awk -F \- '{print $1}')
        month=$(echo $date | awk -F \- '{print $2}')
        day=$(echo $date | awk -F \- '{print $3}')
        echo $newname
        mv $i $


    #!/usr/bin/env python
    Script to fix hilighted code blocks from WordPress wp-syntax plugin.
    WordPress wp-syntax plugin (
    uses a "lang" attr on pre tags to define the syntax hilighting, like:
        <pre lang="bash">
    When pelican-import runs this through Pandoc to produce MarkDown, it comes out
    as a weird and meaningless block like:
        ~~~~ {lang="bash"}
    This script takes file path(s) as arguments, and converts this junk
    into proper MarkDown notation. It CHANGES FILES IN-PLACE.
    import os
    import sys
    import re
    from pygments import lexers
    files = sys.argv[1:]
    translation of GeSHi identifiers to Pygments identifiers,
    for GeSHi identifiers not supported by Pygments
    overrides = {'none': 'text', 'lisp': 'common-lisp', 'html4strict': 'html'}
    overrides['xorg'] = 'text'
    overrides['nagios'] = 'text'
    Mapping of WP categories to new blog categories, for any that change.
    categories = {}
    categories['Strategy'] = 'Research'
    categories['Seminars'] = 'Research'
    categories['PhD'] = 'Research'
    categories['Hypotheses'] = 'Research'
    categories['Running'] = 'Fitness'
    categories['Health'] = 'Fitness'
    categories['Readings'] = 'Research'
    categories['Quick reference'] = 'Jottings'
    categories['Blogging'] = 'Jottings'
    categories['MBA'] = 'Jottings'
    categories['Out and about'] = 'Jottings'
    def translate_identifier(lexers, overrides, i, fname=None):
        Translate a wp-syntax/GeSHi language identifier to
        a Pygments identifier.
        if i in lexers:
        return lexers[i].lower()
        if i in overrides:
        return overrides[i]
        sys.stderr.write("Unknown lexer, leaving as-is: %s" % i)
        if fname is not None:
        sys.stderr.write(" in file %s" % fname)
        return i
    def get_lexers_list():
        """ get a list of all pygments lexers """
        d = {}
        ls = lexers.get_all_lexers()
        for l in ls:
        d[l[0]] = l[0]
        for n in l[1]:
            d[n] = l[0]
        return d
    def translate_category(i):
        """ translate a category name """
        if i in categories:
        return categories[i]
        return i
    lang_re = re.compile(r'^~~~~ {lang="([^"]+)"}$')
    cat_re = re.compile(r'^Category: (.+)$')
    lexers = get_lexers_list()
    for f in files:
        content = ""
        inpre = False
        count = 0
        with open(f, "r") as fh:
        for line in fh:
            m = cat_re.match(line)
            if m is not None:
            line = ("Category: %s\n" % translate_category(
            content = content + line
            m = lang_re.match(line)
            if m is not None:
            line = ("~~~~{.%s}\n" % translate_identifier(lexers, overrides,, fname=f))
            inpre = True
            count = count + 1
            elif inpre and line.strip() == "~~~~":
            inpre = False
            content = content + line
        with open(f, "w") as fh:
        print(" fixed %d blocks in %s" % (count, f))
    # done

    I should note that none of this is my orginal work. I found and tweaked many helpful sources from the web.

    If you webmention this page, please let me know the URL of your page.

    BTW: Your webmention won't show up until I next "build" my site.

    Word count: 600 (about 3 minutes)


    Updated: 18 Dec '14 14:21

    Author: Peter Smith


    Section: blog

    Kind: page

    Bundle type: leaf

    Source: blog/2014/12/18/pelican-conversion-notes/