Skip to main content

The work involved in migrating a 20-year-old, 10,000-page website into a CMS ecosystem for the first time is daunting for both agency and client alike. The potential for a well-planned project to go off the rails is high, and a great planning process is necessary in order to build scope around the work. Important site materials that are gathered and generated during a research and planning phase are; a complete sitemap (with content breakdown of current URLs, titles, meta and SEO data, and page slug if possible) and a project Information Architecture document that will contain data to build custom post types and custom fields. You can also choose to include fields for a Featured Image and Content Excerpt if you wish to use those WordPress features.

Whether you decide to add those fields to your content breakdown spreadsheet or allow a content scraper tool to target the element on your current site depends on how consistently formatted your current sites HTML markup is. For example, if you want to use your sites meta-description as the content excerpt, and also have a large banner image that is the first object in your site HTML main content container (for example if it was in a container called <div class="main"> and was always given a class of main-banner, then you could probably automate that process because you will consistently be able to target that class without worrying about scraping other unwanted content. However, if you don’t have that type of structure to work with, it might be suggested to add columns to your breakdown spreadsheet and generate that site content before any importing to the new site begins.

You can also include data for tags, post categories, and a description of a page template if that content is available. The more information that can be provided the better.

We’ve Got Data; Let’s Get Started

At this point, we are assuming we have a breakdown of site content, as well as an information architecture document. No matter what you decide comes next, it is highly suggested to have those completed documents before anything else happens. So in order to gather the content we are going to use a scraper to read the HTML data from a set of page URLs and save it in a JSON format that a developer can convert to XML to be imported by WordPress. After testing a few tools for this process, we settled on the free version of ParseHub. The biggest constraint to using a free version vs a paid version was the amount of URLs that could be scraped in a batch. The free one only allows 200 URLs at a time while the premium version is unlimited. After downloading the application, you will be setting a loop to read your list of URLs and scrape the HTML content on each one. In your project settings (1) you will input a JSON formatted list of URLs in the starting value field. Under commands you will create a loop of Select Page, For each item in urls, Being new entry in <name>, Go to template results. You will be asked what template to use, we just created a new one called Results. In the results template (2) you will then target the HTML containers on the page you want to scrape. The better your HTML formatting is the more streamlined the conversion will be to something that WordPress will use.

Running the scraper should result in a JSON file that looks like this:

{
  "links": [
    {
      "title": "A descriptive title scraped from the page <h1> or the <title> attribute",
      "description": "Scraped from the SEO meta description attribute if available",
      "excerpt": "A short description of the page content.",
      "selection": "Your HTML markup"
    }
  ]
}

This will now need to be reformatted into an XML document that can be imported by the WordPress Import plugin. You should be able to do some basic find and replace actions in the document to turn the JSON to XML that looks like this:

<?xml version="1.0" encoding="UTF-8" ?>
<channel>
    <wp:wxr_version>1.2</wp:wxr_version>
   
    <item>
        <title>A descriptive title scraped from the page <h1> or the <title> attribute</title>
        <dc:creator><![CDATA[utadmin]]></dc:creator>
        <wp:postmeta>
            <wp:meta_key><![CDATA[_yoast_wpseo_metadesc]]></wp:meta_key>
            <wp:meta_value><![CDATA[Scraped from the SEO meta description attribute if available]]></wp:meta_value>
        </wp:postmeta>
        <excerpt:encoded><![CDATA[A short description of the page content.]]></excerpt:encoded>
        <content:encoded><![CDATA[Your HTML markup]]></content:encoded>
        <wp:post_type><![CDATA[post]]></wp:post_type>
    </item>
</channel>

You may be able to combine your scraped JSON files so that you can run the find and replace commands in a larger bulk operation, however as the page size increases memory consumption can cause issues in a IDE or code editor.

Depending on what content was targeted you may have a lot of HTML formatting to clean up. By default, WordPress is going to treat soft returns (shift-return) as a <br> tag and hard returns as a <p> tag, so you should not really need to include any such formatting with your content. WordPress will also treat spaces and returns around the tags as content, so it’s suggested to remove anything around the ![CDATA[ and the actual content at the beginning and end of that node.

Tools for working with imported data

Assuming you have some kind of site hierarchy means site pages will have a child/parent relation. Note in WordPress all content is a post, however by default a WordPress install contains a post type and a page type. As a programming object they are the same, but they are given unique features and used in separate ways on a site. The post type of page is utilized for a newsroom type of treatment, and has enabled with categorization and tags. A page type is given the option to set a parent page, but not categories or tags (although these can be added as a theme function if desired).

Instead of trying to recreate the child/parent relationships of the pages sitemap inside the XML data, we found it was a simpler process to simply import everything without the relationships, allow WordPress to automatically generate the post id data for each and then use the CMS Page Tree View plugin to build the relationships. If you do attempt to assign each page an ID in the XML and then add a reference as <wp:post_parent>ID#</wp:post_parent> as a node in the XML data, note that unless the child and parent are both in the same import data, setting the parent id will fail.

Bonus Tips for Photo Galleries

Media is treated as a post object by WordPress, same as other content and can be similarly imported. Working with the Media Library Assistant Plugin gave us access to a lot of flexible shortcode display options and using the WCK Custom Field creator allowed the creation of additional categories to use for organization of the images. The required XML fields for import are similar, however, you will want to include the following:

<wp:post_type><![CDATA[attachment]]></wp:post_type> <wp:attachment_url><![CDATA[https://link-to-current-photo-image.jpg]]></wp:attachment_url> <wp:status><![CDATA[inherit]]></wp:status>

You will also want to remove the <excerpt:encoded> data and use the <content:encoded> data for an image caption. When you import you will want to check the option for Download and import file attachments that will allow WordPress to save the image to the WP Uploads directory (make sure you have set permissions on this folder on your server).

WP-CLI for bulk operations

If you are working on a large-scale site migration you will probably be working through a lot of trial and error, and WP-CLI can make that a lot easier by allowing you to do post import/export/update operations from a command line with a direct line to the database. For example, you can delete every page in WordPress by using the command wp post delete $(wp post list --post_type='page' --format=ids). By default, WordPress only shows 20 posts in a list of pages in the admin, so without installing a bulk editing plugin you would only be able to select and delete 20 posts at a time. You can also target specific page ids, categories, or other conditions in bulk. The easiest place to work with the site data will be in the JSON or XML files and through CLI functions vs the Admin UI. It won’t be possible all the time (i.e. using the CMS Page Tree plugin to organize the sitemap), but it will make life easier in many ways.

Lessons learned

Bulk scraping and importing of a legacy website that does not use a database to manage separation of view and content requires a lot of planning and manual work. However, it can be optimized to avoid unseen diversions and costly delays. Be prepared to dive into a thorough planning process, and review the tools and techniques you plan to use during the research phase to ensure practical application of the migration plan.