Kickoff Data Liberation: Let's Build WordPress-first Data Migration T…

…ools (#1888) Let's officially kickoff [the Data Liberation](https://wordpress.org/data-liberation/) efforts under the Playground umbrella and unlock powerful new use cases for WordPress. ## Rationale ### Why work on Data Liberation? WordPress core _really_ needs reliable data migration tools. There's just no reliable, free, open source solution for: - Content import and export - Site import and export - Site transfer and bulk transfers, e.g. mass WordPress -> WordPress, or Tumblr -> WordPress - Site-to-site synchronization Yes, there's the WXR content export. However, it won't help you backup a photography blog full of media files, plugins, API integrations, and custom tables. There are paid products out there, but nothing in core. At the same time, so many Playground use-cases are **all about moving your data**. Exporting your site as a zip archive, migrating between hosts with the [Data Liberation browser extension](https://github.com/WordPress/try-wordpress/), creating interactive tutorials and showcasing beautiful sites using [the Playground block](https://wordpress.org/plugins/interactive-code-block/), previewing Pull Requests, building new themes, and [editing documentation](#1524) are just the tip of the iceberg. ### Why the existing data migration tools fall short? Moving data around seems easy, but it's a complex problem – consider migrating links. Imagine you're moving a site from [https://my-old-site.com](https://playground-site-1.com) to [https://my-new-site.com/blog/](https://my-site-2.com). If you just moved the posts, all the links would still point to the old domain so you'll need an importer that can adjust all the URLs in your entire database. However, the typical tools like `preg_replace` or `wp search_replace` can only replace some URLs correctly. They won't reliably adjust deeply encoded data, such as this URL inside JSON inside an HTML comment inside a WXR export: The only way to perform a reliable replacement here is to carefully parse each and every data format and replace the relevant parts of the URL at the bottom of it. That requires four parsers: an XML parser, an HTML parser, a JSON parser, a WHATWG URL parser. Most of those tools don't exist in PHP. PHP provides `json_encode()`, which isn't free of issues, and that's it. You can't even rely on DOMDocument to parse XML because of its limited availability and non-streaming nature. ### Why build this in Playground? Playground gives us a lot for free: - **Customer-centric environment.** The need to move data around is so natural in Playground. So many people asked for reliable WXR imports, site exports, synchronization with git, and the ability to share their Playground. Playground allows us to get active users and customer feedback every step of the way. - **Free QA**. Anyone can share a testing link and easily report any problems they found. Playground is the perfect environment to get ample, fast moving feedback. - **Space to mature the API**. Playground doesn’t provide the same backward compatibility guarantees as WordPress core. It's easy to prototype a parser, find a use case where the design breaks down, and start over. - **Control over the runtime.** Playground can lean on PHP extensions to validate our ideas, test them on a simulated slow hardware, and ship them to a tablet to see how they do when the app goes into background and the internet is flaky. Playground enables methodically building spec-compliant software to create the solid foundation WordPress needs. ## The way there ### What needs to be built? There's been a lot of [gathering information, ideas, and tools](https://core.trac.wordpress.org/ticket/60375). This writeup is based on 10 years worth of site transfer problems, WordPress synchronization plugins, chats with developers, analyzing existing codebases, past attempts at data importing, non-WordPress tools, discussions, and more. WordPress needs parsers. Not just any parsers, they must be streaming, re-entrant, fast, standard compliant, and tested using a large body of possible inputs. The data synchronization tools must account for data conflicts, WordPress plugins, invalid inputs, and unexpected power outages. The errors must be non-fatal, retryable, and allow manual resolution by the user. No data loss, ever. The transfer target site should be usable as early as possible and show no broken links or images during the transfer. That's the gist of it. A number of parsers have already been prototyped. There's even [a draft of reliable URL rewriting library](https://github.com/adamziel/site-transfer-protocol). Here's a bunch of early drafts of specific streaming use-cases: - [A URL parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_URL.php) - [A block markup parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_Block_Markup_Processor.php) - [An XML parser](WordPress/wordpress-develop#6713), also explored by @dmsnell and @jonsurrell - [A Zip archive parser](https://github.com/WordPress/blueprints-library/blob/87afea1f9a244062a14aeff3949aae054bf74b70/src/WordPress/Zip/ZipStreamReader.php) - [A multihandle HTTP client](https://github.com/WordPress/blueprints-library/blob/trunk/src/WordPress/AsyncHttp/Client.php) without curl dependency - [A MySQL query parser](WordPress/sqlite-database-integration#157) started by @zieladam and now explored by @JanJakes - [A stream chaining API](adamziel/wxr-normalize#1) to connect all these pieces On top of that, WordPress core now has an HTML parser, and @dmsnell have been exploring a [UTF-8](WordPress/wordpress-develop#6883) decoder that would to enable fast and regex-less URL detection in long data streams. There are still technical challenges to figure out, such as how to pause and resume the data streaming. As this work progresses, you'll start seeing incremental improvements in Playground. One possible roadmap is shipping a reliable content importer, then reliable site zip importer and exporter, then cloning a site, and then extends towards full-featured site transfers and synchronization. ### How soon can it be shipped? Three points: * No dates. * Let's keep building on top of prior work and ship meaningful user flows often. * Let's not ship any stable public APIs until the design is mature. For example, the [Try WordPress extension](https://github.com/WordPress/try-wordpress/) can already give you a Playground site, even if you cannot migrate it to another WordPress site just yet. **Shipping matters. At the same time, taking the time required to build rigorous, reliable software is also important**. An occasional early version of this or that parser may be shipped once its architecture seems alright, but the architecture and the stable API won't be rushed. That would jeopardize the entire project. This project aims for a solid design that will serve WordPress for years. The progress will be communicated in the open, while maintaining feedback loops and using the work to ship new Playground features. ## Plans, goals, details ### Next steps Let's start with building a tool to export and import _a single WordPress post_. Yes! Just one post. The tricky part is that all the URLs will have to be preserved. From there, let's explore the breadth and depth of the problem, e.g.: * Rewriting links * Frontloading media files * Preserving dependent data (post meta, custom tables, etc.) * Exporting/importing a WXR file using the above * Pausing and resuming a WXR export/import * Exporting/importing a full WordPress site as a zip file Ideally, each milestone will result in a small, readily reusable tool. For example "paste WordPress post, paste a new site URL, get your post migrated". There's an ample body of existing work. Let's keep the existing codebases (e.g. WXR, site migration plugins) and discussions open in a browser window during this work. Let's involve the authors of these tools, ask them questions, ask them for reviews. Let's publish the progress and the challenges encountered on the way. ### Design goals - **Fault tolerance** – all the data tools should be able to start, stop, resume, tolerate errors, accept alternative data from the user, e.g. media files, posts etc. - **WordPress-first** – let's build everything in PHP using WordPress naming conventions. - **Compatibility** – Every WordPress version, PHP version (7.2+, CLI), and Playground runtime (web, CLI, browser extension, desktop app, CI etc.) should be supported. - **Dependency-free** – No PHP extensions required. If this means we can't rely on cUrl, then let's build an HTTP client from scratch. Only minimal Composer dependencies allowed, and only when absolutely necessary. - **Simplicity** – no advanced OOP patterns. Our role model is [WP_HTML_Processor](https://developer.wordpress.org/reference/classes/wp_html_processor/) – a **single class** that can parse nearly all HTML. There's no "Node", "Element", "Attribute" classes etc. Let's aim for the same here. - **Extensibility** – Playground should be able to benefit from, say, WASM markdown parser even if core WordPress cannot. - **Reusability** – Each library should be framework-agnostic and usable outside of WordPress. We should be able to use them in WordPress core, WP-CLI, Blueprint steps, Drupal, Symfony bundles, non-WordPress tools like https://github.com/adamziel/playground-content-converters, and even in Next.js via PHP.wasm. ### Prior art Here's a few codebases that needs to be reviewed at minimum, and brought into this project at maximum: - URL rewriter: https://github.com/adamziel/site-transfer-protocol - URL detector : WordPress/wordpress-develop#7450 - WXR rewriter: https://github.com/adamziel/wxr-normalize/ - Stream Chain: adamziel/wxr-normalize#1 - WordPress/wordpress-develop#5466 - WordPress/wordpress-develop#6666 - XML parser: WordPress/wordpress-develop#6713 - Streaming PHP parsers: https://github.com/WordPress/blueprints-library/tree/trunk/src/WordPress - Zip64 support (in JS ZIP parser): #1799 - Local Zip file reader in PHP (seeks to central directory, seeks back as needed): https://github.com/adamziel/wxr-normalize/blob/rewrite-remote-xml/zip-stream-reader-local.php - WordPress/wordpress-develop#6883 - Blocky formats – Markdown <-> Block markup WordPress plugin: https://github.com/dmsnell/blocky-formats - Sandbox Site plugin that exports and imports WordPress to/from a zip file: https://github.com/WordPress/playground-tools/tree/trunk/packages/playground - WordPress + Playground CLI setup to import, convert, and exporting data: https://github.com/adamziel/playground-content-converters - Markdown -> Playground workflow _and WordPress plugins_: https://github.com/adamziel/playground-docs-workflow - _Edit Visually_ browser extension for bringing data in and out of Playground: WordPress/playground-tools#298 - _Try WordPress_ browser extension that imports existing WordPress and non-WordPress sites to Playground: https://github.com/WordPress/try-wordpress/ - Humanmade WXR importer designed by @rmccue: https://github.com/humanmade/WordPress-Importer ### Related resources - [Site transfer protocol](https://core.trac.wordpress.org/ticket/60375) - [Existing data migration plugins](https://core.trac.wordpress.org/ticket/60375#comment:32) - WordPress/data-liberation#74 - #1524 - WordPress/gutenberg#65012 ### The project structure The structure of the `data-liberation` package is an open exploration and will change multiple times. Here's what it aims to achieve. **Structural goals:** - Publish each library as a separate Composer package - Publish each WordPress plugin separately (perhaps a single plugin would be the most useful?) - No duplication of libraries between WordPress plugins - Easy installation in Playground via Blueprints, e.g. no `composer install` required - Compatibility with different Playground runtimes (web, CLI) and versions of WordPress and PHP **Logical parts** - First-party libraries, e.g. streaming parsers - WordPress plugins where those libraries are used, e.g. content importers - Third party libraries installed via Composer, e.g. a URL parser **Ideas:** - Use Composer dependency graph to automatically resolve dependencies between libraries and WordPress plugins - or use WordPress "required plugins" feature to manage dependencies - or use Blueprints to manage dependencies cc @brandonpayton @bgrgicak @mho22 @griffbrad @akirk @psrpinto @ashfame @ryanwelcher @justintadlock @azaozz @annezazu @mtias @schlessera @swissspidy @eliot-akira @sirreal @obenland @rralian @ockham @youknowriad @ellatrix @mcsf @hellofromtonya @jsnajdr @dawidurbanski @palmiak @JanJakes @luisherranz @naruniec @peterwilsoncc @priethor @zzap @michalczaplinski @danluu
WordPress · Oct 14, 2024 · e9bb384 · e9bb384
1 parent 1cd30bf
commit e9bb384
Show file tree

Hide file tree

Showing 5 changed files with 164 additions and 0 deletions.
diff --git a/packages/playground/data-liberation/PLAN.md b/packages/playground/data-liberation/PLAN.md
@@ -0,0 +1,69 @@
+## Plan
+
+The initial plan is to build a tool to export and import a single WordPress post.
+Yes! Just one post. The tricky part is that all the links, media files, post meta,
+etc. must be preserved. This is closely related to WXR exporters, so let's keep
+these codebases open on our screens as we work on this project.
+
+### Design goals
+
+-   Build re-entrant data tools that can start, stop, resume, tolerate errors, accept alternative media files, posts etc. from the user.
+-   WordPress-first – let's build everything in PHP using WordPress naming conventions.
+-   Compatibility – Every WordPress version, PHP version (7.2+, CLI), and Playground runtime (web, CLI, browser extension, desktop app, CI etc.) should be supported.
+-   Dependency-free – No PHP extensions required. If this means we can't rely on cUrl, then let's build an HTTP client from scratch. Only minimal Composer dependencies allowed, and only when absolutely necessary.
+-   Simple – no advanced OOP patterns. Our role model is [WP_HTML_Processor](https://developer.wordpress.org/reference/classes/wp_html_processor/) – a **single class** that can parse nearly all HTML. There's no "Node", "Element", "Attribute" classes etc. Let's aim for the same here.
+-   Extensibility – Playground should be able to benefit from, say, WASM markdown parser even if core WordPress cannot.
+-   Reusability – Each library should be framework-agnostic and usable outside of WordPress. We should be able to use them in non-WordPress tools like https://github.com/adamziel/playground-content-converters.
+
+### Prior art
+
+Here's a few codebases we'll need to review and bring into this project:
+
+-   URL rewriter: https://github.com/adamziel/site-transfer-protocol
+-   URL detector : https://github.com/WordPress/wordpress-develop/pull/7450
+-   WXR rewriter: https://github.com/adamziel/wxr-normalize/
+-   Stream Chain: https://github.com/adamziel/wxr-normalize/pull/1
+-   Unicode-aware comprehensive sluggify(): https://github.com/WordPress/wordpress-develop/pull/5466
+-   Doodlings on how a Core URL parser could look: https://github.com/WordPress/wordpress-develop/pull/6666
+-   XML parser: https://github.com/WordPress/wordpress-develop/pull/6713
+-   Streaming PHP parsers: https://github.com/WordPress/blueprints-library/tree/trunk/src/WordPress
+-   Zip64 support (in JS ZIP parser): https://github.com/WordPress/wordpress-playground/pull/1799
+-   Local Zip file reader in PHP (seeks to central directory, seeks back as needed): https://github.com/adamziel/wxr-normalize/blob/rewrite-remote-xml/zip-stream-reader-local.php
+-   Blocky formats – Markdown <-> Block markup WordPress plugin: https://github.com/dmsnell/blocky-formats
+-   Sandbox Site plugin that exports and imports WordPress to/from a zip file: https://github.com/WordPress/playground-tools/tree/trunk/packages/playground
+-   WordPress + Playground CLI setup to import, convert, and exporting data: https://github.com/adamziel/playground-content-converters
+-   Markdown -> Playground workflow _and WordPress plugins_: https://github.com/adamziel/playground-docs-workflow
+-   _Edit Visually_ browser extension for bringing data in and out of Playground: https://github.com/WordPress/playground-tools/pull/298
+-   _Try WordPress_ browser extension that imports existing WordPress and non-WordPress sites to Playground: https://github.com/WordPress/try-wordpress/
+-   Humanmade WXR importer designed by @rmccue: https://github.com/humanmade/WordPress-Importer
+
+### Related resources
+
+-   Site transfer protocol: https://core.trac.wordpress.org/ticket/60375
+-   Solving rewriting site URLs in WordPress using the HTML API and URL parser: https://github.com/WordPress/data-liberation/discussions/74
+-   WordPress for docs (importing architecture): https://github.com/WordPress/wordpress-playground/discussions/1524
+-   Collaborative editing in Gutenberg: https://github.com/WordPress/gutenberg/discussions/65012
+
+### Repository structure
+
+The structure of this project is an open exploration and will change multiple times.
+
+It consists of the following parts:
+
+-   First-party libraries, e.g. streaming parsers
+-   WordPress plugins where those libraries are used, e.g. content importers
+-   Third party libraries installed via Composer, e.g. a URL parser
+
+**Structural goals:**
+
+-   Publish each library as a separate Composer package
+-   Publish each WordPress plugin separately (perhaps a single plugin would be the most useful?)
+-   No duplication of libraries between WordPress plugins
+-   Easy installation in Playground via Blueprints, e.g. no `composer install` required
+-   Compatibility with different Playground runtimes (web, CLI) and versions of WordPress and PHP
+
+**Ideas:**
+
+-   Use Composer dependency graph to automatically resolve dependencies between libraries and WordPress plugins
+-   or use WordPress "required plugins" feature to manage dependencies
+-   or use Blueprints to manage dependencies
diff --git a/packages/playground/data-liberation/RATIONALE.md b/packages/playground/data-liberation/RATIONALE.md
@@ -0,0 +1,73 @@
+## Data Liberation
+
+The Data Liberation project aims to unlock powerful new use cases for WordPress.
+
+### Wait, a PHP project inside a TypeScript monorepo?
+
+Is this is a weird setup? Sure! But is it useful? YES!
+This way PHP code can be easily developed and tested with all the
+official WordPress Playground runtimes.
+
+## Rationale
+
+### Why work on data tools?
+
+WordPress core _really_ needs reliable data migration tools. There's just no reliable, free, open source solution for:
+
+-   Content import and export
+-   Site import and export
+-   Site transfer and bulk transfers, e.g. mass WordPress -> WordPress, or Tumblr -> WordPress
+-   Site-to-site synchronization
+
+Yes, there's the WXR content export. However, it won't help you backup a photography blog full of media files, plugins, API integrations, and custom tables. There are paid products out there, but nothing in core.
+
+At the same time, so many Playground use-cases are **all about moving your data**. Exporting your site as a zip archive, migrating between hosts with the [Data Liberation browser extension](https://github.com/WordPress/try-wordpress/), creating interactive tutorials and showcasing beautiful sites using [the Playground block](https://wordpress.org/plugins/interactive-code-block/), previewing Pull Requests, building new themes, and [editing documentation](https://github.com/WordPress/wordpress-playground/discussions/1524) are just the tip of the iceberg.
+
+### Why are there no existing data tools?
+
+Moving data around seems easy, but it's a complex problem – consider migrating links.
+
+Imagine you're moving a site from [https://my-old-site.com](https://playground-site-1.com) to [https://my-new-site.com/blog/](https://my-site-2.com). If you just moved the posts, all the links would still point to the old domain so you'll need an importer that can adjust all the URLs in your entire database. However, the typical tools like `preg_replace` or `wp search_replace` can only replace some URLs correctly. They won't reliably adjust deeply encoded data, such as this URL inside JSON inside an HTML comment inside a WXR export:
+
+The only way to perform a reliable replacement here is to carefully parse each and every data format and replace the relevant parts of the URL at the bottom of it. That requires four parsers: an XML parser, an HTML parser, a JSON parser, a WHATWG URL parser. Most of those tools don't exist in PHP. PHP provides `json_encode()`, which isn't free of issues, and that's it. You can't even rely on DOMDocument to parse XML because of its limited availability and non-streaming nature.
+
+### Why build it in Playground?
+
+Playground gives us a lot for free:
+
+-   **Customer-centric environment.** The need to move data around is so natural in Playground. So many people asked for reliable WXR imports, site exports, synchronization with git, and the ability to share their Playground. Playground allows us to get active users and customer feedback every step of the way.
+-   **Free QA**. Anyone can share a testing link and easily report any problems they found. Playground is the perfect environment to get ample, fast moving feedback.
+-   **Space to mature the API**. Playground doesn’t provide the same backward compatibility guarantees as WordPress core. It's easy to prototype a parser, find a use case where our design breaks down, and start over.
+-   **Control over the runtime.** Playground can lean on PHP extensions to validate our ideas, test them on a simulated slow hardware, and ship them to a tablet to see how they do when the app goes into background and the internet is flaky.
+
+Playground enables methodically building spec-compliant software to create the solid foundation WordPress needs.
+
+## The way there
+
+### What needs to be built?
+
+There's been a lot of [gathering information, ideas, and tools](https://core.trac.wordpress.org/ticket/60375). This section is based on 10 years worth of site transfer problems, WordPress synchronization plugins, chats with developers, existing codebases, past attempts at data importing, non-WordPress tools, discussions, and more.
+
+WordPress needs parsers. Not just any parsers, they must be streaming, re-entrant, fast, standard compliant, and tested using a large body of possible inputs. The data synchronization tools must account for data conflicts, WordPress plugins, invalid inputs, and unexpected power outages. The errors must be non-fatal, retryable, and allow manual resolution by the user. No data loss, ever. The transfer target site should be usable as early as possible and show no broken links or images during the transfer. That's the gist of it.
+
+A number of parsers have already been prototyped. There's even [a reliable URL rewriting library](https://github.com/adamziel/site-transfer-protocol). Here's a bunch of early drafts of specific streaming use-cases:
+
+-   [A URL parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_URL.php)
+-   [A block markup parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_Block_Markup_Processor.php)
+-   [An XML parser](https://github.com/WordPress/wordpress-develop/pull/6713), also explored by @dmsnell and @jonsurrell
+-   [A Zip archive parser](https://github.com/WordPress/blueprints-library/blob/87afea1f9a244062a14aeff3949aae054bf74b70/src/WordPress/Zip/ZipStreamReader.php)
+-   [A multihandle HTTP client](https://github.com/WordPress/blueprints-library/blob/trunk/src/WordPress/AsyncHttp/Client.php) without curl dependency
+-   [A MySQL query parser](https://github.com/WordPress/sqlite-database-integration/pull/157) started by @zieladam and now explored by @janjakes
+-   [A stream chaining API](https://github.com/adamziel/wxr-normalize/pull/1) to connect all these pieces
+
+On top of that, WordPress core now has an HTML parser, and @dmsnell have been exploring a [UTF-8](https://github.com/WordPress/wordpress-develop/pull/6883) decoder that would to enable fast and regex-less URL detection in long data streams.
+
+There are still technical challenges to figure out, such as how to pause and resume the data streaming. As this work progresses, you'll start seeing incremental improvements in Playground. One possible roadmap is shipping a reliable content importer, then reliable site zip importer and exporter, then cloning a site, and then extends towards full-featured site transfers and synchronization.
+
+### How soon can it be shipped?
+
+**The work is structured to ship a progression of meaningful user flows.** For example, the [Try WordPress extension](https://github.com/WordPress/try-wordpress/) can already give you a Playground site, even if you cannot migrate it to another WordPress site just yet.
+
+**At the same time, we'll take the time required to build rigorous, reliable software**. We may ship an early version of this or that parser once we're comfortable with their architecture, but we are not rushing the architecture. That would jeopardize the entire project. We're aiming for a solid design that will serve WordPress for years.
+
+The progress will be communicated in the open, while maintaining feedback loops and using the work to ship new Playground features.
diff --git a/packages/playground/data-liberation/README.md b/packages/playground/data-liberation/README.md
@@ -0,0 +1,9 @@
+## Data Liberation
+
+This project aims to help the Data Liberation project and unlock powerful new
+use cases for WordPress. See [the rationale](RATIONALE.md) and [the plan](PLAN.md)
+for more details.
+
+### Getting started
+
+No instructions yet. TBD.
diff --git a/packages/playground/data-liberation/package.json b/packages/playground/data-liberation/package.json
@@ -0,0 +1,6 @@
+{
+	"name": "@wp-playground/data-liberation",
+	"version": "0.0.1",
+	"description": "",
+	"private": true
+}
diff --git a/packages/playground/data-liberation/project.json b/packages/playground/data-liberation/project.json
@@ -0,0 +1,7 @@
+{
+	"name": "playground-data-liberation",
+	"$schema": "../../../node_modules/nx/schemas/project-schema.json",
+	"sourceRoot": "packages/playground/data-liberation",
+	"projectType": "library",
+	"targets": {}
+}