Defining the Data Flow #101

akirk · 2024-11-08T11:11:24Z

akirk
Nov 8, 2024
Maintainer

This is a draft for discussing this, the final version will be put in an issue.

I understand that we have a bit of misalignment of how the data would be represented, how to generically design our infrastructure, and where the connection points are. Here is what I believe is a generic and powerful way of providing this.

Post Types

These are defined in a JSON file that looks something like this:

{
	"post": {
		"date": "Date",
		"author": "String",
		"title": "String",
		"content": "String",
	},
	"page": {
		"title": "String",
		"content": "String",
	},
	"navigation": {
		"navigation": "String"
	},
	"product": {
		"name": "String",
		"price": "String",
		"pictures": [ "String" ],
	},
	"restaurant-menu": {
		"name": "String",
		"contents": "String"
	},
	"contact": {
		"name": "String",
		"address": "String"
		"phone": "String"
	},
}

Browser Extension

The extension co-creates the extraction definitions with the user for each post type that the user chooses. These user choices can later be automated through a site definition file that can be provided by the community.

The interface for the user to choose this is one where they

Choose the post type they want to extract (sourced from the JSON above)
Navigate to such a page
A selection of fields is provided in the extension based on the definition above
The user selects the value in the DOM
The extension passes the selected data on to the php plugin and keeps the DOM selectors available for automatic extraction of other instances of the type.

After the content types have been provided, crawling would happen (see below).

PHP Plugin

The server-side plugin offers to receive any of the above post types above. This is done by registering the post types dynamically, based on the JSON above. By a convention, extension and plugin agree on a naming of the post type, for example by prefixing liberated_data_.

The extension saves the extracted post types through a standard REST API provided by WordPress automatically for custom post types. The extension provides:

guid: the source URL
post_content: the whole HTML for the page
post_meta: the structured extracted data.

Because we're using a standard endpoint, WordPress stores the data in the posts table and the metadata in the post_meta table. It might be better to implement our own endpoint to make it easier to provide previews, so relying on the standard endpoint is not a must.

Upon insert, the plugin triggers actions that can be received by other plugins to transform the extracted data into data structures that can be used by the plugins. For example, it would run something like:

do_action( 'liberated_data', $post_type, $structured_metadata, $html, $url );

For features that are provided by WordPress out of the box, the plugin handles the received action from above themselves. In particular:

Posts: inserted as post_type post (using title, date, author, and content where authors would be created as users automatically if they don't exist yet).
Page: inserted as post_type page (using just title and content).
Navigation: converted into a WordPress menu and inserted as post_type template_part for the current plugin.

Media should also be uploaded by the extension, this still needs to be defined but could work similarly by providing the full origin URL and replacing this into the extracted post types.

3rd party plugins

They can also register the hook and upon receiving them create their own representation of the data. For example a shopping plugin would accept products, a restaurant plugin would read menus and contact data to provide "how to get there" functionality.

Recursing / Crawling

While the extension is in charge of extracting the data (because it can rely on the rendered DOM representation of the browser), the server-side plugin provides the list of pages to be crawled. It will prioritize the pages based on available information, for example the items from the navigation menu. When possible it will provide an expected post type but the extension should offer an autodetect mode where through the selectors the likelyhood for a specific page type could be determined, or the user asked for help.

It would submit back the extracted data and request the next page to be crawled.

Retrying the Import

Because the data is first stored into liberated_data_* post types, the user can choose to retry an import by wiping out all the other posts and re-running the hooks as if the data was just extracted. This would be offered by admin UI by the plugin.

This can also be used to defer the choice of a specialized plugin: the user can import products and then try various shopping plugins to see which one works best for their data.

Testing Infrastructure

The above is also a way how test driven development would work where extracted data can be serialized to a file (fixture) and then re-run to ensure the desired output.

It is also possible that plugins can extract more than the structured data provided from the HTML. This could be done by specialized plugins that augment the structured data coming from the extension, or by plugins themselves that have support for more properties than are available in our default structured data.

cc @psrpinto @ashfame

psrpinto · 2024-11-08T12:22:33Z

psrpinto
Nov 8, 2024
Maintainer

This is great, thanks for taking the time to write it down. The general ideas are in line with how I picture things working as well.

I don't have a full view on how the later parts (3rd-party plugins, crawling, retries, testing) would work in practice, but in general I think the way you present it here makes sense. Regardless, I think that if we build a solid foundation in terms of how data is stored, these later parts will easily fit into the model.

Regarding PHP plugin, at first sight the way you describe it makes sense, @ashfame probably has a clearer view than I do.

The way you describe Browser extension make sense to me and is inline with the mental model I had.

About Post types, also agree with the general idea, though there are some things that aren't super clear to me how it would work in practice. Specifically Navigation:

"navigation": {
    "navigation": "String"
},

I think that navigation would probably need to be some sort of structured data (as you have also suggested recently, I believe), e.g.:

"navigation": {
    "links": [
        { "text": "string", "href": "?" }
    ]    
}

Could you provide an example of how navigation would be represented if it would be a flat string instead of an array of links?

Something else that I think is related to this is that I think that navigation represents a different thing than other types (post, page, product, etc):

it's not "data", it's part of the layout of the site
There will only be one (or a few) while for other types of data there would be N, where N can grow with time
I think it would be fair to say that it's not really a post type in the same sense that other types are post types
The UI for navigation will likely be different. Navigation doesn't need to be "crawled", the user does it once, and that's it. For other types, the user does it once to define the "blueprint", and then that blueprint is used to automatically import (crawl) the other ones.

This leads me to believe that maybe we should treat navigation (and other "layout" things) differently, and not try to make it fit into the model we define for things that are post types.

4 replies

akirk Nov 8, 2024
Maintainer Author

I believe the navigation should be just extracted as a html blob and analyzed on the php side since it's its responsibility to extract the navigation elements from it. It's possible to iterate over this and shift that analyzing responsibility to the extension at some point but this is where I'd suggest to start.

psrpinto Nov 8, 2024
Maintainer

Currently, the way we send a field to the backend is like this:

{
    "content": {
        "raw": "<p>Foo<p>"
        "parsed": "<block markup>"
    }
}

Is your suggestion that for navigation we do the following (raw and parsed would take the same value)?

{
    "navigation: {
        "raw": "<ul>...</ul>"
        "parsed": "<ul>...</ul>"
    }
}

Or is it that navigation would have a structure that is different than everything else, e.g:

{
    navigation: "<ul>...</ul>"
}

Wouldn't doing the parsing on the backend just shift the problem of parsing the navigation from the frontend to the backend? How would the backend parse <ul>...</ul> into navigation? Why would it be better to, for navigation only, do the parsing on the backend, when for other things the parsing is done on the frontend?

akirk Nov 8, 2024
Maintainer Author

Ah, that makes it more clear to me what you are asking. I wouldn't do an exception, it makes sense to me to try and send the navigation through the paste handler, I just don't think that we should try and identify navigation elements specifically just yet.

ashfame Nov 8, 2024
Maintainer

I am also in favor of navigation being its own type in which it would have some sort of structured data for the parsed version

ashfame · 2024-11-08T13:36:39Z

ashfame
Nov 8, 2024
Maintainer

POST Types JSON seems like a list of fields associated with each post_type that we want to declare support for. While I agree that it should be pretty easy to add support for more content types by just specifying the fields we want to collect, but for implementation it often unpacks into more:

For every field, we have 2 properties (raw and parsed)
For fields with actual type like Date, we have 3 (raw, parsed, ISOstring)
For certain fields, like contents of restaurant-menu, we would find ourselves wanting to define proper types to handle the kind of data we expect, otherwise we lose the benefits of a strongly-typed system & make it harder to build a robust system.

In order to add support for more content types, updating an array and updating basic code in multiple files is kinda the same thing. If one has to orchestrate in multiple files, that's undesirable I agree, but quickly copying over template files and just editing them to add to the "definition" is an Ok choice, IMO. With time, as we handle more and more edge cases, we will slowly converge into as you describe it.

Browser extension

Agree! That's what we are working towards 👍

PHP Plugin

Currently we receive data in a single post type liberation_data, which is where all the raw/meta/auxiliary data is stored. We call each kind of data subject and we use subject_type to infer what should be the post type of the transformed subject.

The mental model I have is, we don't ask or require the user to install any plugins whatsoever & lead them straight to assist in liberating data. We choose sane defaults for them such as a plugin for product, a plugin for restaurant menu etc so they get something to show up in preview. Later on, within the playground itself or after moving to a host, they can install any plugin which can offer to re-transform that data.

So, offering a hook like do_action( 'liberated_data', $data ); doesn't work until you have all the 3rd party plugins installed already. I would discover more in my support week but I think postponing decisions and 3rd party plugins operating at that data afterwards is a better choice. This also levels the playing field for any plugins competing for handling a particular kind of data. And we can eventually hope to build support for "previews" with different plugins before deciding and committing to one. This makes it quite powerful while keeping it simple. This is what you describe in "Retrying the import" section.

Crawling

I believe crawling in our context is a better suited job for frontend to handle. A bit different from the traditional crawling, we would be instructing the browser tab to visit a url instead of fetching it directly, and operating on the html as conveyed (possibly after correction) by the browser and not raw html from page source, so that only leaves the task of parsing links out of html and with DOMParser available in browsers, what's the benefit of pushing this on PHP side?

function parseLinksFromHTML(htmlString) {
    // Create a DOM parser instance
    const parser = new DOMParser();
    
    // Parse the HTML string into a document
    const doc = parser.parseFromString(htmlString, 'text/html');
    
    // Find all anchor tags
    const linkElements = doc.querySelectorAll('a');
    
    // Convert NodeList to Array and extract link data
    const links = Array.from(linkElements).map(link => {
        // Get the href attribute
        const href = link.getAttribute('href');
        
        // Skip if no href or it's a javascript: link or anchor link
        if (!href || href.startsWith('javascript:') || href.startsWith('#')) {
            return null;
        }
        
        // Try to resolve relative URLs to absolute
        let absoluteUrl;
        try {
            absoluteUrl = new URL(href, window.location.origin).href;
        } catch (e) {
            // If URL parsing fails, use the original href
            absoluteUrl = href;
        }
        
        return {
            href: absoluteUrl,
            text: link.textContent.trim(),
            title: link.getAttribute('title') || '',
            isExternal: link.hostname !== window.location.hostname
        };
    });
    
    // Filter out null values and return unique links
    return links.filter(link => link !== null)
        .filter((link, index, self) => 
            index === self.findIndex(l => l.href === link.href)
        );
}

Testing Infrastructure

Till now, we haven't used any transformation logic other than relying on Gutenberg's paste_handler so yet to see how this will look in practice though I agree with the direction here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defining the Data Flow #101

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Defining the Data Flow #101

akirk Nov 8, 2024 Maintainer

Post Types

Browser Extension

PHP Plugin

3rd party plugins

Recursing / Crawling

Retrying the Import

Testing Infrastructure

Replies: 2 comments · 4 replies

psrpinto Nov 8, 2024 Maintainer

akirk Nov 8, 2024 Maintainer Author

psrpinto Nov 8, 2024 Maintainer

akirk Nov 8, 2024 Maintainer Author

ashfame Nov 8, 2024 Maintainer

ashfame Nov 8, 2024 Maintainer

akirk
Nov 8, 2024
Maintainer

Replies: 2 comments 4 replies

psrpinto
Nov 8, 2024
Maintainer

akirk Nov 8, 2024
Maintainer Author

psrpinto Nov 8, 2024
Maintainer

akirk Nov 8, 2024
Maintainer Author

ashfame Nov 8, 2024
Maintainer

ashfame
Nov 8, 2024
Maintainer