Defining the Data Flow #101
Replies: 2 comments 4 replies
-
This is great, thanks for taking the time to write it down. The general ideas are in line with how I picture things working as well. I don't have a full view on how the later parts (3rd-party plugins, crawling, retries, testing) would work in practice, but in general I think the way you present it here makes sense. Regardless, I think that if we build a solid foundation in terms of how data is stored, these later parts will easily fit into the model. Regarding PHP plugin, at first sight the way you describe it makes sense, @ashfame probably has a clearer view than I do. The way you describe Browser extension make sense to me and is inline with the mental model I had. About Post types, also agree with the general idea, though there are some things that aren't super clear to me how it would work in practice. Specifically Navigation: "navigation": {
"navigation": "String"
}, I think that navigation would probably need to be some sort of structured data (as you have also suggested recently, I believe), e.g.: "navigation": {
"links": [
{ "text": "string", "href": "?" }
]
} Could you provide an example of how navigation would be represented if it would be a flat string instead of an array of links? Something else that I think is related to this is that I think that navigation represents a different thing than other types (post, page, product, etc):
This leads me to believe that maybe we should treat navigation (and other "layout" things) differently, and not try to make it fit into the model we define for things that are post types. |
Beta Was this translation helpful? Give feedback.
-
POST Types JSON seems like a list of fields associated with each post_type that we want to declare support for. While I agree that it should be pretty easy to add support for more content types by just specifying the fields we want to collect, but for implementation it often unpacks into more:
In order to add support for more content types, updating an array and updating basic code in multiple files is kinda the same thing. If one has to orchestrate in multiple files, that's undesirable I agree, but quickly copying over template files and just editing them to add to the "definition" is an Ok choice, IMO. With time, as we handle more and more edge cases, we will slowly converge into as you describe it.
Agree! That's what we are working towards 👍
Currently we receive data in a single post type The mental model I have is, we don't ask or require the user to install any plugins whatsoever & lead them straight to assist in liberating data. We choose sane defaults for them such as a plugin for product, a plugin for restaurant menu etc so they get something to show up in preview. Later on, within the playground itself or after moving to a host, they can install any plugin which can offer to re-transform that data. So, offering a hook like
I believe crawling in our context is a better suited job for frontend to handle. A bit different from the traditional crawling, we would be instructing the browser tab to visit a url instead of fetching it directly, and operating on the html as conveyed (possibly after correction) by the browser and not raw html from page source, so that only leaves the task of parsing links out of html and with
Till now, we haven't used any transformation logic other than relying on Gutenberg's paste_handler so yet to see how this will look in practice though I agree with the direction here. |
Beta Was this translation helpful? Give feedback.
-
This is a draft for discussing this, the final version will be put in an issue.
I understand that we have a bit of misalignment of how the data would be represented, how to generically design our infrastructure, and where the connection points are. Here is what I believe is a generic and powerful way of providing this.
Post Types
These are defined in a JSON file that looks something like this:
Browser Extension
The extension co-creates the extraction definitions with the user for each post type that the user chooses. These user choices can later be automated through a site definition file that can be provided by the community.
The interface for the user to choose this is one where they
After the content types have been provided, crawling would happen (see below).
PHP Plugin
The server-side plugin offers to receive any of the above post types above. This is done by registering the post types dynamically, based on the JSON above. By a convention, extension and plugin agree on a naming of the post type, for example by prefixing
liberated_data_
.The extension saves the extracted post types through a standard REST API provided by WordPress automatically for custom post types. The extension provides:
Because we're using a standard endpoint, WordPress stores the data in the posts table and the metadata in the post_meta table. It might be better to implement our own endpoint to make it easier to provide previews, so relying on the standard endpoint is not a must.
Upon insert, the plugin triggers actions that can be received by other plugins to transform the extracted data into data structures that can be used by the plugins. For example, it would run something like:
For features that are provided by WordPress out of the box, the plugin handles the received action from above themselves. In particular:
post
(usingtitle
,date
,author
, andcontent
where authors would be created as users automatically if they don't exist yet).page
(using justtitle
andcontent
).template_part
for the current plugin.Media should also be uploaded by the extension, this still needs to be defined but could work similarly by providing the full origin URL and replacing this into the extracted post types.
3rd party plugins
They can also register the hook and upon receiving them create their own representation of the data. For example a shopping plugin would accept products, a restaurant plugin would read menus and contact data to provide "how to get there" functionality.
Recursing / Crawling
While the extension is in charge of extracting the data (because it can rely on the rendered DOM representation of the browser), the server-side plugin provides the list of pages to be crawled. It will prioritize the pages based on available information, for example the items from the navigation menu. When possible it will provide an expected post type but the extension should offer an autodetect mode where through the selectors the likelyhood for a specific page type could be determined, or the user asked for help.
It would submit back the extracted data and request the next page to be crawled.
Retrying the Import
Because the data is first stored into
liberated_data_*
post types, the user can choose to retry an import by wiping out all the other posts and re-running the hooks as if the data was just extracted. This would be offered by admin UI by the plugin.This can also be used to defer the choice of a specialized plugin: the user can import products and then try various shopping plugins to see which one works best for their data.
Testing Infrastructure
The above is also a way how test driven development would work where extracted data can be serialized to a file (fixture) and then re-run to ensure the desired output.
It is also possible that plugins can extract more than the structured data provided from the HTML. This could be done by specialized plugins that augment the structured data coming from the extension, or by plugins themselves that have support for more properties than are available in our default structured data.
cc @psrpinto @ashfame
Beta Was this translation helpful? Give feedback.
All reactions