Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question if PR will be accepted #3

Open
MarGul opened this issue Dec 18, 2020 · 7 comments
Open

Question if PR will be accepted #3

MarGul opened this issue Dec 18, 2020 · 7 comments

Comments

@MarGul
Copy link

MarGul commented Dec 18, 2020

Hi guys.

Not sure how I can request a feature before I build it to see if it will be accepted. At ThingConverter you are doing the right thing when ignoring a type of https://schema.org./product with a lowercase first letter.

Unfortunately I'm crawling a site that do not respect the full schema.org contract and are using a lowercase first letter. If I change the preg_match in ThingsConverter to preg_match('/https?\:\/\/schema.org\/([a-zA-z0-9]+)$/', $schemaOrgId, $matches) === 1 it all works because now it doesn't have to be an uppercase first letter.

Would this be OK otherwise I will have to extend the package by creating my own SchemaReader and own ThingsConverter which would be fine but would you accept this change?

Thanks for a great package!

@BenMorel
Copy link
Member

Hi, it depends on whether the thing names are case-sensitive, in which case I'd be happy to accept a PR to change this.
Can you find a reference somewhere?

@MarGul
Copy link
Author

MarGul commented Dec 18, 2020

Thanks for your quick response @BenMorel . I should have done this research before but it looks like Schema is case-sensitive so we shouldn't change this. This is from Google Note: Schema.org markup can be used on web pages written in any language. The markup, like HTML, is in English. Schema.org values are case-sensitive..

What do you think is the best way for me to solve this? It will be hard but maybe I can parse the HTML before I send it to the ThingsConverter and look for all https://schema.org/product and change it to https://schema.org/Product. Just a little bit concerned if I have to do this for every corner-case that I find.

@BenMorel
Copy link
Member

Schema.org values are case-sensitive.

I'm still not sure what they understand by values. I'd rather try to find this information on https://schema.org/ if it exists!

Otherwise, TBH I'm not sure what value case-sensitivity brings, I mean, nobody will ever create a schema.org/product to compete with schema.org/Product, so even if the spec says that it should be case-sensitive, I wouldn't be against a boolean configuration somewhere to make it lenient and therefore case-insensitive.

But let's first be 100% sure that it's supposed to be case-sensitive, before introducing a setting.

Also please test the output of the Google Structured Data Testing Tool and Yandex Structured data validator to see how they behave with a lowercase product.

Report your findings here, and we'll decide what to do!

@MarGul
Copy link
Author

MarGul commented Dec 18, 2020

I agree. Don't see why we should completely dismiss the item if it's a little bit misspelled. Maybe go with just checking https://schema.org like you say would be sufficient. I think so.

The Yandex tool was down for me but with Google Structured Data Testing Tool it does recognize https://schema.org/product as a Thing. This is the URL that I'm using https://www.telia.se/privat/telefoni/telefoner/produkt/sony-xperia-5-ll.

Personally I don't think that this needs a setting because it's pretty clear that they have tried to put in structured data and Google are accepting it as well.

@BenMorel
Copy link
Member

I can confirm that Google handles it case-insensitively.

I could not find a better source for case sensitivity, but this link says:

While Schema.org vocabulary is case sensitive (and you can enable this option), Google is not as strict – so this isn’t required for Google rich result features and their understanding of structured data.

I think it could make sense to make case sensitivity an option, still.

@MarGul
Copy link
Author

MarGul commented Dec 19, 2020

I agree that a config wouldn't hurt but I think default should be case-insensitively. Where do you think a config like this should reside?

Do you then find it sufficient to "only" build up the preg_match string depending on this config at ThingsConverter like

$pattern = $caseInsensitively ? '[a-zA-z0-9]' : '[A-Za-z0-9]';

if (preg_match("/https?\:\/\/schema.org\/([{$pattern}]+)$/", $schemaOrgId, $matches) === 1) {
    //
}

@jhard
Copy link

jhard commented Jan 28, 2023

This would probably have to go into ObjectFactory, since only there it will be known what properties exist. The regexp in ThingsConverter already matches both, the order in the character group does not matter.
In ObjectFactory, it could be something like this to determine the types in the build function:

        $types = array_map(function(string $type) {
           if(isset($this->propertiesByType[$type])) {
              return $type;
           }
           foreach(array_keys($this->propertiesByType) as $potentialType) {
              if(strtolower($potentialType) === strtolower($type)) {
                 return $potentialType;
              }
           }
           return null;
        }, $types);
       
        $types = array_filter(array_values($types));

Which isn't super efficient, but it'll get the job done. I guess the best way to make it configurable would be to have the config live as a property on SchemaReader and then pass it as a parameter into ThingConverter, which would pass it into ObjectFactory, wouldn't it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants