Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lists shared between graphs are not correctly serialized #357

Closed
lanthaler opened this issue Jul 23, 2014 · 18 comments
Closed

Lists shared between graphs are not correctly serialized #357

lanthaler opened this issue Jul 23, 2014 · 18 comments

Comments

@lanthaler
Copy link
Member

_Reported by @afs:_

We are encountering an issue when converting RDF Datasets to JSON-LD.

The problem is with blank nodes that are shared between graphs and lists.

In TriG (yes, this is a synthetic reduced test case that captures a
smaller example that might appear for real):

# Bnode references across graph and lists
PREFIX :        <http://www.example.com/>
PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

:G {
   # Written in short form it would be:
   # :z :q ("cell-A" "cell-B")
   # but we want to share the tail ("cell-B")

   :z :q _:z0 .

   _:z0   rdf:first "cell-A" .
   _:z0   rdf:rest  _:z1 .

   _:z1   rdf:first "cell-B" .
   _:z1   rdf:rest rdf:nil .
}

:G1 {
    # This references the tail  ("cell-B")
    :x :p _:z1 .
}

The triple in :G1 references into the list in :G.

But as we understand the conversion algorithm, section 4 only considers
each graph in turn and so does not see the cross graph sharing.

Is this a correct reading of the spec text?

Part 4 of the conversion algorithm has
"For each name and graph object in graph map: "

so 4.3.3.* walks back up the list in one graph only.

(Conversion generated by jsonld-java : it does not matter if compaction
is applied or not):

{
   "@graph" : [ {
     "@graph" : [ {
       "@id" : ":z",
       ":q" : {
         "@list" : [ "cell-A", "cell-B" ]
       }
     } ],
     "@id" : ":G"
   }, {
     "@graph" : [ {
       "@id" : ":x",
       ":p" : {
         "@id" : "_:b1"
       }
     } ],
     "@id" : ":G1"
   } ],
   "@context" : {
     "@base" : "http://www.example.com/",
     "" : "http://www.example.com/",
     "rdf" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   }
}

There is no _:b1 in :G to refer to because the algorith generated @list
and its implicit bNodes don't have labels.
This is a different dataset with no shared bNode.

If it is all the same graph (s/:G1/:G/), the RDF dataset structure is
correctly serialized.

Andy

@gkellogg
Copy link
Member

Thanks @afs, this appears to be a bug in the Serialize RDF as JSON-LD Algorithm in JSON-LD 1.0 Processing Algorithms and API [1]. In particular, step 4.3.3 checks to see that there are exactly 1 member of rdf:first and rdf:rest for a given node, that that node is a BNode and that the value of rdf:first is a BNode by looking in the usages member of the list node. The usages member is set up in step 3.5.8.2, but is done only within the context of a specific graph, and not the dataset.

One way to fix this would be to insert a step between steps 3 and 4 to consolidate the usage maps for nodes across the dataset, or to extract the usage collection outside of an individual graph context.

I think what needs to happen is to add new tests to the fromRdf test suite that cover this use case, propose and develop an update to the algorithm, and git buy in from the developer community that this is a reasonable fix, after which the algorithm update can be added to the errata for this document.

IMO, the appropriate solution will serialize to JSON-LD using rdf:first/rest/nil, rather than using the @list syntax, as that is the intention of step 4.3.3. As with Turtle lists, the JSON-lD @list structure is intended for fully conforming RDF Collections.

@dlongley
Copy link
Member

+1 to what @gkellogg said.

gkellogg added a commit that referenced this issue Jul 27, 2014
gkellogg added a commit that referenced this issue Jul 27, 2014
@gkellogg
Copy link
Member

I added a test, and updated spec/latest/json-ld-api/index.html; however, this currently won't show on the web site, as it redirects to the recommendation (through .htaccess, I believe).

I think we may want to remove that redirection, as the latest version does reference it, or we should come up with another location for edited specs, although the fact it's /latest seems fairly clear to me.

The fix to the fromRdf algorithm is probably not the most elegant, but it is simple and does the job.

dlongley added a commit to digitalbazaar/jsonld.js that referenced this issue Jul 29, 2014
dlongley added a commit to digitalbazaar/pyld that referenced this issue Jul 29, 2014
dlongley added a commit to digitalbazaar/php-json-ld that referenced this issue Jul 29, 2014
@lanthaler
Copy link
Member Author

This is a tricky question. It is certainly a bug in the sense that the result isn’t what we would like it to be. Nevertheless, the JSON-LD 1.0 spec algorithm is quite clear about this is supposed to be processed so all JSON-LD 1.0 conformant processors do exactly the same thing. Are we now changing this while the stable 1.0 spec says something else? Do we introduce a new processing mode (1.1? 2.0?)? I think we should really discuss this further on the mailing (sending mail in a minute).

@gkellogg
Copy link
Member

It's an errata; we have to be able to fix obvious bugs, although the spec won't be revised until there's a new WG with that in its charter.

For my part, I think we should keep the "latest" versions on json-ld.org up to date with corrections. Changing these in no way invalidates the REC. AFAIK, there's still a CG, and the community can decide what to do with our version of the specs, as well as the other unreleased specs such as normalization and framing.

lanthaler added a commit that referenced this issue Sep 15, 2014
The algorithm update in 3f31c3c doesn't fully address the issue.

/cc @gkellogg @dlongley
lanthaler added a commit that referenced this issue Sep 15, 2014
lanthaler added a commit to lanthaler/JsonLD that referenced this issue Sep 15, 2014
@lanthaler
Copy link
Member Author

As fromRdf-0021 in fc0e6bf shows, the algorithm updates that have been made are not sufficient to fix this bug. I addressed it slightly differently in lanthaler/JsonLD@10a03f7#diff-86a207a446f1f808e29f4a32dba82be1R2305: Instead of storing usages for every node, I do that just for rdf:nil nodes. For all other nodes, the references are stored in the (what Gregg called) node usages map under a key graphName|subject|predicate. Please have a look and let me know if you are OK with that approach. If so, I'll go ahead and update the spec accordingly.

@lanthaler lanthaler reopened this Sep 15, 2014
@dlongley
Copy link
Member

I left some comments that may simplify things in your diff. I'd also prefer the @usages map to be declared separately from the graphs var in the algorithm spec text because it isn't a graph itself and doesn't belong in the graphs var, IMO. It may also confuse some people, despite the fact that it's unlikely to cause an actual issue.

dlongley added a commit to digitalbazaar/jsonld.js that referenced this issue Sep 15, 2014
@dlongley
Copy link
Member

@lanthaler, If you look at digitalbazaar/jsonld.js@6f0cc8d, you can see a simplified implementation (see the referencedOnce map usage) via what I described in comments on your diff.

@lanthaler
Copy link
Member Author

Thanks for the feedback @dlongley. I fully agree that in the spec we shouldn’t put this information in the graphs object. I just had a quick look at your changes in jsonld.js and was wondering whether it handles duplicate triples in the input RDF document (both in the same and in different graphs) properly!?

@dlongley
Copy link
Member

It passes all the tests! :)

@dlongley
Copy link
Member

Yeah, it might need to check the existing entry (for a match) before just setting it to false.

@dlongley
Copy link
Member

@lanthaler, hmm, can an abstract RDF dataset have duplicate triples? Because our algorithm's input is an RDF dataset. My processor automatically strips out duplicate triples before it ever gets to the fromRDF algorithm.

@lanthaler
Copy link
Member Author

No, an abstract dataset can’t, but a serialized one can. In the spec we do not filter duplicate triples as far as I remember (will check tomorrow) and thus we need to take that into consideration I’d say.

@dlongley
Copy link
Member

What I'm saying is that by the time the dataset reaches the fromRDF algorithm, it has already been deserialized. My processor, for instance, has already parsed the nquads and removed any duplicates.

The fromRDF algorithm doesn't say anything about the serialization format of the RDF dataset, rather it's an abstract dataset at that point, isn't it? Do you think we should say, in the lead up to the algorithm, that any duplicate triples have already been removed in the RDF dataset, as a precaution?

@dlongley
Copy link
Member

In the spec we say that the input to the fromRDF algorithm is an RDF Dataset (defined by RDF Contepts) which is:

An RDF dataset is a collection of RDF graphs

And an RDF graph is:

An RDF graph is a set of RDF triples.

A set cannot contain duplicates. Now, this doesn't rule out triples that appear in different graphs, but in that case, a @list shouldn't be used as in one of the tests you just added for this issue (IOW, the object therein should be counted as being referenced more than once). So I don't think we have any issue here -- other than deciding whether or not we feel the need to explicitly reiterate that RDF datasets can't have duplicates in the prose before the fromRDF algorithm.

@gkellogg
Copy link
Member

My implementation also passes all these tests, and I did use a global node usages map. I tried to keep changes to the algorithm minimal, but perhaps this was too minimal. However, my implementation from which the algorithm updates were taken does pass these tests. In fact the amended text did say that the node usages map is initialize separate from the graphs. I'm a little unclear about exactly what the issue with the spec text is. Do you have an alternate wording?

@gkellogg
Copy link
Member

gkellogg commented Aug 1, 2018

Closed in favor of w3c/json-ld-api#13.

@gkellogg gkellogg closed this as completed Aug 1, 2018
@gkellogg
Copy link
Member

gkellogg commented Aug 1, 2018

I agree with @dlongley that duplicates are handled by virtue of the RDF Dataset description, so the NQuads input would be normalized into a Dataset first, which would handle all triples. No need to duplicate this logic to deal with a serialization that may contain duplicates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants