Query timeout limit reached while updating German Nouns #124

shashank-iitbhu · 2024-03-25T03:09:19Z

Terms

I have searched all open bug reports
I agree to follow Scribe-Data's Code of Conduct

Behavior

Description

(scribedev) shashankmittal@ShashanksLaptop Scribe-Data % python3 src/scribe_data/extract_transform/wikidata/update_data.py '["German"]' '["nouns", "verbs"]' 
Data updated:   0%|                                                                                                                   | 0/2 [00:00<?, ?dirs/s]Querying and formatting German nouns
Data updated:   0%|                                                                                                                   | 0/2 [01:00<?, ?dirs/s]
Traceback (most recent call last):
  File "/Users/shashankmittal/Documents/Developer/scribe/Scribe-Data/src/scribe_data/extract_transform/wikidata/update_data.py", line 141, in <module>
    results = sparql.query().convert()
  File "/opt/anaconda3/envs/scribedev/lib/python3.10/site-packages/SPARQLWrapper/Wrapper.py", line 1196, in convert
    return self._convertJSON()
  File "/opt/anaconda3/envs/scribedev/lib/python3.10/site-packages/SPARQLWrapper/Wrapper.py", line 1059, in _convertJSON
    json_str = json.loads(self.response.read().decode("utf-8"))
  File "/opt/anaconda3/envs/scribedev/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/anaconda3/envs/scribedev/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/anaconda3/envs/scribedev/lib/python3.10/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 320797 column 115 (char 6713171)

Query builder Link

The query time limit is reached that's why results = sparql.query().convert() in update_data.py is throwing json.decoder.JSONDecodeError due to Invalid control character at: line 320797 column 115 (char 6713171) in sparql.query().response as it contains the timeout error logs.

Suggested Changes

Considered splitting SPARQL query into smaller queries, such as one query for nouns and another for pronouns, or querying for singular and plural forms separately.
Still got Query timeout limit reached error as total number of nouns and pronouns for German are 165869. Verified here.
Use LIMIT and OFFSET to split into multiple queries.

The text was updated successfully, but these errors were encountered:

shashank-iitbhu · 2024-03-25T03:10:47Z

This can be reproduced by running:
python3 src/scribe_data/extract_transform/wikidata/update_data.py '["German"]' '["nouns", "verbs"]'
@andrewtavis Are you able to reproduce this issue?
If so, I can open a PR with the proposed changes.

andrewtavis · 2024-03-25T07:59:53Z

I can confirm on my end, @shashank-iitbhu:

json.decoder.JSONDecodeError: Invalid control character at: line 320797 column 115 (char 6713171)

Two questions to decide on this:

Could you do a query of just nouns and just pronouns and let us know what the percentage of the total each is?
For LIMIT and OFFSET I'm a bit worried that it might not work as we'd need to compute the full result in order to then offset for the ones we want?
- This would maybe be a solution if the result JSON was too large, but then this issue is query time
- I can check with folks at work on this 😇

All in all great that you figured this out and suggested solutions! As you can see by the verbs queries, this is not the first time that this has happened 😅

andrewtavis · 2024-03-25T09:32:00Z

Confirmed from the Wikidata team that splitting based on nouns and proper nouns would be the initial path forward, but offset could work if it continues to be problematic :)

andrewtavis · 2024-04-07T23:23:41Z

Note that I just tried to query only all the singulars of only the nouns, not the proper nouns, and it's still failing. At this point it might make sense to use LIMIT and OFFSET.

andrewtavis · 2024-04-21T19:49:55Z

CC @daveads, do you want to write in here so I can assign this issue to you?

daveads · 2024-04-22T18:41:35Z

yup @andrewtavis

andrewtavis · 2024-06-04T00:20:04Z

Lots of commits above, and post a discussion with a Wikidata admin today I was able to get it working with changes to the query itself. This issue has been great for Scribe as it really unveiled a lot of parts of the queries that weren't necessary and were slowing things down. Note on this as well: maybe something to consider if a query stops is to remove the labeling service if it's been used, as there is a lot of overhead to get it to run over the results of a large query.

With the above being said, there have been lots of improvements, and I'm super grateful to @shashank-iitbhu for opening this and to @daveads for all the conversations that led to the possible solutions that led us to here! 😊 Thanks so much!

shashank-iitbhu added the bug Something isn't working label Mar 25, 2024

andrewtavis mentioned this issue Apr 21, 2024

Deleted: Convert all query processes to use LIMIT and OFFSET #130

Closed

2 tasks

andrewtavis assigned daveads Apr 22, 2024

daveads mentioned this issue Apr 30, 2024

noun query timeout #132

Merged

andrewtavis self-assigned this Jun 2, 2024

andrewtavis added the -priority- High priority label Jun 2, 2024

andrewtavis added this to Scribe Board Jun 2, 2024

github-project-automation bot moved this to Todo in Scribe Board Jun 2, 2024

andrewtavis moved this from Todo to In Progress in Scribe Board Jun 2, 2024

andrewtavis added a commit that referenced this issue Jun 2, 2024

#124 Fixes for update_data.py and reformat/add to noun queries

bff1da6

andrewtavis added a commit that referenced this issue Jun 2, 2024

#124 convert all prep formatting to use common functions

7e4ca2e

andrewtavis added a commit that referenced this issue Jun 2, 2024

#124 standardize annotation ordering and format file fxn calls

7553fd6

andrewtavis added a commit that referenced this issue Jun 3, 2024

#124 remove 'a ontolex:lexicalEntry' from all queries as it's unneeded

cdc5291

andrewtavis added a commit that referenced this issue Jun 3, 2024

#124 fix to German noun query

cc746ef

andrewtavis added a commit that referenced this issue Jun 3, 2024

#124 fix to German noun query - remove filter/fix errors

6669e65

andrewtavis added a commit that referenced this issue Jun 4, 2024

#124 #130 remove explicit noun types filter from sparql queries + fixes

7bfa5bc

andrewtavis added a commit that referenced this issue Jun 4, 2024

#124 remove now unneeded query total nouns script

763fe6e

andrewtavis closed this as completed Jun 4, 2024

github-project-automation bot moved this from In Progress to Done in Scribe Board Jun 4, 2024

andrewtavis mentioned this issue Jun 15, 2024

Investigate and implement LIMIT and OFFSET within queries #156

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query timeout limit reached while updating German Nouns #124

Query timeout limit reached while updating German Nouns #124

shashank-iitbhu commented Mar 25, 2024 •

edited

Loading

shashank-iitbhu commented Mar 25, 2024 •

edited

Loading

andrewtavis commented Mar 25, 2024 •

edited

Loading

andrewtavis commented Mar 25, 2024

andrewtavis commented Apr 7, 2024

andrewtavis commented Apr 21, 2024

daveads commented Apr 22, 2024

andrewtavis commented Jun 4, 2024

Query timeout limit reached while updating German Nouns #124

Query timeout limit reached while updating German Nouns #124

Comments

shashank-iitbhu commented Mar 25, 2024 • edited Loading

Terms

Behavior

Description

Suggested Changes

shashank-iitbhu commented Mar 25, 2024 • edited Loading

andrewtavis commented Mar 25, 2024 • edited Loading

andrewtavis commented Mar 25, 2024

andrewtavis commented Apr 7, 2024

andrewtavis commented Apr 21, 2024

daveads commented Apr 22, 2024

andrewtavis commented Jun 4, 2024

shashank-iitbhu commented Mar 25, 2024 •

edited

Loading

shashank-iitbhu commented Mar 25, 2024 •

edited

Loading

andrewtavis commented Mar 25, 2024 •

edited

Loading