Hyphe relies on a JsonRPC API that can be controlled easily through the web interface or called directly from a JsonRPC client.
Note: as it relies on the JSON-RPC protocol, it is not quite easy to test the API methods from a browser (having to send arguments through POST), but you can test directly from the command-line using the dedicated tools, see the Developers' documentation.
The current JSON-RPC 1.0 implementation requires to provide arguments as an ordered array of the methods arguments. Call with named arguments is possible but not well handled and not recommanded until we migrate to REST.
The API will always answer as such:
- Success:
{
"code": "success",
"result": "<The actual expected result, possibly an objet, an array, a number, a string, ...>"
}
- Error:
{
"code": "fail",
"message": "<A string describing the possible cause of the error.>"
}
- Default API commands (no namespace)
- CORPUS HANDLING
test_corpus
list_corpus
get_corpus_options
set_corpus_options
create_corpus
start_corpus
stop_corpus
get_corpus_tlds
backup_corpus
ping
reinitialize
destroy_corpus
force_destroy_corpus
clear_all
- CORE AND CORPUS STATUS
get_status
- BASIC PAGE DECLARATION (AND WEBENTITY CREATION)
declare_page
declare_pages
- BASIC CRAWL METHODS
listjobs
propose_webentity_startpages
crawl_webentity
crawl_webentity_with_startmode
get_webentity_jobs
cancel_webentity_jobs
get_webentity_logs
- HTTP LOOKUP METHODS
lookup_httpstatus
lookup
- CORPUS HANDLING
- Commands for namespace: "crawl."
deploy_crawler
delete_crawler
cancel_all
start
cancel
get_job_logs
- Commands for namespace: "store."
- DEFINE WEBENTITIES
get_lru_definedprefixes
declare_webentity_by_lruprefix_as_url
declare_webentity_by_lru
declare_webentity_by_lrus_as_urls
declare_webentity_by_lrus
- EDIT WEBENTITIES
basic_edit_webentity
rename_webentity
set_webentity_status
set_webentities_status
set_webentity_homepage
add_webentity_lruprefixes
rm_webentity_lruprefix
add_webentity_startpages
add_webentity_startpage
rm_webentity_startpages
rm_webentity_startpage
merge_webentity_into_another
merge_webentities_into_another
delete_webentity
- RETRIEVE AND SEARCH WEBENTITIES
get_webentity
get_webentity_by_lruprefix
get_webentity_by_lruprefix_as_url
get_webentity_for_url
get_webentity_for_url_as_lru
get_webentities
search_webentities
wordsearch_webentities
get_webentities_by_status
get_webentities_by_name
get_webentities_by_tag_value
get_webentities_by_tag_category
get_webentities_mistagged
get_webentities_uncrawled
get_webentities_page
get_webentities_ranking_stats
- TAGS
rebuild_tags_dictionary
add_webentity_tag_value
add_webentities_tag_value
rm_webentity_tag_value
rm_webentities_tag_value
edit_webentity_tag_value
get_tags
get_tag_namespaces
get_tag_categories
get_tag_values
- PAGES, LINKS AND NETWORKS
get_webentity_pages
paginate_webentity_pages
get_webentity_mostlinked_pages
get_webentity_subwebentities
get_webentity_parentwebentities
get_webentity_pagelinks_network
paginate_webentity_pagelinks_network
get_webentity_referrers
get_webentity_referrals
get_webentity_ego_network
get_webentities_network
- CREATION RULES
get_default_webentity_creationrule
get_webentity_creationrules
delete_webentity_creationrule
add_webentity_creationrule
simulate_creationrules_for_urls
simulate_creationrules_for_lrus
- VARIOUS
trigger_links_build
get_webentities_stats
- DEFINE WEBENTITIES
test_corpus
:corpus
(optional, default:"--hyphe--"
)
Returns the current status of a corpus
: "ready"/"starting"/"missing"/"stopped"/"error".
list_corpus
:light
(optional, default:true
)
Returns the list of all existing corpora with metas.
get_corpus_options
:corpus
(optional, default:"--hyphe--"
)
Returns detailed settings of a corpus
.
set_corpus_options
:corpus
(optional, default:"--hyphe--"
)options
(optional, default:null
)
Updates the settings of a corpus
according to the keys/values provided in options
as a json object respecting the settings schema visible by querying get_corpus_options
. Returns the detailed settings.
create_corpus
:name
(optional, default:"--hyphe--"
)password
(optional, default:""
)options
(optional, default:{}
)
Creates a corpus with the chosen name
and optional password
and options
(as a json object see set/get_corpus_options
). Returns the corpus generated id and status.
start_corpus
:corpus
(optional, default:"--hyphe--"
)password
(optional, default:""
)
Starts an existing corpus
possibly password
-protected. Returns the new corpus status.
stop_corpus
:corpus
(optional, default:"--hyphe--"
)
Stops an existing and running corpus
. Returns the new corpus status.
get_corpus_tlds
:corpus
(optional, default:"--hyphe--"
)
Returns the tree of TLDs rules built from Mozilla's list at the creation of corpus
.
backup_corpus
:corpus
(optional, default:"--hyphe--"
)
Saves locally on the server in the archive directory a timestamped backup of corpus
including 4 json backup files of all webentities/links/crawls and corpus options.
ping
:corpus
(optional, default:null
)timeout
(optional, default:3
)
Tests during timeout
seconds whether an existing corpus
is started. Returns "pong" on success or the corpus status otherwise.
reinitialize
:corpus
(optional, default:"--hyphe--"
)
Resets completely a corpus
by cancelling all crawls and emptying the Traph and Mongo data.
destroy_corpus
:corpus
(optional, default:"--hyphe--"
)
Backups, resets, then definitely deletes a corpus
and anything associated with it.
force_destroy_corpus
:corpus
(optional, default:"--hyphe--"
)
Deletes completely and definitely a corpus
without restarting it (backup may be less complete).
clear_all
:except_corpus_ids
(optional, default:[]
)
Resets Hyphe completely: starts then resets and destroys all existing corpora one by one except for those whose ID is given in except_corpus_ids
.
get_status
:corpus
(optional, default:"--hyphe--"
)
Returns global metadata on Hyphe's status and specific information on a corpus
.
declare_page
:url
(mandatory)corpus
(optional, default:"--hyphe--"
)
Indexes a url
into a corpus
. Returns the (newly created or not) associated WebEntity.
declare_pages
:list_urls
(mandatory)corpus
(optional, default:"--hyphe--"
)
Indexes a bunch of urls given as an array in list_urls
into a corpus
. Returns the (newly created or not) associated WebEntities.
listjobs
:list_ids
(optional, default:null
)from_ts
(optional, default:null
)to_ts
(optional, default:null
)light
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns the list and details of all "finished"/"running"/"pending" crawl jobs of a corpus
. Optionally returns only the jobs whose id is given in an array of list_ids
and/or that was created after timestamp from_ts
or before to_ts
. Set light
to true to get only essential metadata for heavy queries.
propose_webentity_startpages
:webentity_id
(mandatory)startmode
(optional, default:"default"
)categories
(optional, default:false
)save_startpages
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns a list of suggested startpages to crawl an existing WebEntity defined by its webentity_id
using the "default" startmode
defined for the corpus
or one or an array of either the WebEntity's preset "startpages", "homepage" or "prefixes" or most seen "pages-". Returns them categorised by type of source if "categories" is set to true. Will save them into the webentity if save_startpages
is True.
crawl_webentity
:webentity_id
(mandatory)depth
(optional, default:0
)phantom_crawl
(optional, default:false
)status
(optional, default:"IN"
)proxy
(optional, default:null
)cookies_string
(optional, default:null
)user_agent
(optional, default:null
)phantom_timeouts
(optional, default:{}
)webarchives
(optional, default:{}
)corpus
(optional, default:"--hyphe--"
)
Schedules a crawl for a corpus
for an existing WebEntity defined by its webentity_id
with a specific crawl depth [int]
.
Optionally use PhantomJS by setting phantom_crawl
to "true" and adjust specific phantom_timeouts
as a json object with possible keys timeout
/ajax_timeout
/idle_timeout
.
Sets simultaneously the WebEntity's status to "IN" or optionally to another valid status
("undecided"/"out"/"discovered").
Optionally add a HTTP proxy
specified as "domain_or_IP:port".
Also optionally add known cookies_string
with auth rights to a protected website and/or specific user_agent
.
Optionally use some webarchives
by defining a json object with keys date
/days_range
/option
, the latter being one of ""/"web.archive.org"/"archivesinternet.bnf.fr".
Will use the WebEntity's startpages if it has any or use otherwise the corpus
' "default" startmode
heuristic as defined in propose_webentity_startpages
(use crawl_webentity_with_startmode
to apply a different heuristic, see details in propose_webentity_startpages
).
crawl_webentity_with_startmode
:webentity_id
(mandatory)depth
(optional, default:0
)phantom_crawl
(optional, default:false
)status
(optional, default:"IN"
)startmode
(optional, default:"default"
)proxy
(optional, default:null
)cookies_string
(optional, default:null
)user_agent
(optional, default:null
)phantom_timeouts
(optional, default:{}
)webarchives
(optional, default:{}
)save_startpages
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Schedules a crawl for a corpus
for an existing WebEntity defined by its webentity_id
with a specific crawl depth [int]
.
Optionally use PhantomJS by setting phantom_crawl
to "true" and adjust specific phantom_timeouts
as a json object with possible keys timeout
/ajax_timeout
/idle_timeout
.
Sets simultaneously the WebEntity's status to "IN" or optionally to another valid status
("undecided"/"out"/"discovered").
Optionally add a HTTP proxy
specified as "domain_or_IP:port".
Also optionally add known cookies_string
with auth rights to a protected website and/or specific user_agent
.
Optionally define the startmode
strategy differently to the corpus
"default one (see details in propose_webentity_startpages
).
Optionally use some webarchives
by defining a json object with keys date
/days_range
/option
, the latter being one of ""/"web.archive.org"/"archivesinternet.bnf.fr".
get_webentity_jobs
:webentity_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
crawl jobs that has run for a specific WebEntity defined by its webentity_id
.
cancel_webentity_jobs
:webentity_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Cancels for a corpus
all running or pending crawl jobs that were booked for a specific WebEntity defined by its webentity_id
.
get_webentity_logs
:webentity_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
crawl activity logs on a specific WebEntity defined by its webentity_id
.
lookup_httpstatus
:url
(mandatory)timeout
(optional, default:30
)corpus
(optional, default:"--hyphe--"
)
Tests a url
for timeout
seconds using a corpus
specific connection (possible proxy for instance). Returns the url's HTTP code.
lookup
:url
(mandatory)timeout
(optional, default:30
)corpus
(optional, default:"--hyphe--"
)
Tests a url
for timeout
seconds using a corpus
specific connection (possible proxy for instance). Returns a boolean indicating whether lookup_httpstatus
returned HTTP code 200 or a redirection code (301/302/...).
deploy_crawler
:corpus
(optional, default:"--hyphe--"
)
Prepares and deploys on the ScrapyD server a spider (crawler) for a corpus
.
delete_crawler
:corpus
(optional, default:"--hyphe--"
)
Removes from the ScrapyD server an existing spider (crawler) for a corpus
.
cancel_all
:corpus
(optional, default:"--hyphe--"
)
Stops all "running" and "pending" crawl jobs for a corpus
.
Cancels all current crawl jobs running or planned for a corpus
and empty related mongo data.
start
:webentity_id
(mandatory)starts
(mandatory)follow_prefixes
(mandatory)nofollow_prefixes
(mandatory)follow_redirects
(optional, default:null
)depth
(optional, default:0
)phantom_crawl
(optional, default:false
)phantom_timeouts
(optional, default:{}
)download_delay
(optional, default:1
)proxy
(optional, default:null
)cookies_string
(optional, default:null
)user_agent
(optional, default:null
)webarchives
(optional, default:{}
)corpus
(optional, default:"--hyphe--"
)
Starts a crawl for a corpus
defining finely the crawl options (mainly for debug purposes):
- a
webentity_id
associated with the crawl a list ofstarts
urls to start from - a list of
follow_prefixes
to know which links to follow - a list of
nofollow_prefixes
to know which links to avoid - a
depth
corresponding to the maximum number of clicks done from the start pages phantom_crawl
set to "true" to use PhantomJS for this crawl and optionalphantom_timeouts
as an object with keys amongtimeout
/ajax_timeout
/idle_timeout
- a
download_delay
corresponding to the time in seconds spent between two requests by the crawler. - an HTTP
proxy
specified as "domain_or_IP:port" - a known
cookies_string
with auth rights to a protected website - a specific
user_agent
. Optionally use somewebarchives
by defining a json object with keysdate
/days_range
/option
, the latter being one of ""/"web.archive.org"/"archivesinternet.bnf.fr".
cancel
:job_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Cancels a crawl of id job_id
for a corpus
.
get_job_logs
:job_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
activity logs of a specific crawl with id job_id
.
get_lru_definedprefixes
:lru
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a list of all possible LRU prefixes shorter than lru
and already attached to WebEntities.
declare_webentity_by_lruprefix_as_url
:url
(mandatory)name
(optional, default:null
)status
(optional, default:null
)startpages
(optional, default:[]
)lruVariations
(optional, default:true
)tags
(optional, default:{}
)corpus
(optional, default:"--hyphe--"
)
Creates for a corpus
a WebEntity defined for the LRU prefix given as a url
and optionnally for the corresponding http/https and www/no-www variations if lruVariations
is true. Optionally set the newly created WebEntity's name
status
("in"/"out"/"undecided"/"discovered") and list of startpages
. Returns the newly created WebEntity.
declare_webentity_by_lru
:lru_prefix
(mandatory)name
(optional, default:null
)status
(optional, default:null
)startpages
(optional, default:[]
)lruVariations
(optional, default:true
)tags
(optional, default:{}
)corpus
(optional, default:"--hyphe--"
)
Creates for a corpus
a WebEntity defined for a lru_prefix
and optionnally for the corresponding http/https and www/no-www variations if lruVariations
is true. Optionally set the newly created WebEntity's name
status
("in"/"out"/"undecided"/"discovered") and list of startpages
. Returns the newly created WebEntity.
declare_webentity_by_lrus_as_urls
:list_urls
(mandatory)name
(optional, default:null
)status
(optional, default:null
)startpages
(optional, default:[]
)lruVariations
(optional, default:true
)tags
(optional, default:{}
)corpus
(optional, default:"--hyphe--"
)
Creates for a corpus
a WebEntity defined for a set of LRU prefixes given as URLs under list_urls
and optionnally for the corresponding http/https and www/no-www variations if lruVariations
is true. Optionally set the newly created WebEntity's name
status
("in"/"out"/"undecided"/"discovered") and list of startpages
. Returns the newly created WebEntity.
declare_webentity_by_lrus
:list_lrus
(mandatory)name
(optional, default:null
)status
(optional, default:""
)startpages
(optional, default:[]
)lruVariations
(optional, default:true
)tags
(optional, default:{}
)corpus
(optional, default:"--hyphe--"
)
Creates for a corpus
a WebEntity defined for a set of LRU prefixes given as list_lrus
and optionnally for the corresponding http/https and www/no-www variations if lruVariations
is true. Optionally set the newly created WebEntity's name
status
("in"/"out"/"undecided"/"discovered") and list of startpages
. Returns the newly created WebEntity.
basic_edit_webentity
:webentity_id
(mandatory)name
(optional, default:null
)status
(optional, default:null
)homepage
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Changes for a corpus
at once the name
, status
and homepage
of a WebEntity defined by webentity_id
.
rename_webentity
:webentity_id
(mandatory)new_name
(mandatory)corpus
(optional, default:"--hyphe--"
)
Changes for a corpus
the name of a WebEntity defined by webentity_id
to new_name
.
set_webentity_status
:webentity_id
(mandatory)status
(mandatory)corpus
(optional, default:"--hyphe--"
)
Changes for a corpus
the status of a WebEntity defined by webentity_id
to status
(one of "in"/"out"/"undecided"/"discovered").
set_webentities_status
:webentity_ids
(mandatory)status
(mandatory)corpus
(optional, default:"--hyphe--"
)
Changes for a corpus
the status of a set of WebEntities defined by a list of webentity_ids
to status
(one of "in"/"out"/"undecided"/"discovered").
set_webentity_homepage
:webentity_id
(mandatory)homepage
(optional, default:""
)corpus
(optional, default:"--hyphe--"
)
Changes for a corpus
the homepage of a WebEntity defined by webentity_id
to homepage
.
add_webentity_lruprefixes
:webentity_id
(mandatory)lru_prefixes
(mandatory)corpus
(optional, default:"--hyphe--"
)
Adds for a corpus
a list of lru_prefixes
(or a single one) to a WebEntity defined by webentity_id
.
rm_webentity_lruprefix
:webentity_id
(mandatory)lru_prefix
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes for a corpus
a lru_prefix
from the list of prefixes of a WebEntity defined by `webentity_id. Will delete the WebEntity if it ends up with no LRU prefix left.
add_webentity_startpages
:webentity_id
(mandatory)startpages_urls
(mandatory)corpus
(optional, default:"--hyphe--"
)
Adds for a corpus
a list of startpages_urls
to the list of startpages to use when crawling the WebEntity defined by webentity_id
.
add_webentity_startpage
:webentity_id
(mandatory)startpage_url
(mandatory)corpus
(optional, default:"--hyphe--"
)
Adds for a corpus
a startpage_url
to the list of startpages to use when crawling the WebEntity defined by webentity_id
.
rm_webentity_startpages
:webentity_id
(mandatory)startpages_urls
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes for a corpus
a list of startpages_urls
from the list of startpages to use when crawling the WebEntity defined by `webentity_id.
rm_webentity_startpage
:webentity_id
(mandatory)startpage_url
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes for a corpus
a startpage_url
from the list of startpages to use when crawling the WebEntity defined by `webentity_id.
merge_webentity_into_another
:old_webentity_id
(mandatory)good_webentity_id
(mandatory)include_tags
(optional, default:false
)include_home_and_startpages_as_startpages
(optional, default:false
)include_name_and_status
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Assembles for a corpus
2 WebEntities by deleting WebEntity defined by old_webentity_id
and adding all of its LRU prefixes to the one defined by good_webentity_id
. Optionally set include_tags
and/or include_home_and_startpages_as_startpages
and/or include_name_and_status
to "true" to also add the tags and/or startpages and/or name&status to the merged resulting WebEntity.
merge_webentities_into_another
:old_webentity_ids
(mandatory)good_webentity_id
(mandatory)include_tags
(optional, default:false
)include_home_and_startpages_as_startpages
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Assembles for a corpus
a bunch of WebEntities by deleting WebEntities defined by a list of old_webentity_ids
and adding all of their LRU prefixes to the one defined by good_webentity_id
. Optionally set include_tags
and/or include_home_and_startpages_as_startpages
to "true" to also add the tags and/or startpages to the merged resulting WebEntity.
delete_webentity
:webentity_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes from a corpus
a WebEntity defined by webentity_id
(mainly for advanced debug use).
get_webentity
:webentity_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a WebEntity defined by its webentity_id
.
get_webentity_by_lruprefix
:lru_prefix
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the WebEntity having lru_prefix
as one of its LRU prefixes.
get_webentity_by_lruprefix_as_url
:url
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the WebEntity having one of its LRU prefixes corresponding to the LRU fiven under the form of a url
.
get_webentity_for_url
:url
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the WebEntity to which a url
belongs (meaning starting with one of the WebEntity's prefix and not another).
get_webentity_for_url_as_lru
:lru
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the WebEntity to which a url given under the form of a lru
belongs (meaning starting with one of the WebEntity's prefix and not another).
get_webentities
:list_ids
(optional, default:[]
)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)light
(optional, default:false
)semilight
(optional, default:false
)light_for_csv
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all existing WebEntities or only the WebEntities whose id is among list_ids
.
Results will be paginated with a total number of returned results of count
and page
the number of the desired page of results. Returns all results at once if list_ids
is provided or count
is -1 ; otherwise results will include metadata on the request including the total number of results and a token
to be reused to collect the other pages via get_webentities_page
.
Other possible options include:
- order the results with
sort
by inputting a field or list of fields as named in the WebEntities returned objects; optionally prefix a sort field with a "-" to revert the sorting on it; for instance:["-indegree", "name"]
will order by maximum indegree first then by alphabetic order of names; - set
light
orsemilight
orlight_for_csv
to "true" to collect lighter data with less WebEntities fields.
search_webentities
:allFieldsKeywords
(optional, default:[]
)fieldKeywords
(optional, default:[]
)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)light
(optional, default:false
)semilight
(optional, default:true
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities matching a specific search using the allFieldsKeywords
and fieldKeywords
arguments.
Returns all results at once if count
_ (optional, default:
= -1 ; otherwise results will be paginated with count
results per page, using page
as index of the desired page. Results will include metadata on the request including the total number of results and a token
to be reused to collect the other pages via get_webentities_page
.`)
allFieldsKeywords
should be a string or list of strings to search in all textual fields of the WebEntities ("name", "lru prefixes", "startpages" & "homepage"). For instance["hyphe", "www"]
fieldKeywords
should be a list of 2-elements arrays giving first the field to search into then the searched value or optionally for the field "indegree" an array of a minimum and maximum values to search into (notes: this does not work with undirected_degree and outdegree ; only exact values will be matched when querying on field status field). For instance:[["name", "hyphe"], ["indegree", [3, 1000]]]
- see description of
sort
,light
andsemilight
inget_webentities
above.
wordsearch_webentities
:allFieldsKeywords
(optional, default:[]
)fieldKeywords
(optional, default:[]
)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)light
(optional, default:false
)semilight
(optional, default:true
)corpus
(optional, default:"--hyphe--"
)
Same as search_webentities
except that search is only matching exact full words
- _
and that
allFieldsKeywords` query also search into tags values.
get_webentities_by_status
:status
(mandatory)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)light
(optional, default:false
)semilight
(optional, default:true
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having their status equal to status
(one of "in"/"out"/"undecided"/"discovered").
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see search_webentities
for explanations on sort
count
and page
.
get_webentities_by_name
:name
(mandatory)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having their name equal to name
.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see search_webentities
for explanations on sort
count
and page
.
get_webentities_by_tag_value
:value
(mandatory)namespace
(optional, default:null
)category
(optional, default:null
)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having at least one tag in any namespace/category equal to value
.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see search_webentities
for explanations on sort
count
and page
.
get_webentities_by_tag_category
:namespace
(mandatory)category
(mandatory)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having at least one tag in a specific category
for a specific namespace
.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see search_webentities
for explanations on sort
count
and page
.
get_webentities_mistagged
:status
(optional, default:'IN'
)missing_a_category
(optional, default:false
)multiple_values
(optional, default:false
)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)light
(optional, default:false
)semilight
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities of status status
with no tag of the namespace "USER" or multiple tags for some USER categories if multiple_values
is true or no tag for at least one existing USER category if missing_a_category
is true.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see search_webentities
for explanations on sort
count
and page
.
get_webentities_uncrawled
:sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)light
(optional, default:false
)semilight
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all IN WebEntities which have no crawljob associated with it.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see search_webentities
for explanations on sort
count
and page
.
get_webentities_page
:pagination_token
(mandatory)n_page
(mandatory)idNamesOnly
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the page number n_page
of WebEntities corresponding to the results of a previous query ran using any of the get_webentities
or search_webentities
methods using the returned pagination_token
. Returns only an array of [id, name] arrays if idNamesOnly
is true.
get_webentities_ranking_stats
:pagination_token
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
histogram data on the indegrees of all WebEntities matching a previous query ran using any of the get_webentities
or search_webentities
methods using the return pagination_token
.
rebuild_tags_dictionary
:corpus
(optional, default:"--hyphe--"
)
Administrative function to regenerate for a corpus
the dictionnary of tag values used by autocompletion features
- _`mostly a debug function which should not be used in most cases.
add_webentity_tag_value
:webentity_id
(mandatory)namespace
(mandatory)category
(mandatory)value
(mandatory)corpus
(optional, default:"--hyphe--"
)
Adds for a corpus
a tag namespace:category
_ (optional, default: value
to a WebEntity defined by webentity_id
.`)
add_webentities_tag_value
:webentity_ids
(mandatory)namespace
(mandatory)category
(mandatory)value
(mandatory)corpus
(optional, default:"--hyphe--"
)
Adds for a corpus
a tag namespace:category
_ (optional, default: value
to a bunch of WebEntities defined by a list of webentity_ids
.`)
rm_webentity_tag_value
:webentity_id
(mandatory)namespace
(mandatory)category
(mandatory)value
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes for a corpus
a tag namespace:category
_ (optional, default: value
associated with a WebEntity defined by webentity_id
if it is set.`)
rm_webentities_tag_value
:webentity_ids
(mandatory)namespace
(mandatory)category
(mandatory)value
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes for a corpus
a tag namespace:category
_ (optional, default: value
to a bunch of WebEntities defined by a list of webentity_ids
.`)
edit_webentity_tag_value
:webentity_id
(mandatory)namespace
(mandatory)category
(mandatory)old_value
(mandatory)new_value
(mandatory)corpus
(optional, default:"--hyphe--"
)
Replaces for a corpus
a tag namespace:category
_ (optional, default: old_value
into a tag namespace:category=new_value
for the WebEntity defined by webentity_id
if it is set.`)
get_tags
:namespace
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a tree of all existing tags of the webentities hierarchised by namespaces and categories. Optionally limits to a specific namespace
.
get_tag_namespaces
:corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a list of all existing namespaces of the webentities tags.
get_tag_categories
:namespace
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a list of all existing categories of the webentities tags. Optionally limits to a specific namespace
.
get_tag_values
:namespace
(optional, default:null
)category
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a list of all existing values in the webentities tags. Optionally limits to a specific namespace
and/or category
.
get_webentity_pages
:webentity_id
(mandatory)onlyCrawled
(optional, default:true
)corpus
(optional, default:"--hyphe--"
)
Warning: this method can be very slow on webentities with many pages, privilege paginate_webentity_pages whenever possible. Returns for a corpus
all indexed Pages fitting within the WebEntity defined by webentity_id
. Optionally limits the results to Pages which were actually crawled setting onlyCrawled
to "true".
paginate_webentity_pages
:webentity_id
(mandatory)count
(optional, default:5000
)pagination_token
(optional, default:null
)onlyCrawled
(optional, default:false
)include_page_metas
(optional, default:false
)include_page_body
(optional, default:false
)body_as_plain_text
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
count
indexed Pages alphabetically ordered fitting within the WebEntity defined by webentity_id
and returns a pagination_token
to reuse to collect the following pages. Optionally limits the results to Pages which were actually crawled setting onlyCrawled
to "true". Also optionally returns complete page metadata (http status, body size, content_type, encoding, crawl timestamp\ and crawl depth) when include_page_metas
is set to "true". Additionally returns the page's zipped body encoded in base64 when include_page_body
is "true" (only possible when Hyphe is configured with store_crawled_html_content
to "true"); setting body_as_plain_text to "true" decodes and unzip these to return them as plain text.
get_webentity_mostlinked_pages
:webentity_id
(mandatory)npages
(optional, default:20
)max_prefix_distance
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the npages
(defaults to 20) most linked Pages indexed that fit within the WebEntity defined by webentity_id
and optionnally at a maximum depth of max_prefix_distance
.
get_webentity_subwebentities
:webentity_id
(mandatory)light
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all sub-webentities of a WebEntity defined by webentity_id
(meaning webentities having at least one LRU prefix starting with one of the WebEntity's prefixes).
get_webentity_parentwebentities
:webentity_id
(mandatory)light
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all parent-webentities of a WebEntity defined by webentity_id
(meaning webentities having at least one LRU prefix starting like one of the WebEntity's prefixes).
get_webentity_pagelinks_network
:webentity_id
(optional, default:null
)include_external_links
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Warning: this method can be very slow on webentities with many pages or links, privilege paginate_webentity_pagelinks_network whenever possible. Returns for a corpus
the list of all internal NodeLinks of a WebEntity defined by webentity_id
. Optionally add external NodeLinks (the frontier) by setting include_external_links
to "true". Will not return much of anything if the corpus was configured with ignore_internal_links
set to "true".
paginate_webentity_pagelinks_network
:webentity_id
(optional, default:null
)count
(optional, default:10
)pagination_token
(optional, default:null
)include_external_outlinks
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
internal page links for count
source pages of a WebEntity defined by webentity_id
and returns a pagination_token
to reuse to collect the following links. Optionally add external NodeLinks (the frontier) by setting include_external_outlinks
to "true". Will not return much of anything if the corpus was configured with ignore_internal_links
set to "true".
get_webentity_referrers
:webentity_id
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)light
(optional, default:true
)semilight
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities with known links to webentity_id
ordered by decreasing link weight.
Results are paginated and will include a token
to be reused to collect the other entities via get_webentities_page
: see search_webentities
for explanations on count
and page
.
get_webentity_referrals
:webentity_id
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)light
(optional, default:true
)semilight
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities with known links from webentity_id
ordered by decreasing link weight.
Results are paginated and will include a token
to be reused to collect the other entities via get_webentities_page
: see search_webentities
for explanations on count
and page
.
get_webentity_ego_network
:webentity_id
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a list of all weighted links between webentities linked to webentity_id
.
get_webentities_network
:include_links_from_OUT
(optional, default:INCLUDE_LINKS_FROM_OUT
)include_links_from_DISCOVERED
(optional, default:INCLUDE_LINKS_FROM_DISCOVERED
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the list of all agregated weighted links between WebEntities.
get_default_webentity_creationrule
:corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the default WebEntityCreationRule.
get_webentity_creationrules
:lru_prefix
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all existing WebEntityCreationRules or only one set for a specific lru_prefix
.
delete_webentity_creationrule
:lru_prefix
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes from a corpus
an existing WebEntityCreationRule set for a specific lru_prefix
.
add_webentity_creationrule
:lru_prefix
(mandatory)regexp
(mandatory)corpus
(optional, default:"--hyphe--"
)
Adds to a corpus
a new WebEntityCreationRule set for a lru_prefix
to a specific regexp
or one of "subdomain"/"subdomain-N"/"domain"/"path-N"/"prefix+N"/"page" N being an integer. It will immediately by applied to past crawls.
simulate_creationrules_for_urls
:pageURLs
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns an object giving for each URL of pageURLs
(single string or array) the prefix of the theoretical WebEntity the URL would be attached to within a corpus
following its specific WebEntityCreationRules.
simulate_creationrules_for_lrus
:pageLRUs
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns an object giving for each LRU of pageLRUs
(single string or array) the prefix of the theoretical WebEntity the LRU would be attached to within a corpus
following its specific WebEntityCreationRules.
trigger_links_build
:corpus
(optional, default:"--hyphe--"
)
Will initiate a links calculation update (useful especially when a corpus crashed during the links calculation and no more crawls is programmed).
get_webentities_stats
:corpus
(optional, default:"--hyphe--"
)
Returns for a corpus a set of statistics on the WebEntities status repartition of a corpus
each 5 minutes.