-
Notifications
You must be signed in to change notification settings - Fork 356
FAQ
Here are answers to some frequently-asked questions, updated for ConceptNet 5.7.
ConceptNet is a knowledge graph of things people know and computers should know, expressed in various natural languages. See the main page for more details.
ConceptNet is a resource. You can use it as part of making an AI that understands the meanings of words people use.
ConceptNet is not a chatbot. Some chatbot systems have used ConceptNet as a resource, but this is not a primary use case that ConceptNet is designed for.
You can browse the knowledge graph at http://www.conceptnet.io/.
We recommend starting with the Web API. If you need a greater flow of information than the Web API provides, then consider downloading the data.
One way to take advantage of all the information in ConceptNet, as well as information that can be learned from large corpora of text, is to use the ConceptNet Numberbatch word embeddings. These can be used as a more accurate replacement for word2vec or GloVe vectors.
When used together with some extra code in conceptnet5.vectors
, ConceptNet Numberbatch provides the best word embeddings in the world in multiple languages, as tested at SemEval 2017.
The paper we recommend citing when you're using recent versions of ConceptNet is:
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. "ConceptNet 5.5: An Open Multilingual Graph of General Knowledge." In proceedings of AAAI 31.
It's okay to cite this paper for versions later than 5.5. We don't get to publish a new paper for every version.
The BibTeX information is:
@paper{speer2017conceptnet,
author = {Robyn Speer and Joshua Chin and Catherine Havasi},
title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge},
conference = {AAAI Conference on Artificial Intelligence},
year = {2017},
pages = {4444--4451},
keywords = {ConceptNet; knowledge graph; word embeddings},
url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972}
}
If you want to cite a more general, older overview:
Robyn Speer and Catherine Havasi, 2012. Representing General Relational Knowledge in ConceptNet 5. In LREC (pp. 3679-3686).
ConceptNet has changed a lot over its existence. If you cite a paper by Hugo Liu (one of the original creators of ConceptNet), realize that the citation only applies to the design of ConceptNet from 2005 and earlier.
I'm Robyn Speer. The other name is my "deadname", a former name that I don't use for any purposes and don't want to propagate, because it doesn't fit my gender identity.
For me to continue in research as a trans woman, I need to be able to choose my name and keep my publication history. I've amended many of my recent papers to have my new name on them. No matter whether you're seeing the amended version, please always cite me as Robyn Speer.
See Citation complications for further details on this, including what to do if you've accidentally created a new citation of my deadname.
Yes! This is allowed by the Creative Commons Attribution-ShareAlike license, which has two conditions. Here's what they approximately mean for ConceptNet:
- Attribution: Visibly give credit to ConceptNet and its creators
- ShareAlike: If you add data to ConceptNet, modify its data, or combine its data into a larger database, the resulting dataset must have the same license terms as ConceptNet.
To give proper attribution to ConceptNet's data, we suggest this text:
This work includes data from ConceptNet 5, which was compiled by the Commonsense Computing Initiative. ConceptNet 5 is freely available under the Creative Commons Attribution-ShareAlike license (CC BY SA 4.0) from http://conceptnet.io. The included data was created by contributors to Commonsense Computing projects, contributors to Wikimedia projects, Games with a Purpose, Princeton University's WordNet, DBPedia, OpenCyc, and Umbel.
In particular, you may not add restrictions on how data built on ConceptNet is used, such as "research purposes only" or "non-commercial".
I need to use ConceptNet together with a "research purposes only" resource. I really am just using it for research purposes. What do I do?
You can't change ConceptNet's license, not even for the sake of research. I can't change it either, even if I wanted to, because I've agreed to the same license from Wikimedia. But I wouldn't want to change it. The Attribution-ShareAlike license makes sure that ConceptNet remains open data.
Some options you have are:
- Try to get a more permissive license from the creators of the other resource
- Find a different resource
- Put either ConceptNet or the other resource in a separate component, whose data is distributed separately
But I want to make something for ordinary people, not corporations! Why do I have to allow commercial use?
ConceptNet would not exist without commercial use.
Large corporations will get all the data they want anyway. When you put restrictions on data, you don't do anything to large corporations, you only harm people without connections. "Research use only" or "academic use only" is a particularly insidious form of elitism.
We went to some effort to make the API responses look nice in a Web browser. The JSON gets formatted and highlighted, and values that are references to other URLs you can look up become links, so you can just explore by following these links.
Try clicking the link below and you'll be using the ConceptNet API:
http://api.conceptnet.io/c/en/example
Of course you don't have to be a Web browser. If you have curl
(a small command-line HTTP utility) on your computer, try running this at the command line:
curl http://api.conceptnet.io/c/en/example
Or in Python, using the requests
library:
import requests
requests.get('http://api.conceptnet.io/c/en/example').json()
There are more things you can do that won't be quite so obvious just from looking at the responses, so once you've explored a little, go read the API documentation.
There are more pages of results. The default page size is set to 20 -- this speeds up the responses, and makes sure you notice that there aren't many results.
When the API results are paginated, the response will end with a section that looks like this:
"view": {
"@id": "/c/en/example?offset=0&limit=20",
"@type": "PartialCollectionView",
"comment": "There are more results. Follow the 'nextPage' link for more.",
"firstPage": "/c/en/example?offset=0&limit=20",
"nextPage": "/c/en/example?offset=20&limit=20",
"paginatedProperty": "edges"
}
As the comment states, "nextPage" contains a link to the next page of results. If you're viewing the API response in a Web browser, you can click the link to see more results.
We were trying to only send you the formatted HTML if it looked like you were using a Web browser, but maybe we're wrong, and maybe you just want the plain JSON anyway. Add ?format=json
to the URL that you query. For example:
http://api.conceptnet.io/c/en/example?format=json
Try going to that URL in Firefox, which has its own built-in JSON formatter. It won't give you a way to follow the links, but other than that, it's pretty nice.
JSON-LD, a linked data format that on the surface is just reasonable-looking JSON, and under the hood, preserves some of the good parts of RDF and the Semantic Web.
This is an interesting comparison to make, as the projects have similar goals, and by now they both make use of multilingual linked data.
ConceptNet contains more kinds of relationships than WordNet. ConceptNet's vocabulary is larger and interconnected in many more ways. In exchange, it's somewhat messier than WordNet.
ConceptNet does only the bare minimum to distinguish word senses so far -- in the built graph of ConceptNet 5.5, word senses are only distinguished by their part of speech (similar to sense2vec). WordNet has a large number of senses for every word, though some of them are difficult to distinguish in practice.
WordNet is too sparse for some applications. You can't build word vectors from WordNet alone. You can't compare nouns to verbs in WordNet, because they are mostly unconnected vocabularies.
ConceptNet does not assume that words fall into "synsets", sets of synonyms that are completely interchangeable. Synonymy in ConceptNet is a relation like any other. If you've worked with WordNet, you may have been frustrated by the implications of the synset assumption on real text, where words are not marked with specific senses, and where the word "He" cannot usually be replaced synonymously with "atomic number 2".
In ConceptNet, we incorporate as much of WordNet as we can while undoing the synset assumption, and we give it a high weight, because the information in WordNet is valuable and usually quite accurate.
ConceptNet is linked open data, and that makes it fundamentally a different thing than a proprietary knowledge base.
Google's Knowledge Graph is a brand name on top of the structured knowledge that it takes to run the Google search engine, Google Assistant, and probably other applications. It provides those sidebars of facts you get when you search for things on Google, and it provides answers to questions that you ask the Google Assistant. It seems to focus largely on things you can buy and things you can look up on Wikipedia. (In ConceptNet, we focus more on the general meanings of all words, whether they be nouns, verbs, adjectives, or adverbs, and less on named entities.)
I assume it's a very well-designed knowledge representation for a search engine. And there is only one search engine that it can power. Fundamentally, the Google Knowledge Graph supports the ability to interact with Google products on Google's terms.
Unlike the typical corporate knowledge base, ConceptNet has remained true to its crowdsourcing roots. While it's a project developed at Luminoso, it is open for anyone to use under a Creative Commons license. This is the fair thing to do, given how much of it depends on public contributions and linked data, but it's also part of Luminoso's ideals. When we let you see and use our state-of-the-art knowledge representation first-hand, it promotes understanding of why Luminoso's products are a better approach to NLP.
BabelNet is very similar in structure to ConceptNet, but very different in openness.
BabelNet uses many of the same knowledge sources as ConceptNet. It lacks the Open Mind Common Sense and Games with a Purpose data, which provide ConceptNet with a wide range of noisy but effective relational knowledge. It does, on the other hand, have a representation of WordNet-style word senses that ConceptNet doesn't have.
As of 2018, BabelNet is proprietary and not available to the public. You may find this surprising given how they've touted their openness in the past, and given that it's built on Creative Commons Share-Alike resources, but check their site. You won't find a download link.
They allow you to submit an application to use it for research purposes only, if you meet the requirements of having academic credentials and a current academic affiliation.
DBPedia is very much focused on named entities. It's messier than ConceptNet. Its vocabulary consists only of titles of Wikipedia articles.
DBPedia contains information that can be used for answering specific questions, such as "Where is the birthplace of John Adams?" or "What countries have a population of over 10 million?". It particularly knows a lot about locations, movies, and music albums. You could use DBPedia to solve Six Degrees of Kevin Bacon.
ConceptNet imports a small amount of DBPedia, and also contains external links to DBPedia and Wikidata.
DBnary is a counterpart to DBPedia that's actually quite compatible with ConceptNet. Like ConceptNet, it focuses on word definitions rather than named entities, and it gets them from parsing Wiktionary.
Right now we use our own Wiktionary parser, which covers fewer Wiktionary sites than DBnary does but extracts more detail from each entry. We would gladly use DBnary instead, if DBnary starts extracting information such as links from definitions.
Cyc was an ontology built on a predicate logic representation called CycL. CycL enabled very precise reasoning in a way that machine learning over ConceptNet doesn't. However, Cyc was intolerant of errors, and adding information to Cyc was a difficult task that kept Cycorp occupied for over 30 years.
OpenCyc provides a hierarchy of types of things, with English names, some of which are automatically generated. It seems to be intended as a preview of the full Cyc system, a proprietary system that was shut down in 2017.
ConceptNet includes a subset of OpenCyc, consisting of the IsA statements that can be reasonably represented in natural language.
The Microsoft Concept Graph is a proprietary taxonomy of English nouns, connected with the "IsA" relation, with some automatic word sense disambiguation. Its data comes from machine reading of a Web search index. It resembles an automatically-generated version of OpenCyc, and is derived from an earlier project named Probase.
The Microsoft Concept Graph was shut down in 2018.
Approximately 34 million.
No. Its representation is words and phrases of natural language, and relations between them. Natural language can be vague, illogical, and incredibly useful.
The data that ConceptNet is built from spans a lot of different languages, with a long tail of marginally-represented languages. 10 languages have core support, 77 languages have moderate support, and 304 languages are supported in total. See Languages for a complete list.
This will always be true. We use machine-learning techniques, including word embeddings, to learn generalizable things from ConceptNet despite the incompleteness of the knowledge it contains.
There will probably always be isolated mistakes or falsehoods in ConceptNet. Our data sources and our processes are not perfect. Machine learning can be relatively robust against errors, as long as the errors are not systematic.
If you've identified a systematic source of errors in ConceptNet, that is more important. It would probably improve ConceptNet to get rid of it. In that case, please go to the 'Issues' tab and describe it in an issue report.
See the table on the Relations page of this wiki.
Made-up numbers that are programmed into the reader modules that import various sources of knowledge. These weights represent a rough heuristic of which statements you should trust more than other statements.
During the golden age of crowdsourcing (the decade of the 2000s), ConceptNet accepted direct contributions of knowledge. This was a great start, but now the opportunities for improving ConceptNet have changed, and we are content to leave crowdsourcing to the organizations that are really good at it, like the Wikimedia Foundation.
If you contribute to Wiktionary and follow their guidelines, the information you contribute will eventually be represented in ConceptNet.
What I mean is, can I make my own version of ConceptNet that includes information that I need in my domain?
Well, you can reproduce ConceptNet's build process and change the code to import a new source of data. This may or may not accomplish what you want.
What ConceptNet is designed for is representing general knowledge. Making a useful domain-specific semantic model is a rather different process, in our experience. The software we built on top of ConceptNet to make this possible eventually became our company, Luminoso. Luminoso provides software as a service that creates domain-specific semantic models, which make use of ConceptNet so they can start out knowing what words mean and just have to learn what's different in your domain.
We've tried a lot of them. Currently PostgreSQL.
Probably one of the following reasons:
- It isn't as efficient as PostgreSQL
- It doesn't actually work as advertised
- It is no longer maintained
- It doesn't provide a good workflow for importing a medium-sized graph such as ConceptNet
- It takes more than a day to import a medium-sized graph such as ConceptNet
- It inflates the size of the data it stores by a factor of more than 10
- It assumes every user has access to and wants to use a distributed computing cluster
- It would be hard for people who want their own copy of ConceptNet to install it
- It's not free software
- It has a restriction on it that would prevent people from reusing ConceptNet, such as the GPL or "academic use only"
If you think you know of a database that doesn't fail one of these criteria, I'd still be interested to hear about it.
It fits on a hard disk, so no. It's enough data for many purposes. But text is small.
If you have textual knowledge that actually requires distributed computation, you work at a company that does Web search.
You're asking about a visualization like this, right?
Notice that that graph is a few thousand times smaller than ConceptNet and it's already an incomprehensible rainbow-colored hairball. I am not convinced there's a technology that exists that can put all of ConceptNet in one meaningful image, although there may be an approach that involves spreading it out into local clusters using t-SNE.
It will almost certainly involve custom code -- ConceptNet makes off-the-shelf graph visualizers collapse under the insoluble problem of laying out its edges. I'm interested in making such a visualization, but the result has to be informative, not just a hairball.
No. SPARQL is computationally infeasible. Similar projects that use SPARQL have unacceptable latency and go down whenever anyone starts using them in earnest.
The way to query ConceptNet is using a rather straightforward REST API, described on the API page. If you need to make a form of query that this API doesn't support, open an issue and we'll look into supporting it.
Blame science reporting for doing what it usually does. There's a nugget of truth in there surrounded by a big wad of meaningless AI hype. It's true that ConceptNet 4 could compete with 4-year-olds on a particular question-answering task -- and ConceptNet 5 performs much better on a similar task. This is cool. It doesn't mean that anyone's about to make robot children.
Here's the background: A much older version of ConceptNet, ConceptNet 4, was evaluated on some intelligence tests involving question-answering and sentence comprehension. The researchers who performed these tests compared ConceptNet's performance to a 4-year-old child.
We found the comparison odd but flattering. 4-year-old children are incredible beings. They have desires, goals, and imagination, and they can communicate them in their spoken language with a level of competence that second-language learners have to put tremendous effort into achieving. No real AI system can come close to emulating the range of things a child can do.
When it comes to the narrower task of answering questions, though, it's believable that ConceptNet 4 compared to a 4-year-old. We're always interested in measurably improving the general intelligence contained in ConceptNet. Excitingly, we now have a question-answering task in which ConceptNet 5 compares to a 17-year-old: that of answering SAT-style analogy questions.
The Story Cloze Test is a test of story understanding that any human can score close to 100% on in their native language. ConceptNet is used in state-of-the-art systems that solve this task. See this paper by Jiaao Chen et al.
Starting points
Reproducibility
Details