Project Characteristics

The PULC project aims at building a computer program that can precisely understand the meaning of certain kinds of Natural Language texts. “Precise understanding” is defined as: correctly solving tasks that require inferring conclusions from the information conveyed in the text.

The PULC project has the following characteristics:

1. Precision

The PULC project is set out to achieve a high level of precision in language understanding. This level of precision is necessary for supporting real inference based on the information conveyed in texts. Inference means: combining the bits of information that are distributed throughout the text in order to reach new conclusions that do not appear explicitly in the text, and not merely finding each fact separately. (Why does real inference from information require a high-quality level of understanding?)

The PULC project contrasts with other projects that aim at a superficial level of language “understanding” using so-called “broad coverage” text processing techniques. For example, current information retrieval, information extraction, and so-called “question answering” techniques essentially try to do pattern-matching between the structures of a query and a text, and so the structure and content of the two must closely resemble each other, and the only questions that might be addressable are of the factoid who-did-what-to-whom kind.

2. Not Just Any Arbitrary Text

Natural language texts are designed for human consumption, and so they are written with the implicit assumption that the reader has standard background world knowledge and commonsense knowledge. This knowledge is not mentioned explicitly in the text. For example, consider the sentence:

President Bush today compared dependence on overseas oil to a “foreign tax on the American people” and proposed a series of initiatives to fight the problem, including more nuclear plants and using technology to develop new energy sources. (http://www.cnn.com, April 27, 2005)

This sentence does not explicitly say that Bush did the comparison by speaking, that Bush proposed each initiative, what “the problem” refers to, that oil is an energy source, and so on. Yet it is impossible to precisely understand the text without this assumed background knowledge.

Representing and utilizing world knowledge in a computer is an extremely difficult endeavor. Doing so for all knowledge and all topics is impossible now and in the forseeable future. In fact, it could be argued that this is almost an AI-complete problem: if it were possible to do it, then most Artificial Intelligence tasks would be solved, since the computer would possess knowledge describing how to do any ordinary or sophisticated task, and so would be able to do it). Therefore, it is impossible to precisely understand arbitrary texts on arbitrary topics.

It is possible, nevertheless, to achieve precise understanding on particular texts and topics if the necessary background knowledge is fed into the computer. This is possible for certain kinds of texts.

3. Rigorous and Principled Study of Natural Language Phenomena

Just as the Human Genome Project aimed to map out the entire human DNA sequence, the PULC project aims to map out the entire human knowledge of the English language. The amount of the linguistic knowledge about a language like English, while vast, is relatively much smaller than the amount of general world knowledge, and it is conceivable that we can discover all of it and represent it in a computer in the not too distant future.

Of course, the entire field of Linguistics aims at mapping out the human knowledge of English and all other natural languages, and there has been a decent progress in the past several decades in discovering and documenting this knowledge. However, there are still many gaps in this knowledge. Moreover, the knowledge is usually stated only semi-rigorously, and is not immediately applicable for a computational system. The PULC project aims at stating the linguistic knowledge in a comprehensive, consistent, and coherent format, so precise that it could be used by a computer.

On the other hand, there has been a lot of work in natural language processing (NLP) aiming at producing computer applications that do useful things with language input. But just as linguistic theories to the most part ignore computational issues, NLP work by and large ignores linguistic theory, and moreover emphasizes formalisms and techniques while ignoring the study of language phenomena themselves. The PULC project is founded on the belief that no amount of computational sophistication can replace a meticulous investigation of the subject matter itself (i.e. language).

In short, the PULC project aims at both computational rigor and a principled scrupulous study of language phenomena.