Introduction to Natural Language Processing



Table of Contents

Natural language processing (NLP) is a field of computer science that focuses on human (natural) languages. NLP has significant overlap with the field of computational linguistics, and is often considered a sub-field of artificial intelligence.

NLP can deal with many tasks. For example, natural language understanding (an NLP task) can be considered as an AI-complete problem that its difficulty is equivalent to solving the central artificial intelligence problem (i.e. making computers as intelligent as people), because in order to solve it we require extensive knowledge of the outside world and the ability to manipulate it.

Most of the recent NLP algorithms are based on machine learning methods and usually use statistical inference. The approaches currently taken in NLP require an understanding of a number of disparate fields, including linguistics, computer science, and statistics.

As described above, machine learning is the basis of the modern approaches to natural language processing. There are many systems that are based on large sets of hand-produced rules. In the machine learning approach, a corpus that typically consists of a set of documents is used to train a system.

The most important advantages of systems that take machine-learning approaches over systems that use hand-produced rules are:

  • The learning procedure can automatically focus on the most common cases. This greatly reduces the effort needed to find such common cases. It is not an easy task for humans to manually process a very large amount of data to extract the rules.
  • Using statistical inference, it is possible to develop algorithms that are robust to unseen and erroneous input. Generally, systems with hand-written rules have very poor performance in such cases.
  • More input data to machine learning systems can make them more accurate. However, in order to achieve more accurate hand-written based systems we have to increase the complexity of the rules, which is a much more difficult task.

Tasks

The following list contains some of the NLP tasks that, to a greater or lesser extent, are related to the research context of this dissertation. Some of these tasks have real-world applications, and others are used as prerequisites of larger ones.

Part-of-speech (POS) tagging

This is the task of determining the part of speech for each word. It is not a trivial task (i.e. it is harder than just having a list of words and their parts of speech) because there are many words, especially common ones, that have multiple parts of speech. For example, the word "book" can be a noun (e.g. in "the book on the table") or can be a verb (e.g. in "to book a flight"). The amount of such ambiguity differs from language to language. The less inflectional morphology a language has, the more such ambiguity arises (e.g. English is particularly prone to such ambiguity).

Named entity recognition (NER)

Given a stream of text, this is the task of determining which items in the text map to predefined categories such as the names of persons, organizations, locations, quantities, etc. For NER both linguistic grammar-based techniques as well as statistical models are used.

Text Chunking

This is the task of dividing a text into syntactically correlated parts of words.

For example, the sentence "I heard a new president will be elected in October" can be chunked as: [NP I] [VP heard] [NP a new president] [VP will be elected] [PP in] [NP October] . Text chunking can be considered as an intermediate step towards full parsing.

Parsing

This is the process of analyzing a text to determine its grammatical structure with respect to a given grammar. Since the grammars of natural languages are ambiguous, typical sentences have multiple possible analyse, i.e. a typical sentence can have thousands of potential parses.

Machine translation (MT)

This is the task of translating text or speech from one natural language to another. At its easiest level, it performs simple word-to-word translation, but that alone usually produces a very poor translation of a text. In order to produce good translations, different types of knowledge that humans possess (e.g. grammar and semantics) are needed.

Coreference resolution (CoRe)

This is the process of finding noun phrases (markables) referring to the same real world entity or concept. Anaphora resolution is a specific example of this type of task. Specifically, CoRe is the process of finding noun phrases (markables) to which pronouns refer.

Automatic Summarization

This concerns the production of a readable summary (a shortened version) of a text. There are different types of summaries, including generic summaries and query-relevant summaries.

Question answering (QA)

Given a human-language question, this is the task of determining its answer. There is a wide range of question types, including, for example, the questions with a specific right answer such as "Where is the capital of Germany?", or open-ended questions such as "What is the meaning of beauty?". The typical approaches to this task use either a pre-structured database or a collection of natural language documents (a text corpus).

Relationship extraction

This is the task of detecting semantic relationships among a set of textual entities (e.g. who is the father of whom).

Sentiment analysis

Given a set of documents, this is the task of extracting subjective information (i.e. determining "polarity" about specific objects). The documents are usually online reviews. Sentiment analysis is used especially for identifying public opinion trends in social media.

Natural language understanding

This converts chunks of text into more formal representations, such as first-order logic structures. Almost all approaches to this task contain these components: lexicon, syntax analyzer and semantic model.

Information extraction (IE)

This is concerned in general with the extraction of structured information (e.g. semantic information) from text. IE includes such tasks as named entity recognition, coreference resolution and relationship extraction.