NLTK (Natural Language Toolkit)

NLTK is a powerful Python library for working with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources, such as WordNet. NLTK also includes a variety of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, among other natural language processing (NLP) tasks.

Key Features and Components of NLTK:

  1. Corpora and Resources: NLTK includes a collection of diverse corpora and lexical resources. These include text data for tasks such as part-of-speech tagging, named entity recognition, sentiment analysis, and more. The library also provides access to WordNet, which is a lexical database of the English language.
  2. Tokenization: NLTK provides tools for tokenizing sentences and words. Tokenization is the process of breaking text into individual words or sentences.
  3. Part-of-Speech Tagging: NLTK allows you to perform part-of-speech tagging, which involves assigning a grammatical category (such as noun, verb, adjective) to each word in a text.
  4. Stemming and Lemmatization: NLTK includes tools for stemming (reducing words to their root form) and lemmatization (reducing words to their base or dictionary form).
  5. Named Entity Recognition (NER): NLTK includes tools for identifying named entities (such as persons, organizations, locations) in text.
  6. Frequency Distribution: NLTK provides tools for analyzing the frequency distribution of words in a text.

Example Code:


import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import ne_chunk
from nltk import FreqDist

# Tokenization
text = "NLTK is a powerful library for natural language processing."
words = word_tokenize(text)
sentences = sent_tokenize(text)

# Part-of-Speech Tagging
tagged_words = pos_tag(words)

# Stemming and Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stemmed_words = [stemmer.stem(word) for word in words]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# Named Entity Recognition
text_ner = "Barack Obama was born in Hawaii."
words_ner = word_tokenize(text_ner)
tagged_words_ner = pos_tag(words_ner)
named_entities = ne_chunk(tagged_words_ner)

# Frequency Distribution
freq_dist = FreqDist(words)

print("Tokenization:")
print("Words:", words)
print("Sentences:", sentences)

print("Part-of-Speech Tagging:")
print("Tagged Words:", tagged_words)

print("Stemming and Lemmatization:")
print("Stemmed Words:", stemmed_words)
print("Lemmatized Words:", lemmatized_words)

print("Named Entity Recognition (NER):")
print("Named Entities:", named_entities)

print("Frequency Distribution:")
print("Most Common Words:", freq_dist.most_common(5))

To use NLTK, you'll need to install it first using:


pip install nltk

After installation, you may need to download additional resources such as corpora and models. NLTK provides a convenient way to download these resources using the nltk.download