Word2Vec Example

This is a simple example of Word2Vec using Python and the Gensim library.

Word2Vec Overview

Word2Vec is a popular technique for learning word embeddings, which represent words as dense vectors in a continuous vector space. Word embeddings capture semantic relationships between words, making them useful for various natural language processing (NLP) tasks. Word2Vec models are trained on large text corpora and learn to predict the context (surrounding words) of a target word or predict a target word given its context.

Key concepts of Word2Vec:

Continuous Vector Representations: Words are represented as vectors in a continuous vector space.
Distributed Representation: Similar words have similar vector representations, capturing semantic relationships.
Context Prediction: Word2Vec models are trained to predict the context of a target word or predict a target word given its context.
Word Similarity: Similarity between word vectors is indicative of semantic similarity between words.

Word2Vec embeddings are commonly used in NLP applications such as sentiment analysis, named entity recognition, and machine translation.

Python Source Code:

# Import necessary libraries
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Example sentences for training Word2Vec model
sentences = [
    "Word embeddings are dense vector representations of words.",
    "They capture semantic relationships between words.",
    "Word2Vec is a popular technique for learning word embeddings.",
    "Natural language processing tasks often benefit from using word embeddings."
]

# Tokenize the sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train the Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Retrieve the vector for a specific word
vector_for_word = word2vec_model.wv['word']

# Find similar words
similar_words = word2vec_model.wv.most_similar('word', topn=3)

# Print results
print("Vector for 'word':", vector_for_word)
print("\nSimilar words to 'word':", similar_words)

Explanation:

Import Libraries: Import necessary Python libraries, including Gensim for Word2Vec and NLTK for tokenization.
Example Sentences: Define a few example sentences to train the Word2Vec model.
Tokenization: Tokenize the sentences into words using NLTK's word_tokenize.
Train Word2Vec Model: Train the Word2Vec model using Gensim's Word2Vec class with specified parameters such as vector size, window size, and minimum word count.
Retrieve Word Vector: Retrieve the vector representation for a specific word ('word' in this example).
Find Similar Words: Find words similar to the target word ('word' in this example) based on vector similarity.
Print Results: Print the vector for the target word and the list of similar words.