{"id":17427,"date":"2025-08-08T10:16:50","date_gmt":"2025-08-08T14:16:50","guid":{"rendered":"https:\/\/blogs.mathworks.com\/deep-learning\/?p=17427"},"modified":"2025-08-08T10:16:50","modified_gmt":"2025-08-08T14:16:50","slug":"vectorizing-language-with-word-and-document-embeddings","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/deep-learning\/2025\/08\/08\/vectorizing-language-with-word-and-document-embeddings\/","title":{"rendered":"Vectorizing Language with Word and Document Embeddings"},"content":{"rendered":"<h6><\/h6>\r\nNatural language is the foundation of human communication, but it\u2019s unstructured and full of nuance. Synonyms convey similar ideas, a single word can carry multiple meanings, and language interpretation shifts with context. Capturing this complexity in a format that algorithms can understand is an important challenge in <a href=\"https:\/\/www.mathworks.com\/discovery\/natural-language-processing.html\">natural language processing (NLP)<\/a>. Vectorization and techniques like word embeddings and document embeddings encode language in an efficient way that NLP algorithms can work directly with.\r\n<h6><\/h6>\r\nIn this post, we\u2019ll explore why vectorizing language matters, what word and document embeddings are, and how you can implement and visualize embeddings with MATLAB. These embeddings enable applications like semantic search, text classification, topic modeling, and recommendation systems.\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 20px; color: #c04c0b;\"><strong>Why Vectorization Matters in NLP<\/strong><\/p>\r\nVectorization is language converted into numbers. It transforms raw text into features that <a href=\"https:\/\/www.mathworks.com\/discovery\/machine-learning.html\">machine learning<\/a> and <a href=\"https:\/\/www.mathworks.com\/discovery\/deep-learning.html\">deep learning<\/a> models can understand, making it foundational to any NLP application, from <a href=\"https:\/\/www.mathworks.com\/videos\/what-is-sentiment-analysis-1717146570792.html\">sentiment analysis<\/a> to language translation and chatbots.\r\n<h6><\/h6>\r\nA naive way to perform vectorization is using one-hot encoding, where each word is represented as a long, sparse vector with a 1 in a single position. However, this approach has major drawbacks:\r\n<h6><\/h6>\r\n<ul>\r\n \t<li>It does not capture semantic similarity (e.g., \u201ccar\u201d and \u201cvehicle\u201d are just as unrelated as \u201ccar\u201d and \u201cbanana\u201d).<\/li>\r\n \t<li>It creates high-dimensional data that\u2019s inefficient to store or compute on.<\/li>\r\n \t<li>It ignores word order and context.<\/li>\r\n<\/ul>\r\n<h6><\/h6>\r\n<p style=\"font-size: 20px; color: #c04c0b;\"><strong>What Are Word Embeddings?<\/strong><\/p>\r\nWord embeddings are vector representations of words that capture their meaning based on context and usage. Word embeddings, such as <a href=\"https:\/\/www.mathworks.com\/discovery\/word2vec.html\">Word2Vec<\/a>, GloVe, and fastText, map words into dense, low-dimensional vectors where semantically similar words are close in the vector space. This transformation enables algorithms to generalize better and capture deeper linguistic relationships.\r\n<h6><\/h6>\r\nThe key characteristics of word embeddings are:\r\n<h6><\/h6>\r\n<ul>\r\n \t<li>Each word is represented by a fixed-length vector (typically 50\u2013300 dimensions).<\/li>\r\n \t<li>Words with similar meanings (e.g., \u201ccat\u201d and \u201ckitten\u201d) end up with similar vectors.<\/li>\r\n \t<li>Arithmetic with embeddings can reveal relationships:\r\n<h6><\/h6>\r\nword2vec(\"king\") - word2vec(\"man\") + word2vec(\"woman\") \u2248 word2vec(\"queen\")<\/li>\r\n<\/ul>\r\nIn MATLAB, you can create a <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/wordembedding.html\">word embedding model<\/a> to map words in a vocabulary to real vectors. For example, you can use the following code to create word embeddings and then, calculate their similarity.\r\n<pre>emb = fastTextWordEmbedding;\r\nvec1 = word2vec(emb,\"Greece\");\r\nvec2 = word2vec(emb,\"Athens\");\r\nsim = cosineSimilarity(vec1,vec2)\r\nsim = single\r\n<\/pre>\r\n<h6><\/h6>\r\n<pre class=\"brush: python\" style=\"background-color: white; border: white;\">0.7875\r\n<\/pre>\r\n<h6><\/h6>\r\nThese representations enable tasks such as <a href=\"https:\/\/www.mathworks.com\/discovery\/clustering.html\">clustering<\/a> and analogy completion, and serve as input features for <a href=\"https:\/\/www.mathworks.com\/discovery\/machine-learning-models.html\">machine learning models<\/a>.\r\n<h6><\/h6>\r\n<p style=\"font-size: 18px;\"><strong>Contextual Word Embeddings<\/strong><\/p>\r\nUnlike traditional word embeddings (often referred to as static embeddings) that assign a single vector to each word, contextual word embeddings generate dynamic representations that vary depending on surrounding words. This allows models to capture the subtle nuances of meaning. For example, contextual word embeddings distinguish between \u201cbank\u201d in \u201criver bank\u201d versus \u201csavings bank.\u201d\r\n<h6><\/h6>\r\nContextual embeddings are produced by large language models like BERT and GPT, which consider the entire sentence when generating a vector for each token. In MATLAB, you can integrate these models using <a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2024\/10\/31\/transformer-models-from-hype-to-implementation\/\">transformer-based architectures<\/a> or external APIs, enabling powerful applications in sentiment analysis, semantic search, and language understanding. For an example, see <a href=\"https:\/\/github.com\/matlab-deep-learning\/llms-with-matlab\/blob\/main\/examples\/InformationRetrievalUsingOpenAIDocumentEmbedding.md\">Information Retrieval Using OpenAI\u2122 Document Embedding<\/a> in the <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/163796-large-language-models-llms-with-matlab\">LLMs with MATLAB repository<\/a>.\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 20px; color: #c04c0b;\"><strong>What Are Document Embeddings?<\/strong><\/p>\r\nWhile word embeddings focus on individual words, document embeddings represent entire phrases, sentences, or documents as single vectors. They aim to capture not only word meanings but also word order, structure, and context.\r\n<h6><\/h6>\r\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17430 size-full\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2025\/06\/word_document_embeddings.png\" alt=\"Visual representation and comparison between word and document embeddings\" width=\"471\" height=\"271\" \/>\r\n<h6><\/h6>\r\nIn MATLAB, you can easily create <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/documentembedding.html\">document embeddings<\/a> for tasks like <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ug\/classify-documents-using-document-embeddings.html\">document classification<\/a> and <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ug\/information-retrieval-with-document-embeddings.html\">information retrieval<\/a>.\r\n<h6><\/h6>\r\n<pre>emb = documentEmbedding;\r\ndocuments = [\"Call me Ishmael. ...\" + ...\r\n\"Some years ago\u2014never mind how long precisely\u2014having ...\" + ...\r\n\"little or no money in my purse, and nothing particular ...\" + ...\r\n\"to interest me on shore, I thought I would sail about ...\" + ...\r\n\"a little and see the watery part of the world.\"];\r\nembeddedDocuments = embed(emb,documents)\r\n<\/pre>\r\n<h6><\/h6>\r\n<pre class=\"brush: python\" style=\"background-color: white; border: white;\">embeddedDocuments = <em>1\u00d7384<\/em><\/pre>\r\n<pre class=\"brush: python\" style=\"background-color: white; border: white;\">0.0640\u00a0\u00a0\u00a0 0.0701\u00a0\u00a0\u00a0 0.0566\u00a0\u00a0\u00a0 0.0361\u00a0\u00a0\u00a0 0.0787\u00a0\u00a0 -0.0815\u00a0\u00a0\u00a0 0.0793\u00a0\u00a0\u00a0 0.0077 ...\r\n<\/pre>\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 20px; color: #c04c0b;\"><strong>Visualizing Embeddings<\/strong><\/p>\r\nUnderstanding how embeddings capture meaning in language often starts with visualization. By projecting high-dimensional word or document vectors into a 2D or 3D space, you can explore how semantically similar terms cluster together. MATLAB provides built-in functions like <a href=\"https:\/\/www.mathworks.com\/help\/stats\/tsne.html\">t-SNE<\/a> and <a href=\"https:\/\/www.mathworks.com\/help\/stats\/pca.html\">PCA<\/a> to help you make sense of these embeddings.\r\n<pre>emb = fastTextWordEmbedding;\r\nwords = emb.Vocabulary(1:5000);\r\nV = word2vec(emb,words);\r\n\u00a0\r\n% Reduce to 2D using t-SNE\r\nXY = tsne(V);\r\n\u00a0\r\n% Plot\r\nfigure\r\ntextscatter(XY,words)\r\ntitle(\"t-SNE Visualization of Word Embeddings\")\r\n<\/pre>\r\n<h6><\/h6>\r\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17463\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2025\/06\/tsne.png\" alt=\"t-SNE Visualization of Word Embeddings\" width=\"613\" height=\"524\" \/>\r\n<h6><\/h6>\r\nYou\u2019ll see meaningful groupings, for example, animals, vehicles, or geographical regions. This indicates that the embedding space organizes words by semantic relationships. For an example, see <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ug\/visualize-word-embedding-using-text-scatter-plot.html\">Visualize Word Embeddings Using Text Scatter Plots<\/a>.\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 20px; color: #c04c0b;\"><strong>Final Thoughts<\/strong><\/p>\r\nWord and document embeddings form the backbone of modern NLP. They allow algorithms to \u201cunderstand\u201d text by converting language into numbers that capture meaning and context.\r\n<h6><\/h6>\r\nWhere are you going to apply embeddings? Classifying documents, analyzing sentiments, or building your own language-based application? Get start with examples from the <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/index.html\">Text Analytics Toolbox documentation<\/a> and comment below to share your workflow.\r\n<h6><\/h6>","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2025\/06\/word_document_embeddings.png\" onError=\"this.style.display ='none';\" \/><\/div><p>\r\nNatural language is the foundation of human communication, but it\u2019s unstructured and full of nuance. Synonyms convey similar ideas, a single word can carry multiple meanings, and language... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2025\/08\/08\/vectorizing-language-with-word-and-document-embeddings\/\">read more >><\/a><\/p>","protected":false},"author":194,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[9],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/17427"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/users\/194"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/comments?post=17427"}],"version-history":[{"count":18,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/17427\/revisions"}],"predecessor-version":[{"id":17487,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/17427\/revisions\/17487"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/media?parent=17427"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/categories?post=17427"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/tags?post=17427"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}