Deep Learning NLP Portfolio | LSTM, TF-IDF & PPMI

Mathematical Foundations

Core algorithms and formulas implemented from mathematical principles

TF-IDF Vectorization

Term Frequency-Inverse Document Frequency transforms text into numerical vectors for similarity analysis.

Term Frequency (TF) $$\text{TF}(t, d) = \log_{10}(\text{count}(t, d) + 1)$$

Inverse Document Frequency (IDF) $$\text{IDF}(t) = \log_{10}\left(\frac{N}{df_t}\right)$$

where $N$ = total documents, $df_t$ = documents containing term $t$

TF-IDF Weight $$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

Cosine Similarity $$\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}$$

Positive Pointwise Mutual Information

PPMI measures word association strength based on co-occurrence probabilities.

Pointwise Mutual Information (PMI) $$\text{PMI}(x, y) = \log_2\left(\frac{p(x, y)}{p(x) \cdot p(y)}\right)$$

Positive PMI $$\text{PPMI}(x, y) = \max(\text{PMI}(x, y), 0)$$

Probability Estimates $$p(x) = \frac{\text{count}(x)}{\text{total\_words}}, \quad p(x, y) = \frac{\text{count}(x, y)}{\text{total\_pairs}}$$

LSTM Neural Network Architecture

Deep learning model for sequence labeling using Long Short-Term Memory networks.

LSTM Cell Operations $$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \quad \text{(Forget Gate)}$$ $$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \quad \text{(Input Gate)}$$ $$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \quad \text{(Candidate)}$$ $$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \quad \text{(Cell State)}$$ $$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \quad \text{(Output Gate)}$$ $$h_t = o_t \odot \tanh(C_t) \quad \text{(Hidden State)}$$

Loss Function (Categorical Cross-Entropy) $$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C} y_{ij} \log(\hat{y}_{ij})$$

TF-IDF Semantic Similarity

Custom vectorizer implementation on CoNLL2003 corpus

Documents Processed

1,000

CoNLL2003 samples

Vocabulary Size

5,847

Unique tokens

Feature Space

1000×5847

Matrix dimensions

Sentence A	Sentence B	Cosine Similarity	Analysis
"I love football"	"I do not love football"	0.7854	High lexical overlap despite semantic negation
"I follow cricket"	"I follow baseball"	0.8944	Semantically similar with shared structure

Technical Insight

While TF-IDF effectively captures lexical similarity, it struggles with semantic nuances like negation. The high similarity (0.7854) between "I love football" and "I do not love football" demonstrates this limitation—both sentences share most words despite opposite meanings. Modern approaches like BERT embeddings or contextualized representations provide better semantic understanding.

Word Association Analysis

Discovering meaningful collocations through PPMI

Simple Sequence Example

words = ['a', 'b', 'a', 'c']

Top Word Associations:

(a, b): 0.5850
(b, a): 0.5850
(a, c): 0.5850

Natural Text Example

"the cat sat on the mat
 the dog sat on the log"
                        

Strongest Associations:

(cat, sat): 1.2630
(dog, sat): 1.2630
(on, the): 0.8451
(sat, on): 0.8451

Applications in Production

PPMI is invaluable for enterprise NLP tasks: Collocation Detection identifies meaningful phrases like "strong coffee" vs "powerful coffee"; Feature Engineering creates word association features for ML models; Semantic Similarity builds word embedding spaces; Information Extraction finds domain-specific terminology patterns in specialized corpora.

Named Entity Recognition

Deep learning model performance on CoNLL2003 benchmark

Accuracy

94.2%

Token classification

Precision

87.5%

Macro-averaged

Recall

85.8%