Deep Learning for NLP

Advanced Natural Language Processing Portfolio

LSTM Networks • TF-IDF Vectorization • PPMI Analysis

Python TensorFlow Keras Word2Vec LSTM CoNLL2003 NumPy

Mathematical Foundations

Core algorithms and formulas implemented from mathematical principles

TF-IDF Vectorization

Term Frequency-Inverse Document Frequency transforms text into numerical vectors for similarity analysis.

Term Frequency (TF) $$\text{TF}(t, d) = \log_{10}(\text{count}(t, d) + 1)$$
Inverse Document Frequency (IDF) $$\text{IDF}(t) = \log_{10}\left(\frac{N}{df_t}\right)$$

where \(N\) = total documents, \(df_t\) = documents containing term \(t\)

TF-IDF Weight $$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$
Cosine Similarity $$\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}$$

Positive Pointwise Mutual Information

PPMI measures word association strength based on co-occurrence probabilities.

Pointwise Mutual Information (PMI) $$\text{PMI}(x, y) = \log_2\left(\frac{p(x, y)}{p(x) \cdot p(y)}\right)$$
Positive PMI $$\text{PPMI}(x, y) = \max(\text{PMI}(x, y), 0)$$
Probability Estimates $$p(x) = \frac{\text{count}(x)}{\text{total\_words}}, \quad p(x, y) = \frac{\text{count}(x, y)}{\text{total\_pairs}}$$

LSTM Neural Network Architecture

Deep learning model for sequence labeling using Long Short-Term Memory networks.

LSTM Cell Operations $$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \quad \text{(Forget Gate)}$$ $$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \quad \text{(Input Gate)}$$ $$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \quad \text{(Candidate)}$$ $$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \quad \text{(Cell State)}$$ $$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \quad \text{(Output Gate)}$$ $$h_t = o_t \odot \tanh(C_t) \quad \text{(Hidden State)}$$
Loss Function (Categorical Cross-Entropy) $$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C} y_{ij} \log(\hat{y}_{ij})$$

TF-IDF Semantic Similarity

Custom vectorizer implementation on CoNLL2003 corpus

Documents Processed

1,000

CoNLL2003 samples

Vocabulary Size

5,847

Unique tokens

Feature Space

1000×5847

Matrix dimensions

Sentence A Sentence B Cosine Similarity Analysis
"I love football" "I do not love football" 0.7854 High lexical overlap despite semantic negation
"I follow cricket" "I follow baseball" 0.8944 Semantically similar with shared structure

Technical Insight

While TF-IDF effectively captures lexical similarity, it struggles with semantic nuances like negation. The high similarity (0.7854) between "I love football" and "I do not love football" demonstrates this limitation—both sentences share most words despite opposite meanings. Modern approaches like BERT embeddings or contextualized representations provide better semantic understanding.

Word Association Analysis

Discovering meaningful collocations through PPMI

Simple Sequence Example

words = ['a', 'b', 'a', 'c']

Top Word Associations:

  • (a, b): 0.5850
  • (b, a): 0.5850
  • (a, c): 0.5850

Natural Text Example

"the cat sat on the mat the dog sat on the log"

Strongest Associations:

  • (cat, sat): 1.2630
  • (dog, sat): 1.2630
  • (on, the): 0.8451
  • (sat, on): 0.8451

Applications in Production

PPMI is invaluable for enterprise NLP tasks: Collocation Detection identifies meaningful phrases like "strong coffee" vs "powerful coffee"; Feature Engineering creates word association features for ML models; Semantic Similarity builds word embedding spaces; Information Extraction finds domain-specific terminology patterns in specialized corpora.

Named Entity Recognition

Deep learning model performance on CoNLL2003 benchmark

Accuracy

94.2%

Token classification

Precision

87.5%

Macro-averaged

Recall

85.8%

Macro-averaged

F1-Score

86.6%

Harmonic mean

LSTM Model Architecture

Input Layer (Sequence Length: 100) ↓ Embedding Layer (300D Word2Vec) ↓ Pre-trained Google News 300D LSTM Layer 1 (128 units, return_sequences=True, dropout=0.2) ↓ LSTM Layer 2 (64 units, return_sequences=True, dropout=0.2) ↓ LSTM Layer 3 (32 units, return_sequences=True, dropout=0.2) ↓ Dense Layer (64 units, ReLU activation) ↓ Dropout: 0.3 Output Layer (9 units, Softmax activation) ↓ NER Predictions (O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC)

Training Progress (10 Epochs)

Dataset Split Samples Vocabulary Max Length Word2Vec Coverage
Training (80%) 4,000 5,847 words 100 tokens ~87%
Validation (10%) 500 100 tokens
Testing (20%) 1,000 100 tokens

BIO Tagging Scheme

  • O: Outside (no entity)
  • B-PER: Beginning of Person
  • I-PER: Inside Person
  • B-ORG: Beginning of Organization
  • I-ORG: Inside Organization
  • B-LOC: Beginning of Location
  • I-LOC: Inside Location
  • B-MISC: Beginning of Miscellaneous
  • I-MISC: Inside Miscellaneous

Training Configuration

  • Loss: Categorical Cross-Entropy
  • Optimizer: Adam
  • Epochs: 10
  • Batch Size: 32
  • Embedding: Word2Vec (300D, frozen)
  • Dropout: 0.2 (LSTM), 0.3 (Dense)
  • Split: 80/10/10 (Train/Val/Test)

Future Enhancements

  • Bidirectional LSTM (BiLSTM)
  • CRF output layer for sequence constraints
  • Character-level embeddings
  • Attention mechanisms
  • Transfer learning with BERT
  • Data augmentation techniques
  • Ensemble methods

Explore the Implementation

View the complete source code, interactive Jupyter notebook with visualizations, and comprehensive documentation of the entire project.