Mathematical Foundations
Core algorithms and formulas implemented from mathematical principles
TF-IDF Vectorization
Term Frequency-Inverse Document Frequency transforms text into numerical vectors for similarity analysis.
where \(N\) = total documents, \(df_t\) = documents containing term \(t\)
Positive Pointwise Mutual Information
PPMI measures word association strength based on co-occurrence probabilities.
LSTM Neural Network Architecture
Deep learning model for sequence labeling using Long Short-Term Memory networks.
TF-IDF Semantic Similarity
Custom vectorizer implementation on CoNLL2003 corpus
Documents Processed
CoNLL2003 samples
Vocabulary Size
Unique tokens
Feature Space
Matrix dimensions
| Sentence A | Sentence B | Cosine Similarity | Analysis |
|---|---|---|---|
| "I love football" | "I do not love football" | 0.7854 | High lexical overlap despite semantic negation |
| "I follow cricket" | "I follow baseball" | 0.8944 | Semantically similar with shared structure |
Technical Insight
While TF-IDF effectively captures lexical similarity, it struggles with semantic nuances like negation. The high similarity (0.7854) between "I love football" and "I do not love football" demonstrates this limitation—both sentences share most words despite opposite meanings. Modern approaches like BERT embeddings or contextualized representations provide better semantic understanding.
Word Association Analysis
Discovering meaningful collocations through PPMI
Simple Sequence Example
Top Word Associations:
- (a, b): 0.5850
- (b, a): 0.5850
- (a, c): 0.5850
Natural Text Example
Strongest Associations:
- (cat, sat): 1.2630
- (dog, sat): 1.2630
- (on, the): 0.8451
- (sat, on): 0.8451
Applications in Production
PPMI is invaluable for enterprise NLP tasks: Collocation Detection identifies meaningful phrases like "strong coffee" vs "powerful coffee"; Feature Engineering creates word association features for ML models; Semantic Similarity builds word embedding spaces; Information Extraction finds domain-specific terminology patterns in specialized corpora.
Named Entity Recognition
Deep learning model performance on CoNLL2003 benchmark
Accuracy
Token classification
Precision
Macro-averaged
Recall
Macro-averaged
F1-Score
Harmonic mean
LSTM Model Architecture
Training Progress (10 Epochs)
| Dataset Split | Samples | Vocabulary | Max Length | Word2Vec Coverage |
|---|---|---|---|---|
| Training (80%) | 4,000 | 5,847 words | 100 tokens | ~87% |
| Validation (10%) | 500 | — | 100 tokens | — |
| Testing (20%) | 1,000 | — | 100 tokens | — |
BIO Tagging Scheme
- O: Outside (no entity)
- B-PER: Beginning of Person
- I-PER: Inside Person
- B-ORG: Beginning of Organization
- I-ORG: Inside Organization
- B-LOC: Beginning of Location
- I-LOC: Inside Location
- B-MISC: Beginning of Miscellaneous
- I-MISC: Inside Miscellaneous
Training Configuration
- Loss: Categorical Cross-Entropy
- Optimizer: Adam
- Epochs: 10
- Batch Size: 32
- Embedding: Word2Vec (300D, frozen)
- Dropout: 0.2 (LSTM), 0.3 (Dense)
- Split: 80/10/10 (Train/Val/Test)
Future Enhancements
- Bidirectional LSTM (BiLSTM)
- CRF output layer for sequence constraints
- Character-level embeddings
- Attention mechanisms
- Transfer learning with BERT
- Data augmentation techniques
- Ensemble methods
Explore the Implementation
View the complete source code, interactive Jupyter notebook with visualizations, and comprehensive documentation of the entire project.