NLP Portfolio

Natural Language Processing & Machine Learning Projects

Text Analytics & Spell Correction

Processing health tweets with custom spell correction algorithms

Tweets

6K+

Processed

Sources

16

News Outlets

Edit Dist

Levenshtein

Algorithm

Vocab

50K+

Words

Spell Correction Engine

  • Edit Distance: Levenshtein algorithm
  • Candidate Gen: 1-2 edits away
  • Ranking: Frequency-based scoring
  • Accuracy: Real-word error detection

Text Processing Pipeline

  • Tokenization: Custom regex patterns
  • Normalization: Case folding, stemming
  • Stopwords: Domain-specific filtering
  • N-grams: Unigram frequency analysis

From-Scratch ML Classifiers

Naive Bayes and Logistic Regression without libraries

Accuracy

85%

Sentiment

Models

2

Classifiers

Dataset

4.8K

Financial Phrases

Classes

3

Sentiments

Naive Bayes Classifier

Bayes' Theorem $$P(c|d) = \frac{P(d|c) \cdot P(c)}{P(d)}$$
  • Smoothing: Add-1 Laplace
  • Features: Bag of words
  • Training: MLE estimation

Logistic Regression

Sigmoid Function $$\sigma(z) = \frac{1}{1 + e^{-z}}$$
  • Optimizer: Gradient descent
  • Regularization: L2 penalty
  • Multi-class: One-vs-rest

N-gram Models & POS Tagging

Text generation and sequence labeling with HMM

Perplexity

45.2

Bigram Model

POS Tags

36

Penn Treebank

Corpus

Gatsby

Training Text

Viterbi

O(n)

Decoding

Bigram Language Model

Bigram Probability $$P(w_n|w_{n-1}) = \frac{C(w_{n-1}, w_n)}{C(w_{n-1})}$$
  • Smoothing: Kneser-Ney
  • Generation: Random sampling
  • Evaluation: Perplexity metric

HMM POS Tagger

Viterbi Algorithm $$v_t(j) = \max_i [v_{t-1}(i) \cdot a_{ij} \cdot b_j(o_t)]$$
  • Training: Supervised MLE
  • Decoding: Viterbi path
  • Fallback: Unknown word handling

Named Entity Recognition with LSTM

Deep learning for sequence labeling with Word2Vec embeddings

Accuracy

94.2%

Token-level

F1-Score

86.6%

Macro-avg

Embeddings

300D

Word2Vec

Entities

4

PER/ORG/LOC/MISC

TF-IDF & PPMI

TF-IDF $$\text{TF-IDF} = \log(1+\text{tf}) \times \log\frac{N}{df}$$
  • Documents: 1,000 samples
  • Similarity: Cosine distance

LSTM Architecture

Embedding(300D) -> LSTM(128) -> LSTM(64) -> LSTM(32) -> Dense(64) -> Softmax(9)
  • Dataset: CoNLL2003
  • Tags: BIO scheme

Constituency & Dependency Parsing

CKY algorithm and Stanford CoreNLP integration

Grammar

13.5K

CNF Rules

Algorithm

CKY

Dynamic Prog

Parses

10+

Ambiguous

Output

CoNLL

Format

Constituency Parse Tree: "Cat sat on the mat"

S
VP
NP
N
Cat
V
sat
PP
P
on
NP
DET
the
N
mat

Production Rules

S -> VP | VP -> NP V PP | NP -> N | PP -> P NP | NP -> DET N

CKY Algorithm

Time Complexity $$O(n^3 \cdot |G|)$$
  • Grammar: ATIS CFG (5,517 rules)
  • CNF: Chomsky Normal Form
  • Back-pointers: Tree reconstruction

Dependency Parsing

Word POS Head Relation The DT 2 det cat NN 3 nsubj sat VBD 0 ROOT
  • Server: Stanford CoreNLP
  • Format: CoNLL output
View Source Project Folder

Word Sense Disambiguation & Semantic Role Labeling

Lesk algorithm, BiLSTM networks, and neural SRL models

WSD F-Score

0.59

BiLSTM

SRL F-Score

0.83

LSTM Model

Models

3

WSD Approaches

Dataset

SemCor

+ OntoNotes

Word Sense Disambiguation

Simplified Lesk Algorithm $$\text{Overlap}(C, D) = |C \cap D|$$
  • MFS Baseline: F-Score 0.54
  • Lesk Algorithm: F-Score 0.48
  • BiLSTM: F-Score 0.59 (best)
  • Corpus: SemCor (50 test sentences)

BiLSTM Architecture

Embedding(100D) -> BiLSTM(128) -> BiLSTM(128) -> Dense -> Softmax(senses)
  • Learning Rate: 0.001
  • Optimizer: Adam
  • Epochs: 10

Semantic Role Labeling

Word Embed(100D) + Predicate(10D) -> LSTM(128) -> LSTM(64) -> TimeDistributed Dense -> Softmax(SRL labels)
  • Dataset: OntoNotes v5
  • Precision: 0.85
  • Recall: 0.82
  • F-Score: 0.83
View Source Project Folder

NLP Toolkit: Chatbot, Slot Filling & Translation

Corpus-based chatbot, LSTM slot filling, and German-English neural translation

Chatbot

TF-IDF

Cosine Sim

Slot F1

0.95

BiLSTM

BLEU

0.18

Translation

Systems

3

Complete

Corpus-Based Chatbot

TF-IDF Similarity $$\text{sim}(\vec{d_1}, \vec{d_2}) = \frac{\vec{d_1} \cdot \vec{d_2}}{|\vec{d_1}| |\vec{d_2}|}$$
  • Corpus: NPS Chat (~10K messages)
  • Retrieval: TF-IDF + Cosine similarity
  • Filtering: Removes questions, short responses
  • Evaluation: Engagingness 3/5, Fluency 4.5/5

LSTM Slot Filling

Embedding(100D) -> BiLSTM(128) -> BiLSTM(64) -> Dense(128) -> TimeDistributed -> Softmax
  • Dataset: ATIS (~4.4K train, 900 test)
  • Slots: 127 unique labels
  • Precision: 0.95
  • F1-Score: 0.95

Neural Machine Translation

Encoder: Embedding -> LSTM Decoder: LSTM -> Attention -> Dense -> Softmax
  • Dataset: WMT14 (de-en)
  • Architecture: Seq2Seq + Attention
  • Vocab: 10K German, 10K English
  • BLEU Score: 0.18
View Source Documentation

Text Summarization: Abstractive & Extractive

Encoder-decoder models, T5 transformers, and PageRank-based extractive summarization

T5 ROUGE-1

0.40

Pre-trained

PageRank

0.35

Extractive

Dataset

300K

Articles

Methods

3

Approaches

Abstractive: Encoder-Decoder

Encoder: Embedding -> LSTM Decoder: LSTM -> Beam Search -> Dense -> Softmax
  • Dataset: CNN/DailyMail (~300K articles)
  • Architecture: Custom LSTM seq2seq
  • Generation: Beam search (width=3)
  • ROUGE-1: ~0.25

Abstractive: T5 Transformer

T5-small (60M params) Pre-trained on C4 corpus Zero-shot summarization
  • Model: T5-small (Hugging Face)
  • Training: Pre-trained, no fine-tuning
  • ROUGE-1: 0.40
  • Performance: State-of-the-art fluency

Extractive: PageRank

Sentence Similarity $$\text{sim}(S_i, S_j) = \frac{\vec{S_i} \cdot \vec{S_j}}{|\vec{S_i}| |\vec{S_j}|}$$
  • Embeddings: GloVe (Wikipedia + Gigaword)
  • Similarity: Cosine similarity matrix
  • Ranking: PageRank algorithm (NetworkX)
  • ROUGE-1: ~0.35
View Source Documentation