NLP Portfolio | Natural Language Processing Projects

Text Analytics & Spell Correction

Processing health tweets with custom spell correction algorithms

Tweets

6K+

Processed

Sources

News Outlets

Edit Dist

Levenshtein

Algorithm

Vocab

50K+

Words

Spell Correction Engine

Edit Distance: Levenshtein algorithm
Candidate Gen: 1-2 edits away
Ranking: Frequency-based scoring
Accuracy: Real-word error detection

Text Processing Pipeline

Tokenization: Custom regex patterns
Normalization: Case folding, stemming
Stopwords: Domain-specific filtering
N-grams: Unigram frequency analysis

View Project

From-Scratch ML Classifiers

Naive Bayes and Logistic Regression without libraries

Accuracy

85%

Sentiment

Models

Classifiers

Dataset

4.8K

Financial Phrases

Classes

Sentiments

Naive Bayes Classifier

Bayes' Theorem $$P(c|d) = \frac{P(d|c) \cdot P(c)}{P(d)}$$

Smoothing: Add-1 Laplace
Features: Bag of words
Training: MLE estimation

Logistic Regression

Sigmoid Function $$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Optimizer: Gradient descent
Regularization: L2 penalty
Multi-class: One-vs-rest

View Project

N-gram Models & POS Tagging

Text generation and sequence labeling with HMM

Perplexity

45.2

Bigram Model

POS Tags

Penn Treebank

Corpus

Gatsby

Training Text

Viterbi

O(n)

Decoding

Bigram Language Model

Bigram Probability $$P(w_n|w_{n-1}) = \frac{C(w_{n-1}, w_n)}{C(w_{n-1})}$$

Smoothing: Kneser-Ney
Generation: Random sampling
Evaluation: Perplexity metric

HMM POS Tagger

Viterbi Algorithm $$v_t(j) = \max_i [v_{t-1}(i) \cdot a_{ij} \cdot b_j(o_t)]$$

Training: Supervised MLE
Decoding: Viterbi path
Fallback: Unknown word handling

View Project

Named Entity Recognition with LSTM

Deep learning for sequence labeling with Word2Vec embeddings

Accuracy

94.2%

Token-level

F1-Score

86.6%

Macro-avg

Embeddings

300D

Word2Vec

Entities

PER/ORG/LOC/MISC

TF-IDF & PPMI

TF-IDF $$\text{TF-IDF} = \log(1+\text{tf}) \times \log\frac{N}{df}$$

Documents: 1,000 samples
Similarity: Cosine distance

LSTM Architecture

Embedding(300D) -> LSTM(128)
-> LSTM(64) -> LSTM(32)
-> Dense(64) -> Softmax(9)
                    

Dataset: CoNLL2003
Tags: BIO scheme

View Source Full Project Page

Constituency & Dependency Parsing

CKY algorithm and Stanford CoreNLP integration

Grammar

13.5K

CNF Rules

Algorithm

CKY

Dynamic Prog

Parses

10+

Ambiguous

Output

CoNLL

Format

Constituency Parse Tree: "Cat sat on the mat"

Cat

sat

DET

the

mat

Production Rules

S -> VP | VP -> NP V PP | NP -> N | PP -> P NP | NP -> DET N

CKY Algorithm

Time Complexity $$O(n^3 \cdot |G|)$$

Grammar: ATIS CFG (5,517 rules)
CNF: Chomsky Normal Form
Back-pointers: Tree reconstruction

Dependency Parsing

Word    POS   Head  Relation
The     DT    2     det
cat     NN    3     nsubj
sat     VBD   0     ROOT
                    

Server: Stanford CoreNLP
Format: CoNLL output

View Source Project Folder

Word Sense Disambiguation & Semantic Role Labeling

Lesk algorithm, BiLSTM networks, and neural SRL models

WSD F-Score

0.59

BiLSTM

SRL F-Score

0.83

LSTM Model

Models

WSD Approaches

Dataset

SemCor

+ OntoNotes

Word Sense Disambiguation

Simplified Lesk Algorithm $$\text{Overlap}(C, D) = |C \cap D|$$

MFS Baseline: F-Score 0.54
Lesk Algorithm: F-Score 0.48
BiLSTM: F-Score 0.59 (best)
Corpus: SemCor (50 test sentences)

BiLSTM Architecture

Embedding(100D) -> BiLSTM(128)
-> BiLSTM(128) -> Dense
-> Softmax(senses)
                    

Learning Rate: 0.001
Optimizer: Adam
Epochs: 10

Semantic Role Labeling

Word Embed(100D) + Predicate(10D)
-> LSTM(128) -> LSTM(64)
-> TimeDistributed Dense
-> Softmax(SRL labels)
                    

Dataset: OntoNotes v5
Precision: 0.85
Recall: 0.82
F-Score: 0.83

View Source Project Folder

NLP Toolkit: Chatbot, Slot Filling & Translation

Corpus-based chatbot, LSTM slot filling, and German-English neural translation

Chatbot

TF-IDF

Cosine Sim

Slot F1

0.95

BiLSTM

BLEU

0.18

Translation

Systems

Complete

Corpus-Based Chatbot

TF-IDF Similarity $$\text{sim}(\vec{d_1}, \vec{d_2}) = \frac{\vec{d_1} \cdot \vec{d_2}}{|\vec{d_1}| |\vec{d_2}|}$$

Corpus: NPS Chat (~10K messages)
Retrieval: TF-IDF + Cosine similarity
Filtering: Removes questions, short responses
Evaluation: Engagingness 3/5, Fluency 4.5/5

LSTM Slot Filling

Embedding(100D) -> BiLSTM(128)
-> BiLSTM(64) -> Dense(128)
-> TimeDistributed -> Softmax
                    

Dataset: ATIS (~4.4K train, 900 test)
Slots: 127 unique labels
Precision: 0.95
F1-Score: 0.95

Neural Machine Translation

Encoder: Embedding -> LSTM
Decoder: LSTM -> Attention
-> Dense -> Softmax
                    

Dataset: WMT14 (de-en)
Architecture: Seq2Seq + Attention
Vocab: 10K German, 10K English
BLEU Score: 0.18

View Source Documentation

Text Summarization: Abstractive & Extractive

Encoder-decoder models, T5 transformers, and PageRank-based extractive summarization

T5 ROUGE-1

0.40

Pre-trained

PageRank

0.35

Extractive

Dataset

300K

Articles

Methods

Approaches

Abstractive: Encoder-Decoder

Encoder: Embedding -> LSTM
Decoder: LSTM -> Beam Search
-> Dense -> Softmax
                    

Dataset: CNN/DailyMail (~300K articles)
Architecture: Custom LSTM seq2seq
Generation: Beam search (width=3)
ROUGE-1: ~0.25

Abstractive: T5 Transformer

T5-small (60M params)
Pre-trained on C4 corpus
Zero-shot summarization
                    

Model: T5-small (Hugging Face)
Training: Pre-trained, no fine-tuning
ROUGE-1: 0.40
Performance: State-of-the-art fluency

Extractive: PageRank

Sentence Similarity $$\text{sim}(S_i, S_j) = \frac{\vec{S_i} \cdot \vec{S_j}}{|\vec{S_i}| |\vec{S_j}|}$$

Embeddings: GloVe (Wikipedia + Gigaword)
Similarity: Cosine similarity matrix
Ranking: PageRank algorithm (NetworkX)
ROUGE-1: ~0.35

View Source Documentation