Skip to content

Commit a79996f

Browse files
added Pos tagging using HMM
1 parent ccd26fb commit a79996f

File tree

2 files changed

+176
-0
lines changed

2 files changed

+176
-0
lines changed
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"<strong><h3>Design of PoS tagger using HMM.</h3></strong>"
8+
]
9+
},
10+
{
11+
"cell_type": "code",
12+
"execution_count": 1,
13+
"metadata": {},
14+
"outputs": [],
15+
"source": [
16+
"# Import liberaries\n",
17+
"from collections import defaultdict\n",
18+
"import nltk\n",
19+
"import numpy as np\n"
20+
]
21+
},
22+
{
23+
"cell_type": "code",
24+
"execution_count": 2,
25+
"metadata": {
26+
"colab": {
27+
"base_uri": "https://localhost:8080/"
28+
},
29+
"id": "8waH4sMDWgrD",
30+
"outputId": "7207175d-6e0b-46bc-86ff-c1266482cafe"
31+
},
32+
"outputs": [],
33+
"source": [
34+
"# Class for pos tagging\n",
35+
"class PosTagging:\n",
36+
" def __init__(self, train_sent):\n",
37+
" self.transition = defaultdict(int)\n",
38+
" self.emission = defaultdict(int)\n",
39+
" self.tag_set = set()\n",
40+
" self.word_set = set()\n",
41+
"\n",
42+
" self.train(train_sent)\n",
43+
"\n",
44+
" def train(self, train_sent):\n",
45+
" for sent in train_sent:\n",
46+
" prev_tag = None\n",
47+
" for word, tag in sent:\n",
48+
" self.transition[(prev_tag, tag)] += 1\n",
49+
" self.emission[(tag, word)] += 1\n",
50+
" self.tag_set.add(tag)\n",
51+
" self.word_set.add(word)\n",
52+
" prev_tag = tag\n",
53+
"\n",
54+
" def tag(self, sentence):\n",
55+
" tagged_sentence = []\n",
56+
" for word in sentence:\n",
57+
" max_prob = 0\n",
58+
" best_tag = None\n",
59+
" for tag in self.tag_set:\n",
60+
" count_total_tag = sum(v for k, v in self.transition.items() if k[0] == tagged_sentence[-1][1]) if tagged_sentence else 1.0\n",
61+
" transition_prob = self.transition[(tagged_sentence[-1][1], tag)] / count_total_tag if tagged_sentence else 1.0\n",
62+
" emission_prob = self.emission[(tag, word)] / count_total_tag\n",
63+
" prob = transition_prob * emission_prob\n",
64+
" if prob > max_prob:\n",
65+
" max_prob = prob\n",
66+
" best_tag = tag\n",
67+
" tagged_sentence.append((word, best_tag))\n",
68+
" return tagged_sentence\n"
69+
]
70+
},
71+
{
72+
"cell_type": "code",
73+
"execution_count": 3,
74+
"metadata": {},
75+
"outputs": [
76+
{
77+
"name": "stdout",
78+
"output_type": "stream",
79+
"text": [
80+
"[('I', 'PRP'), ('love', 'VBP'), ('nautue', None)]\n"
81+
]
82+
}
83+
],
84+
"source": [
85+
"#Expamle to understand ho this works \n",
86+
"train_sent = [[('I', 'PRP'), ('love', 'VBP'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN')]]\n",
87+
"test_sents = \"I love nautue\".split()\n",
88+
"\n",
89+
"hmm_tagger = PosTagging(train_sent)\n",
90+
"tags = hmm_tagger.tag(test_sents)\n",
91+
"print(tags)\n"
92+
]
93+
}
94+
],
95+
"metadata": {
96+
"colab": {
97+
"provenance": []
98+
},
99+
"kernelspec": {
100+
"display_name": "Python 3",
101+
"name": "python3"
102+
},
103+
"language_info": {
104+
"codemirror_mode": {
105+
"name": "ipython",
106+
"version": 3
107+
},
108+
"file_extension": ".py",
109+
"mimetype": "text/x-python",
110+
"name": "python",
111+
"nbconvert_exporter": "python",
112+
"pygments_lexer": "ipython3",
113+
"version": "3.12.2"
114+
}
115+
},
116+
"nbformat": 4,
117+
"nbformat_minor": 0
118+
}

Pos Tagging NLP/readme.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Part-of-Speech Tagging Using Hidden Markov Model (HMM)
2+
3+
This project provides a Python class `PosTagging` for performing Part-of-Speech (POS) tagging using a Hidden Markov Model (HMM). The class trains on a set of tagged sentences and uses the learned model to tag new sentences.
4+
5+
## Table of Contents
6+
- [Introduction](#introduction)
7+
- [How It Works](#how-it-works)
8+
- [Implementation Details](#implementation-details)
9+
- [Advantages and Limitations](#advantages-and-limitations)
10+
- [License](#license)
11+
12+
## Introduction
13+
14+
POS tagging is a fundamental task in Natural Language Processing (NLP) where each word in a sentence is labeled with its corresponding part of speech (e.g., noun, verb, adjective). This implementation uses an HMM-based approach to assign the most likely tag sequence to a sentence.
15+
16+
## How It Works
17+
18+
The `PosTagging` class is designed to:
19+
20+
1. **Train**: It calculates the transition and emission probabilities based on a provided training dataset of tagged sentences.
21+
2. **Tag**: It assigns the most likely POS tags to each word in an unseen sentence using the Viterbi-like algorithm based on the learned probabilities.
22+
23+
## Implementation Details
24+
25+
### Transition and Emission Probabilities
26+
27+
- **Transition Probability**: The probability of a tag $T_i$ given the previous tag $T_{i-1}$. This is calculated during the training phase by counting tag sequences.
28+
29+
$$
30+
P(T_i | T_{i-1}) = \frac{\text{Count}(T_{i-1}, T_i)}{\sum_{T_j} \text{Count}(T_{i-1}, T_j)}
31+
$$
32+
33+
- **Emission Probability**: The probability of a word $W_i$ given a tag $T_i$. This is calculated based on how often a word is associated with a tag in the training data.
34+
35+
$$
36+
P(W_i | T_i) = \frac{\text{Count}(T_i, W_i)}{\sum_{W_j} \text{Count}(T_i, W_j)}
37+
$$
38+
39+
### Tagging Algorithm
40+
41+
For each word in the input sentence, the class calculates the product of the transition and emission probabilities for each possible tag and assigns the tag with the highest probability.
42+
43+
### Key Attributes
44+
45+
- **`transition`**: A dictionary storing the transition probabilities between tags.
46+
- **`emission`**: A dictionary storing the emission probabilities for each word-tag pair.
47+
- **`tag_set`**: A set containing all the unique tags in the training data.
48+
- **`word_set`**: A set containing all the unique words in the training data.
49+
50+
## Advantages and Limitations
51+
52+
### Advantages
53+
- **Simplicity**: The HMM-based approach is straightforward and interpretable.
54+
- **Efficiency**: The algorithm efficiently computes the most likely tag sequence for a sentence.
55+
56+
### Limitations
57+
- **Sparsity**: The model may struggle with unseen words or tags not present in the training data.
58+
- **Context**: HMMs assume that the tag of a word depends only on the previous tag, which can be limiting for capturing long-range dependencies.

0 commit comments

Comments
 (0)