Skip to content

Commit ede837f

Browse files
committed
Add Advanced Extractive Text Summarization model (Issue #100)
1 parent 550d3fc commit ede837f

File tree

3 files changed

+79
-0
lines changed

3 files changed

+79
-0
lines changed
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Advanced Extractive Text Summarization
2+
3+
An advanced extractive text summarization model using NLP techniques.
4+
5+
## Features
6+
- Extracts key sentences from text
7+
- Scores sentences using TF-IDF, sentence length, position, and named entities
8+
- Clusters sentences via K-means to highlight critical points from thematic groups
9+
10+
## Usage
11+
1. Install dependencies:
12+
```bash
13+
pip install nltk spacy scikit-learn
14+
python -m spacy download en_core_web_sm
15+
```
16+
2. Run `summarizer.py`.
17+
3. The script will print a summary of the sample text.
18+
19+
## Requirements
20+
- Python 3.x
21+
- nltk
22+
- spacy
23+
- scikit-learn
24+
25+
## License
26+
MIT
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
nltk
2+
spacy
3+
scikit-learn
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
"""
2+
Advanced Extractive Text Summarization Model
3+
Issue #100 for king04aman/All-In-One-Python-Projects
4+
"""
5+
import nltk
6+
import spacy
7+
from sklearn.feature_extraction.text import TfidfVectorizer
8+
from sklearn.cluster import KMeans
9+
import numpy as np
10+
11+
nltk.download('punkt')
12+
nlp = spacy.load('en_core_web_sm')
13+
14+
def extract_sentences(text):
15+
return nltk.sent_tokenize(text)
16+
17+
def score_sentences(sentences):
18+
tfidf = TfidfVectorizer().fit_transform(sentences)
19+
scores = tfidf.sum(axis=1).A1
20+
features = []
21+
for i, sent in enumerate(sentences):
22+
length = len(sent)
23+
position = i / len(sentences)
24+
doc = nlp(sent)
25+
entities = len(doc.ents)
26+
features.append([scores[i], length, position, entities])
27+
return np.array(features)
28+
29+
def cluster_sentences(features, n_clusters=3):
30+
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
31+
labels = kmeans.fit_predict(features)
32+
return labels
33+
34+
def summarize(text, n_clusters=3):
35+
sentences = extract_sentences(text)
36+
features = score_sentences(sentences)
37+
labels = cluster_sentences(features, n_clusters)
38+
summary = []
39+
for cluster in range(n_clusters):
40+
idx = np.where(labels == cluster)[0]
41+
if len(idx) > 0:
42+
best = idx[np.argmax(features[idx, 0])]
43+
summary.append(sentences[best])
44+
return "\n".join(summary)
45+
46+
if __name__ == "__main__":
47+
sample_text = """
48+
Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through language. NLP techniques are used to analyze text, extract information, and generate summaries. Extractive summarization selects key sentences from the original text to create a concise summary. Advanced models use features like TF-IDF, sentence length, position, and named entities to score sentences. Clustering helps group related sentences and highlight critical points from different themes. This approach is useful for summarizing reports, research papers, and news articles.
49+
"""
50+
print("Summary:\n", summarize(sample_text))

0 commit comments

Comments
 (0)