Merge pull request #69 from himanshumahajan138/feature/text-summarizer

king04aman · web-flow · commit 79d8745acd9b · 2024-10-21T23:31:23.000+05:30
Feature: Text Summarization Tool Using NLP ; Fixed
diff --git a/Text Summarizer/README.md b/Text Summarizer/README.md
@@ -0,0 +1,167 @@
+# TextRank-based Text Summarization
+
+This project implements a **TextRank-based approach** to extract summaries from large textual data, such as articles. The summarization algorithm ranks sentences based on their relevance and importance, using concepts derived from the PageRank algorithm applied to text.
+
+## Table of Contents
+
+1. [Features](#features)
+2. [Installation](#installation)
+3. [Usage](#usage)
+4. [How It Works](#how-it-works)
+5. [Project Structure](#project-structure)
+6. [Dependencies](#dependencies)
+7. [License](#license)
+
+---
+
+## Features
+
+- Preprocesses text to clean and remove stopwords.
+- Utilizes **GloVe word embeddings** for sentence vectorization.
+- Applies the **TextRank algorithm** to rank and select important sentences.
+- Automatically downloads GloVe embeddings if not present locally.
+- Outputs a summary of the most relevant sentences from the input text.
+
+## Installation
+
+1. Clone the repository:
+   ```bash
+   git clone https://github.com/king04aman/All-In-One-Python-Projects.git
+   ```
+2. Install the required Python libraries:
+
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+3. Download necessary NLTK data for tokenization and stopword removal:
+   ```python
+   import nltk
+   nltk.download('punkt')
+   nltk.download('stopwords')
+   ```
+
+## Usage
+
+1. Prepare your CSV file with a column `article_text` containing the text articles you want to summarize.
+
+2. Run the script with your desired input:
+   ```bash
+   python text_summarizer.py
+   ```
+
+### Example:
+
+- Ensure the input CSV file is in the directory:
+
+  ```bash
+  Text Summarizer/sample.csv
+  ```
+
+- The script will output the summary of the most important sentences from the input text.
+
+### Command-line Parameters
+
+You can modify the following paths and settings inside the script:
+
+- `input_csv`: Path to your input CSV file.
+- `glove_dir`: Directory for storing GloVe embeddings.
+- `glove_file`: Path to the GloVe embeddings file.
+- `top_n_sentences`: The number of sentences you want in the summary (default is 10).
+
+## How It Works
+
+### 1. Text Preprocessing
+
+- Sentences are tokenized, and each sentence is cleaned by:
+  - Removing punctuation, numbers, and special characters.
+  - Converting text to lowercase.
+  - Removing stopwords using the NLTK library.
+
+### 2. Sentence Vectorization
+
+- The script uses **GloVe embeddings** to convert words in each sentence into a vector representation. Sentence vectors are the average of all word vectors in a sentence.
+- If the embeddings are not present, the script automatically downloads them.
+
+### 3. Building Similarity Matrix
+
+- A similarity matrix is built by calculating the **cosine similarity** between sentence vectors. This matrix forms the basis for ranking sentences.
+
+### 4. Sentence Ranking
+
+- The **PageRank algorithm** is applied to the similarity matrix. Sentences are ranked based on their scores, where higher-ranked sentences are deemed more important for summarization.
+
+### 5. Output Summary
+
+- Based on the rankings, the top `n` sentences are selected as the summary. These sentences are printed as the output of the script.
+
+## Project Structure
+
+```
+.
+├── Text Summarizer/
+│   ├── sample.csv                # Example CSV input file with articles
+│   ├── text_summarizer.py  # Main script for summarization
+│   ├── glove/                    # Directory for storing GloVe embeddings
+│   └── text_summarizer.log # Log file
+```
+
+## Dependencies
+
+- **Python 3.x**
+- **Libraries**:
+  - `numpy`
+  - `pandas`
+  - `nltk`
+  - `sklearn`
+  - `networkx`
+  - `requests`
+  - `tqdm`
+
+All dependencies can be installed via:
+
+```bash
+pip install -r requirements.txt
+```
+
+### GloVe Embeddings
+
+- The script uses **GloVe embeddings** from Stanford NLP to generate sentence vectors.
+  - By default, the **100-dimensional GloVe vectors** (`glove.6B.100d.txt`) are used.
+  - Download link: [GloVe 6B embeddings](http://nlp.uoregon.edu/download/embeddings/glove.6B.100d.txt)
+
+## Short Summary
+
+TextRank Text Summarization
+
+This script implements a TextRank-based approach for text summarization.
+The input is a CSV file containing text articles, and the output is a summary
+of the text.
+
+Steps:
+
+1. Preprocesses the text by removing punctuation, numbers, special characters, and stopwords.
+2. Generates sentence vectors using GloVe word embeddings.
+3. Builds a similarity matrix using cosine similarity between sentence vectors.
+4. Applies the PageRank algorithm to rank sentences.
+5. Outputs a summary of the most important sentences.
+
+Dependencies:
+
+- numpy
+- pandas
+- nltk
+- sklearn
+- networkx
+- GloVe word embeddings (automatically downloaded if not present)
+
+Author: [Himanshu Mahajan](https://github.com/himanshumahajan138)
+
+Date: 19-10-2024
+
+
+## License
+
+This project is licensed under the MIT License.
+
+---
diff --git a/Text Summarizer/requirements.txt b/Text Summarizer/requirements.txt
@@ -0,0 +1,7 @@
+networkx==3.3
+nltk==3.9.1
+numpy==2.1.2
+pandas==2.2.3
+Requests==2.32.3
+scikit_learn==1.5.2
+tqdm==4.66.5
diff --git a/Text Summarizer/runtime.txt b/Text Summarizer/runtime.txt
@@ -0,0 +1 @@
+python-3.10.7
diff --git a/Text Summarizer/sample.csv b/Text Summarizer/sample.csv
@@ -0,0 +1,9 @@
+article_id,article_text,source
+1,"Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players. I think every person has different interests. I have friends that have completely different jobs and interests, and I've met them in very different parts of my life. I think everyone just thinks because we're tennis players we should be the greatest of friends. But ultimately tennis is just a very small part of what we do. There are so many other things that we're interested in, that we do.'",https://www.tennisworldusa.org/tennis/news/Maria_Sharapova/62220/i-do-not-have-friends-in-tennis-says-maria-sharapova/
+2,"BASEL, Switzerland (AP), Roger Federer advanced to the 14th Swiss Indoors final of his career by beating seventh-seeded Daniil Medvedev 6-1, 6-4 on Saturday. Seeking a ninth title at his hometown event, and a 99th overall, Federer will play 93th-ranked Marius Copil on Sunday. Federer dominated the 20th-ranked Medvedev and had his first match-point chance to break serve again at 5-1. He then dropped his serve to love, and let another match point slip in Medvedev's next service game by netting a backhand. He clinched on his fourth chance when Medvedev netted from the baseline. Copil upset expectations of a Federer final against Alexander Zverev in a 6-3, 6-7 (6), 6-4 win over the fifth-ranked German in the earlier semifinal. The Romanian aims for a first title after arriving at Basel without a career win over a top-10 opponent. Copil has two after also beating No. 6 Marin Cilic in the second round. Copil fired 26 aces past Zverev and never dropped serve, clinching after 2 1/2 hours with a forehand volley winner to break Zverev for the second time in the semifinal. He came through two rounds of qualifying last weekend to reach the Basel main draw, including beating Zverev's older brother, Mischa. Federer had an easier time than in his only previous match against Medvedev, a three-setter at Shanghai two weeks ago.",http://www.tennis.com/pro-game/2018/10/copil-stuns-5th-ranked-zverev-to-reach-swiss-indoors-final/77721/
+3,"Roger Federer has revealed that organisers of the re-launched and condensed Davis Cup gave him three days to decide if he would commit to the controversial competition. Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment. ""They only left me three days to decide"", Federer said. ""I didn't to have time to consult with all the people I had to consult. ""I could not make a decision in that time, so I told them to do what they wanted."" The 20-time Grand Slam champion has voiced doubts about the wisdom of the one-week format to be introduced by organisers Kosmos, who have promised the International Tennis Federation up to $3 billion in prize money over the next quarter-century. The competition is set to feature 18 countries in the November 18-24 finals in Madrid next year, and will replace the classic home-and-away ties played four times per year for decades. Kosmos is headed by Barcelona footballer Gerard Pique, who is hoping fellow Spaniard Rafael Nadal will play in the upcoming event. Novak Djokovic has said he will give precedence to the ATP's intended re-launch of the defunct World Team Cup in January 2020, at various Australian venues. Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest. Federer said earlier this month in Shanghai in that his chances of playing the Davis Cup were all but non-existent. ""I highly doubt it, of course. We will see what happens,"" he said. ""I do not think this was designed for me, anyhow. This was designed for the future generation of players."" Argentina and Britain received wild cards to the new-look event, and will compete along with the four 2018 semi-finalists and the 12 teams who win qualifying rounds next February. ""I don't like being under that kind of pressure,"" Federer said of the deadline Kosmos handed him.",https://scroll.in/field/899938/tennis-roger-federer-ignored-deadline-set-by-new-davis-cup
+4,"Kei Nishikori will try to end his long losing streak in ATP finals and Kevin Anderson will go for his second title of the year at the Erste Bank Open on Sunday. The fifth-seeded Nishikori reached his third final of 2018 after beating Mikhail Kukushkin of Kazakhstan 6-4, 6-3 in the semifinals. A winner of 11 ATP events, Nishikori hasn't triumphed since winning in Memphis in February 2016. He has lost eight straight finals since. The second-seeded Anderson defeated Fernando Verdasco 6-3, 3-6, 6-4. Anderson has a shot at a fifth career title and second of the year after winning in New York in February. Nishikori leads Anderson 4-2 on career matchups, but the South African won their only previous meeting this year. With a victory on Sunday, Anderson will qualify for the ATP Finals. Currently in ninth place, Nishikori with a win could move to within 125 points of the cut for the eight-man event in London next month. Nishikori held serve throughout against Kukushkin, who came through qualifying. He used his first break point to close out the first set before going up 3-0 in the second and wrapping up the win on his first match point. Against Verdasco, Anderson hit nine of his 19 aces in the opening set. The Spaniard broke Anderson twice in the second but didn't get another chance on the South African's serve in the final set.",http://www.tennis.com/pro-game/2018/10/nishikori-beats-kukushkin-in-vienna-for-3rd-final-of-season/77719/
+5,"Federer, 37, first broke through on tour over two decades ago and he has since gone on to enjoy a glittering career. The 20-time Grand Slam winner is chasing his 99th ATP title at the Swiss Indoors this week and he faces Jan-Lennard Struff in the second round on Thursday (6pm BST). Davenport enjoyed most of her success in the late 1990s and her third and final major tournament win came at the 2000 Australian Open. But she claims the mentality of professional tennis players slowly began to change after the new millennium. ""It seems pretty friendly right now,"" said Davenport. ""I think there is a really nice environment and a great atmosphere, especially between some of the veteran players helping some of the younger players out. ""It's a very pleasant atmosphere, I'd have to say, around the locker rooms. ""I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the Olympic weeks, not necessarily during the tournaments. ""And even though maybe we had smaller teams, I still think we kept to ourselves quite a bit. ""Not always, but I really feel like in the mid-2000 years there was a huge shift of the attitudes of the top players and being more friendly and being more giving, and a lot of that had to do with players like Roger coming up. ""I just felt like it really kind of changed where people were a little bit, definitely in the 90s, a lot more quiet, into themselves, and then it started to become better."" Meanwhile, Federer is hoping he can improve his service game as he hunts his ninth Swiss Indoors title this week. ""I didn't serve very well [against first-round opponent Filip Kranjovic,"" Federer said. ""I think I was misfiring the corners, I was not hitting the lines enough. ""Clearly you make your life more difficult, but still I was up 6-2, 3-1, break points, so things could have ended very quickly today, even though I didn't have the best serve percentage stats. ""But maybe that's exactly what caught up to me eventually. It's just getting used to it. This is where the first rounds can be tricky.""",https://www.express.co.uk/sport/tennis/1036101/Roger-Federer-Swiss-Indoors-Jan-Lennard-Struff-Lindsay-Davenport
+6,"Nadal has not played tennis since he was forced to retire from the US Open semi-finals against Juan Martin Del Porto with a knee injury. The world No 1 has been forced to miss Spain's Davis Cup clash with France and the Asian hard court season. But with the ATP World Tour Finals due to begin next month, Nadal is ready to prove his fitness before the season-ending event at the 02 Arena. Nadal flew to Paris on Friday and footage from the Paris Masters official Twitter account shows the Spaniard smiling as he strides onto court for practice. The Paris Masters draw has been made and Nadal will start his campaign on Tuesday or Wednesday against either Fernando Verdasco or Jeremy Chardy. Nadal could then play defending champion Jack Sock in the third round before a potential quarter-final with either Borna Coric or Dominic Thiem. Nadal's appearance in Paris is a big boost to the tournament organisers who could see Roger Federer withdraw. Federer is in action at the Swiss Indoors in Basel and if he reaches the final, he could pull out of Paris in a bid to stay fresh for London. But as it stands, Federer is in the draw and is scheduled to face either former world No 3 Milos Raonic or Jo-Wilfried Tsonga in the second round. Federer's projected route to the Paris final could also lead to matches against Kevin Anderson and Novak Djokovic. Djokovic could play Marco Cecchinato in the second round. British No 1 Kyle Edmund is the 12th seed in Paris and will get underway in round two against either Karen Khachanov or Filip Krajinovic.",https://www.express.co.uk/sport/tennis/1037119/Rafael-Nadal-World-No-1-Paris-Masters-Federer-Djokovic
+7,"Tennis giveth, and tennis taketh away. The end of the season is finally in sight, and with so many players defending,or losing,huge chunks of points in Singapore, Zhuhai and London, podcast co-hosts Nina Pantic and Irina Falconi discuss the art of defending points (02:14). It's no secret that Jack Sock has struggled on the singles court this year (his record is 7-19). He could lose 1,400 points in the next few weeks, but instead of focusing on the negative, it can all be about perspective (06:28). Let's also not forget his two Grand Slam doubles triumphs this season. Two players, Stefanos Tsitsipas and Kyle Edmund, won their first career ATP titles last week (13:26). It's a big deal because you never forget your first. Irina looks back at her WTA title win in Bogota in 2016, and tells an unforgettable story about her semifinal drama (14:04). In Singapore, one of the biggest storylines (aside from the matches, of course) has been the on-court coaching debate. Nina and Irina give their opinions on what coaching should look like in the future, on both tours (18:55).",http://www.tennis.com/pro-game/2018/10/tenniscom-podcast-irina-falconi-jack-sock-rafael-nadal-singapore/77698/
+8,"Federer won the Swiss Indoors last week by beating Romanian qualifier Marius Copil in the final. The 37-year-old claimed his 99th ATP title and is hunting the century in the French capital this week. Federer has been handed a difficult draw where could could come across Kevin Anderson, Novak Djokovic and Rafael Nadal in the latter rounds. But first the 20-time Grand Slam winner wants to train on the Paris Masters court this afternoon before deciding whether to appear for his opening match against either Milos Raonic or Jo-Wilfried Tsonga. ""On Monday, I am free and will look how I feel,"" Federer said after winning the Swiss Indoors. ""On Tuesday I will fly to Paris and train in the afternoon to be ready for my first match on Wednesday night. ""I felt good all week and better every day. ""We also had the impression that at this stage it might be better to play matches than to train. ""And as long as I fear no injury, I play."" Federer's success in Basel last week was the ninth time he has won his hometown tournament. And he was delighted to be watched on by all of his family and friends as he purchased 60 tickets for the final for those dearest to him. ""My children, my parents, my sister and my team are all there,"" Federer added. ""It is always very emotional for me to thank my team. And sometimes it tilts with the emotions, sometimes I just stumble. ""It means the world to me. It makes me incredibly happy to win my home tournament and make people happy here. ""I do not know if it's maybe my last title, so today I try a lot more to absorb that and enjoy the moments much more consciously. ""Maybe I should celebrate as if it were my last title. ""There are very touching moments: seeing the ball children, the standing ovations, all the familiar faces in the audience. Because it was not always easy in the last weeks.""",https://www.express.co.uk/sport/tennis/1038186/Roger-Federer-set-for-crunch-Paris-Masters-decision-today
diff --git a/Text Summarizer/text_summarizer.py b/Text Summarizer/text_summarizer.py