Skip to content

Commit 24c3bd7

Browse files
Update README.md
1 parent 3b584ad commit 24c3bd7

File tree

1 file changed

+37
-35
lines changed

1 file changed

+37
-35
lines changed

README.md

Lines changed: 37 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,41 @@ For custom prompts:
9494
bash scripts/run_inference.sh --mode predict --texts "What do you think about politics right now?"
9595
```
9696

97+
## Data Preprocessing
98+
99+
1. **Subreddit Selection**
100+
- Chosen subreddits: `r/askscience`, `r/AskHistorians`, `r/ExplainLikeImFive`, `r/AskPhysics`, `r/AskSocialScience`, `r/AskDocs`, `r/AskBiology`, `r/AskEconomics`
101+
- Dataset size: 4446 examples
102+
- Reasoning: There was high volume of general-interest Q&A, high-quality natural science or social science topics, good language quality across diverse communities. Quality of data had more weight in selecting subreddits rather than a specific domain of focus, as quality data was needed to get better results within a small training round. Diverse subreddits were chosen to prevent the model having a restrictive tone or style.
103+
104+
2. **Post & Comment Filters**
105+
- **Post-level exclusions**
106+
- Skip any non-self posts, link/image posts, removed content, cross-posts, locked threads, stickied or mod-distinguished posts, and “over 18” content.
107+
- Discard posts by `AutoModerator` or with link flairs in `{"announcement","meta","megathread"}`.
108+
- Require `score ≥ 2` and `upvote_ratio ≥ 0.90`.
109+
- De-duplicate by title (case-insensitive), and ensure the combined title+body has ≥ 6 words.
110+
- **Comment-level exclusions**
111+
- Ignore stickied or mod-distinguished comments, any by `AutoModerator`, or comments from the same author as the post.
112+
- Require comment `score ≥ 2` and ≥ 30 words.
113+
- **Top-comment selection**
114+
- Replace “more” comments, then pick the comment maximizing
115+
`quality = score * (word_count ** 0.3)`
116+
to avoid very short or joke-style replies.
117+
- **Duplication Prevention**
118+
- Drop duplicate Q–A pairs.
119+
120+
3. **Text Cleaning**
121+
- Strip HTML/Markdown tags using a parser (BeautifulSoup).
122+
- Remove fenced code blocks matching ``````.
123+
- Eliminate URLs (`http://` or `https://`) and colon-wrapped emoji codes (e.g. `:smile:`).
124+
- Collapse quoted lines beginning with `>` (including multi-line quotes).
125+
- Remove bot signature footers matching `*I am a bot…*`.
126+
- Replace any sequence of whitespace (spaces, newlines, tabs) with a single space, and trim leading/trailing spaces.
127+
128+
4. **Train/Val/Test Split**
129+
- 80% train, 10% validation, 10% test.
130+
- Save raw dataset, and then create separate CSV files for train, validation, and test.
131+
97132
## Results
98133

99134
![Train Loss curves](assets/train_loss.png)
@@ -142,42 +177,8 @@ This project intentionally focused more on the methods and pipeline than the act
142177
- Train for more epochs.
143178
- Use a better evaluation metric like RougeL for early stopping.
144179
145-
## Data Preprocessing
146-
147-
1. **Subreddit Selection**
148-
- Chosen subreddits: `r/askscience`, `r/AskHistorians`, `r/ExplainLikeImFive`, `r/AskPhysics`, `r/AskSocialScience`, `r/AskDocs`, `r/AskBiology`, `r/AskEconomics`
149-
- Dataset size: 4446 examples
150-
- Reasoning: There was high volume of general-interest Q&A, high-quality natural science or social science topics, good language quality across diverse communities. Quality of data had more weight in selecting subreddits rather than a specific domain of focus, as quality data was needed to get better results within a small training round. Diverse subreddits were chosen to prevent the model having a restrictive tone or style.
151-
152-
2. **Post & Comment Filters**
153-
- **Post-level exclusions**
154-
- Skip any non-self posts, link/image posts, removed content, cross-posts, locked threads, stickied or mod-distinguished posts, and “over 18” content.
155-
- Discard posts by `AutoModerator` or with link flairs in `{"announcement","meta","megathread"}`.
156-
- Require `score ≥ 2` and `upvote_ratio ≥ 0.90`.
157-
- De-duplicate by title (case-insensitive), and ensure the combined title+body has ≥ 6 words.
158-
- **Comment-level exclusions**
159-
- Ignore stickied or mod-distinguished comments, any by `AutoModerator`, or comments from the same author as the post.
160-
- Require comment `score ≥ 2` and ≥ 30 words.
161-
- **Top-comment selection**
162-
- Replace “more” comments, then pick the comment maximizing
163-
`quality = score * (word_count ** 0.3)`
164-
to avoid very short or joke-style replies.
165-
- **Duplication Prevention**
166-
- Drop duplicate Q–A pairs.
167-
168-
3. **Text Cleaning**
169-
- Strip HTML/Markdown tags using a parser (BeautifulSoup).
170-
- Remove fenced code blocks matching ```…```.
171-
- Eliminate URLs (`http://` or `https://`) and colon-wrapped emoji codes (e.g. `:smile:`).
172-
- Collapse quoted lines beginning with `>` (including multi-line quotes).
173-
- Remove bot signature footers matching `*I am a bot…*`.
174-
- Replace any sequence of whitespace (spaces, newlines, tabs) with a single space, and trim leading/trailing spaces.
175-
176-
4. **Train/Val/Test Split**
177-
- 80% train, 10% validation, 10% test.
178-
- Save raw dataset, and then create separate CSV files for train, validation, and test.
179-
180180
## Running the Gradio Inference App
181+
181182
This project includes an interactive Gradio app for making predictions with the trained model.
182183
183184
1. **Obtain the Trained Model:**
@@ -202,6 +203,7 @@ pytest
202203
```
203204
204205
## Hyperparameter Exploration
206+
205207
For systematic hyperparameter exploration, you can use W&B sweeps:
206208
207209
1. Enter your Weights & Biases account entity in sweep.yaml on project root:

0 commit comments

Comments
 (0)