You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Reasoning: There was high volume of general-interest Q&A, high-quality natural science or social science topics, good language quality across diverse communities. Quality of data had more weight in selecting subreddits rather than a specific domain of focus, as quality data was needed to get better results within a small training round. Diverse subreddits were chosen to prevent the model having a restrictive tone or style.
103
+
104
+
2. **Post & Comment Filters**
105
+
- **Post-level exclusions**
106
+
- Skip any non-self posts, link/image posts, removed content, cross-posts, locked threads, stickied or mod-distinguished posts, and “over 18” content.
107
+
- Discard posts by `AutoModerator` or with link flairs in`{"announcement","meta","megathread"}`.
108
+
- Require `score ≥ 2` and `upvote_ratio ≥ 0.90`.
109
+
- De-duplicate by title (case-insensitive), and ensure the combined title+body has ≥ 6 words.
110
+
- **Comment-level exclusions**
111
+
- Ignore stickied or mod-distinguished comments, any by `AutoModerator`, or comments from the same author as the post.
112
+
- Require comment `score ≥ 2` and ≥ 30 words.
113
+
- **Top-comment selection**
114
+
- Replace “more” comments, then pick the comment maximizing
115
+
`quality = score * (word_count ** 0.3)`
116
+
to avoid very short or joke-style replies.
117
+
- **Duplication Prevention**
118
+
- Drop duplicate Q–A pairs.
119
+
120
+
3. **Text Cleaning**
121
+
- Strip HTML/Markdown tags using a parser (BeautifulSoup).
122
+
- Remove fenced code blocks matching ```…```.
123
+
- Eliminate URLs (`http://` or `https://`) and colon-wrapped emoji codes (e.g. `:smile:`).
124
+
- Collapse quoted lines beginning with `>` (including multi-line quotes).
125
+
- Remove bot signature footers matching `*I am a bot…*`.
126
+
- Replace any sequence of whitespace (spaces, newlines, tabs) with a single space, and trim leading/trailing spaces.
127
+
128
+
4. **Train/Val/Test Split**
129
+
- 80% train, 10% validation, 10% test.
130
+
- Save raw dataset, and then create separate CSV files for train, validation, and test.
131
+
97
132
## Results
98
133
99
134

@@ -142,42 +177,8 @@ This project intentionally focused more on the methods and pipeline than the act
142
177
- Train for more epochs.
143
178
- Use a better evaluation metric like RougeL for early stopping.
- Reasoning: There was high volume of general-interest Q&A, high-quality natural science or social science topics, good language quality across diverse communities. Quality of data had more weight in selecting subreddits rather than a specific domain of focus, as quality data was needed to get better results within a small training round. Diverse subreddits were chosen to prevent the model having a restrictive tone or style.
151
-
152
-
2. **Post & Comment Filters**
153
-
- **Post-level exclusions**
154
-
- Skip any non-self posts, link/image posts, removed content, cross-posts, locked threads, stickied or mod-distinguished posts, and “over 18” content.
155
-
- Discard posts by `AutoModerator` or with link flairs in `{"announcement","meta","megathread"}`.
156
-
- Require `score ≥ 2` and `upvote_ratio ≥ 0.90`.
157
-
- De-duplicate by title (case-insensitive), and ensure the combined title+body has ≥ 6 words.
158
-
- **Comment-level exclusions**
159
-
- Ignore stickied or mod-distinguished comments, any by `AutoModerator`, or comments from the same author as the post.
160
-
- Require comment `score ≥ 2` and ≥ 30 words.
161
-
- **Top-comment selection**
162
-
- Replace “more” comments, then pick the comment maximizing
163
-
`quality = score * (word_count ** 0.3)`
164
-
to avoid very short or joke-style replies.
165
-
- **Duplication Prevention**
166
-
- Drop duplicate Q–A pairs.
167
-
168
-
3. **Text Cleaning**
169
-
- Strip HTML/Markdown tags using a parser (BeautifulSoup).
170
-
- Remove fenced code blocks matching ```…```.
171
-
- Eliminate URLs (`http://` or `https://`) and colon-wrapped emoji codes (e.g. `:smile:`).
172
-
- Collapse quoted lines beginning with `>` (including multi-line quotes).
173
-
- Remove bot signature footers matching `*I am a bot…*`.
174
-
- Replace any sequence of whitespace (spaces, newlines, tabs) with a single space, and trim leading/trailing spaces.
175
-
176
-
4. **Train/Val/Test Split**
177
-
- 80% train, 10% validation, 10% test.
178
-
- Save raw dataset, and then create separate CSV files for train, validation, and test.
179
-
180
180
## Running the Gradio Inference App
181
+
181
182
This project includes an interactive Gradio app for making predictions with the trained model.
182
183
183
184
1. **Obtain the Trained Model:**
@@ -202,6 +203,7 @@ pytest
202
203
```
203
204
204
205
## Hyperparameter Exploration
206
+
205
207
For systematic hyperparameter exploration, you can use W&B sweeps:
206
208
207
209
1. Enter your Weights & Biases account entity in sweep.yaml on project root:
0 commit comments