Modify reduce_dimensionality to use fit_transform#2347
Modify reduce_dimensionality to use fit_transform#2347jemather wants to merge 9 commits intoMaartenGr:masterfrom
Conversation
Modifying _reduce_dimensionality to use fit_transform, per MaartenGr#2335
|
Thanks for the PR! I'm seeing some errors popping up here and there. Could you take a look? |
|
@jemather do you want to look at the failures or would it be Ok for me to "take over" this PR to address the test failures? |
|
@betatim Sorry, started working on the changes and did not commit them. Let me see if I can get this done here in the next hour or so and if not happy to hand it over. Thank you for the nudge! |
trying to diagnose what the other potential error is besides a TypeError that needs to be caught
trying another condition
ensuring fit_transform is available
Remove testing for whether topic_representations_ has been modified - not clear why this is needed here, and creates an unhandled case when using partial_fit
fixing lint whitespace issue
|
@betatim took me a few tries, but think it should be set now. |
|
Cool beans! Happy that all it took was a nudge :D :D |
|
I don't have direct experience with BERTopic, coming to this as someone who works on cuml. Hence motivated to help move this PR along. @MaartenGr the diff looks reasonable, but I don't know much about the code base. Is there something I can help with regarding this PR or does it "just" need time from one of the maintainers to take a look at it and think about (I say "just" because maintainer time and brain power is one of the rarest resources on planet earth ;) ) |
MaartenGr
left a comment
There was a problem hiding this comment.
Thanks for the changes. I left a couple of small comments. Can you take a look?
| elif self.topic_representations_ is None: | ||
| self.umap_model.fit(embeddings) | ||
| umap_embeddings = self.umap_model.transform(embeddings) | ||
| else: | ||
| if hasattr(self.umap_model, "fit_transform"): | ||
| umap_embeddings = self.umap_model.fit_transform(embeddings, y=y) | ||
| else: | ||
| self.umap_model.fit(embeddings) | ||
| umap_embeddings = self.umap_model.transform(embeddings) |
There was a problem hiding this comment.
I believe the elif statement should remain here instead of using the else statement. It has to do with some specifics regarding partial_fit not needing to be fitted again.
There was a problem hiding this comment.
This should be fixed in the latest commit. I ran into an issue with this initially because I believe a condition exists within the test suite where partial_fit was True, the umap_model did not have a partial_fit method, and there were apparently topic representations, because umap_embeddings would not be created and switching to an else there fixed this.
I was wondering if it might be better to throw an exception or otherwise give a warning if a partial_fit is requested and the umap_model doesn't support it. Right now that is being obscured, and I've retained that behavior in my commit as I don't really know how it affects things outside of here.
There was a problem hiding this comment.
What I meant was is that the else statement should only transform the data, not fit again. Especially with a partial_fit function, there might be instances where we do not want to retrain the model but instead fit it once (especially if it is a UMAP model).
There was a problem hiding this comment.
Sorry, I'm a bit confused as to what the desired behavior is. I was trying to emulate that of the original code, which to my reading runs partial_fit() if partial_fit == True AND the model has a partial_fit method, otherwise it runs fit. To my reading, there are no conditions under which either fit or partial_fit won't be run when _reduce_dimensionality() is called.
# Partial fit
if partial_fit:
if hasattr(self.umap_model, "partial_fit"):
self.umap_model = self.umap_model.partial_fit(embeddings)
elif self.topic_representations_ is None:
self.umap_model.fit(embeddings)
# Regular fit
else:
try:
# cuml umap needs y to be an numpy array
y = np.array(y) if y is not None else None
self.umap_model.fit(embeddings, y=y)
except TypeError:
self.umap_model.fit(embeddings)
umap_embeddings = self.umap_model.transform(embeddings)
There was a problem hiding this comment.
The original code was this:
# Partial fit
if partial_fit:
if hasattr(self.umap_model, "partial_fit"):
self.umap_model = self.umap_model.partial_fit(embeddings)
elif self.topic_representations_ is None:
self.umap_model.fit(embeddings)which means that if the latter two if and elif statements are not satisfied, then it will only run self.umap_model.transform(embeddings).
This means that in your code, the else statement that you added should only run umap_embeddings = self.umap_model.transform(embeddings)
In other words:
# Partial fit
if partial_fit:
if hasattr(self.umap_model, "partial_fit"):
self.umap_model = self.umap_model.partial_fit(embeddings)
elif self.topic_representations_ is None:
if hasattr(self.umap_model, "fit_transform"):
umap_embeddings = self.umap_model.fit_transform(embeddings)
else:
self.umap_model.fit(embeddings)
umap_embeddings = self.umap_model.transform(embeddings)
else:
umap_embeddings = self.umap_model.transform(embeddings)This retains the original behavior, namely that:
- If you the model has
partial_fit, use that - If it does not and has no representations, fit a model (using your suggested
fit_transformfor speedup if it exists) - If it has no
partial_fitand already has topic representations, then UMAP was already fit once and we should only runtransform
| if hasattr(self.umap_model, "fit_transform"): | ||
| umap_embeddings = self.umap_model.fit_transform(embeddings, y=y) | ||
| else: | ||
| self.umap_model.fit(embeddings) |
There was a problem hiding this comment.
I'm missing the y=y here in the .fit. I can see it's there in the .fit_transform though.
There was a problem hiding this comment.
I saw you moved the y=y to transform step but shouldn't it be used in the .fit instead?
There was a problem hiding this comment.
Yes it should. Sorry about that, I'll fix it.
There was a problem hiding this comment.
In your latest commit, you moved the y parameter to the wrong line. It shouldn't be used in .transform but .fit instead.
Addressing comments raised by @MaartenGr
|
Looks like this might be ready to merge? Is there any way we could help with this PR? |
|
It's almost there. My comments that I made on June 6th (see above) weren't addressed just yet. If those are addressed and cover the mentioned use cases, then I can go ahead and merge. |
|
Closing this as it was picked up and merged in #2416 |
Modifying _reduce_dimensionality to use fit_transform in single step, supporting the use of nearest neighbor descent when using cuML UMAP.
Fixes #2335
Before submitting