Conversation
mkaranasou
left a comment
There was a problem hiding this comment.
I have many questions I think :P
Good job in general, many good ideas in here 👍
| try: | ||
| if num_fails >= self.config.engine.banjax_num_fails_to_ban: | ||
| self.ip_cache.ip_banned(ip) | ||
| sql = f'update request_sets set banned = 1 where ' \ |
There was a problem hiding this comment.
Have you tested the performance of update? I think we could consider having a separate table for the banjax bans , since they will be a lot less rows than request sets.
Also, do you use sql strings because of better performance?
There was a problem hiding this comment.
- No, I did not test the performance. Just monitor the performance of the postprocessing pipeline.
- Not for performance. I was a bit concerned about that mysterious 1h shift issue and thought that SQL update with explicit time is solid and maybe more readable.
| super().__init__(config, steps) | ||
| self.df_chunks = [] | ||
| self.df_white_list = None | ||
| self.ip_cache = IPCache(config, self.logger) |
There was a problem hiding this comment.
Will the IPCache be used in other steps? It could be a TaskWithIPCache task. Also I think that all the Banjax stuff should be in different tasks, as many of the methods in attack detection, like steps of AttackDetection. It would be more modular. I remember you've said that it was a bit tricky because of the difference in df needs but - whenever we have some time of course - we could figure it out.
Again, whenever we have some time :) We could rethink about this when doing performance tuning after the cluster set up, for example.
There was a problem hiding this comment.
TaskWithIPCache looks like an overkill. If you need an IPCahe in other steps you just create an instance and use it. It's a singleton. But yes, we could rethink later. Now, at least, using this singleton does not block us to move in any direction.
| records = self.ip_cache.update(records) | ||
| num_records = len(records) | ||
| if num_records > 0: | ||
| challenged_ips = self.spark.createDataFrame(records).withColumn('challenged', F.lit(1)) |
There was a problem hiding this comment.
Have you tried with perist? If yes, does it help? This also goes for the whitelist df.
There was a problem hiding this comment.
I don't remember. But this part is not a bottleneck. I tried persist for whitelisting. It did not help.
White list hosts added.
Local cache for the challenged IPs. Every challenge IP is cached in order to
ip_failed_challengeorip_passed_challengebanjax reports