Splunk extractor #426

AGhafaryy · 2025-06-27T23:18:55Z

No description provided.

codecov · 2025-06-27T23:46:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.25%. Comparing base (d32b3cd) to head (f06c0ba).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #426      +/-   ##
==========================================
+ Coverage   98.21%   98.25%   +0.03%     
==========================================
  Files         152      154       +2     
  Lines        6111     6249     +138     
==========================================
+ Hits         6002     6140     +138     
  Misses        109      109

Flag	Coverage Δ
3.10-macos-latest	`98.23% <100.00%> (+0.05%)`	⬆️
3.10-ubuntu-latest	`98.22% <100.00%> (+0.04%)`	⬆️
3.10-windows-latest	`98.22% <100.00%> (+0.04%)`	⬆️
3.11-macos-latest	`98.22% <100.00%> (+0.02%)`	⬆️
3.11-ubuntu-latest	`98.22% <100.00%> (+0.04%)`	⬆️
3.11-windows-latest	`98.22% <100.00%> (+0.04%)`	⬆️
3.12-macos-latest	`98.22% <100.00%> (+0.04%)`	⬆️
3.12-ubuntu-latest	`98.22% <100.00%> (+0.04%)`	⬆️
3.12-windows-latest	`98.22% <100.00%> (+0.04%)`	⬆️
3.13-macos-latest	`98.23% <100.00%> (+0.05%)`	⬆️
3.13-ubuntu-latest	`98.22% <100.00%> (+0.04%)`	⬆️
3.13-windows-latest	`98.22% <100.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

angelosantos4 · 2025-07-02T21:45:28Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+        return headers
+
+    @property
+    def _normalized_query(self) -> str:


Note for base queries the search is implied:
index=index_a
But for subqueries:
index=index_a | join [search index=index_b] the search field is required.

angelosantos4 · 2025-07-02T21:53:28Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+                try:
+                    root = ET.fromstring(response.text)
+                    for elem in root.iter():
+                        if (
+                            "dispatchState" in elem.tag
+                            or elem.get("name") == "dispatchState"
+                        ):
+                            dispatch_state = elem.text
+                            break
+                except Exception as e:
+                    self.logger.warning(
+                        "Failed to parse job status",
+                        extra={"error": str(e), "search_id": search_id},
+                    )


I see there is a pattern here between parsing things as Json and parsing them as XML. I would recommend creating a parser class that will take an arbitrary object of one of the two types and get the corresponding object a bit more declaratively.

class SplunkResponseParser: def parse(object: Json | XML ) ...

to riff on this - create a parser interface that you can call parse on and it returns to you a consistent result. Switch on the return type and create an instance of either a JsonSplunkResponseParser or XmlSplunkResponseParser. This can be done as a factory method on SplunkResponseParser.

This is a refactoring stategy called "replace conditional with polymorphism"

+1 to the interface idea.

angelosantos4 · 2025-07-02T21:56:59Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+                job_status = response.json()
+                dispatch_state = (
+                    job_status.get("entry", [{}])[0]
+                    .get("content", {})
+                    .get("dispatchState")
+                )


I think there are heavy assumptions here about the state of the JSON, do we know why it is the first field in the entry list? I fear that a parallel job run from a different splunk extractor will make a job, and it will end up being the top of the list here. Or is it the case that a search can have multiple Jobs? Why is it the first one that we choose that we care about.

Maybe:

for entry in job_status.get("entry", []): Do checking on all entries that some condition is true. Or that one of the entries is the one we are looking for.

angelosantos4 · 2025-07-02T21:59:37Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+            await asyncio.sleep(2)  # Wait 2 seconds before checking again
+            wait_count += 2


I think it is more standard to do something akin to:

attempts = 0 while should_continue: should_continue = success_condition AND attempts < MAX_ATTEMPTS attempts += 1

in #425 we're discussing adding a library like tenacity to handle things like this as well.

Link to tenacity

angelosantos4 · 2025-07-02T22:03:29Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+            elif dispatch_state == "FAILED":
+                raise RuntimeError(f"Search job failed: {search_id}")
+
+            await asyncio.sleep(2)  # Wait 2 seconds before checking again


You can also replace this with a global variable SPLUNK_STATUS_CHECK_PERIOD_SECONDS.

I would very much prefer this. My mantra is "prefer no magic numbers/constant strings". Ideally, most constants should be extracted into config or something like it, but at the very least make them module level constants until it's clear which ones need to change often.

angelosantos4 · 2025-07-02T22:05:10Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+    async def _wait_for_job_completion(
+        self, client: AsyncClient, search_id: str, max_wait_seconds: int = 300
+    ):


Try typing the return types. Instead of nesting try: catches, you can return a completion state of True and False in order to hand off the expected state to the over-arching function.

angelosantos4 · 2025-07-02T22:07:29Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+            "search": self._normalized_query,
+            "earliest_time": self.earliest_time,
+            "latest_time": self.latest_time,
+            "max_count": str(self.max_count),


Why turn to string here, maybe have the interface expect a string

angelosantos4 · 2025-07-02T22:09:40Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+class SplunkExtractor(Extractor):
+    @classmethod
+    def from_file_data(
+        cls,
+        base_url: str,
+        query: str,
+        auth_token: Optional[str] = None,
+        username: Optional[str] = None,
+        password: Optional[str] = None,
+        earliest_time: str = "-24h",
+        latest_time: str = "now",
+        verify_ssl: bool = True,
+        request_timeout_seconds: int = 300,
+        max_count: int = 10000,
+        app: str = "search",
+        user: Optional[str] = None,
+        chunk_size: int = 1000,
+    ) -> "SplunkExtractor":


Try to seperate the functionality into a Client and have the extract_records

angelosantos4 · 2025-07-02T22:13:26Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+                "count": str(self.chunk_size),
+                "offset": str(self.offset),


Why are these strings, do we need to change the input parameters to just intake strigified integers?

angelosantos4 · 2025-07-02T22:18:59Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+            if response.status_code != 200:
+                raise HTTPStatusError(
+                    f"Failed to get job results: {response.status_code}",
+                    request=response.request,
+                    response=response,
+                )


What if the error is a 504, it could be the case that we hit the endpoint often and they would try to rate-limit us.

IMO:

401/403/404 - stop trying

2xx are mostly ok. Limiting to 200 may be a problem, but it rarely is.

3xx should almost never be seen, because they should be handled by the underlying http library and do whatever forwarding is necessary

4xx,5xx - log and try again with backoff

Connection/DNS error - log and try again with backoff

Also, I feel like this indicates there might be a way to make a more pluggable "client" idea once we figure out a unified retry logic. 🤔

We have an attempt to do this in our internal library but have not had the time to introduce it nodestream. Does everything from the status check handling, retrying, error handling, and json safe loading.

angelosantos4 · 2025-07-02T22:23:12Z

Also note that there is a Splunk client supported by Splunk itself.

Not sure if there is much distinguising the two, but it might make a lot of the abstraction easier like the Job handling:
https://docs.splunk.com/DocumentationStatic/PythonSDK/1.1/client.html

zprobst · 2025-06-30T13:12:55Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+
+    async def resume_from_checkpoint(self, checkpoint_object):
+        """Resume extraction from a checkpoint."""
+        if checkpoint_object:


This check is not required. Nodestream does this. You should only be getting called this with a non-none object.

zprobst · 2025-07-07T13:17:39Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+class SplunkExtractor(Extractor):
+    @classmethod
+    def from_file_data(
+        cls,
+        base_url: str,
+        query: str,
+        auth_token: Optional[str] = None,
+        username: Optional[str] = None,
+        password: Optional[str] = None,
+        earliest_time: str = "-24h",
+        latest_time: str = "now",
+        verify_ssl: bool = True,
+        request_timeout_seconds: int = 300,
+        max_count: int = 10000,
+        app: str = "search",
+        user: Optional[str] = None,
+        chunk_size: int = 1000,
+    ) -> "SplunkExtractor":


zprobst · 2025-07-07T13:18:18Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+    def get_jobs_endpoint(self) -> str:
+        """Get the Splunk jobs endpoint."""
+        return f"{self.base_url}/servicesNS/{self.user}/{self.app}/search/jobs"
+


This can be computed in the constructor - I don't see a clear reason to generate this string constantly.

zprobst · 2025-07-07T13:20:32Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+        """Get the results endpoint for a specific search job."""
+        return f"{self.base_url}/servicesNS/{self.user}/{self.app}/search/jobs/{search_id}/results"
+
+    async def _create_search_job(self, client: AsyncClient) -> str:


This function has a lot going on:

formulates a request body

makes the request

parses the response in one of two different possible return types.

For a simpler request, that may be fine, but given the leg work required, its easy to get lost in this function. I'd recommend breaking this down. Make this function tell a "story" while other functions describe the details.

Also, shouldn't this be search/v2/jobs/{search_id}/results or am I looking at the wrong docs?

actually actually, do we want to support a streaming splunk search result from search/v2/jobs/export?

Sorry, more splunk admin/hacker knowledge is being unlocked:

when we say "splunk extractor" we should consider supporting (and being clear about which we're supporting):

ad-hoc "streaming" queries (time-bound, no job-id required)

running a "job" and then getting the results (requires creating the job and then getting the results, what this PR covers)

accessing "scheduled" search results (by schedule name)

zprobst · 2025-07-07T13:23:41Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+                try:
+                    root = ET.fromstring(response.text)
+                    for elem in root.iter():
+                        if (
+                            "dispatchState" in elem.tag
+                            or elem.get("name") == "dispatchState"
+                        ):
+                            dispatch_state = elem.text
+                            break
+                except Exception as e:
+                    self.logger.warning(
+                        "Failed to parse job status",
+                        extra={"error": str(e), "search_id": search_id},
+                    )


to riff on this - create a parser interface that you can call parse on and it returns to you a consistent result. Switch on the return type and create an instance of either a JsonSplunkResponseParser or XmlSplunkResponseParser. This can be done as a factory method on SplunkResponseParser.

This is a refactoring stategy called "replace conditional with polymorphism"

zprobst · 2025-07-07T13:24:33Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+            await asyncio.sleep(2)  # Wait 2 seconds before checking again
+            wait_count += 2


in #425 we're discussing adding a library like tenacity to handle things like this as well.

jbristow · 2025-07-28T18:11:49Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+        )
+        return search_id
+
+    async def _wait_for_job_completion(


My IDE flags this as having too much "cognitive complexity" (Aka too many nested branches and loops)

It would be good to refactor chunks of this into smaller functions to reduce the amount of brainpower required to figure out what's going on.

jbristow · 2025-07-28T18:15:13Z

nodestream/pipeline/extractors/stores/splunk_extractor.py

+                )
+            except (json.JSONDecodeError, KeyError, IndexError):
+                # Try XML parsing
+                import xml.etree.ElementTree as ET


It would be more "pythonic" to say:

Suggested change

import xml.etree.ElementTree as ET

from xml.etree import ElementTree

typically the community reserves aliases for modules (and almost always lowercase only):

import pandas as pd import numpy as np

jbristow · 2025-07-28T18:31:00Z

tests/unit/pipeline/extractors/stores/test_splunk_extractor.py

+    mock_response = mocker.MagicMock()
+    mock_response.status_code = 201
+    mock_response.json.return_value = {"sid": "json123"}
+


We should prefer using responses or pytest-httpx to handle this mocking, they're more careful about rejecting unexpected calls and emulating an actual http call.

jbristow · 2025-07-28T18:35:33Z

tests/unit/pipeline/extractors/stores/test_splunk_extractor.py

+
+# Helper property tests
+def test_splunk_extractor_auth_property_with_token(splunk_extractor):
+    assert_that(splunk_extractor._auth, equal_to(None))  # Token goes in header


in general I would avoid testing internal/private properties and functions to avoid making tests that rely on implementation details.

These tests can be handled by responses and pytest-httpx by having these mock libraries watching for expected auth headers.

jbristow · 2025-07-28T18:38:59Z

tests/unit/pipeline/extractors/stores/test_splunk_extractor.py

+    assert_that(results, has_length(3))
+    assert_that(
+        results[0],
+        has_entries(
+            {
+                "_time": "2023-01-01T10:00:00",
+                "host": "server1",
+                "message": "Login successful",
+            }
+        ),
+    )


why not assert the whole result?

Suggested change

assert_that(results, has_length(3))

assert_that(

results[0],

has_entries(

{

"_time": "2023-01-01T10:00:00",

"host": "server1",

"message": "Login successful",

}

),

)

results == [

{

"_time": "2023-01-01T10:00:00",

"host": "server1",

"message": "Login successful",

},

{...},

{...}

]

jbristow · 2025-07-28T18:40:53Z

tests/unit/pipeline/extractors/stores/test_splunk_extractor.py

+
+    # Should handle gracefully and return empty results
+    assert_that(results, has_length(0))
+    assert_that(splunk_extractor.is_done, equal_to(True))


Hamcrest is great IMO for lists, but it just feels ugly for equalities. This is purely me being a hater of hamcrest.

Suggested change

assert_that(splunk_extractor.is_done, equal_to(True))

assert splunk_extractor.is_done

AGhafaryy added 3 commits June 27, 2025 15:08

splunk extractor try

1119985

removing description

0762dee

removing debug statemetns flattening tests

3a6db6d

AGhafaryy requested review from zprobst and ccloes as code owners June 27, 2025 23:18

removing unsused lib

3c472f9

AGhafaryy force-pushed the splunk_extractor branch from fca2a02 to 3c472f9 Compare June 27, 2025 23:40

linting

7c3a213

AGhafaryy force-pushed the splunk_extractor branch from 14e5162 to 7c3a213 Compare June 27, 2025 23:42

linting test file

e3580e4

AGhafaryy force-pushed the splunk_extractor branch from 2fc411b to e3580e4 Compare June 27, 2025 23:46

increasing code test coverage

2aa4332

AGhafaryy force-pushed the splunk_extractor branch from d61964e to 2aa4332 Compare June 27, 2025 23:52

Merge branch 'main' into splunk_extractor

f06c0ba

angelosantos4 reviewed Jul 2, 2025

View reviewed changes

zprobst reviewed Jul 7, 2025

View reviewed changes

jbristow reviewed Jul 28, 2025

View reviewed changes

		await asyncio.sleep(2) # Wait 2 seconds before checking again
		wait_count += 2

	import xml.etree.ElementTree as ET
	from xml.etree import ElementTree

	assert_that(splunk_extractor.is_done, equal_to(True))
	assert splunk_extractor.is_done

Splunk extractor #426

Are you sure you want to change the base?

Splunk extractor #426

Uh oh!

Conversation

AGhafaryy commented Jun 27, 2025

Uh oh!

codecov bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

angelosantos4 Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

angelosantos4 commented Jul 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

codecov bot commented Jun 27, 2025 •

edited

Loading

angelosantos4 Jul 2, 2025 •

edited

Loading