Skip to content

Conversation

@AsafMah
Copy link
Collaborator

@AsafMah AsafMah commented Feb 18, 2025

solves #567

@github-actions
Copy link
Contributor

github-actions bot commented Feb 18, 2025

Test Results

    6 files  ±0      6 suites  ±0   25m 38s ⏱️ -34s
  314 tests ±0    279 ✅ ±0   35 💤 ±0  0 ❌ ±0 
1 884 runs  ±0  1 674 ✅ ±0  210 💤 ±0  0 ❌ ±0 

Results for commit 18d85ce. ± Comparison against base commit c2e31d3.

♻️ This comment has been updated with latest results.

yogilad
yogilad previously approved these changes Feb 25, 2025
https://docs.microsoft.com/en-us/azure/data-explorer/ingest-data-overview#ingestion-methods
:param pandas.DataFrame df: input dataframe to ingest.
:param azure.kusto.ingest.IngestionProperties ingestion_properties: Ingestion properties.
:param DataFormat data_format: Format to convert the dataframe to. If not specified, it will try to infer it from the mapping, if not found, it will default to JSON.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specificy the valid options are None, CSV and JSON

# If we are given CSV mapping, or the mapping format is explicitly set to CSV, we should use CSV
if not data_format:
if ingestion_properties is not None and (ingestion_properties.ingestion_mapping_type == DataFormat.CSV):
is_json = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check explictly for json mapping and throw if not CSV or JSON or None
https://learn.microsoft.com/en-us/kusto/management/mappings?view=microsoft-fabric#supported-mapping-types

@ViaFerrata
Copy link

Could it be that this has created a bug @AsafMah (tested with latest pypi release)?
I think the data format is not inferred correctly from the IngestionProperties now.

My Pandas ingestion from before the patch looked like this:

ingestion_props = IngestionProperties(database=kusto_db, table=table_name, data_format=DataFormat.CSV, ignore_first_record=True, report_level=ReportLevel.FailuresAndSuccesses)
result_pd = client_qi.ingest_from_dataframe(df_csv, ingestion_properties=ingestion_props)
print(result_pd)

Output (before patch):

IngestionResult(status=IngestionStatus.QUEUED, database=log, table=ssh_log_data, source_id=xxx, obfuscated_blob_uri=https://xxx/yyy.csv.gz)

With the same code after the patch, json is chosen even though csv is explicitely selected within the IngestionProperties.
Output after patch:

IngestionResult(status=IngestionStatus.QUEUED, database=log, table=ssh_log_data, source_id=xxx, obfuscated_blob_uri=https://xxxx/yyy.json.gz)

Only noticed this, because the ingestion now fails since json and ignore_first_record=True is not compatible.

@yogilad
Copy link
Contributor

yogilad commented Mar 16, 2025

@ViaFerrata ,

IIUC, our impl. of the Pandas ingestion does not generate a header record.
It sounds like you were losing the first record of each ingestion.
Can you confirm?

@AsafMah , we should set this value to False in all ingestion paths (JSON and CSV)

@ViaFerrata
Copy link

ViaFerrata commented Mar 16, 2025

@ViaFerrata ,

IIUC, our impl. of the Pandas ingestion does not generate a header record. It sounds like you were losing the first record of each ingestion. Can you confirm?

Looking at the source code (don't have access to my code right now), I think so, yes (due to header=False):

df.to_csv(temp_file_path, index=False, encoding="utf-8", header=False)

I remember that I was fiddling around with the various ingestion options and reused the same IngestionProperties due to being lazy. Wanted to use the proper IngestionProperties for the pandas ingestion in the end, but seems like I forgot, thanks!


And regarding the data format, that is as intended, right (digged a bit into the source code now)?
-> data_format in IngestionProperties is not related to the json/csv handover to ingest_from_file? Only the ingestion_mapping_kind plays a role.

Was just a bit confused since a month ago above code would handover csv files to ingest_from_file instead of json (according to the blob_uri).

EDIT: Release notes make it kind of clear: "By default, the data will be serialized as json to avoid this issue."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants