Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions Data_Ingestion/Read_excel_data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Read Excel Files in Spark and Pandas

This module demonstrates two approaches to read Excel files within Spark environments like **OCI Data Flow**, **Databricks**, or **local Spark clusters**.

---

## 1. Using `com.crealytics.spark.excel`

This approach uses the **Spark Excel connector** developed by [Crealytics](https://github.com/crealytics/spark-excel).
It supports `.xls` and `.xlsx` files directly within Spark DataFrames.

### Requirements

You must add the following JARs to your cluster classpath:

- poi-4.1.2.jar
- poi-ooxml-4.1.2.jar
- poi-ooxml-schemas-4.1.2.jar
- xmlbeans-3.1.0.jar
- curvesapi-1.06.jar
- commons-collections4-4.4.jar
- commons-compress-1.20.jar
- spark-excel_2.12-0.13.5.jar

Download them from [Maven Central Repository](https://mvnrepository.com/).

### Example

```python
excel_path = "/Volumes/test_data.xlsx"

df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(excel_path)

df.show()
```
# Excel to Spark using Pandas

This example demonstrates how to **read Excel files using Pandas**, optionally convert them to **CSV**, and then **load them into Spark** for further processing.
It’s ideal for lightweight or pre-processing workflows before ingesting data into Spark.

---

## Requirements

Install the required dependencies via `requirements.txt`:
- `pandas`
- `openpyxl`
- `xlrd`

### Example

```python
import pandas as pd

# Path to Excel file
excel_path = "/Volumes/test_data.xlsx"

# Read Excel file using Pandas
df = pd.read_excel(excel_path)

# Convert to CSV if needed
csv_path = "/Volumes/test_data.csv"
df.to_csv(csv_path, index=False)

print(df.head())

# Load the CSV back into Spark
spark_df = spark.read.csv(csv_path, header=True, inferSchema=True)
spark_df.show()

```
50 changes: 50 additions & 0 deletions Data_Ingestion/Read_excel_data/read_excel.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{
"cells": [
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"## Reading by using inbuilt spark library\n",
"excel_path = \"/Volumes/test_data.xlsx\"\n",
"\n",
"## You must add the JARs to your cluster classpath as per README.md\n",
"df = spark.read.format(\"com.crealytics.spark.excel\") \\\n",
" .option(\"header\", \"true\") \\\n",
" .option(\"inferSchema\", \"true\") \\\n",
" .load(excel_path)\n",
"\n",
"df.show()"
],
"id": "4d1c762a078b6ac2"
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"## Using pandas to convert excel into csv and then read in spark\n",
"import pandas as pd\n",
"\n",
"excel_path = \"/Volumes/test_data.xlsx\"\n",
"df = pd.read_excel(excel_path)\n",
"\n",
"# Convert to CSV if needed\n",
"csv_path = \"/Volumes/test_data.csv\"\n",
"df.to_csv(csv_path, index=False)\n",
"\n",
"print(df.head())\n",
"\n",
"# Load CSV back into Spark\n",
"spark_df = spark.read.csv(csv_path, header=True, inferSchema=True)\n",
"spark_df.show()\n"
],
"id": "3d929687c9b1c44a"
}
],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 5
}
3 changes: 3 additions & 0 deletions Data_Ingestion/Read_excel_data/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
pandas
openpyxl
xlrd
Loading