Skip to content

Commit be8c987

Browse files
committed
Add fenic integration documentation
1 parent adfba8b commit be8c987

File tree

3 files changed

+190
-0
lines changed

3 files changed

+190
-0
lines changed

docs/hub/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,8 @@
235235
title: Perform vector similarity search
236236
- local: datasets-embedding-atlas
237237
title: Embedding Atlas
238+
- local: datasets-fenic
239+
title: fenic
238240
- local: datasets-fiftyone
239241
title: FiftyOne
240242
- local: datasets-pandas

docs/hub/datasets-fenic.md

Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
# fenic
2+
3+
[fenic](https://github.com/typedef-ai/fenic) is a PySpark-inspired DataFrame framework designed for building production AI and agentic applications. fenic provides support for reading datasets directly from the Hugging Face Hub.
4+
5+
<div class="flex justify-center">
6+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/fenic_hf.png"/>
7+
</div>
8+
9+
## Getting Started
10+
11+
To get started, pip install `fenic`:
12+
13+
```bash
14+
pip install fenic
15+
```
16+
17+
## Overview
18+
19+
fenic is an opinionated data processing framework that combines:
20+
- **DataFrame API**: PySpark-inspired operations for familiar data manipulation
21+
- **Semantic Operations**: Built-in AI/LLM operations including semantic functions, embeddings, and clustering
22+
- **Model Integration**: Native support for AI providers (Anthropic, OpenAI, Cohere, Google)
23+
- **Query Optimization**: Automatic optimization through logical plan transformations
24+
25+
## Read from Hugging Face Hub
26+
27+
fenic can read datasets directly from the Hugging Face Hub using the `hf://` protocol. This functionality is built into fenic's DataFrameReader interface.
28+
29+
### Supported Formats
30+
31+
fenic supports reading the following formats from Hugging Face:
32+
- **Parquet files** (`.parquet`)
33+
- **CSV files** (`.csv`)
34+
35+
### Reading Datasets
36+
37+
To read a dataset from the Hugging Face Hub:
38+
39+
```python
40+
import fenic as fc
41+
42+
# Assuming session is already created
43+
# Read a CSV file from a public dataset
44+
df = session.read.csv("hf://datasets/datasets-examples/doc-formats-csv-1/data.csv")
45+
46+
# Read Parquet files using glob patterns
47+
df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet")
48+
49+
# Read from a specific dataset revision
50+
df = session.read.parquet("hf://datasets/datasets-examples/doc-formats-csv-1@~parquet/**/*.parquet")
51+
```
52+
53+
### Reading with Schema Management
54+
55+
```python
56+
# Read multiple CSV files with schema merging
57+
df = session.read.csv("hf://datasets/username/dataset_name/*.csv", merge_schemas=True)
58+
59+
# Read multiple Parquet files with schema merging
60+
df = session.read.parquet("hf://datasets/username/dataset_name/*.parquet", merge_schemas=True)
61+
```
62+
63+
### Authentication
64+
65+
To read private datasets, you need to set your Hugging Face token as an environment variable:
66+
67+
```python
68+
import os
69+
os.environ["HF_TOKEN"] = "your_hugging_face_token_here"
70+
```
71+
72+
### Path Format
73+
74+
The Hugging Face path format in fenic follows this structure:
75+
```
76+
hf://{repo_type}/{repo_id}/{path_to_file}
77+
```
78+
79+
You can also specify dataset revisions or versions:
80+
```
81+
hf://{repo_type}/{repo_id}@{revision}/{path_to_file}
82+
```
83+
84+
Features:
85+
- Supports glob patterns (`*`, `**`)
86+
- Dataset revisions/versions using `@` notation:
87+
- Specific commit: `@d50d8923b5934dc8e74b66e6e4b0e2cd85e9142e`
88+
- Branch: `@refs/convert/parquet`
89+
- Branch alias: `@~parquet`
90+
- Requires `HF_TOKEN` environment variable for private datasets
91+
92+
### Mixing Data Sources
93+
94+
fenic allows you to combine multiple data sources in a single read operation, including mixing different protocols:
95+
96+
```python
97+
# Mix HF and local files in one read call
98+
df = session.read.parquet([
99+
"hf://datasets/cais/mmlu/astronomy/*.parquet",
100+
"file:///local/data/*.parquet",
101+
"./relative/path/data.parquet"
102+
])
103+
```
104+
105+
This flexibility allows you to seamlessly combine data from Hugging Face Hub and local files in your data processing pipeline.
106+
107+
## Processing Data from Hugging Face
108+
109+
Once loaded from Hugging Face, you can use fenic's full DataFrame API:
110+
111+
### Basic DataFrame Operations
112+
113+
```python
114+
import fenic as fc
115+
116+
# Load IMDB dataset from Hugging Face
117+
df = session.read.parquet("hf://datasets/imdb/plain_text/train-*.parquet")
118+
119+
# Filter and select
120+
positive_reviews = df.filter(fc.api.functions.col("label") == 1).select("text", "label")
121+
122+
# Group by and aggregate
123+
label_counts = df.group_by("label").agg(
124+
fc.api.functions.count().alias("count")
125+
)
126+
```
127+
128+
### AI-Powered Operations
129+
130+
To use semantic and embedding operations, configure language and embedding models in your SessionConfig. Once configured:
131+
132+
```python
133+
import fenic as fc
134+
135+
# Load a text dataset from Hugging Face
136+
df = session.read.parquet("hf://datasets/imdb/plain_text/train-00000-of-00001.parquet")
137+
138+
# Add embeddings to text columns
139+
df_with_embeddings = df.with_column(
140+
"embedding",
141+
fc.api.functions.embedding("text")
142+
)
143+
144+
# Apply semantic functions for sentiment analysis
145+
df_analyzed = df.with_column(
146+
"sentiment_score",
147+
fc.api.functions.semantic("Rate the sentiment from 1-10: {text}")
148+
)
149+
```
150+
151+
## Example: Analyzing MMLU Dataset
152+
153+
```python
154+
import fenic as fc
155+
import os
156+
157+
# Set HF token if accessing private datasets
158+
os.environ["HF_TOKEN"] = "your_token_here"
159+
160+
# Load MMLU astronomy subset from Hugging Face
161+
df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet")
162+
163+
# Process the data
164+
processed_df = (df
165+
# Filter for specific criteria
166+
.filter(fc.api.functions.col("subject") == "astronomy")
167+
# Select relevant columns
168+
.select("question", "choices", "answer")
169+
# Add difficulty analysis (requires semantic configuration)
170+
.with_column("difficulty",
171+
fc.api.functions.semantic("Rate difficulty 1-5: {question}"))
172+
)
173+
174+
# Show results
175+
processed_df.show()
176+
```
177+
178+
## Limitations
179+
180+
- **Writing to Hugging Face Hub**: Currently not supported. fenic can only read from the Hub.
181+
- **Supported Read Formats**: Limited to CSV and Parquet formats when reading from the Hub.
182+
- **Semantic Operations**: Require configuring language/embedding models in SessionConfig.
183+
184+
## Resources
185+
186+
- [fenic GitHub Repository](https://github.com/typedef-ai/fenic)
187+
- [fenic Documentation](https://docs.fenic.ai/latest/)

docs/hub/datasets-libraries.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ The table below summarizes the supported libraries and their level of integratio
1515
| [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback. |||
1616
| [DuckDB](./datasets-duckdb) | In-process SQL OLAP database management system. |||
1717
| [Embedding Atlas](./datasets-embedding-atlas) | Interactive visualization and exploration tool for large embeddings. |||
18+
| [fenic](./datasets-fenic) | PySpark-inspired DataFrame framework for building production AI and agentic applications. |||
1819
| [FiftyOne](./datasets-fiftyone) | FiftyOne is a library for curation and visualization of image, video, and 3D data. |||
1920
| [Pandas](./datasets-pandas) | Python data analysis toolkit. |||
2021
| [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. |||

0 commit comments

Comments
 (0)