Skip to content

Commit 354ef12

Browse files
authored
add spark sql example (#1510)
1 parent b23ab64 commit 354ef12

File tree

1 file changed

+35
-0
lines changed

1 file changed

+35
-0
lines changed

docs/hub/datasets-spark.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,41 @@ To filter the dataset and only keep dialogues in Chinese:
238238
+---+----------------------------+-----+----------+----------+
239239
```
240240

241+
### Run SQL queries
242+
243+
Once you have your PySpark Dataframe ready, you can run SQL queries using `spark.sql`:
244+
245+
```python
246+
>>> from pyspark.sql import SparkSession
247+
>>> spark = SparkSession.builder.appName("demo").getOrCreate()
248+
>>> df = read_parquet("hf://datasets/BAAI/Infinity-Instruct/7M/*.parquet", columns=["source"])
249+
>>> spark.sql("SELECT source, count(*) AS total FROM {df} GROUP BY source ORDER BY total DESC", df=df).show()
250+
+--------------------+-------+
251+
| source| total|
252+
+--------------------+-------+
253+
| flan|2435840|
254+
| Subjective|1342427|
255+
| OpenHermes-2.5| 855478|
256+
| MetaMath| 690138|
257+
| code_exercises| 590958|
258+
|Orca-math-word-pr...| 398168|
259+
| code_bagel| 386649|
260+
| MathInstruct| 329254|
261+
|python-code-datas...| 88632|
262+
|instructional_cod...| 82920|
263+
| CodeFeedback| 79513|
264+
|self-oss-instruct...| 50467|
265+
|Evol-Instruct-Cod...| 43354|
266+
|CodeExercise-Pyth...| 27159|
267+
|code_instructions...| 23130|
268+
| Code-Instruct-700k| 10860|
269+
|Glaive-code-assis...| 9281|
270+
|python_code_instr...| 2581|
271+
|Python-Code-23k-S...| 2297|
272+
+--------------------+-------+
273+
```
274+
275+
241276
## Write
242277

243278
We also provide a helper function to write datasets in a distributed manner to a Hugging Face repository.

0 commit comments

Comments
 (0)