@@ -238,6 +238,41 @@ To filter the dataset and only keep dialogues in Chinese:
238238+ -- -+ ---------------------------- + ---- -+ ---------- + ---------- +
239239```
240240
241+ # ## Run SQL queries
242+
243+ Once you have your PySpark Dataframe ready, you can run SQL queries using `spark.sql` :
244+
245+ ```python
246+ >> > from pyspark.sql import SparkSession
247+ >> > spark = SparkSession.builder.appName(" demo" ).getOrCreate()
248+ >> > df = read_parquet(" hf://datasets/BAAI/Infinity-Instruct/7M/*.parquet" , columns = [" source" ])
249+ >> > spark.sql(" SELECT source, count(*) AS total FROM {df} GROUP BY source ORDER BY total DESC" , df = df).show()
250+ + -------------------- + ------ -+
251+ | source| total|
252+ + -------------------- + ------ -+
253+ | flan| 2435840 |
254+ | Subjective| 1342427 |
255+ | OpenHermes- 2.5 | 855478 |
256+ | MetaMath| 690138 |
257+ | code_exercises| 590958 |
258+ | Orca- math- word- pr... | 398168 |
259+ | code_bagel| 386649 |
260+ | MathInstruct| 329254 |
261+ | python- code- datas... | 88632 |
262+ | instructional_cod... | 82920 |
263+ | CodeFeedback| 79513 |
264+ | self - oss- instruct... | 50467 |
265+ | Evol- Instruct- Cod... | 43354 |
266+ | CodeExercise- Pyth... | 27159 |
267+ | code_instructions... | 23130 |
268+ | Code- Instruct- 700k | 10860 |
269+ | Glaive- code- assis... | 9281 |
270+ | python_code_instr... | 2581 |
271+ | Python- Code- 23k - S... | 2297 |
272+ + -------------------- + ------ -+
273+ ```
274+
275+
241276# # Write
242277
243278We also provide a helper function to write datasets in a distributed manner to a Hugging Face repository.
0 commit comments