Skip to content

Commit 35d2c11

Browse files
committed
Replace old docs with new ones
1 parent 34df1df commit 35d2c11

File tree

2 files changed

+10
-301
lines changed

2 files changed

+10
-301
lines changed

README.md

Lines changed: 10 additions & 298 deletions
Original file line numberDiff line numberDiff line change
@@ -1,313 +1,25 @@
11
# Parquet4S
22

3-
Simple I/O for [Parquet](https://parquet.apache.org/). Allows you to easily read and write Parquet files in [Scala](https://www.scala-lang.org/).
3+
<img align="right" width="256px" height="256px" src="site/docs/images/features-header.svg"/>
4+
5+
Parquet4s is a simple I/O for [Parquet](https://parquet.apache.org/). Allows you to easily read and write Parquet files in [Scala](https://www.scala-lang.org/).
46

57
Use just a Scala case class to define the schema of your data. No need to use Avro, Protobuf, Thrift or other data serialisation systems. You can use generic records if you don't want to use the case class, too.
68

79
Compatible with files generated with [Apache Spark](https://spark.apache.org/). However, unlike in Spark, you do not have to start a cluster to perform I/O operations.
810

9-
Based on official [Parquet library](https://github.com/apache/parquet-mr), [Hadoop Client](https://github.com/apache/hadoop) and [Shapeless](https://github.com/milessabin/shapeless).
10-
11-
Integrations for [Akka Streams](https://doc.akka.io/docs/akka/current/stream/index.html) and [FS2](https://fs2.io/).
12-
13-
Released for Scala 2.11.x, 2.12.x and 2.13.x. FS2 integration is available for 2.12.x and 2.13.x.
14-
15-
## Tutorial
16-
17-
1. [Quick Start](#quick-start)
18-
1. [AWS S3](#aws-s3)
19-
1. [Akka Streams](#akka-streams)
20-
1. [FS2](#fs2)
21-
1. [Before-read filtering or filter pushdown](#before-read-filtering-or-filter-pushdown)
22-
1. [Schema projection](#schema-projection)
23-
1. [Statistics](#statistics)
24-
1. [Supported storage types](#supported-storage-types)
25-
1. [Supported types](#supported-types)
26-
1. [Generic Records](#generic-records)
27-
1. [Customisation and Extensibility](#customisation-and-extensibility)
28-
1. [More Examples](#more-examples)
29-
1. [Contributing](#contributing)
30-
31-
## Quick Start
32-
33-
### SBT
34-
35-
```scala
36-
libraryDependencies ++= Seq(
37-
"com.github.mjakubowski84" %% "parquet4s-core" % "1.9.4",
38-
"org.apache.hadoop" % "hadoop-client" % yourHadoopVersion
39-
)
40-
```
41-
42-
### Mill
43-
44-
```scala
45-
def ivyDeps = Agg(
46-
ivy"com.github.mjakubowski84::parquet4s-core:1.9.4",
47-
ivy"org.apache.hadoop:hadoop-client:$yourHadoopVersion"
48-
)
49-
```
50-
51-
```scala
52-
import com.github.mjakubowski84.parquet4s.{ ParquetReader, ParquetWriter }
53-
54-
case class User(userId: String, name: String, created: java.sql.Timestamp)
55-
56-
val users: Iterable[User] = Seq(
57-
User("1", "parquet", new java.sql.Timestamp(1L))
58-
)
59-
val path = "path/to/local/parquet"
60-
61-
// writing
62-
ParquetWriter.writeAndClose(path, users)
63-
64-
// reading
65-
val parquetIterable = ParquetReader.read[User](path)
66-
try {
67-
parquetIterable.foreach(println)
68-
} finally parquetIterable.close()
69-
```
70-
71-
## AWS S3
72-
73-
In order to connect to AWS S3 you need to define one more dependency:
74-
75-
```scala
76-
"org.apache.hadoop" % "hadoop-aws" % yourHadoopVersion
77-
```
78-
79-
Next, the most common way is to define following environmental variables:
80-
81-
```bash
82-
export AWS_ACCESS_KEY_ID=my.aws.key
83-
export AWS_SECRET_ACCESS_KEY=my.secret.key
84-
```
85-
86-
Please follow [documentation of Hadoop AWS](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) for more details and troubleshooting.
87-
88-
### Passing Hadoop Configs Programmatically
89-
90-
File system configs for S3, GCS or Hadoop can also be set programmatically to the `ParquetReader` and `ParquetWriter` by passing the `Configuration` object to the `ParqetReader.Options` and `ParquetWriter.Options` case classes.
91-
92-
## Akka Streams
93-
94-
Parquet4S has an integration module that allows you to read and write Parquet files using Akka Streams. Just import:
95-
96-
```scala
97-
"com.github.mjakubowski84" %% "parquet4s-akka" % "1.9.4"
98-
"org.apache.hadoop" % "hadoop-client" % yourHadoopVersion
99-
```
100-
101-
Parquet4S has a single `Source` for reading single file or a directory, a `Sink`s for writing a single file and a sophisticated `Flow` for performing complex writes.
102-
103-
```scala
104-
import com.github.mjakubowski84.parquet4s.{ParquetStreams, ParquetWriter}
105-
import org.apache.parquet.hadoop.ParquetFileWriter
106-
import org.apache.parquet.hadoop.metadata.CompressionCodecName
107-
import akka.actor.ActorSystem
108-
import akka.stream.scaladsl.Source
109-
import org.apache.hadoop.conf.Configuration
110-
import scala.concurrent.duration._
111-
112-
case class User(userId: String, name: String, created: java.sql.Timestamp)
113-
114-
implicit val system: ActorSystem = ActorSystem()
115-
116-
val users: Iterable[User] = ???
117-
118-
val conf: Configuration = ??? // Set Hadoop configuration programmatically
119-
120-
// Please check all the available configuration options!
121-
val writeOptions = ParquetWriter.Options(
122-
writeMode = ParquetFileWriter.Mode.OVERWRITE,
123-
compressionCodecName = CompressionCodecName.SNAPPY,
124-
hadoopConf = conf // optional hadoopConf
125-
)
126-
127-
// Writes a single file.
128-
Source(users).runWith(ParquetStreams.toParquetSingleFile(
129-
path = "file:///data/users/user-303.parquet",
130-
options = writeOptions
131-
))
132-
133-
// Tailored for writing indefinite streams.
134-
// Writes file when chunk reaches size limit or defined time period elapses.
135-
// Can also partition files!
136-
// Check all the parameters and example usage in project sources.
137-
Source(users).via(
138-
ParquetStreams
139-
.viaParquet[User]("file:///data/users")
140-
.withMaxCount(writeOptions.rowGroupSize)
141-
.withMaxDuration(30.seconds)
142-
.withWriteOptions(writeOptions)
143-
.build()
144-
).runForeach(user => println(s"Just wrote user ${user.userId}..."))
145-
146-
// Reads a file or files from the path. Please also have a look at the rest of parameters.
147-
ParquetStreams.fromParquet[User]
148-
.withOptions(ParquetReader.Options(hadoopConf = conf))
149-
.read("file:///data/users")
150-
.runForeach(println)
151-
```
152-
153-
## FS2
154-
155-
FS2 integration allows you to read and write Parquet using functional streams. Functionality is exactly the same as in case of Akka module. In order to use it please import:
156-
157-
```scala
158-
"com.github.mjakubowski84" %% "parquet4s-fs2" % "1.9.4"
159-
"org.apache.hadoop" % "hadoop-client" % yourHadoopVersion
160-
```
161-
162-
Please check [examples](examples/src/main/scala/com/github/mjakubowski84/parquet4s/fs2) to learn more.
11+
Based on official [Parquet library](https://github.com/apache/parquet-mr), [Hadoop Client](https://github.com/apache/hadoop) and [Shapeless](https://github.com/milessabin/shapeless) (Shapeless is not in use in a version for Scala 3).
16312

164-
## Before-read filtering or filter pushdown
13+
As it is based on Hadoop Client then you can connect to any Hadoop-compatible storage like AWS S3 or Google Cloud Storage.
16514

166-
One of the best features of Parquet is an efficient way of filtering. Parquet files contain additional metadata that can be leveraged to drop chunks of data without scanning them. Parquet4S allows to define filter predicates in all modules in order to push filtering out from Scala collections and Akka or FS2 stream down to point before file content is even read.
167-
168-
You define your filters using simple algebra as follows.
169-
170-
In core library:
171-
172-
```scala
173-
ParquetReader.read[User](path = "file://my/path", filter = Col("email") === "user@email.com")
174-
```
175-
176-
In Akka filter applies both to content of files and partitions:
177-
178-
```scala
179-
ParquetStreams.fromParquet[Stats]
180-
.withFilter(Col("stats.score") > 0.9 && Col("stats.score") <= 1.0)
181-
.read("file://my/path")
182-
```
183-
184-
You can construct filter predicates using `===`, `!==`, `>`, `>=`, `<`, `<=`, `in` and `udp` operators on columns containing primitive values. You can combine and modify predicates using `&&`, `||` and `!` operators. `in` looks for values in a list of keys, similar to SQL's `in` operator. For custom filtering by column of type `T` implement `UDP[T]` trait and use `udp` operator.
185-
186-
Mind that operations on `java.sql.Timestamp` and `java.time.LocalDateTime` are not supported as Parquet still not allows filtering by `Int96` out of the box.
187-
188-
Check ScalaDoc and code for more!
189-
190-
## Schema projection
191-
192-
Schema projection is another way of optimization of reads. By default, Parquet4S reads the whole content of each Parquet record even when you provide a case class that maps only a part of the columns. Such a behaviour is expected because you may want to use [generic records](#generic-records) to process your data. However, you can explicitly tell Parquet4S to use the provided case class (or implicit `ParquetSchemaResolver`) as an override for the original file schema. In effect, all columns not matching your schema will be skipped and not read. This functionality is available in every module of Parquet4S.
193-
194-
```scala
195-
// core
196-
ParquetReader.withProjection[User].read(path = "file://my/path")
197-
198-
// akka
199-
ParquetStreams.fromParquet[User].withProjection.read("file://my/path")
200-
201-
// fs2
202-
import com.github.mjakubowski84.parquet4s.parquet._
203-
fromParquet[IO, User].projection.read(blocker, "file://my/path")
204-
```
205-
206-
## Statistics
207-
208-
Parquet4S leverages Parquet metadata to efficiently read record count as well as max and min value of the column of Parquet files. It provides correct value for both filtered and unfiltered files. Functionality is available in core module either by direct call to [Stats](core/src/main/scala/com/github/mjakubowski84/parquet4s/Stats.scala) or via API of `ParquetReader` and `ParquetIterable`.
209-
210-
## Supported storage types
211-
212-
As it is based on Hadoop Client, Parquet4S can read and write from a variety of file systems:
213-
214-
- Local files
215-
- HDFS
216-
- Amazon S3
217-
- Google Storage
218-
- Azure
219-
- OpenStack
220-
221-
Please refer to Hadoop Client documentation or your storage provider to check how to connect to your storage.
222-
223-
## Supported types
224-
225-
### Primitive types
226-
227-
| Type | Reading and Writing | Filtering |
228-
|:------------------------|:-------------------:|:---------:|
229-
| Int | &#x2611; | &#x2611; |
230-
| Long | &#x2611; | &#x2611; |
231-
| Byte | &#x2611; | &#x2611; |
232-
| Short | &#x2611; | &#x2611; |
233-
| Boolean | &#x2611; | &#x2611; |
234-
| Char | &#x2611; | &#x2611; |
235-
| Float | &#x2611; | &#x2611; |
236-
| Double | &#x2611; | &#x2611; |
237-
| BigDecimal | &#x2611; | &#x2611; |
238-
| java.time.LocalDateTime | &#x2611; | &#x2612; |
239-
| java.time.LocalDate | &#x2611; | &#x2611; |
240-
| java.sql.Timestamp | &#x2611; | &#x2612; |
241-
| java.sql.Date | &#x2611; | &#x2611; |
242-
| Array[Byte] | &#x2611; | &#x2611; |
243-
244-
### Complex Types
245-
246-
Complex types can be arbitrarily nested.
247-
248-
- Option
249-
- List
250-
- Seq
251-
- Vector
252-
- Set
253-
- Array - Array of bytes is treated as primitive binary
254-
- Map - **Key must be of primitive type**, only **immutable** version.
255-
- **Since 1.2.0**. Any Scala collection that has Scala 2.13 collection Factory (in 2.11 and 2.12 it is derived from CanBuildFrom). Refers to both mutable and immutable collections. Collection must be bounded only by one type of element - because of that Map is supported only in immutable version (for now).
256-
- *Any case class*
257-
258-
## Generic Records
259-
260-
You may want to not use strict schema and process your data in a generic way. Since version 1.2.0 Parquet4S has rich API that allows to build, transform, write and read Parquet records in easy way. Each implementation of `ParquetRecord` is Scala `Iterable` and a mutable collection. You can execute operations on `RowParquetRecord` and `ListParquetRecord` as on mutable `Seq` and you can treat `MapParquetRecord` as mutable `Map`. Moreover, records received addition functions like `get` and `add` (and more) that take implicit `ValueCodec` and allow to read and modify records using regular Scala types. There is default `ParquetRecordEndcoder`, `ParquetRecordDecoder` and `ParquetSchemaResolver` for `RowParquetRecord` so reading Parquet in a generic way works out of the box! In order to write you still need to provide a schema in form of Parquet's `MessageType`.
261-
262-
Functionality is available in all modules. See [examples](https://github.com/mjakubowski84/parquet4s/blob/master/examples/src/main/scala/com/github/mjakubowski84/parquet4s/core/WriteAndReadGenericApp.scala).
263-
264-
## Customisation and Extensibility
265-
266-
Parquet4S is built using Scala's type class system. That allows you to extend Parquet4S by defining your own implementations of its type classes.
267-
268-
For example, you may define your codecs of your own type so that they can be **read from or written** to Parquet. Assume that you have your own type:
269-
270-
```scala
271-
case class CustomType(i: Int)
272-
```
273-
274-
You want to save it as optional `Int`. In order to achieve you have to define your own codec:
275-
276-
```scala
277-
import com.github.mjakubowski84.parquet4s.{OptionalValueCodec, IntValue, Value}
278-
279-
implicit val customTypeCodec: OptionalValueCodec[CustomType] =
280-
new OptionalValueCodec[CustomType] {
281-
override protected def decodeNonNull(value: Value, configuration: ValueCodecConfiguration): CustomType = value match {
282-
case IntValue(i) => CustomType(i)
283-
}
284-
override protected def encodeNonNull(data: CustomType, configuration: ValueCodecConfiguration): Value =
285-
IntValue(data.i)
286-
}
287-
```
288-
289-
Additionally, if you want to write your custom type, you have to define the schema for it:
290-
291-
```scala
292-
import org.apache.parquet.schema.{LogicalTypeAnnotation, PrimitiveType}
293-
import com.github.mjakubowski84.parquet4s.ParquetSchemaResolver.TypedSchemaDef
294-
import com.github.mjakubowski84.parquet4s.SchemaDef
295-
296-
implicit val customTypeSchema: TypedSchemaDef[CustomType] =
297-
SchemaDef.primitive(
298-
primitiveType = PrimitiveType.PrimitiveTypeName.INT32,
299-
required = false,
300-
originalType = Option(LogicalTypeAnnotation.intType(32, true))
301-
).typed[CustomType]
302-
```
303-
304-
In order to filter by a field of a custom type `T` you have to implement `FilterCodec[T]` type class. Check existing implementations for the reference.
15+
Integrations for [Akka Streams](https://doc.akka.io/docs/akka/current/stream/index.html) and [FS2](https://fs2.io/).
30516

306-
## More Examples
17+
Released for Scala 2.12.x, 2.13.x and 3.0.x.\
18+
Akka module is available for Scala 2.12.x and 2.13.x.
30719

308-
Please check [examples](./examples) where you can find simple code covering basics for `core`, `akka` and `fs2` modules.
20+
## Documentation
30921

310-
Moreover, examples contain two simple applications comprising Akka Streams or FS2 and Kafka. It shows how you can write partitioned Parquet files with data coming from an indefinite stream.
22+
Documentation is available at [here](https://mjakubowski84.github.io/parquet4s/).
31123

31224
## Contributing
31325

site/docs/docs/introduction.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,6 @@ title: Introduction
44
permalink: docs/
55
---
66

7-
This page is a work in progress. It is dedicated to the latest release candidate version of Parquet4s.
8-
For a documentation of stable version 1.x of the library please refer to [Readme](https://github.com/mjakubowski84/Parquet4s).
9-
107
# Introduction
118

129
Parquet4s is a simple I/O for [Parquet](https://parquet.apache.org/). Allows you to easily read and write Parquet files in [Scala](https://www.scala-lang.org/).

0 commit comments

Comments
 (0)