Skip to content
This repository was archived by the owner on Aug 31, 2021. It is now read-only.

Commit 4abf6ec

Browse files
committed
Updated README
1 parent 7dc9aef commit 4abf6ec

File tree

1 file changed

+19
-13
lines changed

1 file changed

+19
-13
lines changed

README.md

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@ Plug-and-play implementation of an Apache Spark custom data source for AWS Dynam
44
We published a small article about the project, check it out here:
55
https://www.audienceproject.com/blog/tech/sparkdynamodb-using-aws-dynamodb-data-source-apache-spark/
66

7+
## News
8+
9+
UPDATE 2019-11-25: We are releasing version 1.0.0 of the Spark+DynamoDB connector, which is based on the Spark Data Source V2 API.
10+
Out-of-the-box throughput calculations, parallelism and partition planning should now be more reliable.
11+
We have also pulled out the external dependency on Guava, which was causing a lot of compatibility issues.
12+
713
## Features
814

915
- Distributed, parallel scan with lazy evaluation
@@ -15,11 +21,20 @@ https://www.audienceproject.com/blog/tech/sparkdynamodb-using-aws-dynamodb-data-
1521
- Global secondary index support
1622
- Write support
1723

24+
## Getting The Dependency
25+
26+
The library is available from [Maven Central](https://mvnrepository.com/artifact/com.audienceproject/spark-dynamodb). Add the dependency in SBT as ```"com.audienceproject" %% "spark-dynamodb" % "latest"```
27+
28+
Spark is used in the library as a "provided" dependency, which means Spark has to be installed separately on the container where the application is running, such as is the case on AWS EMR.
29+
1830
## Quick Start Guide
1931

2032
### Scala
2133
```scala
2234
import com.audienceproject.spark.dynamodb.implicits._
35+
import org.apache.spark.sql.SparkSession
36+
37+
val spark = SparkSession.builder().getOrCreate()
2338

2439
// Load a DataFrame from a Dynamo table. Only incurs the cost of a single scan for schema inference.
2540
val dynamoDf = spark.read.dynamodb("SomeTableName") // <-- DataFrame of Row objects with inferred schema.
@@ -45,15 +60,15 @@ val avgWeightByColor = vegetableDs.agg($"color", avg($"weightKg")) // The column
4560
```python
4661
# Load a DataFrame from a Dynamo table. Only incurs the cost of a single scan for schema inference.
4762
dynamoDf = spark.read.option("tableName", "SomeTableName") \
48-
.format("com.audienceproject.spark.dynamodb") \
63+
.format("dynamodb") \
4964
.load() # <-- DataFrame of Row objects with inferred schema.
5065

5166
# Scan the table for the first 100 items (the order is arbitrary) and print them.
5267
dynamoDf.show(100)
5368

5469
# write to some other table overwriting existing item with same keys
5570
dynamoDf.write.option("tableName", "SomeOtherTable") \
56-
.format("com.audienceproject.spark.dynamodb") \
71+
.format("dynamodb") \
5772
.save()
5873
```
5974

@@ -62,29 +77,20 @@ dynamoDf.write.option("tableName", "SomeOtherTable") \
6277
pyspark --packages com.audienceproject:spark-dynamodb_<spark-scala-version>:<version>
6378
```
6479

65-
66-
## Getting The Dependency
67-
68-
The library is available from [Maven Central](https://mvnrepository.com/artifact/com.audienceproject/spark-dynamodb). Add the dependency in SBT as ```"com.audienceproject" %% "spark-dynamodb" % "latest"```
69-
70-
Spark is used in the library as a "provided" dependency, which means Spark has to be installed separately on the container where the application is running, such as is the case on AWS EMR.
71-
7280
## Parameters
7381
The following parameters can be set as options on the Spark reader and writer object before loading/saving.
7482
- `region` sets the region where the dynamodb table. Default is environment specific.
7583
- `roleArn` sets an IAM role to assume. This allows for access to a DynamoDB in a different account than the Spark cluster. Defaults to the standard role configuration.
7684

77-
7885
The following parameters can be set as options on the Spark reader object before loading.
7986

80-
- `readPartitions` number of partitions to split the initial RDD when loading the data into Spark. Corresponds 1-to-1 with total number of segments in the DynamoDB parallel scan used to load the data. Defaults to `sparkContext.defaultParallelism`
87+
- `readPartitions` number of partitions to split the initial RDD when loading the data into Spark. Defaults to the size of the DynamoDB table divided into chunks of `maxPartitionBytes`
88+
- `maxPartitionBytes` the maximum size of a single input partition. Default 128 MB
8189
- `targetCapacity` fraction of provisioned read capacity on the table (or index) to consume for reading. Default 1 (i.e. 100% capacity).
8290
- `stronglyConsistentReads` whether or not to use strongly consistent reads. Default false.
8391
- `bytesPerRCU` number of bytes that can be read per second with a single Read Capacity Unit. Default 4000 (4 KB). This value is multiplied by two when `stronglyConsistentReads=false`
8492
- `filterPushdown` whether or not to use filter pushdown to DynamoDB on scan requests. Default true.
8593
- `throughput` the desired read throughput to use. It overwrites any calculation used by the package. It is intended to be used with tables that are on-demand. Defaults to 100 for on-demand.
86-
- `itemCount` the number of items in the table. This overrides requesting it from the table itself.
87-
- `tableSize` the number of bytes in the table. This overrides requesting it from the table itself.
8894

8995
The following parameters can be set as options on the Spark writer object before saving.
9096

0 commit comments

Comments
 (0)