This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Spark-Solr is a Lucidworks connector library that enables seamless integration between Apache Spark and Apache Solr. It provides tools for reading data from Solr as Spark DataFrames/RDDs and indexing objects from Spark into Solr using SolrJ.
Current Version: 4.0.4-SNAPSHOT
Compatible with: Spark 3.1.2, Solr 8.11.0, Scala 2.12.12, Java 8
mvn clean package -DskipTestsProduces two main artifacts:
target/spark-solr-4.0.4-SNAPSHOT.jar- Core library for embeddingtarget/spark-solr-4.0.4-SNAPSHOT-shaded.jar- Standalone jar for spark-submit
mvn clean package# Java tests (JUnit)
mvn surefire:test
# Scala tests (ScalaTest)
mvn scalatest:test
# All tests
mvn test# Java test class
mvn test -Dtest=SolrRelationTest
# Scala test class
mvn test -Dsuites=com.lucidworks.spark.TestSolrRelationmvn clean package -Pcoveragemvn clean package -Prelease- Entry point for Spark SQL integration via DataSource API v1
- Registers "solr" format and provides RelationProvider interfaces
- Core implementation of Spark's BaseRelation interface
- Handles schema inference, query optimization, and push-down filters
- Supports multiple Solr query handlers:
/select,/export,/stream,/sql
SolrRDD: Abstract base class for Solr RDDsSelectSolrRDD: Standard Solr queries via/selecthandlerStreamingSolrRDD: Streaming queries via/export,/stream,/sqlhandlers
- Centralized configuration management for all Solr connection parameters
- Supports both programmatic and environment-based configuration
val df = spark.read.format("solr")
.option("zkhost", "localhost:9983")
.option("collection", "myCollection")
.load()import com.lucidworks.spark.rdd.SelectSolrRDD
val solrRDD = new SelectSolrRDD(zkHost, collection, sc)import com.lucidworks.spark.rdd.SolrJavaRDD;
SolrJavaRDD solrRDD = SolrJavaRDD.get(zkHost, collection, jsc.sc());- Data Locality: Co-locates Spark partitions with Solr shards when possible
- Intelligent Handler Selection: Automatically chooses optimal query handler based on query characteristics
- Intra-shard Splitting: Parallelizes reading within shards using
split_fieldoption - Streaming Export: 8-10x faster than cursors when fields have docValues enabled (
request_handler="/export")
- Unit tests using JUnit framework
- Integration tests with embedded Solr cluster
- Performance and ML integration tests
- Functional tests using ScalaTest framework
- Comprehensive test suites for core functionality
SparkSolrFunSuiteprovides common test infrastructure
- Sample datasets including NYC taxi data, MovieLens, Twitter data
- Solr configuration files and schemas
- Embedded Solr cluster configuration
- Schema Changes: When modifying data structures, ensure compatibility with Solr's Schema API
- Handler Support: New query features should support both
/selectand/exporthandlers where applicable - Performance Testing: Use
TestShardSplitsandTestPartitionByTimeQuerySupportfor performance validation - Authentication: Test with both Kerberos and Basic Auth configurations
zkhost: ZooKeeper connection string (required)collection: Solr collection name (required)query: Solr query string (default:*:*)fields: Comma-separated field listrequest_handler:/select(default) or/exportfor streaming
rows: Page size for requests (default: 1000)splits: Enable intra-shard splitting (default: false)split_field: Field for splitting (default:_version_)splits_per_shard: Number of splits per shard (default: 1)batch_size: Documents per indexing batch (default: 500)
sample_seed: Random sampling with specified seedpartition_by: Time-series partitioning supportgen_uniq_key: Auto-generate unique keys for documentssolr_field_types: Specify field types for new fields
The src/main/scala/com/lucidworks/spark/example/ directory contains comprehensive examples:
- Basic read/write operations
- ML pipeline integration
- Streaming applications (Twitter, document filtering)
- Analytics using Solr streaming expressions
- Time-series data processing
The project uses Maven Shade plugin to relocate common dependencies to avoid conflicts:
- Jackson →
shaded.fasterxml.jackson - Apache HTTP Client →
shaded.apache.http - Guava →
shaded.google.guava - Joda Time →
shaded.joda.time