-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
I've got a file with 8M records and I'm trying to split it up into words and do a word count. Here's my code. When I run it, I see 4 new Ruby processes start up on my machine but only one of them shoots to 100%. The others just sit there idle. I don't think it's parallelizing properly. Am I missing a configuration setting somewhere?
require 'ruby-spark'
Spark.config do
set_app_name 'RubySpark'
set_master 'local[*]'
set 'spark.ruby.serializer', 'oj'
set 'spark.ruby.serializer.batch_size', 2048
end
Spark.start
sc = Spark.sc
tfile = sc.text_file('work/Contact.csv')
words = tfile.flat_map('lambda { |x| x.downcase.gsub(/[^a-z]/, " ").split(" ")}')
words.countMetadata
Metadata
Assignees
Labels
No labels