Zip RDD Workaround

Combining RDD of labels and split raw text using zip() in MLlib adaptations leads to 

> error: Cannot deserialize RDD with different number of items in pair: (89, 86)

Both RDD's have 2 partitions and 379 items within. The current workaround below combines both in order but is less efficient.

`# Combine using zip workaround`
`temp_labels = labels.zipWithIndex().map(lambda x: (x[1], x[0]))`
`temp_tfidf = tfidf.zipWithIndex().map(lambda x: (x[1], x[0]))`
`training = temp_labels.leftOuterJoin(temp_tfidf)`
`raw_label_and_values = training.values()`
`raw_label_and_values = raw_label_and_values.map(lambda x: (x[0], x[1]))`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zip RDD Workaround #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Zip RDD Workaround #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions