-
Notifications
You must be signed in to change notification settings - Fork 82
Description
Symptom
When enabling multi-thread in geotagger module (
i.e.,
java -cp noah-assembly-1.0-SNAPSHOT.jar \
edu.uci.ics.cloudberry.noah.TwitterJSONTagToADM \
-state /mnt/disk/data/twitter/web/public/data/state.json \
-county /mnt/disk/data/twitter/web/public/data/county.json \
-city /mnt/disk/data/twitter/web/public/data/city.json \
-thread 32), AsterixDB will alerts tons of parsing errors complaining ... expecting a rectangle type for the attribute ....
Suspection on the reason
In PR #727 "Add Elasticsearch ingestion pipeline", it introduced a knob var file = "ADM" // By default, generate ADM file. and inside the main function tagOneTweet(...), it checks whether the variable file has ADM (i.e., file.equals("ADM")). My guess is this check might be thread-unsafe, which results in random behaviors and then goes to the other branch that outputs JSON formatted tweets.
Current work-around
Before assembling project noah, use git to revert to the earlier commit before the PR #727.
cd cloudberry/examples/twittermap
git checkout 2455b69d70a45f50b55492304138e16af9125e94
sbt "project noah" assemblyIf you see errors about duplicate lib files, it is because the merging strategy is not appropriate. Modify examples/twittermap/project/commons.scala to the following,
...
case x => MergeStrategy.first
// val oldStrategy = (assemblyMergeStrategy in assembly).value
// oldStrategy(x)
}Next step solution
Try to make the knob var file = "ADM" // By default, generate ADM file. to be thread-safe and test it by ingesting a large number of tweets to a clean AsterixDB.