Skip to content

[noah] New bug introduced by PR #727 "Add Elasticsearch ingestion pipeline" #822

@baiqiushi

Description

@baiqiushi

Symptom

When enabling multi-thread in geotagger module (
i.e.,

java -cp noah-assembly-1.0-SNAPSHOT.jar \
edu.uci.ics.cloudberry.noah.TwitterJSONTagToADM \
    -state /mnt/disk/data/twitter/web/public/data/state.json \
    -county /mnt/disk/data/twitter/web/public/data/county.json \
    -city /mnt/disk/data/twitter/web/public/data/city.json \
    -thread 32

), AsterixDB will alerts tons of parsing errors complaining ... expecting a rectangle type for the attribute ....

Suspection on the reason

In PR #727 "Add Elasticsearch ingestion pipeline", it introduced a knob var file = "ADM" // By default, generate ADM file. and inside the main function tagOneTweet(...), it checks whether the variable file has ADM (i.e., file.equals("ADM")). My guess is this check might be thread-unsafe, which results in random behaviors and then goes to the other branch that outputs JSON formatted tweets.

Current work-around

Before assembling project noah, use git to revert to the earlier commit before the PR #727.

cd cloudberry/examples/twittermap
git checkout 2455b69d70a45f50b55492304138e16af9125e94
sbt "project noah" assembly

If you see errors about duplicate lib files, it is because the merging strategy is not appropriate. Modify examples/twittermap/project/commons.scala to the following,

...
case x => MergeStrategy.first
      //  val oldStrategy = (assemblyMergeStrategy in assembly).value
      //  oldStrategy(x)
}

Next step solution

Try to make the knob var file = "ADM" // By default, generate ADM file. to be thread-safe and test it by ingesting a large number of tweets to a clean AsterixDB.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions