Skip to content

Non-base language events are being dropped in the pipeline #28

@c-w

Description

@c-w

It looks like we're writing a mix of both english and spanish terms to cassandra. For example, if ataque is a watchlist term for a Fortis site, where the primary language is spanish with english translation support.
If ataque is mentioned in a spanish tweet we archive that term in cassandra. We do the same if attack is mentioned in an english tweet. A data sample is listed below. This presents a problem in the Fortis interface as the services expect content to be aggregated based on terms in the base language. We need to enhance the keyword extraction analyzer to properly normalize this where attack is detected as ataque.

      month | ataque |                   |                   |    11 |     Twitter |    EmbVZLA_enEsp |   11_917_648 | 2017-10-01 00:00:00.000000+0000 |                     0 |            1
        day | ataque |                   |                   |     6 |     Twitter |  RaicesPeronista |      6_28_20 | 2017-10-03 00:00:00.000000+0000 |                     0 |            1
        day | ataque |                   |                   |    13 |     Twitter |          sutpmcu | 13_3670_2592 | 2017-10-03 00:00:00.000000+0000 |                     0 |            1
      month |       attack |                   |                   |    13 |     Twitter |              all | 13_3670_2581 | 2017-10-01 00:00:00.000000+0000 |                     0 |            1
       hour |       attack |                   |                   |    13 |     Twitter |              all | 13_3670_2581 | 2017-10-03 01:00:00.000000+0000 |                     0 |            1

@erikschlegel Just to clarify, you'd expect keywords for which we have a translation to be stored as the English keyword? Specifically, in the example above, you'd want all instances of ataque to be replaced with attack?


Copied from CatalystCode/project-fortis-spark#174

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions