How to improve performance of paralell queries #3098

vitaliishandra · 2022-06-24T10:58:35Z

vitaliishandra
Jun 24, 2022

We have JanusGraph+Casandra CQL.
We have the Gremlin query that creates the Edges between two vertexes.
The query wroks great and is very fast (few milliseconds). However when we start to run this query in paralell the performance starts to degradate.
The query creates edge if it doesn't exist.
For our test scenario we created one Content vertex and we try to create edge to multiple User Vertexes.
So we run the follwing query in paralell:
this.G .V() .Has('uid', userVertexId) .Coalesce<Edge>( __.OutE(Direct).Where(__.OtherV().Has('uid', contentVertexId)), __.AddE(Direct) .Property(Priority, relation.Priority) .To(__.V().Has('uid', assetVertexId))) .Iterate()

When we run it 100 times in parallell the perfromance is degradated from miliseconds to 2-3 seconds or even more.
The query batching (adding multiple edges in one query) also works for us. However there can be thousands of User Vertexes and we will still need to introduce some parallelism.

We have 3 instances of Casandra, 1 instance of JanusGraph server. The resource usage (cpu/memory) for both is almost not impacted during the test and there are available 90% of available CPU/memory.
Our consistency configuration is Quorum/Quorum with replciation factor 3. We tested it with Any/One, All/One. The result is the same, maybe performance was better, however with increasing of paralellism it started to degradate.
Maybe there is some internal lock for JanusGraph on Content vertex as it is shared in all queries.
We tried different configurations, but nothing doesn't help.
We tred to run .profile() for the command during the test. It shows, the it take 2ms in total, however it took few seconds to complete.
Any ideas what can we try? Or how can we investigate it deeper?

Our latest configuration is following:

name: janusgraph.storage.page-size
value: '2000'
- name: >-
janusgraph.storage.buffer-sizestorage.cql.executor-service.core-pool-size
value: '10240'
- name: janusgraph.storage.cql.executor-service.core-pool-size
value: '50'
- name: janusgraph.storage.cql.batch-statement-size
value: '100'
- name: janusgraph.storage.cql.local-max-connections-per-host
value: '2'
- name: janusgraph.storage.cql.max-requests-per-connection
value: '5000'
- name: gremlinserver.maxAccumulationBufferComponents
value: '2048'
- name: gremlinserver.maxChunkSize
value: '28192'
- name: gremlinserver.maxContentLength
value: '128192'
- name: gremlinserver.maxWorkQueueSize
value: '28192'
- name: gremlinserver.threadPoolWorker
value: '8'
- name: gremlinserver.gremlinPool
value: '16'
- name: janusgraph.storage.cql.local-datacenter
value: dc1
- name: JANUS_PROPS_TEMPLATE
value: cql-es
- name: janusgraph.storage.hostname
value: jabyss-cassandra-dc1-service
- name: janusgraph.index.search.backend
value: es
- name: janusgraph.index.search.elasticsearch.http.auth.type
value: basic
- name: janusgraph.index.search.elasticsearch.http.auth.basic.username
valueFrom:
secretKeyRef:
name: elsearch-credentials
key: username
- name: janusgraph.index.search.elasticsearch.http.auth.basic.password
valueFrom:
secretKeyRef:
name: elsearch-credentials
key: password
- name: janusgraph.index.search.hostname
value: jabyss-master
- name: janusgraph.storage.password
valueFrom:
secretKeyRef:
name: jabyss-cassandra-superuser
key: password
- name: janusgraph.storage.username
valueFrom:
secretKeyRef:
name: jabyss-cassandra-superuser
key: username
- name: janusgraph.ids.block-size
value: '100000'
- name: janusgraph.storage.cql.read-consistency-level
value: QUORUM
- name: janusgraph.storage.cql.write-consistency-level
value: QUORUM
- name: janusgraph.storage.cql.replication-factor
value: '3'
- name: janusgraph.cache.db-cache
value: 'false'
- name: janusgraph.query.batch
value: 'true'
- name: janusgraph.query.batch-property-prefetch
value: 'true'
- name: janusgraph.query.smart-limit
value: 'true'
- name: janusgraph.storage.batch-loading
value: 'true'
- name: gremlinserver.writeBufferHighWaterMark
value: '6553600'
- name: gremlinserver.writeBufferLowWaterMark
value: '65536'

li-boxuan · 2022-06-24T15:10:54Z

li-boxuan
Jun 24, 2022
Maintainer

We tred to run .profile() for the command during the test. It shows, the it take 2ms in total, however it took few seconds to complete.

Are you sure profile() output for every such query shows only a few milliseconds? Also note that the profile() output does not count the latency for network round-trip between your client and JanusGraph server. Is your client in the same datacenter as JanusGraph server? How is your client code like, specifically, how do you run your parallel test (just to make sure it's not a problem with your client code)?

0 replies

vitaliishandra · 2022-06-24T15:34:18Z

vitaliishandra
Jun 24, 2022
Author

@li-boxuan we run it in the pod console of JanusGraph server. However looks like we were mistaken. We tried to run it again during the high load and received 28sec. Maybe you have any ideas why it takes so much time? The similar queries generated by our app were run in paralell when we perform this test.

9 replies

vitaliishandra Jun 24, 2022
Author

@li-boxuan thanks, we will try it on Monday and let you know the result.

li-boxuan Jun 25, 2022
Maintainer

Sorry I was wrong; in fact, graph.openManagement().get("query.batch") leads to undefined behavior because query.batch is a maskable property. For this particular query, query.batch does not help IIRC. You could of course fetch all edges first and then filter the edges in parallel (this requires client-side multi-threading).

However, based on your usecase, I think there is a much better way:
Step 1: v_content = g.V().has('uid', contentVertexId)
Step 2: g.V().has('uid', userVertexId).outE('Direct').where(otherV().is(v_content))

This should hopefully be much faster.

vitaliishandra Jul 1, 2022
Author

Thanks @li-boxuan , It was very helpfull and we were able to improve performance a lot.
All queries are executing fast (<100ms) in most of cases. However there are small periods of time when all queries become slow or even throw evaluationTimeour error. In this case we don't see large load on CPU/memory of our janus/casandra instances.
So maybe you can advice regarding the following:

We have 3 instance of our app that process 256 async messages in paralell. During processing of each message we generate some graph queries. Can it be the potentiall root cause of timeouts?
We are going to add even more instances (at least 20 that will process 256 messages in paralell each, so there will be about 5k paralell queries to Gremlin).
What parameters can we tune to improve/speed-up such paralell processng?
We suppose we can play with the followings,, but we should understand what impact on performance will be after changing of these parameters? (we are not restricted in instances count and size)
Client side: MaxInProcessPerConnection=32, PoolSize=4
GremlimServer: gremlinserver.threadPoolWorker=8. gremlinserver.gremlinPool=16
JanusGraph instances: 3
Casandra instances: 3
Consistency: quorum/quorum for read/write

li-boxuan Jul 1, 2022
Maintainer

However there are small periods of time when all queries become slow or even throw evaluationTimeour error

Not sure what you mean by "small periods", but did you check if JVMs experienced full GC at that time? What garbage collector are you using? I personally recommend G1GC.

Please disable query.smart-limit. It is recommended to be turned off. In your use case, you don't need that.

Tu tune the throughput, I would suggest tuning storage.parallel-backend-executor-service.core-pool-size and storage.cql.executor-service.core-pool-size. If possible, you should also try scaling out JanusGraph instances and see how it helps.

vitaliishandra Jul 27, 2022
Author

Thanks @li-boxuan . It helped to improve our queries and overall performance

Uh oh!

How to improve performance of paralell queries #3098

Uh oh!

vitaliishandra Jun 24, 2022

Replies: 2 comments · 9 replies

Uh oh!

li-boxuan Jun 24, 2022 Maintainer

Uh oh!

Uh oh!

vitaliishandra Jun 24, 2022 Author

Uh oh!

vitaliishandra Jun 24, 2022 Author

Uh oh!

Uh oh!

li-boxuan Jun 25, 2022 Maintainer

Uh oh!

vitaliishandra Jul 1, 2022 Author

Uh oh!

li-boxuan Jul 1, 2022 Maintainer

Uh oh!

vitaliishandra Jul 27, 2022 Author

vitaliishandra
Jun 24, 2022

Replies: 2 comments 9 replies

li-boxuan
Jun 24, 2022
Maintainer

vitaliishandra
Jun 24, 2022
Author

vitaliishandra Jun 24, 2022
Author

li-boxuan Jun 25, 2022
Maintainer

vitaliishandra Jul 1, 2022
Author

li-boxuan Jul 1, 2022
Maintainer

vitaliishandra Jul 27, 2022
Author