Skip to content

[Bug]: AvroCoder uses a lot of memory because it keeps instantiating the same DatumReader and DatumWriter #34749

@wollowizard

Description

@wollowizard

What happened?

In a dataflow streaming pipeline I noticed that around 28% of the memory allocations are made by AvroDatumFactory$ReflectDatumFactory.apply which keeps being called with the same schemas as input (see figure)

Image

This is part of the AvroCoder.decode call, which in theory has a mechanism to avoid exactly this situation: it has a ThreadLocal to cache the previously computed BinaryDecoder.

However, it seems that this ThreadLocal is always empty and so the value needs to be recomputed each time. It is unclear to me at the moment if this ThreadLocal is empty because the runner keeps creating new threads, or because the ThreadLocal, which is actually an EmptyOnDeserializationThreadLocal, is always empty when deserialization happens.

Recreating the BinaryDecoder's initial value means that the AvroDatumFactory is called again and again, with the same Schemas as inputs. This seems wasteful, even more considering that the DatumReader and DatumWriter should be reusable and thread-safe.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions