Make LzoTextInputFormat#listStatus thread safe for concurrent call#120
Make LzoTextInputFormat#listStatus thread safe for concurrent call#120xq262144 wants to merge 1 commit intotwitter:masterfrom
Conversation
|
Thanks @xq262144 for your contribution. While I understand the desire to make these classes thread safe, I don't think in general that there is no guarantee or expectation that an InputFormat or OutputFormat class should be thread safe. What is the case where you're running into thread-safety issues? Can you not make it work by simply instantiating a new instance for each thread? |
|
@sjlee Thank you for your response. I found this thread-safe issue while trying to integrate Then I analyzed the call stack and found Hive has some sort of input format caching mechanism in here https://github.com/apache/hive/blob/41fbe7bb7d4ad1eb0510a08df22db59e7a81c245/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L250, thus it requires input formats to be thread-safe. To solve this, it's ok to add an exception for lzo input formats in Hive or make lzo input formats thread-safe. I choose to make lzo input formats thread-safe while comparing to modifying Hive. Because it's a rather small code base and simpler to upgrade in a production environment. And since |
|
Thanks for the explanation. It is iffy that Hive caches input format instances and lets them be used by multiple concurrent threads. Hadoop-lzo might not be the only input format types that might have issues. Have you tried opening a discussion with the Hive community? While I'm not necessarily against making this change (seems fairly low risk), I'm more curious to what the Hive community has to say. |
DeprecatedLzoTextInputFormatandLzoTextInputFormatdo not thread-safe.Use
ConcurrentHashMapinstead ofHashMap.