-
Notifications
You must be signed in to change notification settings - Fork 721
Use file content heuristics to decide file reader. #1962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…sed on the magic number.
…ics detection method.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #1962 +/- ##
========================================
Coverage 83.45% 83.45%
========================================
Files 311 311
Lines 55000 55163 +163
Branches 11789 11935 +146
========================================
+ Hits 45899 46036 +137
- Misses 7822 8248 +426
+ Partials 1279 879 -400
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
enum class CaptureFileFormat | ||
{ | ||
Unknown, | ||
Pcap, | ||
PcapNG, | ||
Snoop, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: this can be an enum inside of CaptureFileFormatDetector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose. It has internal linkage so it doesn't really matter.
But then we would end up with really long case names: CaptureFileFormatDetector::FileFormat::Pcap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's fine? It's all internal anyway...
Tests/Pcap++Test/Tests/FileTests.cpp
Outdated
PTF_ASSERT_NOT_NULL(dynamic_cast<pcpp::PcapNgFileReaderDevice*>(genericReader)); | ||
PTF_ASSERT_TRUE(genericReader->open()); | ||
// ------- IFileReaderDevice::createReader() Factory | ||
// TODO: Move to a separate unit test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add the following to get more coverage:
- Open a snoop file
- Open a file that is not any of the options
- Open pcap files with different magic numbers
- Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3d713ab adds the following tests:
- Pcap, PcapNG, Zst file with correct content + extension
- Pcap, PcanNG file with correct content + wrong extension
- Bogus content file with correct extension (pcap, pcapng, zst)
- Bogus content file with wrong extension (txt)
Haven't found a snoop file to add. Do we have any?
Open pcap files with different magic numbers
Do you mean Pcap content that has just its magic number changed? Because IMO it is reasonable to consider that invalid format and fail as regular bogus data.
Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions
Pending on #1962 (comment) .
Move it out if it needs to be reused somewhere.
Libpcap supports reading this format since 0.9.1. The heuristics detection will identify such magic number as pcap and leave final support decision to the pcap backend infrastructure.
@Dimi1010 some CI tests fail... |
@seladb I think I found the issue. According to ChatGPT the Winpcap's The NPcap sdk and drivers provide their own wpcap.dll where they do support nanosec precision. The tests we have on nanosecond precision pcap files ("nanosecs.pcap") currently are flawed. We don't use a static base base to compare to. We write a packet with the Due to the above the current nanosecond test in the master branch passes with Winpcap, even though it doesn't actually write / read a nanosecond precision pcap. The new tests use a static copy of Edit: |
Maybe we can remove pcap files that have nanoseconds precision from this PR and replace them with other pcap files to make the tests pass? |
Then, we don't have coverage for the magic number for nanosecond pcap. See #1977. It is a fix for the underlying issue. |
…le-selection # Conflicts: # Pcap++/src/PcapFileDevice.cpp
Nanoseconds are skipped if they aren't supported.
…td compression is unsupported, the format detector will return Unknown.
|
||
// PcapNG backend can support ZstdCompressed Pcap files, so we assume an archive is compressed PcapNG. | ||
// If Zstd is not supported, we cannot read the file anyway, so we return Unknown. | ||
if (isPcapNgFile(content) || (checkZstdSupport() && isZstdArchive(content))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to return Unknown
instead of PcapNG
, if Zst is not supported in the binary.
This will in turn, return nullptr
for reader device. I think that is better than returning a PcapNGDevice
that can't be opened.
Maybe we can also do something similar for PcapReaderDevice's nanosecond format files, if the environment can't read them?
Any thoughts @seladb ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that it's the correct behavior 👍
|
||
// PcapNG backend can support ZstdCompressed Pcap files, so we assume an archive is compressed PcapNG. | ||
// If Zstd is not supported, we cannot read the file anyway, so we return Unknown. | ||
if (isPcapNgFile(content) || (checkZstdSupport() && isZstdArchive(content))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that it's the correct behavior 👍
Tests/Pcap++Test/TestDefinition.h
Outdated
PTF_TEST_CASE(TestIFileReaderDeviceFactory_Pcap_MicroPrecision); | ||
PTF_TEST_CASE(TestIFileReaderDeviceFactory_Pcap_NanoPrecision); | ||
PTF_TEST_CASE(TestIFileReaderDeviceFactory_PcapNG); | ||
PTF_TEST_CASE(TestIFileReaderDeviceFactory_PcapNG_ZST); | ||
PTF_TEST_CASE(TestIFileReaderDeviceFactory_PcapNG_ZST_Unsupported); | ||
PTF_TEST_CASE(TestIFileReaderDeviceFactory_Invalid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add another test for the snoop format?
Tests/Pcap++Test/TestDefinition.h
Outdated
PTF_TEST_CASE(TestIFileReaderDeviceFactory_Pcap_MicroPrecision); | ||
PTF_TEST_CASE(TestIFileReaderDeviceFactory_Pcap_NanoPrecision); | ||
PTF_TEST_CASE(TestIFileReaderDeviceFactory_PcapNG); | ||
PTF_TEST_CASE(TestIFileReaderDeviceFactory_PcapNG_ZST); | ||
PTF_TEST_CASE(TestIFileReaderDeviceFactory_PcapNG_ZST_Unsupported); | ||
PTF_TEST_CASE(TestIFileReaderDeviceFactory_Invalid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we shorten the test names to fit into 35 characters? The current output doesn't look good:
TestLoggerMultiThread : PASSED
TestIFileReaderDeviceFactory_Pcap_MicroPrecision: PASSED
TestIFileReaderDeviceFactory_Pcap_NanoPrecision: PASSED
TestIFileReaderDeviceFactory_PcapNG: PASSED
TestIFileReaderDeviceFactory_PcapNG_ZST: PASSED
TestIFileReaderDeviceFactory_Invalid: PASSED
TestPcapFileReadWrite : PASSED
TestPcapFileMicroPrecision : PASSED
TestPcapFileNanoPrecision : PASSED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a reason to stick to 35 chars output? Perhaps we can up it to something like 50~80?
I can reduce the names, but it would hurt the specificity of the test name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked ChatGPT, here is its suggestions:
TestIFileReaderDeviceFactory_Pcap_MicroPrecision
→ TestReaderFactory_Pcap_Micro
TestIFileReaderDeviceFactory_Pcap_NanoPrecision
→ TestReaderFactory_Pcap_Nano
TestIFileReaderDeviceFactory_PcapNG
→ TestReaderFactory_PcapNG
TestIFileReaderDeviceFactory_PcapNG_ZST
→ TestReaderFactory_PcapNG_ZST
TestIFileReaderDeviceFactory_Invalid
→ TestReaderFactory_Invalid
Tests/Pcap++Test/Tests/FileTests.cpp
Outdated
} | ||
}; | ||
|
||
PTF_TEST_CASE(TestIFileReaderDeviceFactory_Pcap_MicroPrecision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to test a real pcap file, maybe we can add syntethic files that have a different magic number to test all options?
We don't have to put them in PcapExample/file_heuristics
, instead we can create vectors with the content std::vector<uint8_t>
and save them to temp files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what would be the purpose? Just to test that it returns nullptr
?
Doesn't TestIFileReaderDeviceFactory_Invalid
already handle that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the other possible magic numbers of a valid pcap file. Since it's not easy to find such pcap files, we can generate synthetic files that are not actually valid, but will look valid for the sake of the test
constexpr const char* PCAPNG_ZST_FILE_PATH = "PcapExamples/file_heuristics/pcapng-example.pcapng.zst"; | ||
constexpr const char* PCAPNG_ZSTD_FILE_PATH = "PcapExamples/file_heuristics/pcapng-example.pcapng.zstd"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extension doesn't matter to determine if it's a Zstd file, so we probably don't need both files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I will remove the zstd file. I think the .zst is the more common extension?
Edit: #1962 (comment)
// Garbage data, correct extension | ||
constexpr const char* PCAP_BOGUS_FILE_PATH = "PcapExamples/file_heuristics/bogus-content.pcap"; | ||
constexpr const char* PCAPNG_BOGUS_FILE_PATH = "PcapExamples/file_heuristics/bogus-content.pcapng"; | ||
constexpr const char* PCAPNG_ZST_BOGUS_FILE_PATH = "PcapExamples/file_heuristics/bogus-content.zst"; | ||
|
||
// Garbage data | ||
constexpr const char* BOGUS_FILE_PATH = "PcapExamples/file_heuristics/bogus-content.txt"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto: since the extension doesn't matter, do we need all 4 files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this verifies that the extension doesn't matter? I guess this can be said about #1962 (comment) too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's really needed...
The PR adds heuristics based on the file content that is more robust than deciding based on the file extension.
The new decision model scans the start of the file for its magic number signature. It then compares it to the signatures of supported file types [1] and constructs a reader instance based on the result.
A new function
createReader
has been added due to changes in the public API of the factory.nullptr
is returned. (previously returnedPcapFileDeviceReader
)std::runtime_error
is thrown. (previously returnedPcapFileDeviceReader
)std::unique_ptr<IFileDeviceReader>
instead ofIFileDeviceReader*
.