diff --git a/.precious.toml b/.precious.toml index 144df8f..5488656 100644 --- a/.precious.toml +++ b/.precious.toml @@ -11,6 +11,15 @@ path-args = "absolute-file" include = "**/*.json" ok_exit_codes = 0 +[commands.prettier-markdown] +type = "both" +cmd = ["prettier", "--prose-wrap", "always"] +lint-flags = ["--check"] +tidy-flags = ["--write"] +path-args = "absolute-file" +include = "**/*.md" +ok_exit_codes = 0 + [commands.golangci-lint-fmt] type = "both" cmd = ["golangci-lint", "fmt"] diff --git a/MaxMind-DB-spec.md b/MaxMind-DB-spec.md index a9f3c3a..66ea2ba 100644 --- a/MaxMind-DB-spec.md +++ b/MaxMind-DB-spec.md @@ -3,6 +3,7 @@ layout: default title: MaxMind DB File Format Specification version: v2.0 --- + # MaxMind DB File Format Specification ## Description @@ -18,40 +19,38 @@ The version number consists of separate major and minor version numbers. It should not be considered a decimal number. In other words, version 2.10 comes after version 2.9. -Code which is capable of reading a given major version of the format should -not be broken by minor version changes to the format. +Code which is capable of reading a given major version of the format should not +be broken by minor version changes to the format. ## Overview The binary database is split into three parts: -1. The binary search tree. Each level of the tree corresponds to a single bit -in the prefix of the network the IP address belongs to. -2. The data section with the values for the networks in the binary search -tree. These values may be comprised of a single data type, e.g., the string -"US" or "New York", or they may be a more complex map or array type made up -of multiple fields. +1. The binary search tree. Each level of the tree corresponds to a single bit in + the prefix of the network the IP address belongs to. +2. The data section with the values for the networks in the binary search tree. + These values can be a single data type, e.g., the string "US" or "New York", + or a more complex map or array type made up of multiple fields. 3. Database metadata. Information about the database itself. ## Database Metadata -This portion of the database is stored at the end of the file. It is -documented first because understanding some of the metadata is key to -understanding how the other sections work. +This portion of the database is stored at the end of the file. It is documented +first because understanding some of the metadata is key to understanding how the +other sections work. This section can be found by looking for a binary sequence matching -"\xab\xcd\xefMaxMind.com". The *last* occurrence of this string in the file +"\xab\xcd\xefMaxMind.com". The _last_ occurrence of this string in the file marks the end of the data section and the beginning of the metadata. Since we allow for arbitrary binary data in the data section, some other piece of data -could contain these values. This is why you need to find the last occurrence -of this sequence. +could contain these values. This is why you need to find the last occurrence of +this sequence. The maximum allowable size for the metadata section, including the marker that starts the metadata, is 128KiB. -The metadata is stored as a separate data section comprised of a map data -structure starting at the beginning of that section. This structure is -described later in the spec. +The metadata is stored as a separate data section containing a map data +structure at its beginning. This structure is described later in the spec. Except where otherwise specified, each key listed is required for the database to be considered valid. @@ -61,70 +60,68 @@ change for this spec. Adding a key constitutes a minor version change. The list of known keys for the current version of the format is as follows: -### node\_count +### node_count -This is an unsigned 32-bit integer indicating the number of nodes in the -search tree. +This is an unsigned 32-bit integer indicating the number of nodes in the search +tree. -### record\_size +### record_size -This is an unsigned 16-bit integer. It indicates the number of bits in a -record in the search tree. Note that each node consists of *two* records. +This is an unsigned 16-bit integer. It indicates the number of bits in a record +in the search tree. Note that each node consists of _two_ records. -### ip\_version +### ip_version -This is an unsigned 16-bit integer which is always 4 or 6. It indicates -whether the database contains IPv4 or IPv6 address data. +This is an unsigned 16-bit integer which is always 4 or 6. It indicates whether +the database contains IPv4 or IPv6 address data. -### database\_type +### database_type This is a string that indicates the structure of each data record associated -with an IP address. The actual definition of these structures is left up to -the database creator. +with an IP address. The actual definition of these structures is left up to the +database creator. Names starting with "GeoIP" are reserved for use by MaxMind (and "GeoIP" is a trademark anyway). ### languages -An array of strings, each of which is a locale code. A given record may -contain data items that have been localized to some or all of these -locales. Records should not contain localized data for locales not included in -this array. +An array of strings, each of which is a locale code. A given record may contain +data items that have been localized to some or all of these locales. Records +should not contain localized data for locales not included in this array. This is an optional key, as this may not be relevant for all types of data. -### binary\_format\_major\_version +### binary_format_major_version This is an unsigned 16-bit integer indicating the major version number for the database's binary format. -### binary\_format\_minor\_version +### binary_format_minor_version This is an unsigned 16-bit integer indicating the minor version number for the database's binary format. -### build\_epoch +### build_epoch -This is an unsigned 64-bit integer that contains the database build timestamp -as a Unix epoch value. +This is an unsigned 64-bit integer that contains the database build timestamp as +a Unix epoch value. ### description This key will always point to a map. The keys of that map will be language -codes, and the values will be a description in that language as a UTF-8 -string. +codes, and the values will be a description in that language as a UTF-8 string. The codes may include additional information such as script or country identifiers, like "zh-TW" or "mn-Cyrl-MN". The additional identifiers will be separated by a dash character ("-"). -This key is optional. However, creators of databases are strongly -encouraged to include a description in at least one language. +This key is optional. However, creators of databases are strongly encouraged to +include a description in at least one language. ### Calculating the Search Tree Section Size -The formula for calculating the search tree section size *in bytes* is as +The formula for calculating the search tree section size _in bytes_ is as follows: ( ( $record_size * 2 ) / 8 ) * $number_of_nodes @@ -149,8 +146,7 @@ node in the search tree address space. These pointers are followed as part of the IP address search algorithm, described below. The pointer can point to a value equal to `$number_of_nodes`. If this is the -case, it means that the IP address we are searching for is not in the -database. +case, it means that the IP address we are searching for is not in the database. Finally, it may point to an address in the data section. This is the data relevant to the given netblock. @@ -159,12 +155,13 @@ relevant to the given netblock. Each node in the search tree consists of two records, each of which is a pointer. The record size varies by database, but inside a single database node -records are always the same size. A record may be anywhere from 24 to 128 bits -long, depending on the number of nodes in the tree. These pointers are -stored in big-endian format (most significant byte first). +records are always the same size. The record size must be a multiple of 4 (so +that nodes are an integral number of bytes) and is at least 24 bits. All +existing databases use record sizes of 24, 28, or 32 bits, but the format +supports larger sizes following the same pattern. These pointers are stored in +big-endian format (most significant byte first). -Here are some examples of how the records are laid out in a node for 24, 28, -and 32 bit records. Larger record sizes follow this same pattern. +Here are the record layouts for 24, 28, and 32 bit records. #### 24 bits (small database), one node is 6 bytes @@ -176,7 +173,7 @@ and 32 bit records. Larger record sizes follow this same pattern. | <------------- node --------------->| | 23 .. 0 | 27..24 | 27..24 | 23 .. 0 | -Note 4 bits of each pointer are combined into the middle byte. For both +Note that 4 bits of each pointer are combined into the middle byte. For both records, they are prepended and end up in the most significant position. #### 32 bits (large database), one node is 8 bytes @@ -187,52 +184,51 @@ records, they are prepended and end up in the most significant position. ### Search Lookup Algorithm The first step is to convert the IP address to its big-endian binary -representation. For an IPv4 address, this becomes 32 bits. For IPv6 you get -128 bits. +representation. For an IPv4 address, this becomes 32 bits. For IPv6 you get 128 +bits. -The leftmost bit corresponds to the first node in the search tree. For each -bit, a value of 0 means we choose the left record in a node, and a value of 1 -means we choose the right record. +The leftmost bit corresponds to the first node in the search tree. For each bit, +a value of 0 means we choose the left record in a node, and a value of 1 means +we choose the right record. -The record value is always interpreted as an unsigned integer. The maximum -size of the integer is dependent on the number of bits in a record (24, 28, or -32). +The record value is always interpreted as an unsigned integer. The maximum size +of the integer is dependent on the number of bits in a record. -If the record value is a number that is less than the *number of nodes* (not -in bytes, but the actual node count) in the search tree (this is stored in the -database metadata), then the value is a node number. In this case, we find -that node in the search tree and repeat the lookup algorithm from there. +If the record value is a number that is less than the _number of nodes_ (not in +bytes, but the actual node count) in the search tree (this is stored in the +database metadata), then the value is a node number. In this case, we find that +node in the search tree and repeat the lookup algorithm from there. If the record value is equal to the number of nodes, that means that we do not have any data for the IP address, and the search ends here. -If the record value is *greater* than the number of nodes in the search tree, -then it is an actual pointer value pointing into the data section. The value -of the pointer is relative to the start of the data section, *not* the -start of the file. +If the record value is _greater_ than the number of nodes in the search tree, +then it is an actual pointer value pointing into the data section. The value of +the pointer is relative to the start of the data section, _not_ the start of the +file. In order to determine where in the data section we should start looking, we use the following formula: $data_section_offset = ( $record_value - $node_count ) - 16 -The 16 is the size of the data section separator. We subtract it because we -want to permit pointing to the first byte of the data section. Recall that -the record value cannot equal the node count as that means there is no -data. Instead, we choose to start values that go to the data section at -`$node_count + 16`. (This has the side effect that record values -`$node_count + 1` through `$node_count + 15` inclusive are not valid). +The 16 is the size of the data section separator. We subtract it because we want +to permit pointing to the first byte of the data section. Recall that the record +value cannot equal the node count as that means there is no data. Instead, we +choose to start values that go to the data section at `$node_count + 16`. (This +has the side effect that record values `$node_count + 1` through +`$node_count + 15` inclusive are not valid). This is best demonstrated by an example: -Let's assume we have a 24-bit tree with 1,000 nodes. Each node contains 48 -bits, or 6 bytes. The size of the tree is 6,000 bytes. +Let's assume we have a 24-bit tree with 1,000 nodes. Each node contains 48 bits, +or 6 bytes. The size of the tree is 6,000 bytes. -When a record in the tree contains a number that is less than 1,000, this -is a *node number*, and we look up that node. If a record contains a value -greater than or equal to 1,016, we know that it is a data section value. We -subtract the node count (1,000) and then subtract 16 for the data section -separator, giving us the number 0, the first byte of the data section. +When a record in the tree contains a number that is less than 1,000, this is a +_node number_, and we look up that node. If a record contains a value greater +than or equal to 1,016, we know that it is a data section value. We subtract the +node count (1,000) and then subtract 16 for the data section separator, giving +us the number 0, the first byte of the data section. If a record contained the value 6,000, this formula would give us an offset of 4,984 into the data section. @@ -246,16 +242,21 @@ determining the size of the search tree in bytes and then adding an additional + $search_tree_size_in_bytes + 16 -Since we subtract and then add 16, the final formula to determine the -offset in the file can be simplified to: +Since we subtract and then add 16, the final formula to determine the offset in +the file can be simplified to: $offset_in_file = ( $record_value - $node_count ) + $search_tree_size_in_bytes ### IPv4 addresses in an IPv6 tree -When storing IPv4 addresses in an IPv6 tree, they are stored as-is, so they -occupy the first 32-bits of the address space (from 0 to 2**32 - 1). +When storing IPv4 addresses in an IPv6 tree, the four bytes of the IPv4 address +are placed in the least significant 32 bits of the 128-bit address, with the +upper 96 bits set to zero. This means IPv4 addresses occupy the lowest portion +of the 128-bit address space, from 0 to 2\*\*32 - 1 (the `::/96` network in IPv6 +notation). In the search tree, looking up an IPv4 address requires traversing 96 +zero-bit branches from the root before reaching the 32 bits that distinguish +individual IPv4 addresses. Creators of databases should decide on a strategy for handling the various mappings between IPv4 and IPv6. @@ -265,66 +266,71 @@ from the `::ffff:0:0/96` subnet to the root node of the IPv4 address space in the tree. This accounts for the [IPv4-mapped IPv6 address](http://en.wikipedia.org/wiki/IPv6#IPv4-mapped_IPv6_addresses). -MaxMind also includes a pointer from the `2002::/16` subnet to the root node -of the IPv4 address space in the tree. This accounts for the +MaxMind also includes a pointer from the `2002::/16` subnet to the root node of +the IPv4 address space in the tree. This accounts for the [6to4 mapping](http://en.wikipedia.org/wiki/6to4) subnet. Database creators are encouraged to document whether they are doing something similar for their databases. -The Teredo subnet cannot be accounted for in the tree. Instead, code that -searches the tree can offer to decode the IPv4 portion of a Teredo address and -look that up. +The Teredo subnet (`2001::/32`, +[RFC 4380](https://datatracker.ietf.org/doc/html/rfc4380)) cannot be accounted +for with a tree alias. In a Teredo address, the client's public IPv4 address is +in the last 32 bits (bits 96-127), XOR'd with `0xFFFFFFFF`, and is separated +from the Teredo prefix by 64 bits of server address, flags, and port data. A +tree alias can only map to the bits immediately following the aliased prefix, so +there is no way to construct an alias that reaches the client's IPv4 address. +Correct handling of Teredo addresses requires the reader library or application +to extract and decode the client's IPv4 address and look it up separately. ## Data Section Separator -There are 16 bytes of NULLs in between the search tree and the data -section. This separator exists in order to make it possible for a verification -tool to distinguish between the two sections. +There are 16 bytes of NULLs in between the search tree and the data section. +This separator exists in order to make it possible for a verification tool to +distinguish between the two sections. This separator is not considered part of the data section itself. In other -words, the data section starts at `$size_of_search_tree + 16` bytes in the -file. +words, the data section starts at `$size_of_search_tree + 16` bytes in the file. ## Output Data Section Each output data field has an associated type, and that type is encoded as a number that begins the data field. Some types are variable length. In those -cases, the type indicator is also followed by a length. The data payload -always comes at the end of the field. +cases, the type indicator is also followed by a length. The data payload always +comes at the end of the field. All binary data is stored in big-endian format. -Note that the *interpretation* of a given data type's meaning is decided by +Note that the _interpretation_ of a given data type's meaning is decided by higher-level APIs, not by the binary format itself. ### pointer - 1 -A pointer to another part of the data section's address space. The pointer -will point to the beginning of a field. It is illegal for a pointer to point -to another pointer. +A pointer to another part of the data section's address space. The pointer will +point to the beginning of a field. It is illegal for a pointer to point to +another pointer. -Pointer values start from the beginning of the data section, *not* the -beginning of the file. Pointers in the metadata start from the beginning of -the metadata section. +Pointer values start from the beginning of the data section, _not_ the beginning +of the file. Pointers in the metadata start from the beginning of the metadata +section. ### UTF-8 string - 2 -A variable length byte sequence that contains valid utf8. If the length is -zero then this is an empty string. +A variable length byte sequence that contains valid utf8. If the length is zero +then this is an empty string. ### double - 3 -This is stored as an IEEE-754 double (binary64) in big-endian format. The -length of a double is always 8 bytes. +This is stored as an IEEE-754 double (binary64) in big-endian format. The length +of a double is always 8 bytes. ### bytes - 4 A variable length byte sequence containing any sort of binary data. If the -length is zero then this a zero-length byte sequence. +length is zero then this is a zero-length byte sequence. -This is not currently used but may be used in the future to embed non-text -data (images, etc.). +This is not currently used but may be used in the future to embed non-text data +(images, etc.). ### integer formats @@ -333,24 +339,23 @@ Integers are stored in variable length binary fields. We support 16-bit, 32-bit, 64-bit, and 128-bit unsigned integers. We also support 32-bit signed integers. -A 128-bit integer can use up to 16 bytes, but may use fewer. Similarly, a -32-bit integer may use from 0-4 bytes. The number of bytes used is determined -by the length specifier in the control byte. See below for details. +A 128-bit integer can use up to 16 bytes, but may use fewer. Similarly, a 32-bit +integer may use from 0-4 bytes. The number of bytes used is determined by the +length specifier in the control byte. See below for details. A length of zero always indicates the number 0. -When storing a signed integer, fields shorter than the maximum byte length -are always positive. When the field is the maximum length, e.g., 4 bytes for -32-bit integers, the left-most bit is the sign. A 1 is negative and a 0 is -positive. +When storing a signed integer, fields shorter than the maximum byte length are +always positive. When the field is the maximum length, e.g., 4 bytes for 32-bit +integers, the left-most bit is the sign. A 1 is negative and a 0 is positive. The type numbers for our integer types are: -* unsigned 16-bit int - 5 -* unsigned 32-bit int - 6 -* signed 32-bit int - 8 -* unsigned 64-bit int - 9 -* unsigned 128-bit int - 10 +- unsigned 16-bit int - 5 +- unsigned 32-bit int - 6 +- signed 32-bit int - 8 +- unsigned 64-bit int - 9 +- unsigned 128-bit int - 10 The unsigned 32-bit and 128-bit types may be used to store IPv4 and IPv6 addresses, respectively. @@ -359,56 +364,51 @@ The signed 32-bit integers are stored using the 2's complement representation. ### map - 7 -A map data type contains a set of key/value pairs. Unlike other data types, -the length information for maps indicates how many key/value pairs it -contains, not its length in bytes. This size can be zero. +A map data type contains a set of key/value pairs. Unlike other data types, the +length information for maps indicates how many key/value pairs it contains, not +its length in bytes. This size can be zero. -See below for the algorithm used to determine the number of pairs in the -hash. This algorithm is also used to determine the length of a field's -payload. +See below for the algorithm used to determine the number of pairs in the map. +This algorithm is also used to determine the length of a field's payload. ### array - 11 An array type contains a set of ordered values. The length information for -arrays indicates how many values it contains, not its length in bytes. This -size can be zero. +arrays indicates how many values it contains, not its length in bytes. This size +can be zero. This type uses the same algorithm as maps for determining the length of a field's payload. -### data cache container - 12 +### data cache container - 12 (deprecated) -This is a special data type that marks a container used to cache repeated -data. For example, instead of repeating the string "United States" over and -over in the database, we store it in the cache container and use pointers -*into* this container instead. +This type is deprecated. It has never been used in any known database and +readers are not expected to support it. -Nothing in the database will ever contain a pointer to this field -itself. Instead, various fields will point into the container. +It was originally intended to mark a container of repeated data that a database +dumper tool could skip. In practice, data deduplication is handled by pointers, +making this type unnecessary. -The primary reason for making this a separate data type versus simply inlining -the cached data is so that a database dumper tool can skip this cache when -dumping the data section. The cache contents will end up being dumped as -pointers into it are followed. +### end marker - 13 (deprecated) -### end marker - 13 +This type is deprecated. It has never been used in any known database and +readers are not expected to support it. -The end marker marks the end of the data section. It is not strictly -necessary, but including this marker allows a data section deserializer to -process a stream of input, rather than having to find the end of the section -before beginning the deserialization. +It was originally intended to mark the end of the data section for stream-based +deserialization. In practice, readers determine section boundaries from the +metadata, making this type unnecessary. This data type is not followed by a payload, and its size is always zero. ### boolean - 14 -A true or false value. The length information for a boolean type will always -be 0 or 1, indicating the value. There is no payload for this field. +A true or false value. The length information for a boolean type will always be +0 or 1, indicating the value. There is no payload for this field. ### float - 15 -This is stored as an IEEE-754 float (binary32) in big-endian format. The -length of a float is always 4 bytes. +This is stored as an IEEE-754 float (binary32) in big-endian format. The length +of a float is always 4 bytes. This type is provided primarily for completeness. Because of the way floating point numbers are stored, this type can easily lose precision when serialized @@ -422,25 +422,25 @@ about the field's data type and payload size. The first three bits of the control byte tell you what type the field is. If these bits are all 0, then this is an "extended" type, which means that the -*next* byte contains the actual type. Otherwise, the first three bits will +_next_ byte contains the actual type. Otherwise, the first three bits will contain a number from 1 to 7, the actual type for the field. We've tried to assign the most commonly used types as numbers 1-7 as an optimization. -With an extended type, the type number in the second byte is the number -minus 7. In other words, an array (type 11) will be stored with a 0 for the -type in the first byte and a 4 in the second. +With an extended type, the type number in the second byte is the number minus 7. +In other words, an array (type 11) will be stored with a 0 for the type in the +first byte and a 4 in the second. Here is an example of how the control byte may combine with the next byte to tell us the type: 001XXXXX pointer 010XXXXX UTF-8 string - 110XXXXX unsigned 32-bit int (ASCII) - 000XXXXX 00000011 unsigned 128-bit int (binary) + 110XXXXX unsigned 32-bit int + 000XXXXX 00000011 unsigned 128-bit int 000XXXXX 00000100 array - 000XXXXX 00000110 end marker + 000XXXXX 00000110 end marker (deprecated) #### Payload Size @@ -456,49 +456,47 @@ bytes. For example: 11000001 unsigned 32-bit int - 1 byte long 00000011 00000011 unsigned 128-bit int - 3 bytes long -If the five bits are equal to 29, 30, or 31, then use the following algorithm -to calculate the payload size. +If the five bits are equal to 29, 30, or 31, then use the following algorithm to +calculate the payload size. -If the value is 29, then the size is 29 + *the next byte after the type -specifying bytes as an unsigned integer*. +If the value is 29, then the size is 29 + _the next byte (as an unsigned +integer)_. -If the value is 30, then the size is 285 + *the next two bytes after the type -specifying bytes as a single unsigned integer*. +If the value is 30, then the size is 285 + _the next two bytes (as a single +unsigned integer)_. -If the value is 31, then the size is 65,821 + *the next three bytes after the -type specifying bytes as a single unsigned integer*. +If the value is 31, then the size is 65,821 + _the next three bytes (as a single +unsigned integer)_. Some examples: 01011101 00110011 UTF-8 string - 80 bytes long -In this case, the last five bits of the control byte equal 29. We treat the -next byte as an unsigned integer. The next byte is 51, so the total size is -(29 + 51) = 80. +In this case, the last five bits of the control byte equal 29. We treat the next +byte as an unsigned integer. The next byte is 51, so the total size is (29 + 51) += 80. 01011110 00110011 00110011 UTF-8 string - 13,392 bytes long -The last five bits of the control byte equal 30. We treat the next two bytes -as a single unsigned integer. The next two bytes equal 13,107, so the total -size is (285 + 13,107) = 13,392. +The last five bits of the control byte equal 30. We treat the next two bytes as +a single unsigned integer. The next two bytes equal 13,107, so the total size is +(285 + 13,107) = 13,392. 01011111 00110011 00110011 00110011 UTF-8 string - 3,421,264 bytes long The last five bits of the control byte equal 31. We treat the next three bytes -as a single unsigned integer. The next three bytes equal 3,355,443, so the -total size is (65,821 + 3,355,443) = 3,421,264. +as a single unsigned integer. The next three bytes equal 3,355,443, so the total +size is (65,821 + 3,355,443) = 3,421,264. -This means that the maximum payload size for a single field is 16,843,036 -bytes. +This means that the maximum payload size for a single field is 16,843,036 bytes. The binary number types always have a known size, but for consistency's sake, the control byte will always specify the correct size for these types. #### Maps -Maps use the size in the control byte (and any following bytes) to indicate -the number of key/value pairs in the map, not the size of the payload in -bytes. +Maps use the size in the control byte (and any following bytes) to indicate the +number of key/value pairs in the map, not the size of the payload in bytes. This means that the maximum number of pairs for a single map is 16,843,036. @@ -508,9 +506,8 @@ pair, etc. The keys are **always** UTF-8 strings. The values may be any data type, including maps or pointers. -Once we know the number of pairs, we can look at each pair in turn to -determine the size of the key and the key name, as well as the value's type -and payload. +Once we know the number of pairs, we can look at each pair in turn to determine +the size of the key and the key name, as well as the value's type and payload. #### Pointers @@ -537,31 +534,31 @@ Finally, if the size is 3, the pointer's value is contained in the next four bytes as a 32-bit value. In this case, the last three bits of the control byte are ignored. -This means that we are limited to 4GB of address space for pointers, so the -data section size for the database is limited to 4GB. +This means that we are limited to 4GB of address space for pointers, so the data +section size for the database is limited to 4GB. ## Reference Implementations ### Writer -* [Go](https://github.com/maxmind/mmdbwriter) +- [Go](https://github.com/maxmind/mmdbwriter) ### Reader -* [C](https://github.com/maxmind/libmaxminddb) -* [C#](https://github.com/maxmind/MaxMind-DB-Reader-dotnet) -* [Java](https://github.com/maxmind/MaxMind-DB-Reader-java) -* [PHP](https://github.com/maxmind/MaxMind-DB-Reader-php) -* [Python](https://github.com/maxmind/MaxMind-DB-Reader-python) -* [Ruby](https://github.com/maxmind/MaxMind-DB-Reader-ruby) +- [C](https://github.com/maxmind/libmaxminddb) +- [C#](https://github.com/maxmind/MaxMind-DB-Reader-dotnet) +- [Java](https://github.com/maxmind/MaxMind-DB-Reader-java) +- [PHP](https://github.com/maxmind/MaxMind-DB-Reader-php) +- [Python](https://github.com/maxmind/MaxMind-DB-Reader-python) +- [Ruby](https://github.com/maxmind/MaxMind-DB-Reader-ruby) ## Authors This specification was created by the following authors: -* Greg Oschwald \ -* Dave Rolsky \ -* Boris Zentner \ +- Greg Oschwald \ +- Dave Rolsky \ +- Boris Zentner \ ## License @@ -570,4 +567,3 @@ Unported License. To view a copy of this license, visit [http://creativecommons.org/licenses/by-sa/3.0/](http://creativecommons.org/licenses/by-sa/3.0/) or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA - diff --git a/README.md b/README.md index fab226f..d5d20bd 100644 --- a/README.md +++ b/README.md @@ -8,8 +8,8 @@ This repository contains the spec for that format as well as test databases. The `write-test-data` command generates the MMDB test files under `test-data/` and `bad-data/`. -When run from anywhere inside this repository, it auto-detects the repo root -and uses default paths: +When run from anywhere inside this repository, it auto-detects the repo root and +uses default paths: ```bash go run ./cmd/write-test-data @@ -28,5 +28,6 @@ go run ./cmd/write-test-data \ This software is Copyright (c) 2013 - 2026 by MaxMind, Inc. -This is free software, licensed under the [Apache License, Version -2.0](LICENSE-APACHE) or the [MIT License](LICENSE-MIT), at your option. +This is free software, licensed under the +[Apache License, Version 2.0](LICENSE-APACHE) or the [MIT License](LICENSE-MIT), +at your option. diff --git a/bad-data/README.md b/bad-data/README.md index f1d0bc0..4fe57f8 100644 --- a/bad-data/README.md +++ b/bad-data/README.md @@ -1,14 +1,14 @@ These are corrupt databases that have been known to cause problems such as -segfaults or unhandled errors on one or more MaxMind DB reader -implementations. Implementations _should_ return an appropriate error -or raise an exception on these databases. +segfaults or unhandled errors on one or more MaxMind DB reader implementations. +Implementations _should_ return an appropriate error or raise an exception on +these databases. Databases are organized into subdirectories named after the reader implementation that exposed the issue (e.g., `libmaxminddb/`). Note: `libmaxminddb/libmaxminddb-uint64-max-epoch.mmdb` contains a valid -database structure with `build_epoch` set to `UINT64_MAX`. It may not produce -a reader error but can cause overflow in time type conversions. +database structure with `build_epoch` set to `UINT64_MAX`. It may not produce a +reader error but can cause overflow in time type conversions. If you find a corrupt test-sized database that crashes a MMDB reader library, please feel free to add it here by creating a pull request. diff --git a/test-data/README.md b/test-data/README.md index 7931168..ef9a950 100644 --- a/test-data/README.md +++ b/test-data/README.md @@ -1,12 +1,14 @@ ## How to generate test data -Use the [write-test-data](https://github.com/maxmind/MaxMind-DB/blob/main/cmd/write-test-data) + +Use the +[write-test-data](https://github.com/maxmind/MaxMind-DB/blob/main/cmd/write-test-data) go tool to create a small set of test databases with a variety of data and record sizes. These test databases are useful for testing code that reads MaxMind DB files. -There are several ways to figure out what IP addresses are actually in the -test databases. You can take a look at the +There are several ways to figure out what IP addresses are actually in the test +databases. You can take a look at the [source-data directory](https://github.com/maxmind/MaxMind-DB/tree/main/source-data) in this repository. This directory contains JSON files which are used to generate many (but not all) of the database files. @@ -17,10 +19,11 @@ in the [MaxMind-DB-Reader-perl repository](https://github.com/maxmind/MaxMind-DB-Reader-perl). ## Static test data + Some of the test files are remnants of the [old perl test data writer](https://github.com/maxmind/MaxMind-DB/blob/f0a85c671c5b6e9c5e514bd66162724ee1dedea3/test-data/write-test-data.pl) -and cannot be generated with the go tool. These databases are intentionally broken, -and exploited functionality simply not available in the go mmdbwriter: +and cannot be generated with the go tool. These databases are intentionally +broken, and exploited functionality simply not available in the go mmdbwriter: - MaxMind-DB-test-broken-pointers-24.mmdb - MaxMind-DB-test-broken-search-tree-24.mmdb @@ -30,6 +33,7 @@ and exploited functionality simply not available in the go mmdbwriter: - maps-with-pointers.raw ## Usage + ``` Usage of ./write-test-data: -source string @@ -38,5 +42,4 @@ Usage of ./write-test-data: Destination directory for the generated mmdb files ``` -Example: -`./write-test-data --source ../../source-data --target ../../test-data` +Example: `./write-test-data --source ../../source-data --target ../../test-data`