The goal of this document is to make Velox developers more productive by promoting consistency within the codebase, and encouraging common practices which will make the codebase easier to read, edit, maintain and debug in the future.
We use pre-commit to manage the installation and execution of a number of code quality checks, called hooks.
The recommended way to install pre-commit is through either
pipx or the newer
uv tool. Once you have
pre-commit available in your environment, you can enable running checks on
each commit by running pre-commit install in the root of the repository.
Tip
This will take a few minutes the first time you run it, as pre-commit will
set up the environment for each hook by installing the required tool and
its dependencies in a separate environment to ensure reproducibility of the
check results.
The hooks are defined in .pre-commit-config.yaml.
After the setup is complete, each time you git commit, the hooks will be run
and potential changes applied to your staged files. Any unstaged files will
be stashed while the hooks run. If any changes occurred, the commit will not
succeed. The same happens when you git push but applies to all files that are being
pushed into the repository.
You will have to review and stage the changed files and commit again.
To manually run one specific hook, use pre-commit run <hookid>, for example
pre-commit run clang-format. You can find the hookid in .pre-commit-config.yaml.
By design, pre-commit will only be run on the files that are part of the commit
or push. If you want to run the checks on all files (including unstaged files), you can
run pre-commit run --all-files.
The clang-tidy hook will not be run automatically as it takes a long time and
requires CMake to be run first to create compile_commands.json.
It can be run explicitly via pre-commit run --hook-stage=manual.
If you need to temporarily skip the checks, you can use the git flag --no-verify.
Important
We also run the hooks as part of our CI, which will flag any issues introduced by skipping the checks.
Many aspects of C++ style will be covered by clang-format, such as spacing, line width, indentation and ordering (for includes, using directives and etc).
- Always ensure your code is clang-format compatible.
- If you’re working on a legacy file which is not clang-format compliant yet,
refrain from formatting the entire file in the same PR as it pollutes your
original PR and makes it harder to review.
- Submit a separate diff/PR with the format-only changes.
- Use PascalCase for types (classes, structs, enums, type aliases, type template parameters) and file names.
- Use camelCase for functions, member and local variables, and non-type template parameters.
- camelCase_ for private and protected members variables.
- Use snake_case for namespace names and build targets
- Use UPPER_SNAKE_CASE for macros.
- Use kPascalCase for static constants and enumerators.
- Use testing prefix for test only class methods, e.g. obj.testingFoo(). As much as possible, refrain from adding test methods, and test objects using their public APIs. Only use test methods in rare edge cases where this is not possible. For example, MemoryAllocator::testingSetFailureInjection() is used to to inject various memory allocation failures to test error handling paths.
- Use the debug prefix for query configs that are intended for debugging
purposes only. These configs may enable expensive checks or disable selective
code paths, and are not recommended for use in production environments. For
example,
debug_disable_expression_with_peelingis used to disable peeling optimization employed in expression evaluation.
Some good practices about code comments:
- Optimize code for the reader, not the writer.
- Velox is growing and starting to be used by multiple compute engines and teams; as a result, more time will be spent reading code than writing it.
- Overall goal: make the code more obvious and remove obscurity.
- Comments should capture information that was on the writer’s mind, but
couldn’t be represented as code.
- As such, refrain from adding obvious comments, e.g: simple getter/setter methods.
- However, “obviousness” is in the reader’s mind - if a reviewer says
something is not obvious, then it is not obvious.
- Consider that the audience is an experienced Software Engineer with a moderate knowledge of the codebase.
- What should be commented:
-
Every File
-
Every Class
-
Every method that's not an obvious getter/setter
-
Every member variable
- Do not simply restate the variable name. Either add a comment explaining the semantic meaning of that variable or do not add a comment at all. This is an anti-pattern:
// A simple counter. size_t count_{0}; -
For functions with large bodies, a good practice is to group blocks of related code, and precede them with a blank line and a high-level comment on what the block does.
-
About comment style:
-
Header comment for source files:
- All source files (.h and .cpp) should have a standard header in a large
comment block at the top; this should include the standard copyright
notice, the license header, and a brief file description. This comment
block should use
/* */
- All source files (.h and .cpp) should have a standard header in a large
comment block at the top; this should include the standard copyright
notice, the license header, and a brief file description. This comment
block should use
-
For single line comments, always use
//(with a space separator before the comment). -
Comments should be full english sentences, starting with a capital letter and ending with a period (.).
-
Use:
// True if this node only sorts a portion of the final result. -
Instead of:
// true if this node only sorts a portion of the final result
-
-
For multi-line comments:
- Velox will follow the doxygen comment style.
- For multi-line comments within actual code blocks (the ones which are not
to be exposed as documentation, use
//(double slashes). - For comments on headers for classes, functions, methods and member
variables (the ones to be exposed as documentation), use
///(three slashes) at the front of each line. - Don't use old-style
/* */comments inside code or on top-level header comments. It adds two additional lines and makes headers more verbose than they need to be.
-
Special comments:
-
Use this format when something needs to be fixed in the future:
// TODO: Description of what needs to be fixed. -
Include enough context in the comment itself to make clear what will be done, without requiring any references from outside the code.
-
Do not include the author’s username. If required, this can always be retrieved from git blame.
-
- For assertions and other types of validation, use
VELOX_CHECK_*macrosVELOX_CHECK_*will categorize the error to be an internal runtime error.VELOX_USER_CHECK_*will categorize the error to be a user error.
- Use
VELOX_FAIL()orVELOX_USER_FAIL()to inadvertently throw an exception:VELOX_FAIL("Illegal state");
- Use
VELOX_UNREACHABLE()when a particular branch/block should never be executed, such as in a switch statement with an invaliddefault:block. - Use
VELOX_NYI()for features or code paths that are not implemented yet. - When comparing two values/expressions, prefer to use:
VELOX_CHECK_LT(idx, children_.size());
- Rather than:
VELOX_CHECK(idx < children_.size());
- The former will evaluate and include both expressions values in the exception.
- All macro also provide the following optional features:
- Specifying error code (error code listed in
velox/common/base/VeloxException.h):VELOX_CHECK_EQ(v1[, v2[, error_code]]);VELOX_USER_CHECK_EQ(v1[, v2[, error_code]])
- Appending error message:
VELOX_CHECK_EQ(v1, v2, “Some error message”)VELOX_USER_CHECK_EQ(v1, v2, “Some error message”)
- Error message formatting via fmt:
VELOX_USER_CHECK_EQ(v1, v2, “My complex error message {} and {}”, str, i)- Note that the values of v1 and v2 are already included in the exception message by default.
- Specifying error code (error code listed in
- Do not declare more than one variable on a line (this helps accomplish the
other guidelines, promotes commenting, and is good for tagging).
- Obvious exception: for-loop initialization
- Initialize most variables at the time of declaration (unless they are class
objects without default constructors).
- Prefer using uniform-initialization to make initialization more consistent
across types, e.g., prefer
size_t size{0};oversize_t size = 0;
- Prefer using uniform-initialization to make initialization more consistent
across types, e.g., prefer
- Declare your variables in the smallest scope possible.
- Declare your variables as close to the usage point as possible within the given scope.
- Don't group all your variables at the top of the scope -- this makes the code much harder to follow.
- If the variable or function parameter is a pointer or reference type, group
the
*or&with the type -- pointer-ness or reference-ness is an attribute of the type, not the name.int* foo;const Bar& bar;NOTint *foo;const Bar &bar;- Beware that
int* foo, bar;will be parsed as declaringfooas anint*andbaras anint. Note that multiple declaration is discouraged.
- For member variables:
- Group member variable and methods based on their visibility (public,
protected and private)
- It’s ok to have multiple blocks for a given level.
- Most member variables should come with a short comment description and an empty line above that.
- Refrain from using public member variables whenever possible, in order to
promote encapsulation.
- Leverage getter/setter methods when appropriate.
- Name getter methods after the variable, e.g:
foo(), and setter methods using the "set" prefix, e.g:setFoo() - Always mark getter methods as const.
- Group member variable and methods based on their visibility (public,
protected and private)
- Prefer to use value-types,
std::optional, andstd::unique_ptrin that order.- Value-types are conceptually the simplest and cheapest.
std::optionalallows you to express “may be null” without the additional complexity of manual storage duration.std::unique_ptr<>should be used for types that are not cheaply movable but need to transfer ownership, or which are too large to store on the stack. Note that most types that perform large allocations already store their bulk memory on-heap.
- Always use
nullptrif you need a constant that represents a null pointer (T*for someT); use0otherwise for a zero value. - For large literal numbers, use ‘ to make it more readable, e.g:
1’000’000instead of1000000. - For floating point literals, never omit the initial 0 before the decimal
point (always
0.5, not.5). - File level variables and constants should be defined in an anonymous namespace.
- Always prefer const variables and enum to using preprocessor (#define) to define constant values.
- Prefer
enum classoverenumfor better type safety. - As a general rule, do not use string literals without declaring a named
constant for them.
- The best way to make a constant string literal is to use constexpr
std::string_view - NEVER use
std::string- this makes your code more prone to SIOF bugs. - Avoid
const char* constandconst char*- these are less efficient to convert tostd::stringlater on in your program if you ever need to becausestd::string_viewknows its size and can use a more efficient constructor.std::string_viewalso has richer interfaces and often works as a drop-in replacement tostd::string. - Need compile-time string concatenation? You can use
folly::FixedStringfor that. - Do not use
folly::StringPiecein new code, usestd::string_viewinstead.
- The best way to make a constant string literal is to use constexpr
Do not use them unless absolutely necessary. Whenever possible, use normal inline functions or templates instead. If you absolutely need, remember that macro names are always upper-snake-case. Also:
- Use parentheses to sanitize complex inputs.
- For example,
#define MUL(x, y) x * yis incorrect for input likeMUL(2, 1 + 1). The author probably meant#define MUL(x, y) ((x) * (y))instead.
- For example,
- When making macros that have multiple statements, use this idiom:
do { stmt1; stmt2; } while (0)- This allows this block to act the most like a single statement, usable in if/for/while even if they don't use braces and forces use of a trailing semicolon.
- All header files must have a single-inclusion guard using
#pragma once - Included files and using declarations should all be at the top of the file,
and ordered properly to make it easier to see what is included.
- Use clang-format to order your include and using directives.
- Includes should always use the full path (relative to github’s root dir).
- Whenever possible, try to forward-declare as much as possible in the .h and
only
#includethings you need the full implementation for.- For instance, if you just use a
Class*orClass&in a header file, forward-declareClassinstead of including the full header to minimize header dependencies.
- For instance, if you just use a
- Put small private classes that will only get used in one place into a .cpp, inside an anonymous namespace. A common example is a custom comparator class.
- Be aware of what goes into your .h files. If possible, try to separate them into "Public API" .h files that may be included by external modules. These public .h files should contain only those functions or classes that external users need to access.
- No large blocks of code should be in .h files that are widely included if
these blocks aren't vital to all users' understanding of the interface the .h
represents. Split these large implementation blocks into a separate
-inl.hfile.-inl.hfiles should exist only to aid readability with the expectation that a user of your library should only need to read the .h to understand the interface. Only if someone needs to understand your implementation should opening the-inl.hbe necessary (or the .cpp file, for that matter).
- Const
- All objects, whenever possible, should be passed to functions as const-refs
(
const T&). This makes APIs much clearer as to whether it’s an input or output arg, reduces the need for NULL-checks (and reduces crashing bugs), and helps enforce good encapsulation.- An obvious exception is objects that are trivially copy- or
move-constructible, like primitive types or
std::unique_ptr. - Don't pass by const reference when passing by value is both correct and cheaper.
- An obvious exception is objects that are trivially copy- or
move-constructible, like primitive types or
- All non-static member functions should be marked as const if calling them doesn't change the behavior/response of any member.
- All objects, whenever possible, should be passed to functions as const-refs
(
- References as function arguments.
- For input or read-only parameters, pass by
const& - For nullable or optional parameters, use higher order primitives like
std::optional; raw pointers are also appropriate. - For output, mutable, or in/out parameters:
- Non-const ref is the recommendation if the argument is not nullable.
- If it’s nullable, use a raw pointer.
- Always use
std::optionalinstead offolly::Optionalandboost::optional.
- For input or read-only parameters, pass by
- Prefer
std::string_viewtoconst std::string&- This avoids unnecessary construction of a std::string when passing other types. This often happens when string literals are passed into functions.
- If you need a
std::stringinside the function, however, take astd::stringto move the copy to the API boundary. This allows the caller to avoid the copy with a move.
- Almost never take an r-value reference as a function argument; take the
argument by value instead. This usually as efficient as taking an r-value
reference, but simpler and more flexible at the call site. For example:
InPredicate::InPredicate(folly::F14HashSet<T>&& rhs) : rhs_(std::move(rhs)) {}- is more complex than the version that takes expr by value:
InPredicate::InPredicate(folly::F14HashSet<T> rhs) : rhs_(std::move(rhs)) {} - However, the first requires either a std::move() or an explicit copy at the call point; the second doesn’t. The performance of the two should be almost identical.
- is more complex than the version that takes expr by value:
- Add comments when it’s not clear which argument is intended in a function
call. This is particularly a problem for longer signatures with repeated
types or constant arguments. For example,
phrobinicate(/*elements=*/{1, 2}, /*startOffset=*/0, /*length=*/2) - Use the /argName=/value format (note the lack of spaces). Clang-tidy supports checking the given argument name against the declaration when this format is used.
- All Velox code should be placed inside the
facebook::veloxnamespace - Always use nested namespace definition:
namespace facebook::velox::core {and notnamespace facebook { namespace velox { namespace core {
- Always add an inline comment at the end of the namespace definition, and surround it by empty lines:
namespace facebook::velox::exec {
myFunc();
} // namespace facebook::velox::exec
- not:
namespace facebook::velox::exec {
myFunc();
}
- Use sub namespaces (e.g: facebook::velox::core) to logically group large
chunks of related code and prevent identifier clashes.
- Namespaces should make it easier to code, not harder. Refrain from creating hierarchies that are way too long.
- Namespace don’t necessarily need to reflect the on-disk layout.
- This isn’t java.
- Guidelines for importing namespaces:
- Don't EVER put
using namespace anything;ORusing anything::anything;in a header file. This pollutes the global namespace and makes it extremely difficult to refactor code. - It’s very useful to not have to fully-qualify types as it can make code
more readable. In header files, it's unavoidable. ALWAYS fully qualify
names in a header (e.g. it's
std::string, not juststring). - In cpp files, best practice is to add a using declaration after your list
of
#includes(e.g.using tests::VectorMaker;) and to then just writeVectorMaker. using namespace std;can lead to surprising behavior, especially when other modules don't follow best practices with their using statements and what they dump into global scope. The best defensive programming practice for std it to fully qualify symbols, e.g:std::string, std::vector- Do not use
using namespace std;
- Do not use
- Use
using namespace anything_else;somewhat sparingly – the whole point of namespaces is obliterated when the code starts doing this prolifically. Frequently,using foo::bar::BazClass;is better.
- Don't EVER put
- For types widely used together with std::shared_ptr, consider introducing aliases for std::shared_ptr using naming convention XxxPtr. In some cases it makes sense to alias std::shared_ptr. Here are some examples of existing aliases: TypePtr, VectorPtr, FlatVectorPtr, ArrayVectorPtr, MapVectorPtr, RowVectorPtr, TypedExprPtr, PlanNodePtr.
using TypePtr = std::shared_ptr<const Type>;
- Similarly, widely used template types also benefit from shorter or clearer aliases.
using ContinueFuture = folly::SemiFuture<bool>;
using ContinuePromise = VeloxPromise<bool>;