-
Notifications
You must be signed in to change notification settings - Fork 78
Substring Explained
What is Substring?
Substring is a fluent Java API for matching simple patterns in strings.
A Substring pattern safely and efficiently extracts matches, without fiddly
arithmetic on character indexes, or the overhead of regular expressions.
To remove the http:// from the url string, if it exists.
Substring.prefix("http://").removeFrom(url);Instead of
url.startsWith("http://")
? url.substring(7)
: url;
url.startsWith("http://")
? url.substring("http://".length())
: url;To remove any url scheme http://, https://, chrome:// etc.
import static com.google.mu.util.Substring.first;
Substring.upToIncluding(first("://")).removeFrom(url);Instead of:
url.substring(url.indexOf("://") + 3);
// Or if you didn't forget bounds checking
int index = url.indexOf("://");
index == -1
? url
: url.substring(index + 3);To add home/ to a file path if it's not already there.
Use:
Substring.prefix("home/").addToIfAbsent(filePath);Instead of:
filePath.startsWith("home/")
? filePath
: "home/" + filePath;To add comma to the end of line if it's missing.
Use:
Substring.suffix(',').addToIfAbsent(line);Instead of:
filePath.endsWith(",")
? line
: line + ",";To extract the directory path home/foo from home/foo/Bar.java.
Use:
import static com.google.mu.util.Substring.last;
String path = ...;
Optional<String> directory = Substring.before(last('/')).from(path);Instead of:
int index = path.lastIndexOf('/');
Optional<String> directory =
index == -1
? Optional.empty()
: Optional.of(path.substring(0, index));To extract the shelf name id-1 from resource name
bookstores/barnes.noble/shelves/id-1/books/foo.
Optional<String> shelfName =
Substring.between(first("/shelves/"), first('/'))
.from(resourceName);To extract the gaia id from the string id:123.
Use:
import static com.google.mu.util.Substring.prefix;
String str = "id:123";
Optional<Long> gaiaId = Substring.after(prefix("id:"))
.from(str)
.map(Long::parseLong);Instead of:
Optional<Long> gaiaId =
str.startsWith("id:")
? Optional.of(Long.parseLong(
str.substring("id:".length())))
: Optional.empty();To extract the user id from email [email protected].
Use:
Optional<String> userId = Substring.before(first('@')).from(email);Instead of:
int index = email.indexOf('@');
Optional<String> userId =
index == -1
? Optional.empty()
: Optional.of(email.substring(0, index));To extract both the user id and the domain from an email address.
Use:
Optional<UserWithDomain> userWithDomain =
first('@')
.split(email)
.map(UserWithDomain::new);Instead of error-prone index arithmetic:
int index = email.indexOf('@');
Optional<UserWithDomain> userId =
index == -1
? Optional.empty()
: Optional.of(
new UserWithDomain(
email.substring(0, index),
email.substring(index + 1)));Substring or Guava Splitter?
Both Substring.Pattern and Guava Splitter support splitting strings. Substring.Pattern doesn't have methods like limit(), omitEmptyResults() because in Java 8, they are already provided by Stream, for example:
first(',').repeatedly()
.split(input)
.filter(m -> m.length() > 0)
.limit(10); // First 10 non-emptySubstring.Pattern also supports two-way split so for example if you are parsing a flag that looks like --loggingLevels=file1=1,file2=3, you can do:
first('=')
.split(arg)
.map((name, value) -> ...);On the other hand, Splitter only splits to a List or Stream so you'll need to do the length checking and extracting the name and value parts using list.get(0) and list.get(1).
Lastly, repeatedly().split() and repeatedly().splitThenTrim() return Stream<Match>, where Match is a CharSequence view over the original string, so no characters are copied until you explicitly ask for it. That is, if you decide that only certain matches are worth keeping, you can save the allocation and copying cost for items that aren't of interest:
List<String> names =
first(',').repeatedly()
.splitThenTrim("name=foo,age=10,name=bar,location=SFO")
.filter(prefix("name=")::isIn)
.map(Match::toString)
.map(prefix("name=")::removeFrom)
.collect(toList());To parse key-value pairs
The
Substring.Pattern.split()
method returns a BiOptional<String, String> object that optionally contains a
pair of substrings before and after the delimiter pattern.
But if called on the returned RepeatingPattern object from
repeatedly()
method, the input string will be
split into a stream of
substrings.
Combined together, it can parse key-value pairs and then collect them into a
Map, a Multimap or whatever.
For example:
import static com.google.mu.util.stream.GuavaCollectors.toImmutableSetMultimap;
Substring.RepeatingPattern delimiter = first(',').repeatedly();
ImmutableSetMultimap<String, String> multimap =
delimiter
.split("k1=v1,k2=v2,k2=v3") // => ["k1=v1", "k2=v2", "k2=v3"]
.collect(
toImmutableSetMultimap(
// "k1=v1" => (k1, v1), "k2=v2" => (k2, v2) etc.
s -> first('=').split(s).orElseThrow(...)));Substring patterns are best used on strings that are known to be in the
expected format. That is, they are either internal (flag values, internal file
paths etc.), or are already guaranteed to be in the expected format by a
stricter parser or validator (URLs, emails, Cloud resource names, ...).
For example, the following code returns a nonsensical result:
String unexpected = "Surprise! This is not a url with http:// or https://!";
upToIncluding(first("://")).removeFrom(unexpected);
// => " or https://!".If you need to parse a string with complex syntax rules or context-sensitive grammar, use a proper parser or regex instead.
Substring.Pattern
can also be created off of a CharMatcher or regexp:
import static com.google.common.base.CharMatcher.digit;
import static com.google.common.base.CharMatcher.whitespace;
import static com.google.mu.util.Substring.last;
before(first(whitespace())) // first(CharMatcher)
.removeFrom("foo bar") => "foo"
upToIncluding(last(digit())) // last(CharMatcher)
.from("314s") => "314";
first(Pattern.compile("\\(\\d{3}\\)\\s")) // regex
.removeFrom("(312) 987-6543") => "9876543"Substring.Pattern is immutable and a pattern object can be reused to save
object allocation cost, especially when used in a Stream chain.
For example, prefer:
charSource.readLines().stream()
.map(first('=')::split)over:
charSource.readLines().stream()
.map(line -> first('=').split(line))Because the latter will allocate the same Substring.Pattern object over and
over for every line.