/build/cargo-vendor-dir/regex-1.11.1/src/lib.rs
Line | Count | Source |
1 | | /*! |
2 | | This crate provides routines for searching strings for matches of a [regular |
3 | | expression] (aka "regex"). The regex syntax supported by this crate is similar |
4 | | to other regex engines, but it lacks several features that are not known how to |
5 | | implement efficiently. This includes, but is not limited to, look-around and |
6 | | backreferences. In exchange, all regex searches in this crate have worst case |
7 | | `O(m * n)` time complexity, where `m` is proportional to the size of the regex |
8 | | and `n` is proportional to the size of the string being searched. |
9 | | |
10 | | [regular expression]: https://en.wikipedia.org/wiki/Regular_expression |
11 | | |
12 | | If you just want API documentation, then skip to the [`Regex`] type. Otherwise, |
13 | | here's a quick example showing one way of parsing the output of a grep-like |
14 | | program: |
15 | | |
16 | | ```rust |
17 | | use regex::Regex; |
18 | | |
19 | | let re = Regex::new(r"(?m)^([^:]+):([0-9]+):(.+)$").unwrap(); |
20 | | let hay = "\ |
21 | | path/to/foo:54:Blue Harvest |
22 | | path/to/bar:90:Something, Something, Something, Dark Side |
23 | | path/to/baz:3:It's a Trap! |
24 | | "; |
25 | | |
26 | | let mut results = vec![]; |
27 | | for (_, [path, lineno, line]) in re.captures_iter(hay).map(|c| c.extract()) { |
28 | | results.push((path, lineno.parse::<u64>()?, line)); |
29 | | } |
30 | | assert_eq!(results, vec![ |
31 | | ("path/to/foo", 54, "Blue Harvest"), |
32 | | ("path/to/bar", 90, "Something, Something, Something, Dark Side"), |
33 | | ("path/to/baz", 3, "It's a Trap!"), |
34 | | ]); |
35 | | # Ok::<(), Box<dyn std::error::Error>>(()) |
36 | | ``` |
37 | | |
38 | | # Overview |
39 | | |
40 | | The primary type in this crate is a [`Regex`]. Its most important methods are |
41 | | as follows: |
42 | | |
43 | | * [`Regex::new`] compiles a regex using the default configuration. A |
44 | | [`RegexBuilder`] permits setting a non-default configuration. (For example, |
45 | | case insensitive matching, verbose mode and others.) |
46 | | * [`Regex::is_match`] reports whether a match exists in a particular haystack. |
47 | | * [`Regex::find`] reports the byte offsets of a match in a haystack, if one |
48 | | exists. [`Regex::find_iter`] returns an iterator over all such matches. |
49 | | * [`Regex::captures`] returns a [`Captures`], which reports both the byte |
50 | | offsets of a match in a haystack and the byte offsets of each matching capture |
51 | | group from the regex in the haystack. |
52 | | [`Regex::captures_iter`] returns an iterator over all such matches. |
53 | | |
54 | | There is also a [`RegexSet`], which permits searching for multiple regex |
55 | | patterns simultaneously in a single search. However, it currently only reports |
56 | | which patterns match and *not* the byte offsets of a match. |
57 | | |
58 | | Otherwise, this top-level crate documentation is organized as follows: |
59 | | |
60 | | * [Usage](#usage) shows how to add the `regex` crate to your Rust project. |
61 | | * [Examples](#examples) provides a limited selection of regex search examples. |
62 | | * [Performance](#performance) provides a brief summary of how to optimize regex |
63 | | searching speed. |
64 | | * [Unicode](#unicode) discusses support for non-ASCII patterns. |
65 | | * [Syntax](#syntax) enumerates the specific regex syntax supported by this |
66 | | crate. |
67 | | * [Untrusted input](#untrusted-input) discusses how this crate deals with regex |
68 | | patterns or haystacks that are untrusted. |
69 | | * [Crate features](#crate-features) documents the Cargo features that can be |
70 | | enabled or disabled for this crate. |
71 | | * [Other crates](#other-crates) links to other crates in the `regex` family. |
72 | | |
73 | | # Usage |
74 | | |
75 | | The `regex` crate is [on crates.io](https://crates.io/crates/regex) and can be |
76 | | used by adding `regex` to your dependencies in your project's `Cargo.toml`. |
77 | | Or more simply, just run `cargo add regex`. |
78 | | |
79 | | Here is a complete example that creates a new Rust project, adds a dependency |
80 | | on `regex`, creates the source code for a regex search and then runs the |
81 | | program. |
82 | | |
83 | | First, create the project in a new directory: |
84 | | |
85 | | ```text |
86 | | $ mkdir regex-example |
87 | | $ cd regex-example |
88 | | $ cargo init |
89 | | ``` |
90 | | |
91 | | Second, add a dependency on `regex`: |
92 | | |
93 | | ```text |
94 | | $ cargo add regex |
95 | | ``` |
96 | | |
97 | | Third, edit `src/main.rs`. Delete what's there and replace it with this: |
98 | | |
99 | | ``` |
100 | | use regex::Regex; |
101 | | |
102 | | fn main() { |
103 | | let re = Regex::new(r"Hello (?<name>\w+)!").unwrap(); |
104 | | let Some(caps) = re.captures("Hello Murphy!") else { |
105 | | println!("no match!"); |
106 | | return; |
107 | | }; |
108 | | println!("The name is: {}", &caps["name"]); |
109 | | } |
110 | | ``` |
111 | | |
112 | | Fourth, run it with `cargo run`: |
113 | | |
114 | | ```text |
115 | | $ cargo run |
116 | | Compiling memchr v2.5.0 |
117 | | Compiling regex-syntax v0.7.1 |
118 | | Compiling aho-corasick v1.0.1 |
119 | | Compiling regex v1.8.1 |
120 | | Compiling regex-example v0.1.0 (/tmp/regex-example) |
121 | | Finished dev [unoptimized + debuginfo] target(s) in 4.22s |
122 | | Running `target/debug/regex-example` |
123 | | The name is: Murphy |
124 | | ``` |
125 | | |
126 | | The first time you run the program will show more output like above. But |
127 | | subsequent runs shouldn't have to re-compile the dependencies. |
128 | | |
129 | | # Examples |
130 | | |
131 | | This section provides a few examples, in tutorial style, showing how to |
132 | | search a haystack with a regex. There are more examples throughout the API |
133 | | documentation. |
134 | | |
135 | | Before starting though, it's worth defining a few terms: |
136 | | |
137 | | * A **regex** is a Rust value whose type is `Regex`. We use `re` as a |
138 | | variable name for a regex. |
139 | | * A **pattern** is the string that is used to build a regex. We use `pat` as |
140 | | a variable name for a pattern. |
141 | | * A **haystack** is the string that is searched by a regex. We use `hay` as a |
142 | | variable name for a haystack. |
143 | | |
144 | | Sometimes the words "regex" and "pattern" are used interchangeably. |
145 | | |
146 | | General use of regular expressions in this crate proceeds by compiling a |
147 | | **pattern** into a **regex**, and then using that regex to search, split or |
148 | | replace parts of a **haystack**. |
149 | | |
150 | | ### Example: find a middle initial |
151 | | |
152 | | We'll start off with a very simple example: a regex that looks for a specific |
153 | | name but uses a wildcard to match a middle initial. Our pattern serves as |
154 | | something like a template that will match a particular name with *any* middle |
155 | | initial. |
156 | | |
157 | | ```rust |
158 | | use regex::Regex; |
159 | | |
160 | | // We use 'unwrap()' here because it would be a bug in our program if the |
161 | | // pattern failed to compile to a regex. Panicking in the presence of a bug |
162 | | // is okay. |
163 | | let re = Regex::new(r"Homer (.)\. Simpson").unwrap(); |
164 | | let hay = "Homer J. Simpson"; |
165 | | let Some(caps) = re.captures(hay) else { return }; |
166 | | assert_eq!("J", &caps[1]); |
167 | | ``` |
168 | | |
169 | | There are a few things worth noticing here in our first example: |
170 | | |
171 | | * The `.` is a special pattern meta character that means "match any single |
172 | | character except for new lines." (More precisely, in this crate, it means |
173 | | "match any UTF-8 encoding of any Unicode scalar value other than `\n`.") |
174 | | * We can match an actual `.` literally by escaping it, i.e., `\.`. |
175 | | * We use Rust's [raw strings] to avoid needing to deal with escape sequences in |
176 | | both the regex pattern syntax and in Rust's string literal syntax. If we didn't |
177 | | use raw strings here, we would have had to use `\\.` to match a literal `.` |
178 | | character. That is, `r"\."` and `"\\."` are equivalent patterns. |
179 | | * We put our wildcard `.` instruction in parentheses. These parentheses have a |
180 | | special meaning that says, "make whatever part of the haystack matches within |
181 | | these parentheses available as a capturing group." After finding a match, we |
182 | | access this capture group with `&caps[1]`. |
183 | | |
184 | | [raw strings]: https://doc.rust-lang.org/stable/reference/tokens.html#raw-string-literals |
185 | | |
186 | | Otherwise, we execute a search using `re.captures(hay)` and return from our |
187 | | function if no match occurred. We then reference the middle initial by asking |
188 | | for the part of the haystack that matched the capture group indexed at `1`. |
189 | | (The capture group at index 0 is implicit and always corresponds to the entire |
190 | | match. In this case, that's `Homer J. Simpson`.) |
191 | | |
192 | | ### Example: named capture groups |
193 | | |
194 | | Continuing from our middle initial example above, we can tweak the pattern |
195 | | slightly to give a name to the group that matches the middle initial: |
196 | | |
197 | | ```rust |
198 | | use regex::Regex; |
199 | | |
200 | | // Note that (?P<middle>.) is a different way to spell the same thing. |
201 | | let re = Regex::new(r"Homer (?<middle>.)\. Simpson").unwrap(); |
202 | | let hay = "Homer J. Simpson"; |
203 | | let Some(caps) = re.captures(hay) else { return }; |
204 | | assert_eq!("J", &caps["middle"]); |
205 | | ``` |
206 | | |
207 | | Giving a name to a group can be useful when there are multiple groups in |
208 | | a pattern. It makes the code referring to those groups a bit easier to |
209 | | understand. |
210 | | |
211 | | ### Example: validating a particular date format |
212 | | |
213 | | This examples shows how to confirm whether a haystack, in its entirety, matches |
214 | | a particular date format: |
215 | | |
216 | | ```rust |
217 | | use regex::Regex; |
218 | | |
219 | | let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap(); |
220 | | assert!(re.is_match("2010-03-14")); |
221 | | ``` |
222 | | |
223 | | Notice the use of the `^` and `$` anchors. In this crate, every regex search is |
224 | | run with an implicit `(?s:.)*?` at the beginning of its pattern, which allows |
225 | | the regex to match anywhere in a haystack. Anchors, as above, can be used to |
226 | | ensure that the full haystack matches a pattern. |
227 | | |
228 | | This crate is also Unicode aware by default, which means that `\d` might match |
229 | | more than you might expect it to. For example: |
230 | | |
231 | | ```rust |
232 | | use regex::Regex; |
233 | | |
234 | | let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap(); |
235 | | assert!(re.is_match("𝟚𝟘𝟙𝟘-𝟘𝟛-𝟙𝟜")); |
236 | | ``` |
237 | | |
238 | | To only match an ASCII decimal digit, all of the following are equivalent: |
239 | | |
240 | | * `[0-9]` |
241 | | * `(?-u:\d)` |
242 | | * `[[:digit:]]` |
243 | | * `[\d&&\p{ascii}]` |
244 | | |
245 | | ### Example: finding dates in a haystack |
246 | | |
247 | | In the previous example, we showed how one might validate that a haystack, |
248 | | in its entirety, corresponded to a particular date format. But what if we wanted |
249 | | to extract all things that look like dates in a specific format from a haystack? |
250 | | To do this, we can use an iterator API to find all matches (notice that we've |
251 | | removed the anchors and switched to looking for ASCII-only digits): |
252 | | |
253 | | ```rust |
254 | | use regex::Regex; |
255 | | |
256 | | let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); |
257 | | let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?"; |
258 | | // 'm' is a 'Match', and 'as_str()' returns the matching part of the haystack. |
259 | | let dates: Vec<&str> = re.find_iter(hay).map(|m| m.as_str()).collect(); |
260 | | assert_eq!(dates, vec![ |
261 | | "1865-04-14", |
262 | | "1881-07-02", |
263 | | "1901-09-06", |
264 | | "1963-11-22", |
265 | | ]); |
266 | | ``` |
267 | | |
268 | | We can also iterate over [`Captures`] values instead of [`Match`] values, and |
269 | | that in turn permits accessing each component of the date via capturing groups: |
270 | | |
271 | | ```rust |
272 | | use regex::Regex; |
273 | | |
274 | | let re = Regex::new(r"(?<y>[0-9]{4})-(?<m>[0-9]{2})-(?<d>[0-9]{2})").unwrap(); |
275 | | let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?"; |
276 | | // 'm' is a 'Match', and 'as_str()' returns the matching part of the haystack. |
277 | | let dates: Vec<(&str, &str, &str)> = re.captures_iter(hay).map(|caps| { |
278 | | // The unwraps are okay because every capture group must match if the whole |
279 | | // regex matches, and in this context, we know we have a match. |
280 | | // |
281 | | // Note that we use `caps.name("y").unwrap().as_str()` instead of |
282 | | // `&caps["y"]` because the lifetime of the former is the same as the |
283 | | // lifetime of `hay` above, but the lifetime of the latter is tied to the |
284 | | // lifetime of `caps` due to how the `Index` trait is defined. |
285 | | let year = caps.name("y").unwrap().as_str(); |
286 | | let month = caps.name("m").unwrap().as_str(); |
287 | | let day = caps.name("d").unwrap().as_str(); |
288 | | (year, month, day) |
289 | | }).collect(); |
290 | | assert_eq!(dates, vec![ |
291 | | ("1865", "04", "14"), |
292 | | ("1881", "07", "02"), |
293 | | ("1901", "09", "06"), |
294 | | ("1963", "11", "22"), |
295 | | ]); |
296 | | ``` |
297 | | |
298 | | ### Example: simpler capture group extraction |
299 | | |
300 | | One can use [`Captures::extract`] to make the code from the previous example a |
301 | | bit simpler in this case: |
302 | | |
303 | | ```rust |
304 | | use regex::Regex; |
305 | | |
306 | | let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap(); |
307 | | let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?"; |
308 | | let dates: Vec<(&str, &str, &str)> = re.captures_iter(hay).map(|caps| { |
309 | | let (_, [year, month, day]) = caps.extract(); |
310 | | (year, month, day) |
311 | | }).collect(); |
312 | | assert_eq!(dates, vec![ |
313 | | ("1865", "04", "14"), |
314 | | ("1881", "07", "02"), |
315 | | ("1901", "09", "06"), |
316 | | ("1963", "11", "22"), |
317 | | ]); |
318 | | ``` |
319 | | |
320 | | `Captures::extract` works by ensuring that the number of matching groups match |
321 | | the number of groups requested via the `[year, month, day]` syntax. If they do, |
322 | | then the substrings for each corresponding capture group are automatically |
323 | | returned in an appropriately sized array. Rust's syntax for pattern matching |
324 | | arrays does the rest. |
325 | | |
326 | | ### Example: replacement with named capture groups |
327 | | |
328 | | Building on the previous example, perhaps we'd like to rearrange the date |
329 | | formats. This can be done by finding each match and replacing it with |
330 | | something different. The [`Regex::replace_all`] routine provides a convenient |
331 | | way to do this, including by supporting references to named groups in the |
332 | | replacement string: |
333 | | |
334 | | ```rust |
335 | | use regex::Regex; |
336 | | |
337 | | let re = Regex::new(r"(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})").unwrap(); |
338 | | let before = "1973-01-05, 1975-08-25 and 1980-10-18"; |
339 | | let after = re.replace_all(before, "$m/$d/$y"); |
340 | | assert_eq!(after, "01/05/1973, 08/25/1975 and 10/18/1980"); |
341 | | ``` |
342 | | |
343 | | The replace methods are actually polymorphic in the replacement, which |
344 | | provides more flexibility than is seen here. (See the documentation for |
345 | | [`Regex::replace`] for more details.) |
346 | | |
347 | | ### Example: verbose mode |
348 | | |
349 | | When your regex gets complicated, you might consider using something other |
350 | | than regex. But if you stick with regex, you can use the `x` flag to enable |
351 | | insignificant whitespace mode or "verbose mode." In this mode, whitespace |
352 | | is treated as insignificant and one may write comments. This may make your |
353 | | patterns easier to comprehend. |
354 | | |
355 | | ```rust |
356 | | use regex::Regex; |
357 | | |
358 | | let re = Regex::new(r"(?x) |
359 | | (?P<y>\d{4}) # the year, including all Unicode digits |
360 | | - |
361 | | (?P<m>\d{2}) # the month, including all Unicode digits |
362 | | - |
363 | | (?P<d>\d{2}) # the day, including all Unicode digits |
364 | | ").unwrap(); |
365 | | |
366 | | let before = "1973-01-05, 1975-08-25 and 1980-10-18"; |
367 | | let after = re.replace_all(before, "$m/$d/$y"); |
368 | | assert_eq!(after, "01/05/1973, 08/25/1975 and 10/18/1980"); |
369 | | ``` |
370 | | |
371 | | If you wish to match against whitespace in this mode, you can still use `\s`, |
372 | | `\n`, `\t`, etc. For escaping a single space character, you can escape it |
373 | | directly with `\ `, use its hex character code `\x20` or temporarily disable |
374 | | the `x` flag, e.g., `(?-x: )`. |
375 | | |
376 | | ### Example: match multiple regular expressions simultaneously |
377 | | |
378 | | This demonstrates how to use a [`RegexSet`] to match multiple (possibly |
379 | | overlapping) regexes in a single scan of a haystack: |
380 | | |
381 | | ```rust |
382 | | use regex::RegexSet; |
383 | | |
384 | | let set = RegexSet::new(&[ |
385 | | r"\w+", |
386 | | r"\d+", |
387 | | r"\pL+", |
388 | | r"foo", |
389 | | r"bar", |
390 | | r"barfoo", |
391 | | r"foobar", |
392 | | ]).unwrap(); |
393 | | |
394 | | // Iterate over and collect all of the matches. Each match corresponds to the |
395 | | // ID of the matching pattern. |
396 | | let matches: Vec<_> = set.matches("foobar").into_iter().collect(); |
397 | | assert_eq!(matches, vec![0, 2, 3, 4, 6]); |
398 | | |
399 | | // You can also test whether a particular regex matched: |
400 | | let matches = set.matches("foobar"); |
401 | | assert!(!matches.matched(5)); |
402 | | assert!(matches.matched(6)); |
403 | | ``` |
404 | | |
405 | | # Performance |
406 | | |
407 | | This section briefly discusses a few concerns regarding the speed and resource |
408 | | usage of regexes. |
409 | | |
410 | | ### Only ask for what you need |
411 | | |
412 | | When running a search with a regex, there are generally three different types |
413 | | of information one can ask for: |
414 | | |
415 | | 1. Does a regex match in a haystack? |
416 | | 2. Where does a regex match in a haystack? |
417 | | 3. Where do each of the capturing groups match in a haystack? |
418 | | |
419 | | Generally speaking, this crate could provide a function to answer only #3, |
420 | | which would subsume #1 and #2 automatically. However, it can be significantly |
421 | | more expensive to compute the location of capturing group matches, so it's best |
422 | | not to do it if you don't need to. |
423 | | |
424 | | Therefore, only ask for what you need. For example, don't use [`Regex::find`] |
425 | | if you only need to test if a regex matches a haystack. Use [`Regex::is_match`] |
426 | | instead. |
427 | | |
428 | | ### Unicode can impact memory usage and search speed |
429 | | |
430 | | This crate has first class support for Unicode and it is **enabled by default**. |
431 | | In many cases, the extra memory required to support it will be negligible and |
432 | | it typically won't impact search speed. But it can in some cases. |
433 | | |
434 | | With respect to memory usage, the impact of Unicode principally manifests |
435 | | through the use of Unicode character classes. Unicode character classes |
436 | | tend to be quite large. For example, `\w` by default matches around 140,000 |
437 | | distinct codepoints. This requires additional memory, and tends to slow down |
438 | | regex compilation. While a `\w` here and there is unlikely to be noticed, |
439 | | writing `\w{100}` will for example result in quite a large regex by default. |
440 | | Indeed, `\w` is considerably larger than its ASCII-only version, so if your |
441 | | requirements are satisfied by ASCII, it's probably a good idea to stick to |
442 | | ASCII classes. The ASCII-only version of `\w` can be spelled in a number of |
443 | | ways. All of the following are equivalent: |
444 | | |
445 | | * `[0-9A-Za-z_]` |
446 | | * `(?-u:\w)` |
447 | | * `[[:word:]]` |
448 | | * `[\w&&\p{ascii}]` |
449 | | |
450 | | With respect to search speed, Unicode tends to be handled pretty well, even when |
451 | | using large Unicode character classes. However, some of the faster internal |
452 | | regex engines cannot handle a Unicode aware word boundary assertion. So if you |
453 | | don't need Unicode-aware word boundary assertions, you might consider using |
454 | | `(?-u:\b)` instead of `\b`, where the former uses an ASCII-only definition of |
455 | | a word character. |
456 | | |
457 | | ### Literals might accelerate searches |
458 | | |
459 | | This crate tends to be quite good at recognizing literals in a regex pattern |
460 | | and using them to accelerate a search. If it is at all possible to include |
461 | | some kind of literal in your pattern, then it might make search substantially |
462 | | faster. For example, in the regex `\w+@\w+`, the engine will look for |
463 | | occurrences of `@` and then try a reverse match for `\w+` to find the start |
464 | | position. |
465 | | |
466 | | ### Avoid re-compiling regexes, especially in a loop |
467 | | |
468 | | It is an anti-pattern to compile the same pattern in a loop since regex |
469 | | compilation is typically expensive. (It takes anywhere from a few microseconds |
470 | | to a few **milliseconds** depending on the size of the pattern.) Not only is |
471 | | compilation itself expensive, but this also prevents optimizations that reuse |
472 | | allocations internally to the regex engine. |
473 | | |
474 | | In Rust, it can sometimes be a pain to pass regexes around if they're used from |
475 | | inside a helper function. Instead, we recommend using crates like [`once_cell`] |
476 | | and [`lazy_static`] to ensure that patterns are compiled exactly once. |
477 | | |
478 | | [`once_cell`]: https://crates.io/crates/once_cell |
479 | | [`lazy_static`]: https://crates.io/crates/lazy_static |
480 | | |
481 | | This example shows how to use `once_cell`: |
482 | | |
483 | | ```rust |
484 | | use { |
485 | | once_cell::sync::Lazy, |
486 | | regex::Regex, |
487 | | }; |
488 | | |
489 | | fn some_helper_function(haystack: &str) -> bool { |
490 | | static RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"...").unwrap()); |
491 | | RE.is_match(haystack) |
492 | | } |
493 | | |
494 | | fn main() { |
495 | | assert!(some_helper_function("abc")); |
496 | | assert!(!some_helper_function("ac")); |
497 | | } |
498 | | ``` |
499 | | |
500 | | Specifically, in this example, the regex will be compiled when it is used for |
501 | | the first time. On subsequent uses, it will reuse the previously built `Regex`. |
502 | | Notice how one can define the `Regex` locally to a specific function. |
503 | | |
504 | | ### Sharing a regex across threads can result in contention |
505 | | |
506 | | While a single `Regex` can be freely used from multiple threads simultaneously, |
507 | | there is a small synchronization cost that must be paid. Generally speaking, |
508 | | one shouldn't expect to observe this unless the principal task in each thread |
509 | | is searching with the regex *and* most searches are on short haystacks. In this |
510 | | case, internal contention on shared resources can spike and increase latency, |
511 | | which in turn may slow down each individual search. |
512 | | |
513 | | One can work around this by cloning each `Regex` before sending it to another |
514 | | thread. The cloned regexes will still share the same internal read-only portion |
515 | | of its compiled state (it's reference counted), but each thread will get |
516 | | optimized access to the mutable space that is used to run a search. In general, |
517 | | there is no additional cost in memory to doing this. The only cost is the added |
518 | | code complexity required to explicitly clone the regex. (If you share the same |
519 | | `Regex` across multiple threads, each thread still gets its own mutable space, |
520 | | but accessing that space is slower.) |
521 | | |
522 | | # Unicode |
523 | | |
524 | | This section discusses what kind of Unicode support this regex library has. |
525 | | Before showing some examples, we'll summarize the relevant points: |
526 | | |
527 | | * This crate almost fully implements "Basic Unicode Support" (Level 1) as |
528 | | specified by the [Unicode Technical Standard #18][UTS18]. The full details |
529 | | of what is supported are documented in [UNICODE.md] in the root of the regex |
530 | | crate repository. There is virtually no support for "Extended Unicode Support" |
531 | | (Level 2) from UTS#18. |
532 | | * The top-level [`Regex`] runs searches *as if* iterating over each of the |
533 | | codepoints in the haystack. That is, the fundamental atom of matching is a |
534 | | single codepoint. |
535 | | * [`bytes::Regex`], in contrast, permits disabling Unicode mode for part of all |
536 | | of your pattern in all cases. When Unicode mode is disabled, then a search is |
537 | | run *as if* iterating over each byte in the haystack. That is, the fundamental |
538 | | atom of matching is a single byte. (A top-level `Regex` also permits disabling |
539 | | Unicode and thus matching *as if* it were one byte at a time, but only when |
540 | | doing so wouldn't permit matching invalid UTF-8.) |
541 | | * When Unicode mode is enabled (the default), `.` will match an entire Unicode |
542 | | scalar value, even when it is encoded using multiple bytes. When Unicode mode |
543 | | is disabled (e.g., `(?-u:.)`), then `.` will match a single byte in all cases. |
544 | | * The character classes `\w`, `\d` and `\s` are all Unicode-aware by default. |
545 | | Use `(?-u:\w)`, `(?-u:\d)` and `(?-u:\s)` to get their ASCII-only definitions. |
546 | | * Similarly, `\b` and `\B` use a Unicode definition of a "word" character. |
547 | | To get ASCII-only word boundaries, use `(?-u:\b)` and `(?-u:\B)`. This also |
548 | | applies to the special word boundary assertions. (That is, `\b{start}`, |
549 | | `\b{end}`, `\b{start-half}`, `\b{end-half}`.) |
550 | | * `^` and `$` are **not** Unicode-aware in multi-line mode. Namely, they only |
551 | | recognize `\n` (assuming CRLF mode is not enabled) and not any of the other |
552 | | forms of line terminators defined by Unicode. |
553 | | * Case insensitive searching is Unicode-aware and uses simple case folding. |
554 | | * Unicode general categories, scripts and many boolean properties are available |
555 | | by default via the `\p{property name}` syntax. |
556 | | * In all cases, matches are reported using byte offsets. Or more precisely, |
557 | | UTF-8 code unit offsets. This permits constant time indexing and slicing of the |
558 | | haystack. |
559 | | |
560 | | [UTS18]: https://unicode.org/reports/tr18/ |
561 | | [UNICODE.md]: https://github.com/rust-lang/regex/blob/master/UNICODE.md |
562 | | |
563 | | Patterns themselves are **only** interpreted as a sequence of Unicode scalar |
564 | | values. This means you can use Unicode characters directly in your pattern: |
565 | | |
566 | | ```rust |
567 | | use regex::Regex; |
568 | | |
569 | | let re = Regex::new(r"(?i)Δ+").unwrap(); |
570 | | let m = re.find("ΔδΔ").unwrap(); |
571 | | assert_eq!((0, 6), (m.start(), m.end())); |
572 | | // alternatively: |
573 | | assert_eq!(0..6, m.range()); |
574 | | ``` |
575 | | |
576 | | As noted above, Unicode general categories, scripts, script extensions, ages |
577 | | and a smattering of boolean properties are available as character classes. For |
578 | | example, you can match a sequence of numerals, Greek or Cherokee letters: |
579 | | |
580 | | ```rust |
581 | | use regex::Regex; |
582 | | |
583 | | let re = Regex::new(r"[\pN\p{Greek}\p{Cherokee}]+").unwrap(); |
584 | | let m = re.find("abcΔᎠβⅠᏴγδⅡxyz").unwrap(); |
585 | | assert_eq!(3..23, m.range()); |
586 | | ``` |
587 | | |
588 | | While not specific to Unicode, this library also supports character class set |
589 | | operations. Namely, one can nest character classes arbitrarily and perform set |
590 | | operations on them. Those set operations are union (the default), intersection, |
591 | | difference and symmetric difference. These set operations tend to be most |
592 | | useful with Unicode character classes. For example, to match any codepoint |
593 | | that is both in the `Greek` script and in the `Letter` general category: |
594 | | |
595 | | ```rust |
596 | | use regex::Regex; |
597 | | |
598 | | let re = Regex::new(r"[\p{Greek}&&\pL]+").unwrap(); |
599 | | let subs: Vec<&str> = re.find_iter("ΔδΔ𐅌ΔδΔ").map(|m| m.as_str()).collect(); |
600 | | assert_eq!(subs, vec!["ΔδΔ", "ΔδΔ"]); |
601 | | |
602 | | // If we just matches on Greek, then all codepoints would match! |
603 | | let re = Regex::new(r"\p{Greek}+").unwrap(); |
604 | | let subs: Vec<&str> = re.find_iter("ΔδΔ𐅌ΔδΔ").map(|m| m.as_str()).collect(); |
605 | | assert_eq!(subs, vec!["ΔδΔ𐅌ΔδΔ"]); |
606 | | ``` |
607 | | |
608 | | ### Opt out of Unicode support |
609 | | |
610 | | The [`bytes::Regex`] type that can be used to search `&[u8]` haystacks. By |
611 | | default, haystacks are conventionally treated as UTF-8 just like it is with the |
612 | | main `Regex` type. However, this behavior can be disabled by turning off the |
613 | | `u` flag, even if doing so could result in matching invalid UTF-8. For example, |
614 | | when the `u` flag is disabled, `.` will match any byte instead of any Unicode |
615 | | scalar value. |
616 | | |
617 | | Disabling the `u` flag is also possible with the standard `&str`-based `Regex` |
618 | | type, but it is only allowed where the UTF-8 invariant is maintained. For |
619 | | example, `(?-u:\w)` is an ASCII-only `\w` character class and is legal in an |
620 | | `&str`-based `Regex`, but `(?-u:\W)` will attempt to match *any byte* that |
621 | | isn't in `(?-u:\w)`, which in turn includes bytes that are invalid UTF-8. |
622 | | Similarly, `(?-u:\xFF)` will attempt to match the raw byte `\xFF` (instead of |
623 | | `U+00FF`), which is invalid UTF-8 and therefore is illegal in `&str`-based |
624 | | regexes. |
625 | | |
626 | | Finally, since Unicode support requires bundling large Unicode data |
627 | | tables, this crate exposes knobs to disable the compilation of those |
628 | | data tables, which can be useful for shrinking binary size and reducing |
629 | | compilation times. For details on how to do that, see the section on [crate |
630 | | features](#crate-features). |
631 | | |
632 | | # Syntax |
633 | | |
634 | | The syntax supported in this crate is documented below. |
635 | | |
636 | | Note that the regular expression parser and abstract syntax are exposed in |
637 | | a separate crate, [`regex-syntax`](https://docs.rs/regex-syntax). |
638 | | |
639 | | ### Matching one character |
640 | | |
641 | | <pre class="rust"> |
642 | | . any character except new line (includes new line with s flag) |
643 | | [0-9] any ASCII digit |
644 | | \d digit (\p{Nd}) |
645 | | \D not digit |
646 | | \pX Unicode character class identified by a one-letter name |
647 | | \p{Greek} Unicode character class (general category or script) |
648 | | \PX Negated Unicode character class identified by a one-letter name |
649 | | \P{Greek} negated Unicode character class (general category or script) |
650 | | </pre> |
651 | | |
652 | | ### Character classes |
653 | | |
654 | | <pre class="rust"> |
655 | | [xyz] A character class matching either x, y or z (union). |
656 | | [^xyz] A character class matching any character except x, y and z. |
657 | | [a-z] A character class matching any character in range a-z. |
658 | | [[:alpha:]] ASCII character class ([A-Za-z]) |
659 | | [[:^alpha:]] Negated ASCII character class ([^A-Za-z]) |
660 | | [x[^xyz]] Nested/grouping character class (matching any character except y and z) |
661 | | [a-y&&xyz] Intersection (matching x or y) |
662 | | [0-9&&[^4]] Subtraction using intersection and negation (matching 0-9 except 4) |
663 | | [0-9--4] Direct subtraction (matching 0-9 except 4) |
664 | | [a-g~~b-h] Symmetric difference (matching `a` and `h` only) |
665 | | [\[\]] Escaping in character classes (matching [ or ]) |
666 | | [a&&b] An empty character class matching nothing |
667 | | </pre> |
668 | | |
669 | | Any named character class may appear inside a bracketed `[...]` character |
670 | | class. For example, `[\p{Greek}[:digit:]]` matches any ASCII digit or any |
671 | | codepoint in the `Greek` script. `[\p{Greek}&&\pL]` matches Greek letters. |
672 | | |
673 | | Precedence in character classes, from most binding to least: |
674 | | |
675 | | 1. Ranges: `[a-cd]` == `[[a-c]d]` |
676 | | 2. Union: `[ab&&bc]` == `[[ab]&&[bc]]` |
677 | | 3. Intersection, difference, symmetric difference. All three have equivalent |
678 | | precedence, and are evaluated in left-to-right order. For example, |
679 | | `[\pL--\p{Greek}&&\p{Uppercase}]` == `[[\pL--\p{Greek}]&&\p{Uppercase}]`. |
680 | | 4. Negation: `[^a-z&&b]` == `[^[a-z&&b]]`. |
681 | | |
682 | | ### Composites |
683 | | |
684 | | <pre class="rust"> |
685 | | xy concatenation (x followed by y) |
686 | | x|y alternation (x or y, prefer x) |
687 | | </pre> |
688 | | |
689 | | This example shows how an alternation works, and what it means to prefer a |
690 | | branch in the alternation over subsequent branches. |
691 | | |
692 | | ``` |
693 | | use regex::Regex; |
694 | | |
695 | | let haystack = "samwise"; |
696 | | // If 'samwise' comes first in our alternation, then it is |
697 | | // preferred as a match, even if the regex engine could |
698 | | // technically detect that 'sam' led to a match earlier. |
699 | | let re = Regex::new(r"samwise|sam").unwrap(); |
700 | | assert_eq!("samwise", re.find(haystack).unwrap().as_str()); |
701 | | // But if 'sam' comes first, then it will match instead. |
702 | | // In this case, it is impossible for 'samwise' to match |
703 | | // because 'sam' is a prefix of it. |
704 | | let re = Regex::new(r"sam|samwise").unwrap(); |
705 | | assert_eq!("sam", re.find(haystack).unwrap().as_str()); |
706 | | ``` |
707 | | |
708 | | ### Repetitions |
709 | | |
710 | | <pre class="rust"> |
711 | | x* zero or more of x (greedy) |
712 | | x+ one or more of x (greedy) |
713 | | x? zero or one of x (greedy) |
714 | | x*? zero or more of x (ungreedy/lazy) |
715 | | x+? one or more of x (ungreedy/lazy) |
716 | | x?? zero or one of x (ungreedy/lazy) |
717 | | x{n,m} at least n x and at most m x (greedy) |
718 | | x{n,} at least n x (greedy) |
719 | | x{n} exactly n x |
720 | | x{n,m}? at least n x and at most m x (ungreedy/lazy) |
721 | | x{n,}? at least n x (ungreedy/lazy) |
722 | | x{n}? exactly n x |
723 | | </pre> |
724 | | |
725 | | ### Empty matches |
726 | | |
727 | | <pre class="rust"> |
728 | | ^ the beginning of a haystack (or start-of-line with multi-line mode) |
729 | | $ the end of a haystack (or end-of-line with multi-line mode) |
730 | | \A only the beginning of a haystack (even with multi-line mode enabled) |
731 | | \z only the end of a haystack (even with multi-line mode enabled) |
732 | | \b a Unicode word boundary (\w on one side and \W, \A, or \z on other) |
733 | | \B not a Unicode word boundary |
734 | | \b{start}, \< a Unicode start-of-word boundary (\W|\A on the left, \w on the right) |
735 | | \b{end}, \> a Unicode end-of-word boundary (\w on the left, \W|\z on the right)) |
736 | | \b{start-half} half of a Unicode start-of-word boundary (\W|\A on the left) |
737 | | \b{end-half} half of a Unicode end-of-word boundary (\W|\z on the right) |
738 | | </pre> |
739 | | |
740 | | The empty regex is valid and matches the empty string. For example, the |
741 | | empty regex matches `abc` at positions `0`, `1`, `2` and `3`. When using the |
742 | | top-level [`Regex`] on `&str` haystacks, an empty match that splits a codepoint |
743 | | is guaranteed to never be returned. However, such matches are permitted when |
744 | | using a [`bytes::Regex`]. For example: |
745 | | |
746 | | ```rust |
747 | | let re = regex::Regex::new(r"").unwrap(); |
748 | | let ranges: Vec<_> = re.find_iter("💩").map(|m| m.range()).collect(); |
749 | | assert_eq!(ranges, vec![0..0, 4..4]); |
750 | | |
751 | | let re = regex::bytes::Regex::new(r"").unwrap(); |
752 | | let ranges: Vec<_> = re.find_iter("💩".as_bytes()).map(|m| m.range()).collect(); |
753 | | assert_eq!(ranges, vec![0..0, 1..1, 2..2, 3..3, 4..4]); |
754 | | ``` |
755 | | |
756 | | Note that an empty regex is distinct from a regex that can never match. |
757 | | For example, the regex `[a&&b]` is a character class that represents the |
758 | | intersection of `a` and `b`. That intersection is empty, which means the |
759 | | character class is empty. Since nothing is in the empty set, `[a&&b]` matches |
760 | | nothing, not even the empty string. |
761 | | |
762 | | ### Grouping and flags |
763 | | |
764 | | <pre class="rust"> |
765 | | (exp) numbered capture group (indexed by opening parenthesis) |
766 | | (?P<name>exp) named (also numbered) capture group (names must be alpha-numeric) |
767 | | (?<name>exp) named (also numbered) capture group (names must be alpha-numeric) |
768 | | (?:exp) non-capturing group |
769 | | (?flags) set flags within current group |
770 | | (?flags:exp) set flags for exp (non-capturing) |
771 | | </pre> |
772 | | |
773 | | Capture group names must be any sequence of alpha-numeric Unicode codepoints, |
774 | | in addition to `.`, `_`, `[` and `]`. Names must start with either an `_` or |
775 | | an alphabetic codepoint. Alphabetic codepoints correspond to the `Alphabetic` |
776 | | Unicode property, while numeric codepoints correspond to the union of the |
777 | | `Decimal_Number`, `Letter_Number` and `Other_Number` general categories. |
778 | | |
779 | | Flags are each a single character. For example, `(?x)` sets the flag `x` |
780 | | and `(?-x)` clears the flag `x`. Multiple flags can be set or cleared at |
781 | | the same time: `(?xy)` sets both the `x` and `y` flags and `(?x-y)` sets |
782 | | the `x` flag and clears the `y` flag. |
783 | | |
784 | | All flags are by default disabled unless stated otherwise. They are: |
785 | | |
786 | | <pre class="rust"> |
787 | | i case-insensitive: letters match both upper and lower case |
788 | | m multi-line mode: ^ and $ match begin/end of line |
789 | | s allow . to match \n |
790 | | R enables CRLF mode: when multi-line mode is enabled, \r\n is used |
791 | | U swap the meaning of x* and x*? |
792 | | u Unicode support (enabled by default) |
793 | | x verbose mode, ignores whitespace and allow line comments (starting with `#`) |
794 | | </pre> |
795 | | |
796 | | Note that in verbose mode, whitespace is ignored everywhere, including within |
797 | | character classes. To insert whitespace, use its escaped form or a hex literal. |
798 | | For example, `\ ` or `\x20` for an ASCII space. |
799 | | |
800 | | Flags can be toggled within a pattern. Here's an example that matches |
801 | | case-insensitively for the first part but case-sensitively for the second part: |
802 | | |
803 | | ```rust |
804 | | use regex::Regex; |
805 | | |
806 | | let re = Regex::new(r"(?i)a+(?-i)b+").unwrap(); |
807 | | let m = re.find("AaAaAbbBBBb").unwrap(); |
808 | | assert_eq!(m.as_str(), "AaAaAbb"); |
809 | | ``` |
810 | | |
811 | | Notice that the `a+` matches either `a` or `A`, but the `b+` only matches |
812 | | `b`. |
813 | | |
814 | | Multi-line mode means `^` and `$` no longer match just at the beginning/end of |
815 | | the input, but also at the beginning/end of lines: |
816 | | |
817 | | ``` |
818 | | use regex::Regex; |
819 | | |
820 | | let re = Regex::new(r"(?m)^line \d+").unwrap(); |
821 | | let m = re.find("line one\nline 2\n").unwrap(); |
822 | | assert_eq!(m.as_str(), "line 2"); |
823 | | ``` |
824 | | |
825 | | Note that `^` matches after new lines, even at the end of input: |
826 | | |
827 | | ``` |
828 | | use regex::Regex; |
829 | | |
830 | | let re = Regex::new(r"(?m)^").unwrap(); |
831 | | let m = re.find_iter("test\n").last().unwrap(); |
832 | | assert_eq!((m.start(), m.end()), (5, 5)); |
833 | | ``` |
834 | | |
835 | | When both CRLF mode and multi-line mode are enabled, then `^` and `$` will |
836 | | match either `\r` and `\n`, but never in the middle of a `\r\n`: |
837 | | |
838 | | ``` |
839 | | use regex::Regex; |
840 | | |
841 | | let re = Regex::new(r"(?mR)^foo$").unwrap(); |
842 | | let m = re.find("\r\nfoo\r\n").unwrap(); |
843 | | assert_eq!(m.as_str(), "foo"); |
844 | | ``` |
845 | | |
846 | | Unicode mode can also be selectively disabled, although only when the result |
847 | | *would not* match invalid UTF-8. One good example of this is using an ASCII |
848 | | word boundary instead of a Unicode word boundary, which might make some regex |
849 | | searches run faster: |
850 | | |
851 | | ```rust |
852 | | use regex::Regex; |
853 | | |
854 | | let re = Regex::new(r"(?-u:\b).+(?-u:\b)").unwrap(); |
855 | | let m = re.find("$$abc$$").unwrap(); |
856 | | assert_eq!(m.as_str(), "abc"); |
857 | | ``` |
858 | | |
859 | | ### Escape sequences |
860 | | |
861 | | Note that this includes all possible escape sequences, even ones that are |
862 | | documented elsewhere. |
863 | | |
864 | | <pre class="rust"> |
865 | | \* literal *, applies to all ASCII except [0-9A-Za-z<>] |
866 | | \a bell (\x07) |
867 | | \f form feed (\x0C) |
868 | | \t horizontal tab |
869 | | \n new line |
870 | | \r carriage return |
871 | | \v vertical tab (\x0B) |
872 | | \A matches at the beginning of a haystack |
873 | | \z matches at the end of a haystack |
874 | | \b word boundary assertion |
875 | | \B negated word boundary assertion |
876 | | \b{start}, \< start-of-word boundary assertion |
877 | | \b{end}, \> end-of-word boundary assertion |
878 | | \b{start-half} half of a start-of-word boundary assertion |
879 | | \b{end-half} half of a end-of-word boundary assertion |
880 | | \123 octal character code, up to three digits (when enabled) |
881 | | \x7F hex character code (exactly two digits) |
882 | | \x{10FFFF} any hex character code corresponding to a Unicode code point |
883 | | \u007F hex character code (exactly four digits) |
884 | | \u{7F} any hex character code corresponding to a Unicode code point |
885 | | \U0000007F hex character code (exactly eight digits) |
886 | | \U{7F} any hex character code corresponding to a Unicode code point |
887 | | \p{Letter} Unicode character class |
888 | | \P{Letter} negated Unicode character class |
889 | | \d, \s, \w Perl character class |
890 | | \D, \S, \W negated Perl character class |
891 | | </pre> |
892 | | |
893 | | ### Perl character classes (Unicode friendly) |
894 | | |
895 | | These classes are based on the definitions provided in |
896 | | [UTS#18](https://www.unicode.org/reports/tr18/#Compatibility_Properties): |
897 | | |
898 | | <pre class="rust"> |
899 | | \d digit (\p{Nd}) |
900 | | \D not digit |
901 | | \s whitespace (\p{White_Space}) |
902 | | \S not whitespace |
903 | | \w word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control}) |
904 | | \W not word character |
905 | | </pre> |
906 | | |
907 | | ### ASCII character classes |
908 | | |
909 | | These classes are based on the definitions provided in |
910 | | [UTS#18](https://www.unicode.org/reports/tr18/#Compatibility_Properties): |
911 | | |
912 | | <pre class="rust"> |
913 | | [[:alnum:]] alphanumeric ([0-9A-Za-z]) |
914 | | [[:alpha:]] alphabetic ([A-Za-z]) |
915 | | [[:ascii:]] ASCII ([\x00-\x7F]) |
916 | | [[:blank:]] blank ([\t ]) |
917 | | [[:cntrl:]] control ([\x00-\x1F\x7F]) |
918 | | [[:digit:]] digits ([0-9]) |
919 | | [[:graph:]] graphical ([!-~]) |
920 | | [[:lower:]] lower case ([a-z]) |
921 | | [[:print:]] printable ([ -~]) |
922 | | [[:punct:]] punctuation ([!-/:-@\[-`{-~]) |
923 | | [[:space:]] whitespace ([\t\n\v\f\r ]) |
924 | | [[:upper:]] upper case ([A-Z]) |
925 | | [[:word:]] word characters ([0-9A-Za-z_]) |
926 | | [[:xdigit:]] hex digit ([0-9A-Fa-f]) |
927 | | </pre> |
928 | | |
929 | | # Untrusted input |
930 | | |
931 | | This crate is meant to be able to run regex searches on untrusted haystacks |
932 | | without fear of [ReDoS]. This crate also, to a certain extent, supports |
933 | | untrusted patterns. |
934 | | |
935 | | [ReDoS]: https://en.wikipedia.org/wiki/ReDoS |
936 | | |
937 | | This crate differs from most (but not all) other regex engines in that it |
938 | | doesn't use unbounded backtracking to run a regex search. In those cases, |
939 | | one generally cannot use untrusted patterns *or* untrusted haystacks because |
940 | | it can be very difficult to know whether a particular pattern will result in |
941 | | catastrophic backtracking or not. |
942 | | |
943 | | We'll first discuss how this crate deals with untrusted inputs and then wrap |
944 | | it up with a realistic discussion about what practice really looks like. |
945 | | |
946 | | ### Panics |
947 | | |
948 | | Outside of clearly documented cases, most APIs in this crate are intended to |
949 | | never panic regardless of the inputs given to them. For example, `Regex::new`, |
950 | | `Regex::is_match`, `Regex::find` and `Regex::captures` should never panic. That |
951 | | is, it is an API promise that those APIs will never panic no matter what inputs |
952 | | are given to them. With that said, regex engines are complicated beasts, and |
953 | | providing a rock solid guarantee that these APIs literally never panic is |
954 | | essentially equivalent to saying, "there are no bugs in this library." That is |
955 | | a bold claim, and not really one that can be feasibly made with a straight |
956 | | face. |
957 | | |
958 | | Don't get the wrong impression here. This crate is extensively tested, not just |
959 | | with unit and integration tests, but also via fuzz testing. For example, this |
960 | | crate is part of the [OSS-fuzz project]. Panics should be incredibly rare, but |
961 | | it is possible for bugs to exist, and thus possible for a panic to occur. If |
962 | | you need a rock solid guarantee against panics, then you should wrap calls into |
963 | | this library with [`std::panic::catch_unwind`]. |
964 | | |
965 | | It's also worth pointing out that this library will *generally* panic when |
966 | | other regex engines would commit undefined behavior. When undefined behavior |
967 | | occurs, your program might continue as if nothing bad has happened, but it also |
968 | | might mean your program is open to the worst kinds of exploits. In contrast, |
969 | | the worst thing a panic can do is a denial of service. |
970 | | |
971 | | [OSS-fuzz project]: https://android.googlesource.com/platform/external/oss-fuzz/+/refs/tags/android-t-preview-1/projects/rust-regex/ |
972 | | [`std::panic::catch_unwind`]: https://doc.rust-lang.org/std/panic/fn.catch_unwind.html |
973 | | |
974 | | ### Untrusted patterns |
975 | | |
976 | | The principal way this crate deals with them is by limiting their size by |
977 | | default. The size limit can be configured via [`RegexBuilder::size_limit`]. The |
978 | | idea of a size limit is that compiling a pattern into a `Regex` will fail if it |
979 | | becomes "too big." Namely, while *most* resources consumed by compiling a regex |
980 | | are approximately proportional (albeit with some high constant factors in some |
981 | | cases, such as with Unicode character classes) to the length of the pattern |
982 | | itself, there is one particular exception to this: counted repetitions. Namely, |
983 | | this pattern: |
984 | | |
985 | | ```text |
986 | | a{5}{5}{5}{5}{5}{5} |
987 | | ``` |
988 | | |
989 | | Is equivalent to this pattern: |
990 | | |
991 | | ```text |
992 | | a{15625} |
993 | | ``` |
994 | | |
995 | | In both of these cases, the actual pattern string is quite small, but the |
996 | | resulting `Regex` value is quite large. Indeed, as the first pattern shows, |
997 | | it isn't enough to locally limit the size of each repetition because they can |
998 | | be stacked in a way that results in exponential growth. |
999 | | |
1000 | | To provide a bit more context, a simplified view of regex compilation looks |
1001 | | like this: |
1002 | | |
1003 | | * The pattern string is parsed into a structured representation called an AST. |
1004 | | Counted repetitions are not expanded and Unicode character classes are not |
1005 | | looked up in this stage. That is, the size of the AST is proportional to the |
1006 | | size of the pattern with "reasonable" constant factors. In other words, one |
1007 | | can reasonably limit the memory used by an AST by limiting the length of the |
1008 | | pattern string. |
1009 | | * The AST is translated into an HIR. Counted repetitions are still *not* |
1010 | | expanded at this stage, but Unicode character classes are embedded into the |
1011 | | HIR. The memory usage of a HIR is still proportional to the length of the |
1012 | | original pattern string, but the constant factors---mostly as a result of |
1013 | | Unicode character classes---can be quite high. Still though, the memory used by |
1014 | | an HIR can be reasonably limited by limiting the length of the pattern string. |
1015 | | * The HIR is compiled into a [Thompson NFA]. This is the stage at which |
1016 | | something like `\w{5}` is rewritten to `\w\w\w\w\w`. Thus, this is the stage |
1017 | | at which [`RegexBuilder::size_limit`] is enforced. If the NFA exceeds the |
1018 | | configured size, then this stage will fail. |
1019 | | |
1020 | | [Thompson NFA]: https://en.wikipedia.org/wiki/Thompson%27s_construction |
1021 | | |
1022 | | The size limit helps avoid two different kinds of exorbitant resource usage: |
1023 | | |
1024 | | * It avoids permitting exponential memory usage based on the size of the |
1025 | | pattern string. |
1026 | | * It avoids long search times. This will be discussed in more detail in the |
1027 | | next section, but worst case search time *is* dependent on the size of the |
1028 | | regex. So keeping regexes limited to a reasonable size is also a way of keeping |
1029 | | search times reasonable. |
1030 | | |
1031 | | Finally, it's worth pointing out that regex compilation is guaranteed to take |
1032 | | worst case `O(m)` time, where `m` is proportional to the size of regex. The |
1033 | | size of the regex here is *after* the counted repetitions have been expanded. |
1034 | | |
1035 | | **Advice for those using untrusted regexes**: limit the pattern length to |
1036 | | something small and expand it as needed. Configure [`RegexBuilder::size_limit`] |
1037 | | to something small and then expand it as needed. |
1038 | | |
1039 | | ### Untrusted haystacks |
1040 | | |
1041 | | The main way this crate guards against searches from taking a long time is by |
1042 | | using algorithms that guarantee a `O(m * n)` worst case time and space bound. |
1043 | | Namely: |
1044 | | |
1045 | | * `m` is proportional to the size of the regex, where the size of the regex |
1046 | | includes the expansion of all counted repetitions. (See the previous section on |
1047 | | untrusted patterns.) |
1048 | | * `n` is proportional to the length, in bytes, of the haystack. |
1049 | | |
1050 | | In other words, if you consider `m` to be a constant (for example, the regex |
1051 | | pattern is a literal in the source code), then the search can be said to run |
1052 | | in "linear time." Or equivalently, "linear time with respect to the size of the |
1053 | | haystack." |
1054 | | |
1055 | | But the `m` factor here is important not to ignore. If a regex is |
1056 | | particularly big, the search times can get quite slow. This is why, in part, |
1057 | | [`RegexBuilder::size_limit`] exists. |
1058 | | |
1059 | | **Advice for those searching untrusted haystacks**: As long as your regexes |
1060 | | are not enormous, you should expect to be able to search untrusted haystacks |
1061 | | without fear. If you aren't sure, you should benchmark it. Unlike backtracking |
1062 | | engines, if your regex is so big that it's likely to result in slow searches, |
1063 | | this is probably something you'll be able to observe regardless of what the |
1064 | | haystack is made up of. |
1065 | | |
1066 | | ### Iterating over matches |
1067 | | |
1068 | | One thing that is perhaps easy to miss is that the worst case time |
1069 | | complexity bound of `O(m * n)` applies to methods like [`Regex::is_match`], |
1070 | | [`Regex::find`] and [`Regex::captures`]. It does **not** apply to |
1071 | | [`Regex::find_iter`] or [`Regex::captures_iter`]. Namely, since iterating over |
1072 | | all matches can execute many searches, and each search can scan the entire |
1073 | | haystack, the worst case time complexity for iterators is `O(m * n^2)`. |
1074 | | |
1075 | | One example of where this occurs is when a pattern consists of an alternation, |
1076 | | where an earlier branch of the alternation requires scanning the entire |
1077 | | haystack only to discover that there is no match. It also requires a later |
1078 | | branch of the alternation to have matched at the beginning of the search. For |
1079 | | example, consider the pattern `.*[^A-Z]|[A-Z]` and the haystack `AAAAA`. The |
1080 | | first search will scan to the end looking for matches of `.*[^A-Z]` even though |
1081 | | a finite automata engine (as in this crate) knows that `[A-Z]` has already |
1082 | | matched the first character of the haystack. This is due to the greedy nature |
1083 | | of regex searching. That first search will report a match at the first `A` only |
1084 | | after scanning to the end to discover that no other match exists. The next |
1085 | | search then begins at the second `A` and the behavior repeats. |
1086 | | |
1087 | | There is no way to avoid this. This means that if both patterns and haystacks |
1088 | | are untrusted and you're iterating over all matches, you're susceptible to |
1089 | | worst case quadratic time complexity. One possible way to mitigate this |
1090 | | is to drop down to the lower level `regex-automata` crate and use its |
1091 | | `meta::Regex` iterator APIs. There, you can configure the search to operate |
1092 | | in "earliest" mode by passing a `Input::new(haystack).earliest(true)` to |
1093 | | `meta::Regex::find_iter` (for example). By enabling this mode, you give up |
1094 | | the normal greedy match semantics of regex searches and instead ask the regex |
1095 | | engine to immediately stop as soon as a match has been found. Enabling this |
1096 | | mode will thus restore the worst case `O(m * n)` time complexity bound, but at |
1097 | | the cost of different semantics. |
1098 | | |
1099 | | ### Untrusted inputs in practice |
1100 | | |
1101 | | While providing a `O(m * n)` worst case time bound on all searches goes a long |
1102 | | way toward preventing [ReDoS], that doesn't mean every search you can possibly |
1103 | | run will complete without burning CPU time. In general, there are a few ways |
1104 | | for the `m * n` time bound to still bite you: |
1105 | | |
1106 | | * You are searching an exceptionally long haystack. No matter how you slice |
1107 | | it, a longer haystack will take more time to search. This crate may often make |
1108 | | very quick work of even long haystacks because of its literal optimizations, |
1109 | | but those aren't available for all regexes. |
1110 | | * Unicode character classes can cause searches to be quite slow in some cases. |
1111 | | This is especially true when they are combined with counted repetitions. While |
1112 | | the regex size limit above will protect you from the most egregious cases, |
1113 | | the default size limit still permits pretty big regexes that can execute more |
1114 | | slowly than one might expect. |
1115 | | * While routines like [`Regex::find`] and [`Regex::captures`] guarantee |
1116 | | worst case `O(m * n)` search time, routines like [`Regex::find_iter`] and |
1117 | | [`Regex::captures_iter`] actually have worst case `O(m * n^2)` search time. |
1118 | | This is because `find_iter` runs many searches, and each search takes worst |
1119 | | case `O(m * n)` time. Thus, iteration of all matches in a haystack has |
1120 | | worst case `O(m * n^2)`. A good example of a pattern that exhibits this is |
1121 | | `(?:A+){1000}|` or even `.*[^A-Z]|[A-Z]`. |
1122 | | |
1123 | | In general, unstrusted haystacks are easier to stomach than untrusted patterns. |
1124 | | Untrusted patterns give a lot more control to the caller to impact the |
1125 | | performance of a search. In many cases, a regex search will actually execute in |
1126 | | average case `O(n)` time (i.e., not dependent on the size of the regex), but |
1127 | | this can't be guaranteed in general. Therefore, permitting untrusted patterns |
1128 | | means that your only line of defense is to put a limit on how big `m` (and |
1129 | | perhaps also `n`) can be in `O(m * n)`. `n` is limited by simply inspecting |
1130 | | the length of the haystack while `m` is limited by *both* applying a limit to |
1131 | | the length of the pattern *and* a limit on the compiled size of the regex via |
1132 | | [`RegexBuilder::size_limit`]. |
1133 | | |
1134 | | It bears repeating: if you're accepting untrusted patterns, it would be a good |
1135 | | idea to start with conservative limits on `m` and `n`, and then carefully |
1136 | | increase them as needed. |
1137 | | |
1138 | | # Crate features |
1139 | | |
1140 | | By default, this crate tries pretty hard to make regex matching both as fast |
1141 | | as possible and as correct as it can be. This means that there is a lot of |
1142 | | code dedicated to performance, the handling of Unicode data and the Unicode |
1143 | | data itself. Overall, this leads to more dependencies, larger binaries and |
1144 | | longer compile times. This trade off may not be appropriate in all cases, and |
1145 | | indeed, even when all Unicode and performance features are disabled, one is |
1146 | | still left with a perfectly serviceable regex engine that will work well in |
1147 | | many cases. (Note that code is not arbitrarily reducible, and for this reason, |
1148 | | the [`regex-lite`](https://docs.rs/regex-lite) crate exists to provide an even |
1149 | | more minimal experience by cutting out Unicode and performance, but still |
1150 | | maintaining the linear search time bound.) |
1151 | | |
1152 | | This crate exposes a number of features for controlling that trade off. Some |
1153 | | of these features are strictly performance oriented, such that disabling them |
1154 | | won't result in a loss of functionality, but may result in worse performance. |
1155 | | Other features, such as the ones controlling the presence or absence of Unicode |
1156 | | data, can result in a loss of functionality. For example, if one disables the |
1157 | | `unicode-case` feature (described below), then compiling the regex `(?i)a` |
1158 | | will fail since Unicode case insensitivity is enabled by default. Instead, |
1159 | | callers must use `(?i-u)a` to disable Unicode case folding. Stated differently, |
1160 | | enabling or disabling any of the features below can only add or subtract from |
1161 | | the total set of valid regular expressions. Enabling or disabling a feature |
1162 | | will never modify the match semantics of a regular expression. |
1163 | | |
1164 | | Most features below are enabled by default. Features that aren't enabled by |
1165 | | default are noted. |
1166 | | |
1167 | | ### Ecosystem features |
1168 | | |
1169 | | * **std** - |
1170 | | When enabled, this will cause `regex` to use the standard library. In terms |
1171 | | of APIs, `std` causes error types to implement the `std::error::Error` |
1172 | | trait. Enabling `std` will also result in performance optimizations, |
1173 | | including SIMD and faster synchronization primitives. Notably, **disabling |
1174 | | the `std` feature will result in the use of spin locks**. To use a regex |
1175 | | engine without `std` and without spin locks, you'll need to drop down to |
1176 | | the [`regex-automata`](https://docs.rs/regex-automata) crate. |
1177 | | * **logging** - |
1178 | | When enabled, the `log` crate is used to emit messages about regex |
1179 | | compilation and search strategies. This is **disabled by default**. This is |
1180 | | typically only useful to someone working on this crate's internals, but might |
1181 | | be useful if you're doing some rabbit hole performance hacking. Or if you're |
1182 | | just interested in the kinds of decisions being made by the regex engine. |
1183 | | |
1184 | | ### Performance features |
1185 | | |
1186 | | * **perf** - |
1187 | | Enables all performance related features except for `perf-dfa-full`. This |
1188 | | feature is enabled by default is intended to cover all reasonable features |
1189 | | that improve performance, even if more are added in the future. |
1190 | | * **perf-dfa** - |
1191 | | Enables the use of a lazy DFA for matching. The lazy DFA is used to compile |
1192 | | portions of a regex to a very fast DFA on an as-needed basis. This can |
1193 | | result in substantial speedups, usually by an order of magnitude on large |
1194 | | haystacks. The lazy DFA does not bring in any new dependencies, but it can |
1195 | | make compile times longer. |
1196 | | * **perf-dfa-full** - |
1197 | | Enables the use of a full DFA for matching. Full DFAs are problematic because |
1198 | | they have worst case `O(2^n)` construction time. For this reason, when this |
1199 | | feature is enabled, full DFAs are only used for very small regexes and a |
1200 | | very small space bound is used during determinization to avoid the DFA |
1201 | | from blowing up. This feature is not enabled by default, even as part of |
1202 | | `perf`, because it results in fairly sizeable increases in binary size and |
1203 | | compilation time. It can result in faster search times, but they tend to be |
1204 | | more modest and limited to non-Unicode regexes. |
1205 | | * **perf-onepass** - |
1206 | | Enables the use of a one-pass DFA for extracting the positions of capture |
1207 | | groups. This optimization applies to a subset of certain types of NFAs and |
1208 | | represents the fastest engine in this crate for dealing with capture groups. |
1209 | | * **perf-backtrack** - |
1210 | | Enables the use of a bounded backtracking algorithm for extracting the |
1211 | | positions of capture groups. This usually sits between the slowest engine |
1212 | | (the PikeVM) and the fastest engine (one-pass DFA) for extracting capture |
1213 | | groups. It's used whenever the regex is not one-pass and is small enough. |
1214 | | * **perf-inline** - |
1215 | | Enables the use of aggressive inlining inside match routines. This reduces |
1216 | | the overhead of each match. The aggressive inlining, however, increases |
1217 | | compile times and binary size. |
1218 | | * **perf-literal** - |
1219 | | Enables the use of literal optimizations for speeding up matches. In some |
1220 | | cases, literal optimizations can result in speedups of _several_ orders of |
1221 | | magnitude. Disabling this drops the `aho-corasick` and `memchr` dependencies. |
1222 | | * **perf-cache** - |
1223 | | This feature used to enable a faster internal cache at the cost of using |
1224 | | additional dependencies, but this is no longer an option. A fast internal |
1225 | | cache is now used unconditionally with no additional dependencies. This may |
1226 | | change in the future. |
1227 | | |
1228 | | ### Unicode features |
1229 | | |
1230 | | * **unicode** - |
1231 | | Enables all Unicode features. This feature is enabled by default, and will |
1232 | | always cover all Unicode features, even if more are added in the future. |
1233 | | * **unicode-age** - |
1234 | | Provide the data for the |
1235 | | [Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age). |
1236 | | This makes it possible to use classes like `\p{Age:6.0}` to refer to all |
1237 | | codepoints first introduced in Unicode 6.0 |
1238 | | * **unicode-bool** - |
1239 | | Provide the data for numerous Unicode boolean properties. The full list |
1240 | | is not included here, but contains properties like `Alphabetic`, `Emoji`, |
1241 | | `Lowercase`, `Math`, `Uppercase` and `White_Space`. |
1242 | | * **unicode-case** - |
1243 | | Provide the data for case insensitive matching using |
1244 | | [Unicode's "simple loose matches" specification](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches). |
1245 | | * **unicode-gencat** - |
1246 | | Provide the data for |
1247 | | [Unicode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values). |
1248 | | This includes, but is not limited to, `Decimal_Number`, `Letter`, |
1249 | | `Math_Symbol`, `Number` and `Punctuation`. |
1250 | | * **unicode-perl** - |
1251 | | Provide the data for supporting the Unicode-aware Perl character classes, |
1252 | | corresponding to `\w`, `\s` and `\d`. This is also necessary for using |
1253 | | Unicode-aware word boundary assertions. Note that if this feature is |
1254 | | disabled, the `\s` and `\d` character classes are still available if the |
1255 | | `unicode-bool` and `unicode-gencat` features are enabled, respectively. |
1256 | | * **unicode-script** - |
1257 | | Provide the data for |
1258 | | [Unicode scripts and script extensions](https://www.unicode.org/reports/tr24/). |
1259 | | This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`, |
1260 | | `Latin` and `Thai`. |
1261 | | * **unicode-segment** - |
1262 | | Provide the data necessary to provide the properties used to implement the |
1263 | | [Unicode text segmentation algorithms](https://www.unicode.org/reports/tr29/). |
1264 | | This enables using classes like `\p{gcb=Extend}`, `\p{wb=Katakana}` and |
1265 | | `\p{sb=ATerm}`. |
1266 | | |
1267 | | # Other crates |
1268 | | |
1269 | | This crate has two required dependencies and several optional dependencies. |
1270 | | This section briefly describes them with the goal of raising awareness of how |
1271 | | different components of this crate may be used independently. |
1272 | | |
1273 | | It is somewhat unusual for a regex engine to have dependencies, as most regex |
1274 | | libraries are self contained units with no dependencies other than a particular |
1275 | | environment's standard library. Indeed, for other similarly optimized regex |
1276 | | engines, most or all of the code in the dependencies of this crate would |
1277 | | normally just be unseparable or coupled parts of the crate itself. But since |
1278 | | Rust and its tooling ecosystem make the use of dependencies so easy, it made |
1279 | | sense to spend some effort de-coupling parts of this crate and making them |
1280 | | independently useful. |
1281 | | |
1282 | | We only briefly describe each crate here. |
1283 | | |
1284 | | * [`regex-lite`](https://docs.rs/regex-lite) is not a dependency of `regex`, |
1285 | | but rather, a standalone zero-dependency simpler version of `regex` that |
1286 | | prioritizes compile times and binary size. In exchange, it eschews Unicode |
1287 | | support and performance. Its match semantics are as identical as possible to |
1288 | | the `regex` crate, and for the things it supports, its APIs are identical to |
1289 | | the APIs in this crate. In other words, for a lot of use cases, it is a drop-in |
1290 | | replacement. |
1291 | | * [`regex-syntax`](https://docs.rs/regex-syntax) provides a regular expression |
1292 | | parser via `Ast` and `Hir` types. It also provides routines for extracting |
1293 | | literals from a pattern. Folks can use this crate to do analysis, or even to |
1294 | | build their own regex engine without having to worry about writing a parser. |
1295 | | * [`regex-automata`](https://docs.rs/regex-automata) provides the regex engines |
1296 | | themselves. One of the downsides of finite automata based regex engines is that |
1297 | | they often need multiple internal engines in order to have similar or better |
1298 | | performance than an unbounded backtracking engine in practice. `regex-automata` |
1299 | | in particular provides public APIs for a PikeVM, a bounded backtracker, a |
1300 | | one-pass DFA, a lazy DFA, a fully compiled DFA and a meta regex engine that |
1301 | | combines all them together. It also has native multi-pattern support and |
1302 | | provides a way to compile and serialize full DFAs such that they can be loaded |
1303 | | and searched in a no-std no-alloc environment. `regex-automata` itself doesn't |
1304 | | even have a required dependency on `regex-syntax`! |
1305 | | * [`memchr`](https://docs.rs/memchr) provides low level SIMD vectorized |
1306 | | routines for quickly finding the location of single bytes or even substrings |
1307 | | in a haystack. In other words, it provides fast `memchr` and `memmem` routines. |
1308 | | These are used by this crate in literal optimizations. |
1309 | | * [`aho-corasick`](https://docs.rs/aho-corasick) provides multi-substring |
1310 | | search. It also provides SIMD vectorized routines in the case where the number |
1311 | | of substrings to search for is relatively small. The `regex` crate also uses |
1312 | | this for literal optimizations. |
1313 | | */ |
1314 | | |
1315 | | #![no_std] |
1316 | | #![deny(missing_docs)] |
1317 | | #![cfg_attr(feature = "pattern", feature(pattern))] |
1318 | | #![warn(missing_debug_implementations)] |
1319 | | |
1320 | | #[cfg(doctest)] |
1321 | | doc_comment::doctest!("../README.md"); |
1322 | | |
1323 | | extern crate alloc; |
1324 | | #[cfg(any(test, feature = "std"))] |
1325 | | extern crate std; |
1326 | | |
1327 | | pub use crate::error::Error; |
1328 | | |
1329 | | pub use crate::{builders::string::*, regex::string::*, regexset::string::*}; |
1330 | | |
1331 | | mod builders; |
1332 | | pub mod bytes; |
1333 | | mod error; |
1334 | | mod find_byte; |
1335 | | #[cfg(feature = "pattern")] |
1336 | | mod pattern; |
1337 | | mod regex; |
1338 | | mod regexset; |
1339 | | |
1340 | | /// Escapes all regular expression meta characters in `pattern`. |
1341 | | /// |
1342 | | /// The string returned may be safely used as a literal in a regular |
1343 | | /// expression. |
1344 | 0 | pub fn escape(pattern: &str) -> alloc::string::String { |
1345 | 0 | regex_syntax::escape(pattern) |
1346 | 0 | } |