/build/cargo-vendor-dir/regex-automata-0.4.5/src/util/empty.rs
Line | Count | Source (jump to first uncovered line) |
1 | | /*! |
2 | | This module provides helper routines for dealing with zero-width matches. |
3 | | |
4 | | The main problem being solved here is this: |
5 | | |
6 | | 1. The caller wants to search something that they know is valid UTF-8, such |
7 | | as a Rust `&str`. |
8 | | 2. The regex used by the caller can match the empty string. For example, `a*`. |
9 | | 3. The caller should never get match offsets returned that occur within the |
10 | | encoding of a UTF-8 codepoint. It is logically incorrect, and also means that, |
11 | | e.g., slicing the `&str` at those offsets will lead to a panic. |
12 | | |
13 | | So the question here is, how do we prevent the caller from getting match |
14 | | offsets that split a codepoint? For example, strictly speaking, the regex `a*` |
15 | | matches `☃` at the positions `[0, 0]`, `[1, 1]`, `[2, 2]` and `[3, 3]` since |
16 | | the UTF-8 encoding of `☃` is `\xE2\x98\x83`. In particular, the `NFA` that |
17 | | underlies all of the matching engines in this crate doesn't have anything in |
18 | | its state graph that prevents matching between UTF-8 code units. Indeed, any |
19 | | engine derived from the `NFA` will match at those positions by virtue of the |
20 | | fact that the `NFA` is byte oriented. That is, its transitions are defined over |
21 | | bytes and the matching engines work by proceeding one byte at a time. |
22 | | |
23 | | (An alternative architecture would be to define the transitions in an `NFA` |
24 | | over codepoints, or `char`. And then make the matching engines proceed by |
25 | | decoding one codepoint at a time. This is a viable strategy, but it doesn't |
26 | | work for DFA matching engines because designing a fast and memory efficient |
27 | | transition table for an alphabet as large as Unicode is quite difficult. More |
28 | | to the point, the top-level `regex` crate supports matching on arbitrary bytes |
29 | | when Unicode mode is disabled and one is searching a `&[u8]`. So in that case, |
30 | | you can't just limit yourself to decoding codepoints and matching those. You |
31 | | really do need to be able to follow byte oriented transitions on the `NFA`.) |
32 | | |
33 | | In an older version of the regex crate, we handled this case not in the regex |
34 | | engine, but in the iterators over matches. Namely, since this case only arises |
35 | | when the match is empty, we "just" incremented the next starting position |
36 | | of the search by `N`, where `N` is the length of the codepoint encoded at |
37 | | the current position. The alternative or more "natural" solution of just |
38 | | incrementing by `1` would result in executing a search of `a*` on `☃` like |
39 | | this: |
40 | | |
41 | | * Start search at `0`. |
42 | | * Found match at `[0, 0]`. |
43 | | * Next start position is `0`. |
44 | | * To avoid an infinite loop, since it's an empty match, increment by `1`. |
45 | | * Start search at `1`. |
46 | | * Found match at `[1, 1]`. Oops. |
47 | | |
48 | | But if we instead incremented by `3` (the length in bytes of `☃`), then we get |
49 | | the following: |
50 | | |
51 | | * Start search at `0`. |
52 | | * Found match at `[0, 0]`. |
53 | | * Next start position is `0`. |
54 | | * To avoid an infinite loop, since it's an empty match, increment by `3`. |
55 | | * Start search at `3`. |
56 | | * Found match at `[3, 3]`. |
57 | | |
58 | | And we get the correct result. But does this technique work in all cases? |
59 | | Crucially, it requires that a zero-width match that splits a codepoint never |
60 | | occurs beyond the starting position of the search. Because if it did, merely |
61 | | incrementing the start position by the number of bytes in the codepoint at |
62 | | the current position wouldn't be enough. A zero-width match could just occur |
63 | | anywhere. It turns out that it is _almost_ true. We can convince ourselves by |
64 | | looking at all possible patterns that can match the empty string: |
65 | | |
66 | | * Patterns like `a*`, `a{0}`, `(?:)`, `a|` and `|a` all unconditionally match |
67 | | the empty string. That is, assuming there isn't an `a` at the current position, |
68 | | they will all match the empty string at the start of a search. There is no way |
69 | | to move past it because any other match would not be "leftmost." |
70 | | * `^` only matches at the beginning of the haystack, where the start position |
71 | | is `0`. Since we know we're searching valid UTF-8 (if it isn't valid UTF-8, |
72 | | then this entire problem goes away because it implies your string type supports |
73 | | invalid UTF-8 and thus must deal with offsets that not only split a codepoint |
74 | | but occur in entirely invalid UTF-8 somehow), it follows that `^` never matches |
75 | | between the code units of a codepoint because the start of a valid UTF-8 string |
76 | | is never within the encoding of a codepoint. |
77 | | * `$` basically the same logic as `^`, but for the end of a string. A valid |
78 | | UTF-8 string can't have an incomplete codepoint at the end of it. |
79 | | * `(?m:^)` follows similarly to `^`, but it can match immediately following |
80 | | a `\n`. However, since a `\n` is always a codepoint itself and can never |
81 | | appear within a codepoint, it follows that the position immediately following |
82 | | a `\n` in a string that is valid UTF-8 is guaranteed to not be between the |
83 | | code units of another codepoint. (One caveat here is that the line terminator |
84 | | for multi-line anchors can now be changed to any arbitrary byte, including |
85 | | things like `\x98` which might occur within a codepoint. However, this wasn't |
86 | | supported by the old regex crate. If it was, it pose the same problems as |
87 | | `(?-u:\B)`, as we'll discuss below.) |
88 | | * `(?m:$)` a similar argument as for `(?m:^)`. The only difference is that a |
89 | | `(?m:$)` matches just before a `\n`. But the same argument applies. |
90 | | * `(?Rm:^)` and `(?Rm:$)` weren't supported by the old regex crate, but the |
91 | | CRLF aware line anchors follow a similar argument as for `(?m:^)` and `(?m:$)`. |
92 | | Namely, since they only ever match at a boundary where one side is either a |
93 | | `\r` or a `\n`, neither of which can occur within a codepoint. |
94 | | * `\b` only matches at positions where both sides are valid codepoints, so |
95 | | this cannot split a codepoint. |
96 | | * `\B`, like `\b`, also only matches at positions where both sides are valid |
97 | | codepoints. So this cannot split a codepoint either. |
98 | | * `(?-u:\b)` matches only at positions where at least one side of it is an ASCII |
99 | | word byte. Since ASCII bytes cannot appear as code units in non-ASCII codepoints |
100 | | (one of the many amazing qualities of UTF-8), it follows that this too cannot |
101 | | split a codepoint. |
102 | | * `(?-u:\B)` finally represents a problem. It can matches between *any* two |
103 | | bytes that are either both word bytes or non-word bytes. Since code units like |
104 | | `\xE2` and `\x98` (from the UTF-8 encoding of `☃`) are both non-word bytes, |
105 | | `(?-u:\B)` will match at the position between them. |
106 | | |
107 | | Thus, our approach of incrementing one codepoint at a time after seeing an |
108 | | empty match is flawed because `(?-u:\B)` can result in an empty match that |
109 | | splits a codepoint at a position past the starting point of a search. For |
110 | | example, searching `(?-u:\B)` on `a☃` would produce the following matches: `[2, |
111 | | 2]`, `[3, 3]` and `[4, 4]`. The positions at `0` and `1` don't match because |
112 | | they correspond to word boundaries since `a` is an ASCII word byte. |
113 | | |
114 | | So what did the old regex crate do to avoid this? It banned `(?-u:\B)` from |
115 | | regexes that could match `&str`. That might sound extreme, but a lot of other |
116 | | things were banned too. For example, all of `(?-u:.)`, `(?-u:[^a])` and |
117 | | `(?-u:\W)` can match invalid UTF-8 too, including individual code units with a |
118 | | codepoint. The key difference is that those expressions could never produce an |
119 | | empty match. That ban happens when translating an `Ast` to an `Hir`, because |
120 | | that process that reason about whether an `Hir` can produce *non-empty* matches |
121 | | at invalid UTF-8 boundaries. Bottom line though is that we side-stepped the |
122 | | `(?-u:\B)` issue by banning it. |
123 | | |
124 | | If banning `(?-u:\B)` were the only issue with the old regex crate's approach, |
125 | | then I probably would have kept it. `\B` is rarely used, so it's not such a big |
126 | | deal to have to work-around it. However, the problem with the above approach |
127 | | is that it doesn't compose. The logic for avoiding splitting a codepoint only |
128 | | lived in the iterator, which means if anyone wants to implement their own |
129 | | iterator over regex matches, they have to deal with this extremely subtle edge |
130 | | case to get full correctness. |
131 | | |
132 | | Instead, in this crate, we take the approach of pushing this complexity down |
133 | | to the lowest layers of each regex engine. The approach is pretty simple: |
134 | | |
135 | | * If this corner case doesn't apply, don't do anything. (For example, if UTF-8 |
136 | | mode isn't enabled or if the regex cannot match the empty string.) |
137 | | * If an empty match is reported, explicitly check if it splits a codepoint. |
138 | | * If it doesn't, we're done, return the match. |
139 | | * If it does, then ignore the match and re-run the search. |
140 | | * Repeat the above process until the end of the haystack is reached or a match |
141 | | is found that doesn't split a codepoint or isn't zero width. |
142 | | |
143 | | And that's pretty much what this module provides. Every regex engine uses these |
144 | | methods in their lowest level public APIs, but just above the layer where |
145 | | their internal engine is used. That way, all regex engines can be arbitrarily |
146 | | composed without worrying about handling this case, and iterators don't need to |
147 | | handle it explicitly. |
148 | | |
149 | | (It turns out that a new feature I added, support for changing the line |
150 | | terminator in a regex to any arbitrary byte, also provokes the above problem. |
151 | | Namely, the byte could be invalid UTF-8 or a UTF-8 continuation byte. So that |
152 | | support would need to be limited or banned when UTF-8 mode is enabled, just |
153 | | like we did for `(?-u:\B)`. But thankfully our more robust approach in this |
154 | | crate handles that case just fine too.) |
155 | | */ |
156 | | |
157 | | use crate::util::search::{Input, MatchError}; |
158 | | |
159 | | #[cold] |
160 | | #[inline(never)] |
161 | 0 | pub(crate) fn skip_splits_fwd<T, F>( |
162 | 0 | input: &Input<'_>, |
163 | 0 | init_value: T, |
164 | 0 | match_offset: usize, |
165 | 0 | find: F, |
166 | 0 | ) -> Result<Option<T>, MatchError> |
167 | 0 | where |
168 | 0 | F: FnMut(&Input<'_>) -> Result<Option<(T, usize)>, MatchError>, |
169 | 0 | { |
170 | 0 | skip_splits(true, input, init_value, match_offset, find) |
171 | 0 | } |
172 | | |
173 | | #[cold] |
174 | | #[inline(never)] |
175 | 0 | pub(crate) fn skip_splits_rev<T, F>( |
176 | 0 | input: &Input<'_>, |
177 | 0 | init_value: T, |
178 | 0 | match_offset: usize, |
179 | 0 | find: F, |
180 | 0 | ) -> Result<Option<T>, MatchError> |
181 | 0 | where |
182 | 0 | F: FnMut(&Input<'_>) -> Result<Option<(T, usize)>, MatchError>, |
183 | 0 | { |
184 | 0 | skip_splits(false, input, init_value, match_offset, find) |
185 | 0 | } |
186 | | |
187 | 0 | fn skip_splits<T, F>( |
188 | 0 | forward: bool, |
189 | 0 | input: &Input<'_>, |
190 | 0 | init_value: T, |
191 | 0 | mut match_offset: usize, |
192 | 0 | mut find: F, |
193 | 0 | ) -> Result<Option<T>, MatchError> |
194 | 0 | where |
195 | 0 | F: FnMut(&Input<'_>) -> Result<Option<(T, usize)>, MatchError>, |
196 | 0 | { |
197 | 0 | // If our config says to do an anchored search, then we're definitely |
198 | 0 | // done. We just need to determine whether we have a valid match or |
199 | 0 | // not. If we don't, then we're not allowed to continue, so we report |
200 | 0 | // no match. |
201 | 0 | // |
202 | 0 | // This is actually quite a subtle correctness thing. The key here is |
203 | 0 | // that if we got an empty match that splits a codepoint after doing an |
204 | 0 | // anchored search in UTF-8 mode, then that implies that we must have |
205 | 0 | // *started* the search at a location that splits a codepoint. This |
206 | 0 | // follows from the fact that if a match is reported from an anchored |
207 | 0 | // search, then the start offset of the match *must* match the start |
208 | 0 | // offset of the search. |
209 | 0 | // |
210 | 0 | // It also follows that no other non-empty match is possible. For |
211 | 0 | // example, you might write a regex like '(?:)|SOMETHING' and start its |
212 | 0 | // search in the middle of a codepoint. The first branch is an empty |
213 | 0 | // regex that will bubble up a match at the first position, and then |
214 | 0 | // get rejected here and report no match. But what if 'SOMETHING' could |
215 | 0 | // have matched? We reason that such a thing is impossible, because |
216 | 0 | // if it does, it must report a match that starts in the middle of a |
217 | 0 | // codepoint. This in turn implies that a match is reported whose span |
218 | 0 | // does not correspond to valid UTF-8, and this breaks the promise |
219 | 0 | // made when UTF-8 mode is enabled. (That promise *can* be broken, for |
220 | 0 | // example, by enabling UTF-8 mode but building an by hand NFA that |
221 | 0 | // produces non-empty matches that span invalid UTF-8. This is an unchecked |
222 | 0 | // but documented precondition violation of UTF-8 mode, and is documented |
223 | 0 | // to have unspecified behavior.) |
224 | 0 | // |
225 | 0 | // I believe this actually means that if an anchored search is run, and |
226 | 0 | // UTF-8 mode is enabled and the start position splits a codepoint, |
227 | 0 | // then it is correct to immediately report no match without even |
228 | 0 | // executing the regex engine. But it doesn't really seem worth writing |
229 | 0 | // out that case in every regex engine to save a tiny bit of work in an |
230 | 0 | // extremely pathological case, so we just handle it here. |
231 | 0 | if input.get_anchored().is_anchored() { |
232 | 0 | return Ok(if input.is_char_boundary(match_offset) { |
233 | 0 | Some(init_value) |
234 | | } else { |
235 | 0 | None |
236 | | }); |
237 | 0 | } |
238 | 0 | // Otherwise, we have an unanchored search, so just keep looking for |
239 | 0 | // matches until we have one that does not split a codepoint or we hit |
240 | 0 | // EOI. |
241 | 0 | let mut value = init_value; |
242 | 0 | let mut input = input.clone(); |
243 | 0 | while !input.is_char_boundary(match_offset) { |
244 | 0 | if forward { |
245 | 0 | // The unwrap is OK here because overflowing usize while |
246 | 0 | // iterating over a slice is impossible, at it would require |
247 | 0 | // a slice of length greater than isize::MAX, which is itself |
248 | 0 | // impossible. |
249 | 0 | input.set_start(input.start().checked_add(1).unwrap()); |
250 | 0 | } else { |
251 | 0 | input.set_end(match input.end().checked_sub(1) { |
252 | 0 | None => return Ok(None), |
253 | 0 | Some(end) => end, |
254 | | }); |
255 | | } |
256 | 0 | match find(&input)? { |
257 | 0 | None => return Ok(None), |
258 | 0 | Some((new_value, new_match_end)) => { |
259 | 0 | value = new_value; |
260 | 0 | match_offset = new_match_end; |
261 | 0 | } |
262 | | } |
263 | | } |
264 | 0 | Ok(Some(value)) |
265 | 0 | } |