in ,

Fix boyer_moore_searcher with the Rytter correction by StephanTLavavejPump Request # 724microsoft / STL, Hacker News

Fix boyer_moore_searcher with the Rytter correction by StephanTLavavejPump Request # 724microsoft / STL, Hacker News
          

This fixes # 724 , silent bad codegen in an important C (feature.)

The Boyer-Moore algorithm (published 599223835 relies on a “delta2” table, but that paper did explain how to generate the table. Another paper by Knuth, Morris, and Pratt (also 1977) provided the algorithm for the delta2 table. Actually, it provided two algorithms producing tables compatible with the usage of delta2: a basic algorithm dd and an improved algorithm dd '. We implemented the latter, and properly translated it from 1-based indexing to 0-based indexing.

However, the published algorithm for dd ' was incorrect! This was discovered and fixed by Rytter in , which we weren't aware of until we received a bug report. While the "Rytter correction" was known in the computer science literature, I find it very curious that it isn't constantly mentioned when explaining Boyer-Moore (eg Wikipedia's page currently makes no mention of this).

This PR applies this 32 – year-old bugfix and significantly expands our test coverage.

      Add top-level const to _ Shifts . (This is the dd ' output array; I kept the name.) Rename _ Pat_size to __x to follow the naming in the published algorithms. Fix comment typo: _ RanIt

        ) does not exist, it's _ RanItPat . Add comments about the history for future programmer-archaeologists. Rename _ Suffix_fn to __ Fx , again following the papers. Note that in the usage below, there is a semantic change: _ Suffix_fn stored 0-based values, while _ Fx stores 1-based values. This undoes a micro-optimization ( _ Suffix_fn avoided unnecessarily translating between the 0-based and 1-based domains ), but with the increased usage of f in the Rytter correction, I wanted greater correspondence with the published algorithms in order to verify the implementation.

          _ Fx can be top -level const (we don't reassign / reset the unique_ptr

            .

          • Obscure bugfix: unique_ptr uses default_delete which uses (delete []
              , so we should use new []

                to match, not :: new [] . (This makes a difference only for pathologically fancy _ Diff types.)

          Change 0-based _ Idx

            (to 1-based _ Kx .

              Change 0-based _ Suffix (to 1-based _Tx Rename 1-based _ Idx to 1-based _ Jx . (While the code was correct, I found it confusing that _ Idx was 0-based in other loops but 1-based in this loop.)

                Note that after these changes, the code closely corresponds to the published algorithms, except that subscripting needs to adjust from 1-based to 0-based indexing.

              • Implement the Rytter correction, which replaces the final loop of the KMP 713 algorithm.
              • For For clarity and debug codegen, I extracted a repeated array access into _ Temp after verifying that this does not disturb the algorithm.
                      R1_searchers / test.cpp

                          Add test cases to . ) test_boyer_moore_table2_construction ()

                            from Rytter's paper, and other repetitive patterns where the Rytter correction produces different results. Those patterns were "made from scratch" but for the results, I just used the output of the implementation and manually verified selected answers for the AB and ABC categories against my understanding of delta2's meaning (including the unused last entry, see (#) ; I was able to understand why it's for aaaaaaaaaa , 2 for "abaabaabaa" , and why it should be 1 for ("ababababab" and "abcabcabc") ). Add the test case from # , plus hand-selected test cases from the randomized testing below (selected for interesting -looking patterns). Add randomized test coverage for both Boyer-Moore and Boyer-Moore-Horspool (the latter is not known to have any bugs). This uses a fully-seeded mt 599223835 (for speed, instead of directly using random_device ). It prints out the needle / haystack for any failures, so we don't need the seed printing / reload machinery from P (R5_charconv / test.cpp) . (I'm using mt 599223835

                              for 40 - bit performance, since we don't need - bit values.) The randomized coverage uses alphabets from [a-b] to ; the former finds more examples of the bug being fixed here (as it's more likely to create highly repetitive patterns). Expanding the alphabet makes repetition unlikely, which is why I stop at [a-f]

                                . The test does a few things to improve non-optimized debug performance (reusing needle and haystack

                                  to avoid repeated allocations, using const char to avoid iterator overhead), although I did not pursue this to the ultimate limit (eg uniform_int_distribution is somewhat costly and could be avoided at the expense of some bias). Finally, as with the other randomized coverage, I print out timing statistics if it takes a long time; on my i7 - it's tuned to take ~ ms which seems reasonable (given the number of configurations, the performance of VMs, and the time needed for compilation).

                                      While this changes the behavior of a function, it is ABI-safe. This does not require coordinated changes across functions - the rest of the Boyer-Moore machinery is unchanged, and the layout of the delta2 table is unchanged. We're simply filling it with different values. In the event of mismatch, the linker will either pick the correct or incorrect algorithm, so this can't make things any worse.

                                            
                                      Read More

What do you think?

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings

Coronavirus: China's Wuhan city revises up death toll by 1,290 – a 50% increase – Sky News, Sky.com

Coronavirus: China's Wuhan city revises up death toll by 1,290 – a 50% increase – Sky News, Sky.com

Google and Apple Plan to Turn Phones into COVID-19 Contact-Tracking Devices