
[ad_1]
Discussing the pros and cons of these text filtering strategies

In this third part of a four-part series, we’ll cover benchmarking in Go using our text filtering tools as a test topic. You can find the other parts below:
first, we Created a Go program to remove unwanted textAnd we Enhanced it with regular expression support, Now, we will learn about benchmarking in Go by comparing the initial substring search approach against paired regular expression pattern matching. Note that the methods used here are simple and intended to facilitate benchmarking discussions, not to objectively prove that one method is strictly faster than the other.
Go’s standard library test package Provides comprehensive testing support (you may have seen functions of the form TestXxx(t *testing.T)
), as well as benchmarking support type b, B
Provides the convenience of managing benchmark times, iterations, etc., as well as cleaning, logging, or failing tests on specific utility functions.
Using these tools, we can use substring or pattern matching to gain insight on whether the instrument runs faster (under these scenarios).
Go benchmarks are functions that run a . lives in <filename>_test.go
file, where <filename>
is usually the name of the file that contains the functions to be tested. Both files are usually in the same package, and the benchmark functions take the form of:
func BenchmarkXxx(b *testing.B) { // benchmark code }
When writing benchmarks, we usually need some code-specific initial setup and then move into a loop that will run the code under test multiple times (determined by the test package) Finally, we are told how many loops ran and how long each loop took (on average). From there, we can analyze the code performance.
We a. will start by installing []string
Along a few lines of text — this would be typical for substring and pattern matching searches. then we’ll build a config
which contains a set of either keys
or a pattern
and then we’ll run lineMatches()
(in a benchmark loop) in front of the input text and see how long each iteration takes. We’ll do this with varying amounts of keyphrase and pattern complexity (each made to match the same text) and compare the results.
Our benchmarks will be very simple: they will all use the same input text and generate a config
either a. with keys
or a pattern
, We will create congruent substring search and pattern matching benchmarks for each query type. Then, we just need to run the test and compare the results.
We will use the following as our input test (e.g. in example/input.txt
,
The benchmark matches “hello” – it will match all lines in our input
benchmark match “large” – this will match two input lines
Benchmark matches “big,” “brig,” “byte,” and “bright” – this will match three input lines
Finally, we can run all our benchmarks by running this command:
go test -bench=.
in the root directory of the project repo, which results in the following:
The first column is the name of the benchmark. The second is how many loops were used (think: the value of b.N
), and the third column is the average time taken to run the loop. In our case, the each loop was searching through all inputText
For either substring or pattern matching.
note that -10
Indicates the number of CPU cores used for the test (10 in my case) appended to each test name.
We can see that, looking at the above tests on an Apple Silicon Mac, substring search is anywhere between 3.55x and 83.74x faster than the pattern matching implemented.
Does this mean that we should always choose substring search? Not necessarily.
In software we should choose the right tool for the right job. Can you screw with a hammer? Technically yes, but it’s far from ideal. If you can match your preferred text to one or a few substrings, this tool will probably be faster.
But, there are a lot of benefits of using regular expressions. They will generally translate well between instruments and can answer more complex questions than substrings (without significant repetition). One can also consider how much of a performance difference the performance difference has over time. As we can see, it will depend on query complexity and input size.
In many cases, the performance difference may not be significant (fractions of a second to less than a minute).
Also, note that the above tests are not comprehensive but potentially represent different general questions. For example, it would be nice to see how performance increases with input size – that can be left as an exercise to the reader. We also have the option to call b.ResetTimer()
just before b.N
Loop if the initial setup takes a significant amount of time, but there’s no need to do that here. And as always, there’s more Benchmark part of the test package to dive in!
If there are additional use cases you’d like to see benchmarked, please let me know in the comments! In the next article (the last of a four-part series), we’ll do some command line benchmarking for this tool like grep (thanks for the idea, Aaron!).
[ad_2]
Source link
#Benchmarking #Substring #Regular #Expression #Stephen #Wayne #August