Warm tip: This article is reproduced from serverfault.com, please click

Why is my brute force substring search returning extra counts?

发布于 2020-11-29 16:11:22

Doing some work with timing different algorithms, however my brute force implementation which I have found numerous times on different sites is sometimes returning more results than, say, Notepad++ search or VSCode search. Not sure what I am doing wrong.

The program opens a txt file with a DNA strand string of length 10000000 and searches and counts the number of occurrences of the string passed in via command line.

Algorithm:

int main(int argc, char *argv[]) {
    // read in dna strand
    ifstream file("dna.txt");
    string dna((istreambuf_iterator<char>(file)), istreambuf_iterator<char>());
    dna.c_str();
    int dnaLength = dna.length();
    cout << "DNA Strand Length: " << dnaLength << endl;
    string pat = argv[1];
    cout << "Pattern: " << pat << endl;

    // algorithm
    int M = pat.length();
    int N = dnaLength;
    int localCount = 0;
    for (int i = 0; i <= N - M; i++) {
        int j;
        for (j = 0; j < M; j++) {
            if (dna.at(i + j) != pat.at(j)) {
                break;
            }
        }
        if (j == M) {
            localCount++;
        }
    }
Questioner
cyberguy
Viewed
0
P. Saladin 2020-11-30 00:50:58

The difference might be because your algorithm also counts overlapping results, while a quick check with Notepad++ shows that it does not. Example:

Let dna be "FooFooFooFoo"

And your pattern "FooFoo"

What result do you expect? Notepad++ shows 2 (one starts at position 1, the second at position 7 (after the first).

Your algorithm will find 3 (position 1, 4 and 7)