Home All Groups Group Topic Archive Search About

Search a file and find all occurences

Author
13 Dec 2006 12:47 AM
Dameon
Hi All,

I have a process where I'd like to search the contents of a file(in a dir)
for all occurences (or the count of) of a given string. My goal is to focus
more on performance, as some of the files could be upwards of 25mb in size
and time is important. I don't want to take the route of loading the text of
the file into a giant string and searching it, but would rather focus on a
performance-minded solution. Any sugesstions for a performance - oriented
"find all occurences" type of search against a file?

Thanks in advance!
--
Dameon

Author
13 Dec 2006 2:15 AM
Bruce Wood
Dameon wrote:
> Hi All,
>
> I have a process where I'd like to search the contents of a file(in a dir)
> for all occurences (or the count of) of a given string. My goal is to focus
> more on performance, as some of the files could be upwards of 25mb in size
> and time is important.

> I don't want to take the route of loading the text of the file into a giant string and
> searching it, but would rather focus on a performance-minded solution.

First of all you have to get clear in your mind what your requirements
are. First you state that you want a performant (for real time)
solution, then you state that you don't want to load the text of the
file into a string and search it. Well, if the file is up to 25Mb then
that wouldn't be feasible... nonetheless, are you concerned with time
or memory? Often you have to trade one to get the other. Where along
this scale are you:

1. Care about speed and memory consumption be damned.
2. Want speed without using unnecessarily large amounts of memory.
3. Want a compact footprint and you're willing to sacrifice some speed.
4. Want a compact footprint and speed be damned.

Obviously you're not near the bottom of the list, but are you willing
to pay for speed with memory? That may help limit possible solutions.

> Any sugesstions for a performance - oriented "find all occurences" type of search against a file?

I don't have time to search for the code or the precise algorithm, but
I believe that what you want is a state-machine driven model. For a
search string you can build a state machine that indicates what to do
at each mismatch point. Then you can read the file as a sequence of
bytes and apply the state machine to the stream.

As I said, I don't have time to search for the code, but here is an
idea. Let's say you're searching for "abracadabra" in the stream
"abracabracadabra":

1. a matches a. Move along.
2. b matches b. Move along.
3. r matches r. Move along.
4. a matches a. Move along.
5. c matches c. Move along.
6. a matches a. Move along.
7. b does not match d. At this point, the state machine knows that you
don't have to back up all the way to the preceding "b". You can back up
to the 4th check, the "a" and start comparing again.
....and so on...

The idea is that you pre-process the search string into a state
machine. The states and transitions are unique to the search string and
allow you to avoid re-checking characters you don't have to. The
technique becomes more effective the longer the search string.

Can anyone point the OP to a link for this algorithm? It's an oldie...
I'm sure there are examples out there.
Are all your drivers up to date? click for free checkup

Author
13 Dec 2006 4:35 AM
Bruce Wood
Bruce Wood wrote:
> Dameon wrote:
> > Hi All,
> >
> > I have a process where I'd like to search the contents of a file(in a dir)
> > for all occurences (or the count of) of a given string. My goal is to focus
> > more on performance, as some of the files could be upwards of 25mb in size
> > and time is important.
>
> Can anyone point the OP to a link for this algorithm? It's an oldie...
> I'm sure there are examples out there.

I found it. It's called the Boyer-Moore Algorithm, and I had it
backward... it tests from the end of the search string to the start.
It's the Knuth-Pratt-Morris Algorithm that searches in the way I
described.

You can just Google those names. I found this page with some examples
of the algorithms in action:

http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html
Author
13 Dec 2006 5:13 AM
Lucian Wischik
"Bruce Wood" <brucew***@canada.com> wrote:
>You can just Google those names. I found this page with some examples
>of the algorithms in action:
>http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html

Thank you! The Byers-Moore algorithm on that page is very clever, and
the web page is a very nice way to present it!

--
Lucian
Author
13 Dec 2006 6:08 PM
Dameon
Hi Bruce -

Thanks for your reply. In all of my searches, the Boyer-Moore algorithm
seemed to be the one most recommended for this type of operation. And to your
point about requirements - your point #2 is where I am: "2. Want speed
without using unnecessarily large amounts of memory.".

Any advice on how to implement Boyer-Moore in C# for the following scenario:

1. I have an automated process which needs to add a closing tag to a file
2. The close tag cannot be added until the file has X amount of occurences
of a specified string
3. The automated process will call methodX to check for X amount of
occurences in the specified file
4. When the correct # of occurences are found, the closing tag will be added.

The difficulty comes in deciding how best to implement an algorithm like
Boyer-Moore for the above scenario and the most efficient way to load a
string or parts of a string for application of the algorithm. Via System.IO,
Streaming, StreamReader, etc.

I will plug away and post my results here. I appreciate the advice.

--
Dameon


Show quoteHide quote
"Bruce Wood" wrote:

>
> Dameon wrote:
> > Hi All,
> >
> > I have a process where I'd like to search the contents of a file(in a dir)
> > for all occurences (or the count of) of a given string. My goal is to focus
> > more on performance, as some of the files could be upwards of 25mb in size
> > and time is important.
>
> > I don't want to take the route of loading the text of the file into a giant string and
> > searching it, but would rather focus on a performance-minded solution.
>
> First of all you have to get clear in your mind what your requirements
> are. First you state that you want a performant (for real time)
> solution, then you state that you don't want to load the text of the
> file into a string and search it. Well, if the file is up to 25Mb then
> that wouldn't be feasible... nonetheless, are you concerned with time
> or memory? Often you have to trade one to get the other. Where along
> this scale are you:
>
> 1. Care about speed and memory consumption be damned.
> 2. Want speed without using unnecessarily large amounts of memory.
> 3. Want a compact footprint and you're willing to sacrifice some speed.
> 4. Want a compact footprint and speed be damned.
>
> Obviously you're not near the bottom of the list, but are you willing
> to pay for speed with memory? That may help limit possible solutions.
>
> > Any sugesstions for a performance - oriented "find all occurences" type of search against a file?
>
> I don't have time to search for the code or the precise algorithm, but
> I believe that what you want is a state-machine driven model. For a
> search string you can build a state machine that indicates what to do
> at each mismatch point. Then you can read the file as a sequence of
> bytes and apply the state machine to the stream.
>
> As I said, I don't have time to search for the code, but here is an
> idea. Let's say you're searching for "abracadabra" in the stream
> "abracabracadabra":
>
> 1. a matches a. Move along.
> 2. b matches b. Move along.
> 3. r matches r. Move along.
> 4. a matches a. Move along.
> 5. c matches c. Move along.
> 6. a matches a. Move along.
> 7. b does not match d. At this point, the state machine knows that you
> don't have to back up all the way to the preceding "b". You can back up
> to the 4th check, the "a" and start comparing again.
> ....and so on...
>
> The idea is that you pre-process the search string into a state
> machine. The states and transitions are unique to the search string and
> allow you to avoid re-checking characters you don't have to. The
> technique becomes more effective the longer the search string.
>
> Can anyone point the OP to a link for this algorithm? It's an oldie...
> I'm sure there are examples out there.
>
>
--
Dameon


"Bruce Wood" wrote:

>
> Dameon wrote:
> > Hi All,
> >
> > I have a process where I'd like to search the contents of a file(in a dir)
> > for all occurences (or the count of) of a given string. My goal is to focus
> > more on performance, as some of the files could be upwards of 25mb in size
> > and time is important.
>
> > I don't want to take the route of loading the text of the file into a giant string and
> > searching it, but would rather focus on a performance-minded solution.
>
> First of all you have to get clear in your mind what your requirements
> are. First you state that you want a performant (for real time)
> solution, then you state that you don't want to load the text of the
> file into a string and search it. Well, if the file is up to 25Mb then
> that wouldn't be feasible... nonetheless, are you concerned with time
> or memory? Often you have to trade one to get the other. Where along
> this scale are you:
>
> 1. Care about speed and memory consumption be damned.
> 2. Want speed without using unnecessarily large amounts of memory.
> 3. Want a compact footprint and you're willing to sacrifice some speed.
> 4. Want a compact footprint and speed be damned.
>
> Obviously you're not near the bottom of the list, but are you willing
> to pay for speed with memory? That may help limit possible solutions.
>
> > Any sugesstions for a performance - oriented "find all occurences" type of search against a file?
>
> I don't have time to search for the code or the precise algorithm, but
> I believe that what you want is a state-machine driven model. For a
> search string you can build a state machine that indicates what to do
> at each mismatch point. Then you can read the file as a sequence of
> bytes and apply the state machine to the stream.
>
> As I said, I don't have time to search for the code, but here is an
> idea. Let's say you're searching for "abracadabra" in the stream
> "abracabracadabra":
>
> 1. a matches a. Move along.
> 2. b matches b. Move along.
> 3. r matches r. Move along.
> 4. a matches a. Move along.
> 5. c matches c. Move along.
> 6. a matches a. Move along.
> 7. b does not match d. At this point, the state machine knows that you
> don't have to back up all the way to the preceding "b". You can back up
> to the 4th check, the "a" and start comparing again.
> ....and so on...
>
> The idea is that you pre-process the search string into a state
> machine. The states and transitions are unique to the search string and
> allow you to avoid re-checking characters you don't have to. The
> technique becomes more effective the longer the search string.
>
> Can anyone point the OP to a link for this algorithm? It's an oldie...
> I'm sure there are examples out there.
>
>
Author
14 Dec 2006 1:13 AM
Bruce Wood
Dameon wrote:
>
> Any advice on how to implement Boyer-Moore in C# for the following scenario:
>
> 1. I have an automated process which needs to add a closing tag to a file
> 2. The close tag cannot be added until the file has X amount of occurences
> of a specified string
> 3. The automated process will call methodX to check for X amount of
> occurences in the specified file
> 4. When the correct # of occurences are found, the closing tag will be added.

The first question is, how many of the files are likely to need the
closing tag added?
1. Most of them
2. A few of them

This governs your solution. If the answer is "a few of them" then you
can tolerate an algorithm that first has to read the file to see if it
is a match, and then has to read it again to actually append the tag.
This is easier. If the answer is "most of them" then you have to search
and append the tag all in one operation, which is harder.
Unfortunately, there's no free lunch, either: if you choose the second,
harder algorithm, then its performance will suck if you have to append
to only a few files. If you choose the first, easier algorithm, then
its performance will suck if you have to append to most files.

In either case, in order to do the search, I would create an array of
characters in memory that is the same size as or larger than the search
string. I would then wrap it in a class that gives you "circular
buffer" semantics with indices that work backward (where index 0 is the
last character you just read, 1 is the one before that, 2 is the one
before that, etc):

public class CircularBuffer
{
    private char[] _buffer;
    private int _length;
    private int _zeroIndex;

    public CircularBuffer(int capacity)
    {
        this._buffer = new char[capacity];
        this._length = 0;
        this._zeroIndex = 0;
    }

    public void Add(char newChar)
    {
        this._zeroIndex -= 1;
        if (this._zeroIndex < 0)
        {
            this._zeroIndex += this._buffer.Length;
        }
        this._buffer[this._zeroIndex] = newChar;
        if (this._length < this._buffer.Length)
        {
            this._length += 1;
        }
    }

    public char this[int index]
    {
        if (index < 0 || index >= this._length)
        {
            throw new IndexOutOfRangeException(...);
        }
        int trueIndex = this._zeroIndex + index;

        // Yes, one could use a modulus operator to do this, but I
would expect
        // a test followed by a subtraction to be slightly faster than
a modulus,
        // which requires an integer division operation.
        if (trueIndex > 0)
        {
            trueIndex -= this._buffer.Length;
        }
        return this._buffer[trueIndex];
    }
}

Basically what this gives you is a buffer that holds the last
"capacity" characters that you have read from a stream. As you read
each character you can "Add" it to the buffer, which makes one
character "fall off" the "start" of the array to make room for the new
character.

With this, it should be easy enough to program Boyer-Moore: you have a
rolling record of enough characters from the stream that you can move
far enough backward through the stream to make the required
comparisons.

Now the only problem is adding the closing tag.

Here the only question is whether you want to be writing an output file
as you are doing Boyer-Moore, and throw away the copied file if it
turns out that the file fails the test, or whether you want to do
Boyer-Moore first and then, if the file passes the test, read the file
again, copying it into a new file and appending the closing tag.

In the first case, you will waste time writing files you don't need if
the Boyer-Moore test eventually reveals that the file isn't a match. In
the second case, you have to read the file again if the test reveals
that the file is a match. So, if few files will be a match, then the
second algorithm (read to test / read again to copy) is obviously more
performant. On the other hand, if most files will be a match, then the
first algorithm (always write new file / delete if not needed) will be
more performant. This first algorithm is, by the way, slightly easier
to write, as you can write two completely separate bits of code, each
of which has one job to do. However, it really depends upon your mix of
files which is the best way to do it.
Author
14 Dec 2006 6:59 PM
Dameon
Hi Bruce -

To answer a couple fo your questions:
1. All of the files checked will eventually each need a close tag when they
have the correct # of occurences of the search string.

2. Adding the tag isn't an issue, as it will always be the last set of chars
in the file(like an XML root tag being closed). In my search method, I will
return a true\false to the caller letting him know that the file searched
either has all occurences or not.

3. The code below I assume doesn't consume the entire string, but more so I
am 'feeding' the CircularBuffer characters as I read them from the stream and
the capacity of the buffer is set via the constructor of the class.

Once again, thanks for your assistance. I will reply with some results!

Dameon

--
Dameon


Show quoteHide quote
"Bruce Wood" wrote:

>
> Dameon wrote:
> >
> > Any advice on how to implement Boyer-Moore in C# for the following scenario:
> >
> > 1. I have an automated process which needs to add a closing tag to a file
> > 2. The close tag cannot be added until the file has X amount of occurences
> > of a specified string
> > 3. The automated process will call methodX to check for X amount of
> > occurences in the specified file
> > 4. When the correct # of occurences are found, the closing tag will be added.
>
> The first question is, how many of the files are likely to need the
> closing tag added?
> 1. Most of them
> 2. A few of them
>
> This governs your solution. If the answer is "a few of them" then you
> can tolerate an algorithm that first has to read the file to see if it
> is a match, and then has to read it again to actually append the tag.
> This is easier. If the answer is "most of them" then you have to search
> and append the tag all in one operation, which is harder.
> Unfortunately, there's no free lunch, either: if you choose the second,
> harder algorithm, then its performance will suck if you have to append
> to only a few files. If you choose the first, easier algorithm, then
> its performance will suck if you have to append to most files.
>
> In either case, in order to do the search, I would create an array of
> characters in memory that is the same size as or larger than the search
> string. I would then wrap it in a class that gives you "circular
> buffer" semantics with indices that work backward (where index 0 is the
> last character you just read, 1 is the one before that, 2 is the one
> before that, etc):
>
> public class CircularBuffer
> {
>     private char[] _buffer;
>     private int _length;
>     private int _zeroIndex;
>
>     public CircularBuffer(int capacity)
>     {
>         this._buffer = new char[capacity];
>         this._length = 0;
>         this._zeroIndex = 0;
>     }
>
>     public void Add(char newChar)
>     {
>         this._zeroIndex -= 1;
>         if (this._zeroIndex < 0)
>         {
>             this._zeroIndex += this._buffer.Length;
>         }
>         this._buffer[this._zeroIndex] = newChar;
>         if (this._length < this._buffer.Length)
>         {
>             this._length += 1;
>         }
>     }
>
>     public char this[int index]
>     {
>         if (index < 0 || index >= this._length)
>         {
>             throw new IndexOutOfRangeException(...);
>         }
>         int trueIndex = this._zeroIndex + index;
>
>         // Yes, one could use a modulus operator to do this, but I
> would expect
>         // a test followed by a subtraction to be slightly faster than
> a modulus,
>         // which requires an integer division operation.
>         if (trueIndex > 0)
>         {
>             trueIndex -= this._buffer.Length;
>         }
>         return this._buffer[trueIndex];
>     }
> }
>
> Basically what this gives you is a buffer that holds the last
> "capacity" characters that you have read from a stream. As you read
> each character you can "Add" it to the buffer, which makes one
> character "fall off" the "start" of the array to make room for the new
> character.
>
> With this, it should be easy enough to program Boyer-Moore: you have a
> rolling record of enough characters from the stream that you can move
> far enough backward through the stream to make the required
> comparisons.
>
> Now the only problem is adding the closing tag.
>
> Here the only question is whether you want to be writing an output file
> as you are doing Boyer-Moore, and throw away the copied file if it
> turns out that the file fails the test, or whether you want to do
> Boyer-Moore first and then, if the file passes the test, read the file
> again, copying it into a new file and appending the closing tag.
>
> In the first case, you will waste time writing files you don't need if
> the Boyer-Moore test eventually reveals that the file isn't a match. In
> the second case, you have to read the file again if the test reveals
> that the file is a match. So, if few files will be a match, then the
> second algorithm (read to test / read again to copy) is obviously more
> performant. On the other hand, if most files will be a match, then the
> first algorithm (always write new file / delete if not needed) will be
> more performant. This first algorithm is, by the way, slightly easier
> to write, as you can write two completely separate bits of code, each
> of which has one job to do. However, it really depends upon your mix of
> files which is the best way to do it.
>
>

Bookmark and Share