|
ms
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Search a file and find all occurencesHi All,
I have a process where I'd like to search the contents of a file(in a dir) for all occurences (or the count of) of a given string. My goal is to focus more on performance, as some of the files could be upwards of 25mb in size and time is important. I don't want to take the route of loading the text of the file into a giant string and searching it, but would rather focus on a performance-minded solution. Any sugesstions for a performance - oriented "find all occurences" type of search against a file? Thanks in advance! -- Dameon Dameon wrote:
> Hi All, First of all you have to get clear in your mind what your requirements> > I have a process where I'd like to search the contents of a file(in a dir) > for all occurences (or the count of) of a given string. My goal is to focus > more on performance, as some of the files could be upwards of 25mb in size > and time is important. > I don't want to take the route of loading the text of the file into a giant string and > searching it, but would rather focus on a performance-minded solution. are. First you state that you want a performant (for real time) solution, then you state that you don't want to load the text of the file into a string and search it. Well, if the file is up to 25Mb then that wouldn't be feasible... nonetheless, are you concerned with time or memory? Often you have to trade one to get the other. Where along this scale are you: 1. Care about speed and memory consumption be damned. 2. Want speed without using unnecessarily large amounts of memory. 3. Want a compact footprint and you're willing to sacrifice some speed. 4. Want a compact footprint and speed be damned. Obviously you're not near the bottom of the list, but are you willing to pay for speed with memory? That may help limit possible solutions. > Any sugesstions for a performance - oriented "find all occurences" type of search against a file? I don't have time to search for the code or the precise algorithm, butI believe that what you want is a state-machine driven model. For a search string you can build a state machine that indicates what to do at each mismatch point. Then you can read the file as a sequence of bytes and apply the state machine to the stream. As I said, I don't have time to search for the code, but here is an idea. Let's say you're searching for "abracadabra" in the stream "abracabracadabra": 1. a matches a. Move along. 2. b matches b. Move along. 3. r matches r. Move along. 4. a matches a. Move along. 5. c matches c. Move along. 6. a matches a. Move along. 7. b does not match d. At this point, the state machine knows that you don't have to back up all the way to the preceding "b". You can back up to the 4th check, the "a" and start comparing again. ....and so on... The idea is that you pre-process the search string into a state machine. The states and transitions are unique to the search string and allow you to avoid re-checking characters you don't have to. The technique becomes more effective the longer the search string. Can anyone point the OP to a link for this algorithm? It's an oldie... I'm sure there are examples out there. Bruce Wood wrote:
> Dameon wrote: I found it. It's called the Boyer-Moore Algorithm, and I had it> > Hi All, > > > > I have a process where I'd like to search the contents of a file(in a dir) > > for all occurences (or the count of) of a given string. My goal is to focus > > more on performance, as some of the files could be upwards of 25mb in size > > and time is important. > > Can anyone point the OP to a link for this algorithm? It's an oldie... > I'm sure there are examples out there. backward... it tests from the end of the search string to the start. It's the Knuth-Pratt-Morris Algorithm that searches in the way I described. You can just Google those names. I found this page with some examples of the algorithms in action: http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html "Bruce Wood" <brucew***@canada.com> wrote: Thank you! The Byers-Moore algorithm on that page is very clever, and>You can just Google those names. I found this page with some examples >of the algorithms in action: >http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html the web page is a very nice way to present it! -- Lucian Hi Bruce -
Thanks for your reply. In all of my searches, the Boyer-Moore algorithm seemed to be the one most recommended for this type of operation. And to your point about requirements - your point #2 is where I am: "2. Want speed without using unnecessarily large amounts of memory.". Any advice on how to implement Boyer-Moore in C# for the following scenario: 1. I have an automated process which needs to add a closing tag to a file 2. The close tag cannot be added until the file has X amount of occurences of a specified string 3. The automated process will call methodX to check for X amount of occurences in the specified file 4. When the correct # of occurences are found, the closing tag will be added. The difficulty comes in deciding how best to implement an algorithm like Boyer-Moore for the above scenario and the most efficient way to load a string or parts of a string for application of the algorithm. Via System.IO, Streaming, StreamReader, etc. I will plug away and post my results here. I appreciate the advice. -- Show quoteHide quoteDameon "Bruce Wood" wrote: > > Dameon wrote: > > Hi All, > > > > I have a process where I'd like to search the contents of a file(in a dir) > > for all occurences (or the count of) of a given string. My goal is to focus > > more on performance, as some of the files could be upwards of 25mb in size > > and time is important. > > > I don't want to take the route of loading the text of the file into a giant string and > > searching it, but would rather focus on a performance-minded solution. > > First of all you have to get clear in your mind what your requirements > are. First you state that you want a performant (for real time) > solution, then you state that you don't want to load the text of the > file into a string and search it. Well, if the file is up to 25Mb then > that wouldn't be feasible... nonetheless, are you concerned with time > or memory? Often you have to trade one to get the other. Where along > this scale are you: > > 1. Care about speed and memory consumption be damned. > 2. Want speed without using unnecessarily large amounts of memory. > 3. Want a compact footprint and you're willing to sacrifice some speed. > 4. Want a compact footprint and speed be damned. > > Obviously you're not near the bottom of the list, but are you willing > to pay for speed with memory? That may help limit possible solutions. > > > Any sugesstions for a performance - oriented "find all occurences" type of search against a file? > > I don't have time to search for the code or the precise algorithm, but > I believe that what you want is a state-machine driven model. For a > search string you can build a state machine that indicates what to do > at each mismatch point. Then you can read the file as a sequence of > bytes and apply the state machine to the stream. > > As I said, I don't have time to search for the code, but here is an > idea. Let's say you're searching for "abracadabra" in the stream > "abracabracadabra": > > 1. a matches a. Move along. > 2. b matches b. Move along. > 3. r matches r. Move along. > 4. a matches a. Move along. > 5. c matches c. Move along. > 6. a matches a. Move along. > 7. b does not match d. At this point, the state machine knows that you > don't have to back up all the way to the preceding "b". You can back up > to the 4th check, the "a" and start comparing again. > ....and so on... > > The idea is that you pre-process the search string into a state > machine. The states and transitions are unique to the search string and > allow you to avoid re-checking characters you don't have to. The > technique becomes more effective the longer the search string. > > Can anyone point the OP to a link for this algorithm? It's an oldie... > I'm sure there are examples out there. > > -- Dameon "Bruce Wood" wrote: > > Dameon wrote: > > Hi All, > > > > I have a process where I'd like to search the contents of a file(in a dir) > > for all occurences (or the count of) of a given string. My goal is to focus > > more on performance, as some of the files could be upwards of 25mb in size > > and time is important. > > > I don't want to take the route of loading the text of the file into a giant string and > > searching it, but would rather focus on a performance-minded solution. > > First of all you have to get clear in your mind what your requirements > are. First you state that you want a performant (for real time) > solution, then you state that you don't want to load the text of the > file into a string and search it. Well, if the file is up to 25Mb then > that wouldn't be feasible... nonetheless, are you concerned with time > or memory? Often you have to trade one to get the other. Where along > this scale are you: > > 1. Care about speed and memory consumption be damned. > 2. Want speed without using unnecessarily large amounts of memory. > 3. Want a compact footprint and you're willing to sacrifice some speed. > 4. Want a compact footprint and speed be damned. > > Obviously you're not near the bottom of the list, but are you willing > to pay for speed with memory? That may help limit possible solutions. > > > Any sugesstions for a performance - oriented "find all occurences" type of search against a file? > > I don't have time to search for the code or the precise algorithm, but > I believe that what you want is a state-machine driven model. For a > search string you can build a state machine that indicates what to do > at each mismatch point. Then you can read the file as a sequence of > bytes and apply the state machine to the stream. > > As I said, I don't have time to search for the code, but here is an > idea. Let's say you're searching for "abracadabra" in the stream > "abracabracadabra": > > 1. a matches a. Move along. > 2. b matches b. Move along. > 3. r matches r. Move along. > 4. a matches a. Move along. > 5. c matches c. Move along. > 6. a matches a. Move along. > 7. b does not match d. At this point, the state machine knows that you > don't have to back up all the way to the preceding "b". You can back up > to the 4th check, the "a" and start comparing again. > ....and so on... > > The idea is that you pre-process the search string into a state > machine. The states and transitions are unique to the search string and > allow you to avoid re-checking characters you don't have to. The > technique becomes more effective the longer the search string. > > Can anyone point the OP to a link for this algorithm? It's an oldie... > I'm sure there are examples out there. > > Dameon wrote:
> The first question is, how many of the files are likely to need the> Any advice on how to implement Boyer-Moore in C# for the following scenario: > > 1. I have an automated process which needs to add a closing tag to a file > 2. The close tag cannot be added until the file has X amount of occurences > of a specified string > 3. The automated process will call methodX to check for X amount of > occurences in the specified file > 4. When the correct # of occurences are found, the closing tag will be added. closing tag added? 1. Most of them 2. A few of them This governs your solution. If the answer is "a few of them" then you can tolerate an algorithm that first has to read the file to see if it is a match, and then has to read it again to actually append the tag. This is easier. If the answer is "most of them" then you have to search and append the tag all in one operation, which is harder. Unfortunately, there's no free lunch, either: if you choose the second, harder algorithm, then its performance will suck if you have to append to only a few files. If you choose the first, easier algorithm, then its performance will suck if you have to append to most files. In either case, in order to do the search, I would create an array of characters in memory that is the same size as or larger than the search string. I would then wrap it in a class that gives you "circular buffer" semantics with indices that work backward (where index 0 is the last character you just read, 1 is the one before that, 2 is the one before that, etc): public class CircularBuffer { private char[] _buffer; private int _length; private int _zeroIndex; public CircularBuffer(int capacity) { this._buffer = new char[capacity]; this._length = 0; this._zeroIndex = 0; } public void Add(char newChar) { this._zeroIndex -= 1; if (this._zeroIndex < 0) { this._zeroIndex += this._buffer.Length; } this._buffer[this._zeroIndex] = newChar; if (this._length < this._buffer.Length) { this._length += 1; } } public char this[int index] { if (index < 0 || index >= this._length) { throw new IndexOutOfRangeException(...); } int trueIndex = this._zeroIndex + index; // Yes, one could use a modulus operator to do this, but I would expect // a test followed by a subtraction to be slightly faster than a modulus, // which requires an integer division operation. if (trueIndex > 0) { trueIndex -= this._buffer.Length; } return this._buffer[trueIndex]; } } Basically what this gives you is a buffer that holds the last "capacity" characters that you have read from a stream. As you read each character you can "Add" it to the buffer, which makes one character "fall off" the "start" of the array to make room for the new character. With this, it should be easy enough to program Boyer-Moore: you have a rolling record of enough characters from the stream that you can move far enough backward through the stream to make the required comparisons. Now the only problem is adding the closing tag. Here the only question is whether you want to be writing an output file as you are doing Boyer-Moore, and throw away the copied file if it turns out that the file fails the test, or whether you want to do Boyer-Moore first and then, if the file passes the test, read the file again, copying it into a new file and appending the closing tag. In the first case, you will waste time writing files you don't need if the Boyer-Moore test eventually reveals that the file isn't a match. In the second case, you have to read the file again if the test reveals that the file is a match. So, if few files will be a match, then the second algorithm (read to test / read again to copy) is obviously more performant. On the other hand, if most files will be a match, then the first algorithm (always write new file / delete if not needed) will be more performant. This first algorithm is, by the way, slightly easier to write, as you can write two completely separate bits of code, each of which has one job to do. However, it really depends upon your mix of files which is the best way to do it. Hi Bruce -
To answer a couple fo your questions: 1. All of the files checked will eventually each need a close tag when they have the correct # of occurences of the search string. 2. Adding the tag isn't an issue, as it will always be the last set of chars in the file(like an XML root tag being closed). In my search method, I will return a true\false to the caller letting him know that the file searched either has all occurences or not. 3. The code below I assume doesn't consume the entire string, but more so I am 'feeding' the CircularBuffer characters as I read them from the stream and the capacity of the buffer is set via the constructor of the class. Once again, thanks for your assistance. I will reply with some results! Dameon -- Show quoteHide quoteDameon "Bruce Wood" wrote: > > Dameon wrote: > > > > Any advice on how to implement Boyer-Moore in C# for the following scenario: > > > > 1. I have an automated process which needs to add a closing tag to a file > > 2. The close tag cannot be added until the file has X amount of occurences > > of a specified string > > 3. The automated process will call methodX to check for X amount of > > occurences in the specified file > > 4. When the correct # of occurences are found, the closing tag will be added. > > The first question is, how many of the files are likely to need the > closing tag added? > 1. Most of them > 2. A few of them > > This governs your solution. If the answer is "a few of them" then you > can tolerate an algorithm that first has to read the file to see if it > is a match, and then has to read it again to actually append the tag. > This is easier. If the answer is "most of them" then you have to search > and append the tag all in one operation, which is harder. > Unfortunately, there's no free lunch, either: if you choose the second, > harder algorithm, then its performance will suck if you have to append > to only a few files. If you choose the first, easier algorithm, then > its performance will suck if you have to append to most files. > > In either case, in order to do the search, I would create an array of > characters in memory that is the same size as or larger than the search > string. I would then wrap it in a class that gives you "circular > buffer" semantics with indices that work backward (where index 0 is the > last character you just read, 1 is the one before that, 2 is the one > before that, etc): > > public class CircularBuffer > { > private char[] _buffer; > private int _length; > private int _zeroIndex; > > public CircularBuffer(int capacity) > { > this._buffer = new char[capacity]; > this._length = 0; > this._zeroIndex = 0; > } > > public void Add(char newChar) > { > this._zeroIndex -= 1; > if (this._zeroIndex < 0) > { > this._zeroIndex += this._buffer.Length; > } > this._buffer[this._zeroIndex] = newChar; > if (this._length < this._buffer.Length) > { > this._length += 1; > } > } > > public char this[int index] > { > if (index < 0 || index >= this._length) > { > throw new IndexOutOfRangeException(...); > } > int trueIndex = this._zeroIndex + index; > > // Yes, one could use a modulus operator to do this, but I > would expect > // a test followed by a subtraction to be slightly faster than > a modulus, > // which requires an integer division operation. > if (trueIndex > 0) > { > trueIndex -= this._buffer.Length; > } > return this._buffer[trueIndex]; > } > } > > Basically what this gives you is a buffer that holds the last > "capacity" characters that you have read from a stream. As you read > each character you can "Add" it to the buffer, which makes one > character "fall off" the "start" of the array to make room for the new > character. > > With this, it should be easy enough to program Boyer-Moore: you have a > rolling record of enough characters from the stream that you can move > far enough backward through the stream to make the required > comparisons. > > Now the only problem is adding the closing tag. > > Here the only question is whether you want to be writing an output file > as you are doing Boyer-Moore, and throw away the copied file if it > turns out that the file fails the test, or whether you want to do > Boyer-Moore first and then, if the file passes the test, read the file > again, copying it into a new file and appending the closing tag. > > In the first case, you will waste time writing files you don't need if > the Boyer-Moore test eventually reveals that the file isn't a match. In > the second case, you have to read the file again if the test reveals > that the file is a match. So, if few files will be a match, then the > second algorithm (read to test / read again to copy) is obviously more > performant. On the other hand, if most files will be a match, then the > first algorithm (always write new file / delete if not needed) will be > more performant. This first algorithm is, by the way, slightly easier > to write, as you can write two completely separate bits of code, each > of which has one job to do. However, it really depends upon your mix of > files which is the best way to do it. > >
Other interesting topics
Fast Case Insensitive String Comparisons
Launch Application as Different User in C# Windows Applicatoin How to properly check for NULL! Visual Studio Plugin? Need Simple Example of Class Event Raising and Handling in Form Appropriate pattern KeyedCollection in use for GUI editor? Chicken in Egg problem... GetHicon memory leak InvalidOperations excetion running wsh with Process class XML Edit/Save |
|||||||||||||||||||||||