Home All Groups Group Topic Archive Search About

Replace strings in a text file and get the number of replacements made

Author
7 Jul 2009 4:46 PM
Smith
Hi

Usually I only replace strings with text = text.Replace("old text", "new
text");

Now I need to display the number of replacements made. Is there an easy way
or do I need some custom replacement method?

Author
7 Jul 2009 5:05 PM
vanderghast
String a= ... ;
    int n = (a.Length-a.Replace("old text", "").Length)/("old text".Length);



should do.


Vanderghast, Access MVP


Show quoteHide quote
"Smith" <n*@thank.you> wrote in message
news:uJCFGKy$JHA.200@TK2MSFTNGP05.phx.gbl...
> Hi
>
> Usually I only replace strings with text = text.Replace("old text", "new
> text");
>
> Now I need to display the number of replacements made. Is there an easy
> way
> or do I need some custom replacement method?
Are all your drivers up to date? click for free checkup

Author
7 Jul 2009 6:26 PM
Peter Duniho
On Tue, 07 Jul 2009 09:46:48 -0700, Smith <n*@thank.you> wrote:

> Hi
>
> Usually I only replace strings with text = text.Replace("old text", "new
> text");
>
> Now I need to display the number of replacements made. Is there an easy 
> way
> or do I need some custom replacement method?

As Vanderghast suggests, as long as the new text is always a different 
length than the original, you can simply look at the difference in length 
of the modified string, which will be an exact multiple of the difference 
in length between the searched-for text and the replacement text.

If you don't have that guarantee -- that is, the new text could be the 
same length as the original -- then an alternative approach would be to 
use the Regex class.  Normally I'd say for simple replacement it's 
overkill, but one thing it provides is an API that returns a list of 
matched sites in your original string; the length of that list is exactly 
the number you're looking for.

Custom code to do the replacement would be most efficient, but Regex won't 
be awful, it's already there written for you, and if this isn't a 
bottleneck in your code it won't matter anyway.

Pete
Author
7 Jul 2009 6:42 PM
vanderghast
I should have specified that you use String.Empty, NOT the new REPLACING
string, in the replace statement, so the result of

        myString.Replace(frament, String.Empty)


will be the original myString shortenend by frament.Length  each time
fragment can be ***consumed*** from  myString.


Sure,that assumes you use the convention that "ohoh" occurs only two times
in "ohohohohoh", not four times (***consuming*** assumption, rather than
***matching*** assumption).


Vanderghast, Access MVP


Show quoteHide quote
"Peter Duniho" <no.peted.spam@no.nwlink.spam.com> wrote in message
news:op.uwpmt30wvmc1hu@macbook-pro.local...
> On Tue, 07 Jul 2009 09:46:48 -0700, Smith <n*@thank.you> wrote:
>
>> Hi
>>
>> Usually I only replace strings with text = text.Replace("old text", "new
>> text");
>>
>> Now I need to display the number of replacements made. Is there an easy
>> way
>> or do I need some custom replacement method?
>
> As Vanderghast suggests, as long as the new text is always a different
> length than the original, you can simply look at the difference in length
> of the modified string, which will be an exact multiple of the difference
> in length between the searched-for text and the replacement text.
>
> If you don't have that guarantee -- that is, the new text could be the
> same length as the original -- then an alternative approach would be to
> use the Regex class.  Normally I'd say for simple replacement it's
> overkill, but one thing it provides is an API that returns a list of
> matched sites in your original string; the length of that list is exactly
> the number you're looking for.
>
> Custom code to do the replacement would be most efficient, but Regex won't
> be awful, it's already there written for you, and if this isn't a
> bottleneck in your code it won't matter anyway.
>
> Pete
Author
7 Jul 2009 9:45 PM
Jesse Houwing
Hello Peter,

Looping through the string in a stringbuilder is probably the safest way
to do this:

        string input = "bla bla bla bla bla bla blabla";
        string search = "bla";
        string replacement = "bli";
        StringBuilder sb = new StringBuilder(input);

        int count = 0;
        for (int i = 0; i < sb.Length; )
        {
            if (AreEqual(sb, search, i))
            {
                sb.Remove(i, search.Length);
                sb.Insert(i, replacement);
                i += replacement.Length;
                count++;
            }
            else
            {
                i++;
            }
        }
        Console.WriteLine(count);
        Console.WriteLine(sb.ToString());

    static bool AreEqual(StringBuilder sb, string val, int pos)
    {
        for (int i = 0; i < val.Length; i++)
        {
            if (sb[pos + i] != val[i])
            {
                return false;
            }
        }
        return true;
    }

It might be faster to use

sb.Replace(search, replacement, i, search.Length)

instead of a sb.Remove, sb.Insert, I'm not sure, but they won't differ that
much.

It is probably a lot faster than using a regex, though I haven't done any
measurements.

Jesse

Show quoteHide quote
> On Tue, 07 Jul 2009 09:46:48 -0700, Smith <n*@thank.you> wrote:
>
>> Hi
>>
>> Usually I only replace strings with text = text.Replace("old text",
>> "new text");
>>
>> Now I need to display the number of replacements made. Is there an
>> easy
>> way
>> or do I need some custom replacement method?
> As Vanderghast suggests, as long as the new text is always a different
> length than the original, you can simply look at the difference in
> length  of the modified string, which will be an exact multiple of the
> difference  in length between the searched-for text and the
> replacement text.
>
> If you don't have that guarantee -- that is, the new text could be the
> same length as the original -- then an alternative approach would be
> to  use the Regex class.  Normally I'd say for simple replacement it's
> overkill, but one thing it provides is an API that returns a list of
> matched sites in your original string; the length of that list is
> exactly  the number you're looking for.
>
> Custom code to do the replacement would be most efficient, but Regex
> won't  be awful, it's already there written for you, and if this isn't
> a  bottleneck in your code it won't matter anyway.
>
> Pete
>
--
Jesse Houwing
jesse.houwing at sogeti.nl
Author
7 Jul 2009 10:48 PM
Peter Duniho
On Tue, 07 Jul 2009 14:45:02 -0700, Jesse Houwing 
<jesse.houwing@newsgroup.nospam> wrote:

Show quoteHide quote
> Looping through the string in a stringbuilder is probably the safest way 
> to do this:
>
> [...]
>         int count = 0;
>         for (int i = 0; i < sb.Length; )
>         {
>             if (AreEqual(sb, search, i))
>             {
>                 sb.Remove(i, search.Length);
>                 sb.Insert(i, replacement);
>                 i += replacement.Length;
>                 count++;
>             }
>             else
>             {
>                 i++;
>             }
>         }
> [...]
>
> It might be faster to use sb.Replace(search, replacement, i, 
> search.Length)
>
> instead of a sb.Remove, sb.Insert, I'm not sure, but they won't differ 
> that much.

If you simply call StringBuilder.Replace(string, string, int, int) instead 
of having your own AreEqual() method followed by a call to Remove() and 
Insert(), the performance should be practically identical, but you 
wouldn't get any information about how many replacements occurred.

Alternatively, if you still call AreEqual() and then call 
StringBuilder.Replace(string, string, int, int), you're duplicating effort 
(which costs performance), because StringBuilder.Replace(string, string, 
int, int) has to actually do the string comparison again.  That would 
actually be _slower_ than your original code.

You could do a little hack by searching for the first character that 
differs between the search and replacement strings (as an initialization, 
not as part of the loop), and then bumping a counter after each call to 
StringBuilder.Replace() based on whether the character at the same offset 
within the current StringBuilder has changed.  That would be only slightly 
slower than just calling StringBuilder.Replace(string, string, int, int), 
but would include the count.

That said, I would hope that any Replace() method in .NET, including 
Regex.Replace(), String.Replace(), or StringBuilder.Replace() would be 
faster than the code you posted.  The main reason being that all of those 
methods have the opportunity to optimize the construction of the new 
string, whereas your example doesn't optimize at all.

At the very least, I would not use the Remove()/Insert() pattern you've 
shown.  Instead, I would use a String as input, and a StringBuilder as 
output, appending text segments to the output StringBuilder as I scan the 
input String.  That way the code avoids having to repeatedly shift your 
character buffer in the StringBuilder (which happens _twice_ for each 
replacement in your code).  That's exactly the kind of optimization I'd 
expect to find inside the .NET classes.

It might even be worthwhile to defer creation of the output StringBuilder 
until you detect the first match that needs to be replaced, if there's an 
expectation that for a significant frequency of input, no replacements 
would be needed.

> It is probably a lot faster than using a regex, though I haven't done 
> any measurements.

I would expect Regex to be on par with other explicit mechanisms like 
that, especially given the need to count the replacements (which for 
non-Regex solutions requires replacing the search text twice).  If 
performance is an issue, then a "scan and build" approach as I suggest 
above is probably slightly faster than using built-in Replace() methods 
simply because you can incorporate a count into the replacement logic.

All that said, if performance is an issue (and there's nothing in the OP 
to suggest it is), the only way to know for sure what the best solution is 
would be to try the different alternatives and measure them.  Even 
theoretical advantages and disadvantages may be irrelevant for a typical 
data set, and intuition is a terrible way to measure performance.  :)

For best performance, it may be that none of the suggestions offered so 
far are probably suitable.  There's an optimized text search algorithm, 
the name of which I can't recall at the moment, that can probably be 
adapted, but if not then a degenerate state-graph implementation (since 
there's only one string to search for) would probably work too.  Either 
approach would avoid having to keep performing full string comparisons at 
each character index in the original string (consider an original string 
"aaaaaaaaaaaaaaaaaaaaaaaaaa" where you want to replace all occurrences of 
"aaaaaab" with something :) ).

But even there, as I said, there's no way to know for sure without 
measuring.  Performance of the various choices is to some extent going to 
be data dependent; liabilities that exist in the general case might not 
really be that much of a problem.  For example, if dealing with 
essentially random data, it's not too terrible to just keep comparing over 
and over at an incremented index, because those comparisons will normally 
terminate quickly when there's no match.

In other words, even the theoretically worst-case implementation might not 
turn out to be much different than the more optimized ones.

Until there's a performance problem shown, the OP should stick with 
whatever solution _reads_ the best, and is the most maintainable.  And if 
there is a performance problem shown, measuring each viable alternative is 
the only way to know for sure which will be fastest.

Pete
Author
8 Jul 2009 11:03 AM
Jesse Houwing
Hello Peter,

Agreed on the readability part, but using regex.replace opens up a new can
of worms, which people aren't usually prepared for. Say this search/replace
action can be entered from the UI, then adding . or * or { into your search
pattern can lead to unexpected behaviour, or worse a regex parse error. The
regex will also be expensive, because it will have to be parsed/compiled
every time a new pattern is used (and if it is a user defined replacement,
that would be more often than not).

So this would have to be extended with a Regex.Escape call first. The same
applies for the replacement pattern. Say I want to search $2 and replace
it with $0.1 you'd get funny things... ($2.1 actually)... So it isn't just
using a different call to get the same results.

That said, I'd opt for an extention method on string and write an efficient
version (could use mine as an example) of a Replace method that returns the
number of matches. And from that moment on, use that. Just as readable (or
even better) and no crazy unexpected regex problems due to not exactly understanding
what is involved.

Jesse

Show quoteHide quote
> On Tue, 07 Jul 2009 14:45:02 -0700, Jesse Houwing
> <jesse.houwing@newsgroup.nospam> wrote:
>
>> Looping through the string in a stringbuilder is probably the safest
>> way  to do this:
>>
>> [...]
>> int count = 0;
>> for (int i = 0; i < sb.Length; )
>> {
>> if (AreEqual(sb, search, i))
>> {
>> sb.Remove(i, search.Length);
>> sb.Insert(i, replacement);
>> i += replacement.Length;
>> count++;
>> }
>> else
>> {
>> i++;
>> }
>> }
>> [...]
>> It might be faster to use sb.Replace(search, replacement, i,
>> search.Length)
>>
>> instead of a sb.Remove, sb.Insert, I'm not sure, but they won't
>> differ  that much.
>>
> If you simply call StringBuilder.Replace(string, string, int, int)
> instead  of having your own AreEqual() method followed by a call to
> Remove() and  Insert(), the performance should be practically
> identical, but you  wouldn't get any information about how many
> replacements occurred.
>
> Alternatively, if you still call AreEqual() and then call
> StringBuilder.Replace(string, string, int, int), you're duplicating
> effort  (which costs performance), because
> StringBuilder.Replace(string, string,  int, int) has to actually do
> the string comparison again.  That would  actually be _slower_ than
> your original code.
>
> You could do a little hack by searching for the first character that
> differs between the search and replacement strings (as an
> initialization,  not as part of the loop), and then bumping a counter
> after each call to  StringBuilder.Replace() based on whether the
> character at the same offset  within the current StringBuilder has
> changed.  That would be only slightly  slower than just calling
> StringBuilder.Replace(string, string, int, int),  but would include
> the count.
>
> That said, I would hope that any Replace() method in .NET, including
> Regex.Replace(), String.Replace(), or StringBuilder.Replace() would be
> faster than the code you posted.  The main reason being that all of
> those  methods have the opportunity to optimize the construction of
> the new  string, whereas your example doesn't optimize at all.
>
> At the very least, I would not use the Remove()/Insert() pattern
> you've  shown.  Instead, I would use a String as input, and a
> StringBuilder as  output, appending text segments to the output
> StringBuilder as I scan the  input String.  That way the code avoids
> having to repeatedly shift your  character buffer in the StringBuilder
> (which happens _twice_ for each  replacement in your code).  That's
> exactly the kind of optimization I'd  expect to find inside the .NET
> classes.
>
> It might even be worthwhile to defer creation of the output
> StringBuilder  until you detect the first match that needs to be
> replaced, if there's an  expectation that for a significant frequency
> of input, no replacements  would be needed.
>
>> It is probably a lot faster than using a regex, though I haven't done
>> any measurements.
>>
> I would expect Regex to be on par with other explicit mechanisms like
> that, especially given the need to count the replacements (which for
> non-Regex solutions requires replacing the search text twice).  If
> performance is an issue, then a "scan and build" approach as I suggest
> above is probably slightly faster than using built-in Replace()
> methods  simply because you can incorporate a count into the
> replacement logic.
>
> All that said, if performance is an issue (and there's nothing in the
> OP  to suggest it is), the only way to know for sure what the best
> solution is  would be to try the different alternatives and measure
> them.  Even  theoretical advantages and disadvantages may be
> irrelevant for a typical  data set, and intuition is a terrible way to
> measure performance.  :)
>
> For best performance, it may be that none of the suggestions offered
> so  far are probably suitable.  There's an optimized text search
> algorithm,  the name of which I can't recall at the moment, that can
> probably be  adapted, but if not then a degenerate state-graph
> implementation (since  there's only one string to search for) would
> probably work too.  Either  approach would avoid having to keep
> performing full string comparisons at  each character index in the
> original string (consider an original string
> "aaaaaaaaaaaaaaaaaaaaaaaaaa" where you want to replace all occurrences
> of  "aaaaaab" with something :) ).
>
> But even there, as I said, there's no way to know for sure without
> measuring.  Performance of the various choices is to some extent going
> to  be data dependent; liabilities that exist in the general case
> might not  really be that much of a problem.  For example, if dealing
> with  essentially random data, it's not too terrible to just keep
> comparing over  and over at an incremented index, because those
> comparisons will normally  terminate quickly when there's no match.
>
> In other words, even the theoretically worst-case implementation might
> not  turn out to be much different than the more optimized ones.
>
> Until there's a performance problem shown, the OP should stick with
> whatever solution _reads_ the best, and is the most maintainable.  And
> if  there is a performance problem shown, measuring each viable
> alternative is  the only way to know for sure which will be fastest.
>
> Pete
>
--
Jesse Houwing
jesse.houwing at sogeti.nl
Author
8 Jul 2009 11:59 AM
vanderghast
There is a difference between matching and replacing.

Someone can say "ohoh" is matched twice in "ohohoh", once starting at
position 0 and once starting at position 2,

but if you speak to replace (consume) it, you have only one possible
'action'.

I haven't tried, but I assume Regex would find 2 matches, while replace will
replace just once the pattern.

And again, (InitialStrring.Length-InitialString.Replace(pattern,
String.Empty).Length) / pattern.Length  is 'safe', as far as I know, for all
cases, and use no external loop, to count the number of replacements where
will be of pattern into InitialString (by whatever newPattern, which is
irrelevant).



Vanderghast, Access MVP


Show quoteHide quote
"Jesse Houwing" <jesse.houwing@newsgroup.nospam> wrote in message
news:e5317a7e78012e2d8cbcde395c74afa@news.microsoft.com...
> Hello Peter,
>
> Agreed on the readability part, but using regex.replace opens up a new can
> of worms, which people aren't usually prepared for. Say this
> search/replace action can be entered from the UI, then adding . or * or
> { into your search pattern can lead to unexpected behaviour, or worse a
> regex parse error. The regex will also be expensive, because it will have
> to be parsed/compiled every time a new pattern is used (and if it is a
> user defined replacement, that would be more often than not).
>
> So this would have to be extended with a Regex.Escape call first. The same
> applies for the replacement pattern. Say I want to search $2 and replace
> it with $0.1 you'd get funny things... ($2.1 actually)... So it isn't just
> using a different call to get the same results.
>
> That said, I'd opt for an extention method on string and write an
> efficient version (could use mine as an example) of a Replace method that
> returns the number of matches. And from that moment on, use that. Just as
> readable (or even better) and no crazy unexpected regex problems due to
> not exactly understanding what is involved.
>
> Jesse
>
>> On Tue, 07 Jul 2009 14:45:02 -0700, Jesse Houwing
>> <jesse.houwing@newsgroup.nospam> wrote:
>>
>>> Looping through the string in a stringbuilder is probably the safest
>>> way  to do this:
>>>
>>> [...]
>>> int count = 0;
>>> for (int i = 0; i < sb.Length; )
>>> {
>>> if (AreEqual(sb, search, i))
>>> {
>>> sb.Remove(i, search.Length);
>>> sb.Insert(i, replacement);
>>> i += replacement.Length;
>>> count++;
>>> }
>>> else
>>> {
>>> i++;
>>> }
>>> }
>>> [...]
>>> It might be faster to use sb.Replace(search, replacement, i,
>>> search.Length)
>>>
>>> instead of a sb.Remove, sb.Insert, I'm not sure, but they won't
>>> differ  that much.
>>>
>> If you simply call StringBuilder.Replace(string, string, int, int)
>> instead  of having your own AreEqual() method followed by a call to
>> Remove() and  Insert(), the performance should be practically
>> identical, but you  wouldn't get any information about how many
>> replacements occurred.
>>
>> Alternatively, if you still call AreEqual() and then call
>> StringBuilder.Replace(string, string, int, int), you're duplicating
>> effort  (which costs performance), because
>> StringBuilder.Replace(string, string,  int, int) has to actually do
>> the string comparison again.  That would  actually be _slower_ than
>> your original code.
>>
>> You could do a little hack by searching for the first character that
>> differs between the search and replacement strings (as an
>> initialization,  not as part of the loop), and then bumping a counter
>> after each call to  StringBuilder.Replace() based on whether the
>> character at the same offset  within the current StringBuilder has
>> changed.  That would be only slightly  slower than just calling
>> StringBuilder.Replace(string, string, int, int),  but would include
>> the count.
>>
>> That said, I would hope that any Replace() method in .NET, including
>> Regex.Replace(), String.Replace(), or StringBuilder.Replace() would be
>> faster than the code you posted.  The main reason being that all of
>> those  methods have the opportunity to optimize the construction of
>> the new  string, whereas your example doesn't optimize at all.
>>
>> At the very least, I would not use the Remove()/Insert() pattern
>> you've  shown.  Instead, I would use a String as input, and a
>> StringBuilder as  output, appending text segments to the output
>> StringBuilder as I scan the  input String.  That way the code avoids
>> having to repeatedly shift your  character buffer in the StringBuilder
>> (which happens _twice_ for each  replacement in your code).  That's
>> exactly the kind of optimization I'd  expect to find inside the .NET
>> classes.
>>
>> It might even be worthwhile to defer creation of the output
>> StringBuilder  until you detect the first match that needs to be
>> replaced, if there's an  expectation that for a significant frequency
>> of input, no replacements  would be needed.
>>
>>> It is probably a lot faster than using a regex, though I haven't done
>>> any measurements.
>>>
>> I would expect Regex to be on par with other explicit mechanisms like
>> that, especially given the need to count the replacements (which for
>> non-Regex solutions requires replacing the search text twice).  If
>> performance is an issue, then a "scan and build" approach as I suggest
>> above is probably slightly faster than using built-in Replace()
>> methods  simply because you can incorporate a count into the
>> replacement logic.
>>
>> All that said, if performance is an issue (and there's nothing in the
>> OP  to suggest it is), the only way to know for sure what the best
>> solution is  would be to try the different alternatives and measure
>> them.  Even  theoretical advantages and disadvantages may be
>> irrelevant for a typical  data set, and intuition is a terrible way to
>> measure performance.  :)
>>
>> For best performance, it may be that none of the suggestions offered
>> so  far are probably suitable.  There's an optimized text search
>> algorithm,  the name of which I can't recall at the moment, that can
>> probably be  adapted, but if not then a degenerate state-graph
>> implementation (since  there's only one string to search for) would
>> probably work too.  Either  approach would avoid having to keep
>> performing full string comparisons at  each character index in the
>> original string (consider an original string
>> "aaaaaaaaaaaaaaaaaaaaaaaaaa" where you want to replace all occurrences
>> of  "aaaaaab" with something :) ).
>>
>> But even there, as I said, there's no way to know for sure without
>> measuring.  Performance of the various choices is to some extent going
>> to  be data dependent; liabilities that exist in the general case
>> might not  really be that much of a problem.  For example, if dealing
>> with  essentially random data, it's not too terrible to just keep
>> comparing over  and over at an incremented index, because those
>> comparisons will normally  terminate quickly when there's no match.
>>
>> In other words, even the theoretically worst-case implementation might
>> not  turn out to be much different than the more optimized ones.
>>
>> Until there's a performance problem shown, the OP should stick with
>> whatever solution _reads_ the best, and is the most maintainable.  And
>> if  there is a performance problem shown, measuring each viable
>> alternative is  the only way to know for sure which will be fastest.
>>
>> Pete
>>
> --
> Jesse Houwing
> jesse.houwing at sogeti.nl
>
>
Author
8 Jul 2009 2:04 PM
Göran_Andersson
Smith wrote:
> Hi
>
> Usually I only replace strings with text = text.Replace("old text", "new
> text");
>
> Now I need to display the number of replacements made. Is there an easy way
> or do I need some custom replacement method?

You can do the replacing yourself using IndexOf and a StringBuilder, so
that you can count them:


string original = "1234567890123456789012345678901234567890";
string find = "23";
string replacement = "twentythree";

StringBuilder result = new StringBuilder();
int replacements = 0;
int index = 0;
do {
    int newIndex = original.IndexOf(find, index);
    if (newIndex != -1) {
        result.Append(original, index, newIndex - index);
        result.Append(replacement);
        replacements++;
        index = newIndex + find.Length;
    } else {
        result.Append(original, index, original.Length - index);
        index = original.Length;
    }
} while (index < original.Length);

Console.WriteLine(result.ToString());
Console.WriteLine(replacements);


--
Göran Andersson
_____
http://www.guffa.com

Bookmark and Share