Home All Groups Group Topic Archive Search About

replacing characters in a string

Author
17 Dec 2008 9:10 AM
Peter
Hi

in my application I get a lot of strings which I have to "clean up"
before I pass them to a third-party library. The strings I have contain
characters which are invalid for the third-party library, so I have to
either remove them or replace them with reasonable alternatives.

What is a good method of doing this?

At the moment I have the following:

string Clean(string element)
{
  element = element.Replace(",", "");
  element = element.Replace("-", "");
  element = element.Replace("!", "");
  element = element.Replace("/", "");
  element = element.Replace("\\", "");

  element = element.Replace("æ", "ae");
  element = element.Replace("Æ", "AE");
  element = element.Replace("ä", "ae");
  element = element.Replace("Ä", "AE");
  element = element.Replace("ø", "oe");
  element = element.Replace("Ø", "OE");
  element = element.Replace("ö", "oe");
  element = element.Replace("Ö", "OE");
  element = element.Replace("å", "aa");
  element = element.Replace("Å", "AA");

  element = element.Trim(' ', '.');

  return element;
}


Thanks,
Peter

Author
17 Dec 2008 9:34 AM
Peter Morris
It will work but you are scanning the whole string for each replace.  I have
no experience of it but Regex.Replace is likely to only scan the string once
and call a delegate each time it finds a match against one of many patterns
you specify...


http://msdn.microsoft.com/en-us/library/ms149475.aspx




Are all your drivers up to date? click for free checkup

Author
17 Dec 2008 10:21 AM
rossum
Show quote Hide quote
On Wed, 17 Dec 2008 01:10:24 -0800, "Peter" <xdz***@hotmail.com>
wrote:

>Hi
>
>in my application I get a lot of strings which I have to "clean up"
>before I pass them to a third-party library. The strings I have contain
>characters which are invalid for the third-party library, so I have to
>either remove them or replace them with reasonable alternatives.
>
>What is a good method of doing this?
>
>At the moment I have the following:
>
>string Clean(string element)
>{
>  element = element.Replace(",", "");
>  element = element.Replace("-", "");
>  element = element.Replace("!", "");
>  element = element.Replace("/", "");
>  element = element.Replace("\\", "");
>
>  element = element.Replace("æ", "ae");
>  element = element.Replace("Æ", "AE");
>  element = element.Replace("ä", "ae");
>  element = element.Replace("Ä", "AE");
>  element = element.Replace("ø", "oe");
>  element = element.Replace("Ø", "OE");
>  element = element.Replace("ö", "oe");
>  element = element.Replace("Ö", "OE");
>  element = element.Replace("Ã¥", "aa");
>  element = element.Replace("Ã…", "AA");
>
>  element = element.Trim(' ', '.');
>
>  return element;
>}
>
>
>Thanks,
>Peter
You are creating a new String for every replace.  Using a
StringBuilder instead avoids this and may run faster:

  string Clean(string element) {
    StringBuilder sb = new StringBuilder(element);
    sb.Replace(",", "");
    // Other replaces

    return sb.ToString();
  }

rossum
Author
18 Dec 2008 2:41 AM
Mihai N.
>   element = element.Replace("æ", "ae");
>   element = element.Replace("Æ", "AE");
>   element = element.Replace("ä", "ae");
>   element = element.Replace("Ä", "AE");
>   element = element.Replace("ø", "oe");
>   element = element.Replace("Ø", "OE");
>   element = element.Replace("ö", "oe");
>   element = element.Replace("Ö", "OE");
>   element = element.Replace("å", "aa");
>   element = element.Replace("Å", "AA");

Any chance to get a new version of the 3rd party library?
Some of these replacements are locale sensitive.
And even for the locales where they are valid, they affect (negatively)
the quality of the text.
So that is not "clean up", that is "crap"

Imagine someone whould do this to English strings:
   element = element.Replace("w", "vv");
because some stupid library does not support 'w'.



--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Author
18 Dec 2008 2:48 AM
Jeff Johnson
Show quote Hide quote
"Peter" <xdz***@hotmail.com> wrote in message
news:%236MjMdCYJHA.4596@TK2MSFTNGP06.phx.gbl...

> in my application I get a lot of strings which I have to "clean up"
> before I pass them to a third-party library. The strings I have contain
> characters which are invalid for the third-party library, so I have to
> either remove them or replace them with reasonable alternatives.
>
> What is a good method of doing this?
>
> At the moment I have the following:
>
> string Clean(string element)
> {
>  element = element.Replace(",", "");
>  element = element.Replace("-", "");
>  element = element.Replace("!", "");
>  element = element.Replace("/", "");
>  element = element.Replace("\\", "");
>
>  element = element.Replace("æ", "ae");
>  element = element.Replace("Æ", "AE");
>  element = element.Replace("ä", "ae");
>  element = element.Replace("Ä", "AE");
>  element = element.Replace("ø", "oe");
>  element = element.Replace("Ø", "OE");
>  element = element.Replace("ö", "oe");
>  element = element.Replace("Ö", "OE");
>  element = element.Replace("å", "aa");
>  element = element.Replace("Å", "AA");
>
>  element = element.Trim(' ', '.');
>
>  return element;
> }

My take: build a "conversion matrix" and then run every character in that
string through the matrix, outputting a clean string in the end. Something
like this (air code!):

private Dictionary<char, string> _conversions;

// Constructor
public <your class name>
{
    // Ideally you would read these from a database or settings file so
    // that you wouldn't have to recompile if you find new things to replace
    _conversions.Add(',', "");
    _conversions.Add('-', "");
    _conversions.Add('!', "");
    _conversions.Add('/', "");
    _conversions.Add('\\', "");
    _conversions.Add('æ', "ae");
    _conversions.Add('Æ', "AE");
    _conversions.Add('ä', "ae");
    _conversions.Add('Ä', "AE");
    _conversions.Add('ø', "oe");
    _conversions.Add('Ø', "OE");
    _conversions.Add('ö', "oe");
    _conversions.Add('Ö', "OE");
    _conversions.Add('å', "aa");
    _conversions.Add('Å', "AA");
}

private string Clean(string element)
{
    StringBuilder sb = new StringBuilder();

    foreach(char c in element)
    {
        // NOTE: The following line may not compile since one option returns
        // a string and the other a char. In that case, make it a full blown
        // if/else clause.
        sb.Append(_conversions.Contains(c) ? _conversions[c] : c);
    }

    return sb.ToString().Trim(' ', '.');
}

Oh, and for what it's worth, it sounds like your third-party library
sucks....
Author
18 Dec 2008 7:54 AM
Peter
Thanks for all the comments.

With regards to the 3rd-party library, it is a content management
system, and it imposes rules on the names that can be used for path
elements and the "items" or "nodes" which make up the hierarchical
content structure. Some things I do accept, like / or \ in a name (much
the same as in windows) - but I don't really know why one can't use [
or ) or "international" letters like æ or ø. I don't have an exhaustive
list of all the invalid characters.

The data I receive comes from a database, and I have to then insert it
in the CMS - which gives problems if I read "invalid" strings from the
database, so I have to make some sort of "conversion".


/Peter
Author
19 Dec 2008 4:17 AM
Mihai N.
> The data I receive comes from a database, and I have to then insert it
> in the CMS - which gives problems if I read "invalid" strings from the
> database, so I have to make some sort of "conversion".

Is the result visible somewhere "as is", or it will always go thru some
"conversion layer"?

Maybe you can come with some kind of escaping system?

For instance have the string as utf-8, then escape all bytes > 127
When you get them back, you unescape and get the original utf-8 strings,
not characters damaged.


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Author
20 Dec 2008 8:25 PM
Peter
Mihai N. wrote:

> > The data I receive comes from a database, and I have to then insert
> > it in the CMS - which gives problems if I read "invalid" strings
> > from the database, so I have to make some sort of "conversion".
>
> Is the result visible somewhere "as is", or it will always go thru
> some "conversion layer"?
>
> Maybe you can come with some kind of escaping system?
>
> For instance have the string as utf-8, then escape all bytes > 127
> When you get them back, you unescape and get the original utf-8
> strings, not characters damaged.

Hi - I'm not sure I completely follow you. What I am doing is reading
company data from a database, and putting them into the hierarchical
structure of the CMS (as items/nodes in the CMS) - as well as some
accompanying data (like contact info, address, images etc).

This is to make it easy for site editors to access and change
information which is shown on some of the website's pages.

Eg.

IT companies
  microsoft
  yahoo

And some of the companies might have "illegal" characters in their
names (eg ! in Yahoo!).


/Peter

Bookmark and Share