|
ms
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
|
Hi,
I create application which transform huge XML files (~ 150 Mb) to CVS files. And I am facing strange problem. First 1000 rows parsed in 1 sec after 20000 rows speed down to 100 rows per sec, after 70000 rows speed down to 20 rows per sec ( I should parse ~ 2 500 000 rows). For me it looks like a GC problem, but I have no Idea how to fix it :( Any ideas are welcome. -- Thanks, Maxim Maxim Kazitov wrote:
> I create application which transform huge XML files (~ 150 Mb) to CVS files. If you do this transform by reading the whole file into a> And I am facing strange problem. First 1000 rows parsed in 1 sec after 20000 > rows speed down to 100 rows per sec, after 70000 rows speed down to 20 rows > per sec ( I should parse ~ 2 500 000 rows). > > For me it looks like a GC problem, but I have no Idea how to fix it :( representation of the XML file and then generating CVS, you are imposing serious memory pressure. If you can read an XML element and write a CVS element, without each iteration adding (much, if at all) to your working set, you might go much faster. If you do need to build a representation of the whole file, and each XML attribute name and value is a distinct string, you can often save a lot by "interning" string values, eliminating duplicate string values. It's also entirely possible that this has nothing to do with the GC. What you describe is compatible with some code that's walking a linked list that keeps growing .... Hi Jon,
I use XmlTextReader, so I don't read all XML in once, during the parsing I build small Xml Documents (one XmlDocument per row), and apply a set of XPath's to each document. I have a couple of Hashtables in my code, but they pretty small. Thanks, Max Show quote "Jon Shemitz" <j**@midnightbeach.com> wrote in message news:42478916.B18598FF@midnightbeach.com... > Maxim Kazitov wrote: > >> I create application which transform huge XML files (~ 150 Mb) to CVS >> files. >> And I am facing strange problem. First 1000 rows parsed in 1 sec after >> 20000 >> rows speed down to 100 rows per sec, after 70000 rows speed down to 20 >> rows >> per sec ( I should parse ~ 2 500 000 rows). >> >> For me it looks like a GC problem, but I have no Idea how to fix it :( > > If you do this transform by reading the whole file into a > representation of the XML file and then generating CVS, you are > imposing serious memory pressure. If you can read an XML element and > write a CVS element, without each iteration adding (much, if at all) > to your working set, you might go much faster. > > If you do need to build a representation of the whole file, and each > XML attribute name and value is a distinct string, you can often save > a lot by "interning" string values, eliminating duplicate string > values. > > It's also entirely possible that this has nothing to do with the GC. > What you describe is compatible with some code that's walking a linked > list that keeps growing .... > > -- > > www.midnightbeach.com Are you creating an XmlDocument or reusing the same one? You should ensure
that you are simply using the same one and Loading the XML string into the same one. I ran into memory issues when I used XmlDocument instances a lot. Show quote "Maxim Kazitov" <mvka***@tut.by> wrote in message news:%23oONPN1MFHA.3328@TK2MSFTNGP14.phx.gbl... > Hi Jon, > > I use XmlTextReader, so I don't read all XML in once, during the parsing > I build small Xml Documents (one XmlDocument per row), and apply a set of > XPath's to each document. I have a couple of Hashtables in my code, but they > pretty small. > > > Thanks, > Max > > > "Jon Shemitz" <j**@midnightbeach.com> wrote in message > news:42478916.B18598FF@midnightbeach.com... > > Maxim Kazitov wrote: > > > >> I create application which transform huge XML files (~ 150 Mb) to CVS > >> files. > >> And I am facing strange problem. First 1000 rows parsed in 1 sec after > >> 20000 > >> rows speed down to 100 rows per sec, after 70000 rows speed down to 20 > >> rows > >> per sec ( I should parse ~ 2 500 000 rows). > >> > >> For me it looks like a GC problem, but I have no Idea how to fix it :( > > > > If you do this transform by reading the whole file into a > > representation of the XML file and then generating CVS, you are > > imposing serious memory pressure. If you can read an XML element and > > write a CVS element, without each iteration adding (much, if at all) > > to your working set, you might go much faster. > > > > If you do need to build a representation of the whole file, and each > > XML attribute name and value is a distinct string, you can often save > > a lot by "interning" string values, eliminating duplicate string > > values. > > > > It's also entirely possible that this has nothing to do with the GC. > > What you describe is compatible with some code that's walking a linked > > list that keeps growing .... > > > > -- > > > > www.midnightbeach.com > > On Mon, 28 Mar 2005 00:01:40 -0500, "Maxim Kazitov" <mvka***@tut.by> 1. Make sure that you "let go" of each XmlDocument when you no longerwrote: > I use XmlTextReader, so I don't read all XML in once, during the parsing >I build small Xml Documents (one XmlDocument per row), and apply a set of >XPath's to each document. I have a couple of Hashtables in my code, but they >pretty small. use it. All references must have gone out of scope, or set to null references, or reassigned to the new XmlDocument. The old documents must not stay around in memory. 2. Call System.GC.Collect() immediately before you create a new XmlDocument. Microsoft pretends this can't happen but I've seen it myself that the garbage collector's performance can completely break down if you repeatedly allocate large pools of objects without manual Collect calls in-between. Maxim,
Do you need the XmlDocument? Have you considered using XPathDocument class instead. I don't know if its more memory friendly then XmlDocument, I do know it is faster then XmlDocument... Have you used PerfMon or CLR Profiler to see what is the life time of your objects? I would use PerfMon first as Willy suggests, & if it suggests a memory problem, then use CLR Profiler to identify specific problems... Info on the CLR Profiler: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnpag/html/scalenethowto13.asp http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/html/highperfmanagedapps.asp Hope this helps Jay Show quote "Maxim Kazitov" <mvka***@tut.by> wrote in message news:%23oONPN1MFHA.3328@TK2MSFTNGP14.phx.gbl... > Hi Jon, > > I use XmlTextReader, so I don't read all XML in once, during the > parsing I build small Xml Documents (one XmlDocument per row), and apply a > set of XPath's to each document. I have a couple of Hashtables in my code, > but they pretty small. > > > Thanks, > Max > > > "Jon Shemitz" <j**@midnightbeach.com> wrote in message > news:42478916.B18598FF@midnightbeach.com... >> Maxim Kazitov wrote: >> >>> I create application which transform huge XML files (~ 150 Mb) to CVS >>> files. >>> And I am facing strange problem. First 1000 rows parsed in 1 sec after >>> 20000 >>> rows speed down to 100 rows per sec, after 70000 rows speed down to 20 >>> rows >>> per sec ( I should parse ~ 2 500 000 rows). >>> >>> For me it looks like a GC problem, but I have no Idea how to fix it :( >> >> If you do this transform by reading the whole file into a >> representation of the XML file and then generating CVS, you are >> imposing serious memory pressure. If you can read an XML element and >> write a CVS element, without each iteration adding (much, if at all) >> to your working set, you might go much faster. >> >> If you do need to build a representation of the whole file, and each >> XML attribute name and value is a distinct string, you can often save >> a lot by "interning" string values, eliminating duplicate string >> values. >> >> It's also entirely possible that this has nothing to do with the GC. >> What you describe is compatible with some code that's walking a linked >> list that keeps growing .... >> >> -- >> >> www.midnightbeach.com > > You can feed the reader itself into the XSLT processor, is that an option in
your app design? This essentially means XSLT itself iterates through the reader (and outputs to a stream). Untested, but perhaps this gives you better results, unless you explicitly need to do something specific in between each intermediate step. (In which case, you could also have XSLT invoke a specified callback BTW.) Show quote "Maxim Kazitov" <mvka***@tut.by> schrieb im Newsbeitrag news:%23oONPN1MFHA.3328@TK2MSFTNGP14.phx.gbl... > Hi Jon, > > I use XmlTextReader, so I don't read all XML in once, during the > parsing I build small Xml Documents (one XmlDocument per row), and apply a > set of XPath's to each document. I have a couple of Hashtables in my code, > but they pretty small. > > > Thanks, > Max > > > "Jon Shemitz" <j**@midnightbeach.com> wrote in message > news:42478916.B18598FF@midnightbeach.com... >> Maxim Kazitov wrote: >> >>> I create application which transform huge XML files (~ 150 Mb) to CVS >>> files. >>> And I am facing strange problem. First 1000 rows parsed in 1 sec after >>> 20000 >>> rows speed down to 100 rows per sec, after 70000 rows speed down to 20 >>> rows >>> per sec ( I should parse ~ 2 500 000 rows). >>> >>> For me it looks like a GC problem, but I have no Idea how to fix it :( >> >> If you do this transform by reading the whole file into a >> representation of the XML file and then generating CVS, you are >> imposing serious memory pressure. If you can read an XML element and >> write a CVS element, without each iteration adding (much, if at all) >> to your working set, you might go much faster. >> >> If you do need to build a representation of the whole file, and each >> XML attribute name and value is a distinct string, you can often save >> a lot by "interning" string values, eliminating duplicate string >> values. >> >> It's also entirely possible that this has nothing to do with the GC. >> What you describe is compatible with some code that's walking a linked >> list that keeps growing .... >> >> -- >> >> www.midnightbeach.com > > Maxim,
Probably is the reason what you use to build your CSV files. When you create them as long Strings first in memory, than the problem is clear. Can you show that? Cor
Show quote
"Maxim Kazitov" <mvka***@tut.by> wrote in message I could be wrong, but It looks like you are using more memoy than physically news:u0GAOw0MFHA.580@TK2MSFTNGP15.phx.gbl... > Hi, > > I create application which transform huge XML files (~ 150 Mb) to CVS > files. And I am facing strange problem. First 1000 rows parsed in 1 sec > after 20000 rows speed down to 100 rows per sec, after 70000 rows speed > down to 20 rows per sec ( I should parse ~ 2 500 000 rows). > > For me it looks like a GC problem, but I have no Idea how to fix it :( > > Any ideas are welcome. > > -- > Thanks, > Maxim > available and as result the system starts paging and finaly starts thrashing. That would mean you are holding references to objects that could otherwise be collected by the GC, so it's not a GC problem it's a design problem. I suggest you start looking at the memory consumption using Perfmon (GC GEN 0, 1 and 2 memory counters) and the paging activity. If it looks like I'm right you should check your object allocation pattern, check wheter you are holding references that could otherwise be released, for instance references stored in arrays/collections that are no longer needed should be set to null. Willy. You should also use the StringBuilder to build your output string. If
you are using string concatenation, you are creating many string instances, and that is very ineffecient. If you are concatenating a string in a loop, always use stringbuilder. I already use StringBuilder
Show quote "Pat A" <pwales***@gmail.com> wrote in message news:1112021073.480415.41820@o13g2000cwo.googlegroups.com... > You should also use the StringBuilder to build your output string. If > you are using string concatenation, you are creating many string > instances, and that is very ineffecient. If you are concatenating a > string in a loop, always use stringbuilder. > Maxim Kazitov wrote:
> I already use StringBuilder Well, that's probably the chief source of your slowdown. Appending toa StringBuilder is a lot faster than repeatedly doing BigString += SmallString, but it still is periodically reallocating its internal buffer and copying the data from the old buffer to the new buffer. That's slow in several ways: * Copying a big buffer runs at bus speeds AND purges cache * allocating a large object forces a garbage collection * large objects are allocated on the Large Object Heap, a traditional linked-list heap, not a compacted heap. Can you write your CVS to a file, line by line? That would keep your working set from growing, and would tend to make your algorithm cost linear with the number of rows read. Have you tried running a profiling tool such as ANT?
also.... could you comment out the some code to try to isolate the problem, look at perfmon / % time in GC, this should be around 20%. Steve Show quote "Maxim Kazitov" <mvka***@tut.by> wrote in message news:u0GAOw0MFHA.580@TK2MSFTNGP15.phx.gbl... > Hi, > > I create application which transform huge XML files (~ 150 Mb) to CVS files. > And I am facing strange problem. First 1000 rows parsed in 1 sec after 20000 > rows speed down to 100 rows per sec, after 70000 rows speed down to 20 rows > per sec ( I should parse ~ 2 500 000 rows). > > For me it looks like a GC problem, but I have no Idea how to fix it :( > > Any ideas are welcome. > > -- > Thanks, > Maxim > > "Steve Drake" <Steve@_NOSPAM_.Drakey.co.uk> wrote in message What makes you think that?news:OvgZg1GNFHA.1040@TK2MSFTNGP12.phx.gbl... > look at perfmon / % time in GC, this should be around 20%. Willy. How are you parsing the XML? Using DOM or XMLReaders?
Show quote "Willy Denoyette [MVP]" <willy.denoye***@telenet.be> wrote in message news:%23nfS3LJNFHA.1884@TK2MSFTNGP15.phx.gbl... > > "Steve Drake" <Steve@_NOSPAM_.Drakey.co.uk> wrote in message > news:OvgZg1GNFHA.1040@TK2MSFTNGP12.phx.gbl... > >> look at perfmon / % time in GC, this should be around 20%. > > > What makes you think that? > > Willy. > > Hello Maxim,
I looked at each of your responses. Here is what you appear to be doing: You read a very large XML document using XMLTextReader You apply XPath queries... but you want to use XMLDocument (for some reason) because you want to change the nodes. Some folks feel that you are using XSLT, but, reading the messages in the microsoft.public.dotnet.framework newsgroup, I don't see you saying anything about XSLT. Perhaps you posted a response to only one NG? Or did others read that in? If your input is XML and your output is CSV, and you are using XMLTextReader, there is no reason to ever use XMLDocument. You can load data from the xml into a class, manipulate the data as methods and properties, and write it using CSV, without ever using XMLDocument. In fact, I'm wondering about something. How complex is the node structure that is used to generate a single CSV record? Are we talking about hundreds of attributes and tightly wound rules (like with a HIPAA XML transaction) or are we talking about a sales invoice (with a few dozen fields and some repeated columns)? If the latter, then use the XMLTextReader to get the text for each CSV record, extract the InnerXML, and parse it, by hand. You are very likely to get a performance ratio that is useful and that you can understand and optimize. I hope this helps, -- Show quote--- Nick Malik [Microsoft] MCSD, CFPS, Certified Scrummaster http://blogs.msdn.com/nickmalik Disclaimer: Opinions expressed in this forum are my own, and not representative of my employer. I do not answer questions on behalf of my employer. I'm just a programmer helping programmers. -- "Maxim Kazitov" <mvka***@tut.by> wrote in message news:u0GAOw0MFHA.580@TK2MSFTNGP15.phx.gbl... > Hi, > > I create application which transform huge XML files (~ 150 Mb) to CVS > files. And I am facing strange problem. First 1000 rows parsed in 1 sec > after 20000 rows speed down to 100 rows per sec, after 70000 rows speed > down to 20 rows per sec ( I should parse ~ 2 500 000 rows). > > For me it looks like a GC problem, but I have no Idea how to fix it :( > > Any ideas are welcome. > > -- > Thanks, > Maxim > |
|||||||||||||||||||||||