Home All Groups Group Topic Archive Search About

foreach, IEnumerable and modifying contents

Author
28 Nov 2007 4:24 AM
jehugaleahsa@gmail.com
I have a rather complex need.

I have a class that parses web pages and extracts all relevant file
addresses. It allows me to download every pdf on a web page, for
instance. I would like to incorporate threads so that I can download N
files separately.

The obvious solution is a thread pool. However, I need to make sure
that I download the files Async - so I can get percentage and status
information to my interface.

I have decided that the best way to do this is to have my Download (a
class representing the file to download) to have events raised when
they are finished. I was hoping to have my threads rejoin with the
thread pool when the downloads are finished.

However, I have my Download instances coming out of an
IEnumerable<Download> that is recieved from the WebExtractor class
(which parses the HTML) on-the-fly using "yield return".

I think I am lacking some basics about Thread Pools. How can I use a
thread pool and have the events fired by the Downloads still reach the
interface? Is there are way to add an event handler to an instance
while in a foreach or IEnumerator code block?

Any help would put me one step closer to being done with my second
release of the software.  Thanks in advance!

~Travis

Author
28 Nov 2007 5:04 AM
Peter Duniho
On 2007-11-27 20:24:28 -0800, "jehugalea***@gmail.com"
<jehugalea***@gmail.com> said:

> I have a rather complex need.

Perhaps.  Though, I suspect it's more that you've created a complex
need, where it wasn't really necessary to do so.

> I have a class that parses web pages and extracts all relevant file
> addresses. It allows me to download every pdf on a web page, for
> instance. I would like to incorporate threads so that I can download N
> files separately.

A reasonably common operation.

> The obvious solution is a thread pool. However, I need to make sure
> that I download the files Async - so I can get percentage and status
> information to my interface.

It seems to me that a different "obvious" solution would be to just use
the async methods on the HttpWebRequest class, or even just a plain
TcpClient or Socket instance, along with a queue.  The producer of the
queue would add URLs to be downloaded, while the consumer would keep
track of how many active downloads are going on (via HttpWebRequest,
TcpClient, or Socket).

Every time the producer adds something to the queue, it would signal
the consumer.  The consumer in response would remove items from the
queue, stopping when either the queue is empty or your maximum number
of concurrent operations has been reached, whichever comes first.

Upon completion of an item, the consumer would also be signaled,
allowing it to pull a new item from the queue.

In the above, I'm thinking of the consumer and producer as individual
threads.  But you could easily implement it without a thread dedicated
to either, with the consumer and producer classes simply being called
by whatever thread happens to be managing them at the time.  In that
case, "signaling" the consumer would be more a matter of just executing
the method that attempts to dequeue more download operations.

> I have decided that the best way to do this is to have my Download (a
> class representing the file to download) to have events raised when
> they are finished. I was hoping to have my threads rejoin with the
> thread pool when the downloads are finished.

If you use the async methods on the above-mentioned classes, you get
the thread pooling behavior for free.

> However, I have my Download instances coming out of an
> IEnumerable<Download> that is recieved from the WebExtractor class
> (which parses the HTML) on-the-fly using "yield return".

This is another reason I think a queue would be better.  There's no
technical reason you can't implement an asynchronous enumerator, but
having done so in this case seems to have overcomplicated the issue.  A
queue seems like a much more natural fit to me, and wouldn't have the
same complicating factors you seem to be running into.

> I think I am lacking some basics about Thread Pools. How can I use a
> thread pool and have the events fired by the Downloads still reach the
> interface?

I think you can avoid the question altogether, but the basic answer is
that the idea of a thread pool and having "the events...reach the
interface" are orthogonal ideas.  Because of the thread pool, you may
have thread synchronization issues to deal with.  But the basic
question of raising an event in a way that some implementer of some
interface receives them isn't affected by whether there are multiple
threads involved.

> Is there are way to add an event handler to an instance
> while in a foreach or IEnumerator code block?

You can subscribe to an event at any time you find convenient.

> Any help would put me one step closer to being done with my second
> release of the software.  Thanks in advance!

See above.  I recommend abandoning this asynchronous enumerator idea
and going with a nice, simple queue.

Pete
Author
28 Nov 2007 5:23 AM
jehugaleahsa@gmail.com
On Nov 27, 10:04 pm, Peter Duniho <NpOeStPe***@NnOwSlPiAnMk.com>
wrote:
Show quote
> On 2007-11-27 20:24:28 -0800, "jehugalea***@gmail.com"
> <jehugalea***@gmail.com> said:
>
> > I have a rather complex need.
>
> Perhaps.  Though, I suspect it's more that you've created a complex
> need, where it wasn't really necessary to do so.
>
> > I have a class that parses web pages and extracts all relevant file
> > addresses. It allows me to download every pdf on a web page, for
> > instance. I would like to incorporate threads so that I can download N
> > files separately.
>
> A reasonably common operation.
>
> > The obvious solution is a thread pool. However, I need to make sure
> > that I download the files Async - so I can get percentage and status
> > information to my interface.
>
> It seems to me that a different "obvious" solution would be to just use
> the async methods on the HttpWebRequest class, or even just a plain
> TcpClient or Socket instance, along with a queue.  The producer of the
> queue would add URLs to be downloaded, while the consumer would keep
> track of how many active downloads are going on (via HttpWebRequest,
> TcpClient, or Socket).
>
> Every time the producer adds something to the queue, it would signal
> the consumer.  The consumer in response would remove items from the
> queue, stopping when either the queue is empty or your maximum number
> of concurrent operations has been reached, whichever comes first.
>
> Upon completion of an item, the consumer would also be signaled,
> allowing it to pull a new item from the queue.
>
> In the above, I'm thinking of the consumer and producer as individual
> threads.  But you could easily implement it without a thread dedicated
> to either, with the consumer and producer classes simply being called
> by whatever thread happens to be managing them at the time.  In that
> case, "signaling" the consumer would be more a matter of just executing
> the method that attempts to dequeue more download operations.
>
> > I have decided that the best way to do this is to have my Download (a
> > class representing the file to download) to have events raised when
> > they are finished. I was hoping to have my threads rejoin with the
> > thread pool when the downloads are finished.
>
> If you use the async methods on the above-mentioned classes, you get
> the thread pooling behavior for free.
>
> > However, I have my Download instances coming out of an
> > IEnumerable<Download> that is recieved from the WebExtractor class
> > (which parses the HTML) on-the-fly using "yield return".
>
> This is another reason I think a queue would be better.  There's no
> technical reason you can't implement an asynchronous enumerator, but
> having done so in this case seems to have overcomplicated the issue.  A
> queue seems like a much more natural fit to me, and wouldn't have the
> same complicating factors you seem to be running into.
>
> > I think I am lacking some basics about Thread Pools. How can I use a
> > thread pool and have the events fired by the Downloads still reach the
> > interface?
>
> I think you can avoid the question altogether, but the basic answer is
> that the idea of a thread pool and having "the events...reach the
> interface" are orthogonal ideas.  Because of the thread pool, you may
> have thread synchronization issues to deal with.  But the basic
> question of raising an event in a way that some implementer of some
> interface receives them isn't affected by whether there are multiple
> threads involved.
>
> > Is there are way to add an event handler to an instance
> > while in a foreach or IEnumerator code block?
>
> You can subscribe to an event at any time you find convenient.
>
> > Any help would put me one step closer to being done with my second
> > release of the software.  Thanks in advance!
>
> See above.  I recommend abandoning this asynchronous enumerator idea
> and going with a nice, simple queue.
>
> Pete

My first implementation actually had a Queue<Download> that was
consumed when I recieved that a download had finished. However, it was
difficult for my code to say, "Hey, stop trying to consume!" I ended
up having a very rigid code set and I was hoping to get away from it.
I was having BIG issues with the events of from one download finishing
interrupting while another thread was in the middle of a locked block.
I kept getting the occasional dead lock.

My hope in my new design was to get away from the need for so much
concurrency management. I did that by using the yield return statement
and making that my Queue, in a sense. It also makes the termination
point a lot easier to see. However, without a way of saying, "Hey,
we're not ready to start downloading you yet - wait for a moment", I
was downloading as many files at once as my computer could handle. So
my hope was to find a way to say, "Hey wait" while not needing to
necessarily manage the number of threads/concurrent downloads
manually.

I could try to manage the downloads manually again. I did move a lot
of code around to separate the interface from the downloading, so it
might be easier now than before. ThreadPools seemed more intuitive to
me the second time around. Perhaps my first approach is the better
one.

Thanks for your thoughts,
Travis
Author
28 Nov 2007 5:53 AM
Peter Duniho
On 2007-11-27 21:23:40 -0800, "jehugalea***@gmail.com"
<jehugalea***@gmail.com> said:

> My first implementation actually had a Queue<Download> that was
> consumed when I recieved that a download had finished. However, it was
> difficult for my code to say, "Hey, stop trying to consume!"

Typically with a queue, that point is when the queue is empty.  It's
not usually difficult.

> I ended
> up having a very rigid code set and I was hoping to get away from it.
> I was having BIG issues with the events of from one download finishing
> interrupting while another thread was in the middle of a locked block.
> I kept getting the occasional dead lock.

Well, for what it's worth you seem to be dealing with threading issues
anyway.  Dead lock is a consequence of a buggy implementation.  If you
had trouble dealing with thread synchronization in the previous design,
you're likely to have trouble with any other design that also involves
threads.

> My hope in my new design was to get away from the need for so much
> concurrency management.

How you intended to do that by introducing your own thread pool, I'm
not really clear on.  :)

> I did that by using the yield return statement
> and making that my Queue, in a sense. It also makes the termination
> point a lot easier to see. However, without a way of saying, "Hey,
> we're not ready to start downloading you yet - wait for a moment", I
> was downloading as many files at once as my computer could handle. So
> my hope was to find a way to say, "Hey wait" while not needing to
> necessarily manage the number of threads/concurrent downloads
> manually.

Managing that with a the queue/async paradigm I mentioned would be
simple.  Especially given the efficiency advantages of using the async
i/o methods on the network classes, it seems to me that managing the
concurrent consumer count by creating your own thread pool is much more
complicated and error-prone.

I'd say the puzzlement you appear to have put yourself into here is a
good indication of that.  :)

> I could try to manage the downloads manually again. I did move a lot
> of code around to separate the interface from the downloading, so it
> might be easier now than before. ThreadPools seemed more intuitive to
> me the second time around. Perhaps my first approach is the better
> one.

If it's like what I suggested, obviously I'd agree.  :)

Pete
Author
28 Nov 2007 3:01 PM
jehugaleahsa@gmail.com
On Nov 27, 10:53 pm, Peter Duniho <NpOeStPe***@NnOwSlPiAnMk.com>
wrote:
> On 2007-11-27 21:23:40 -0800, "jehugalea***@gmail.com"
> <jehugalea***@gmail.com> said:
>
> > My first implementation actually had a Queue<Download> that was
> > consumed when I recieved that a download had finished. However, it was
> > difficult for my code to say, "Hey, stop trying to consume!"
>
> Typically with a queue, that point is when the queue is empty.  It's
> not usually difficult.
>

Well, the first go around, the queue being empty didn't mean I was
done. It occurred quite often that I would finish downloading all my
files before more files were added to the list. I should have
mentioned that the application pulls all web pages off of a page and
descends into those as well. It happened often that a web page was
slow to download or that one would have many links, but not much
media. I ended up having an empty queue regularly toward the beginning
of a run.

Since I had code for extracting html pages and another for specific
file types, I had to keep them in sync so that the application would
finish when and only when both were done. Again, this was a bit of a
concurrency issue. Before I used the yield return method, my biggest
indication that the program was being cancelled was a class-wide
variable that need to be checked regularly (requiring lots of locks).
However, I can just stop the web extractor now and the downloads will
stop being yielded, which stops the downloader. The downloader can
then cancel all running downloads and break out of the consuming loop.
It did make concurrency simplier in this case.

However, now I just have Downloads coming in as fast as they are
found. I will try your approach of starting the next download when I
have time. What I will have to do is make the Download consumer
without a loop. But just MoveNext of the enumerator when I am
indicated that a download finished.

Here is a scenario: One download finishes and my code begins pulling
the next Download. However, the web extractor is not ready. While
waiting, another download finishes and now a second piece of code
begins pulling the next Download. Now I have two pieces of code trying
to access the same enumerator. Can I be sure that this won't corrupt
my enumerator? If I were to lock the IEnumerator<Download>, would this
cause a deadlock since they are different event handlers?

Concurrency isn't that simple for someone who hasn't had to deal with
it. I had plenty of theory in school, including producer/consumer
algorithms. Dealing with events seems similar to threads, but they
take complete control. Threads at least switch context when they hit a
lock.

Thanks again,
Travis
Author
28 Nov 2007 7:00 PM
Peter Duniho
On 2007-11-28 07:01:58 -0800, "jehugalea***@gmail.com"
<jehugalea***@gmail.com> said:

> Well, the first go around, the queue being empty didn't mean I was
> done. It occurred quite often that I would finish downloading all my
> files before more files were added to the list.

The queue being empty did in fact mean you were done, at least for the moment.

In a typical queue design, you would gracefully deal with an empty
queue.  A queue that's empty just means there's no work to do.  The
consumer sits idle (either as an actual thread blocked on an wait
event, or just a class that doesn't do anything until some code calls
something that adds something new to the queue) until there's more work
to do.  The logic is the same for the case of starting up some
processing as it is for the case of temporarily running out of work to
do and then being presented with some more.

If your design didn't support that, then you probably did not separate
the logic of the producer, consumer, and client of the queue well
enough.

> [...]
> Here is a scenario: One download finishes and my code begins pulling
> the next Download. However, the web extractor is not ready. While
> waiting, another download finishes and now a second piece of code
> begins pulling the next Download. Now I have two pieces of code trying
> to access the same enumerator. Can I be sure that this won't corrupt
> my enumerator? If I were to lock the IEnumerator<Download>, would this
> cause a deadlock since they are different event handlers?

I can't really comment on an enumerator that you haven't posted code
for.  Also, I haven't used any custom enumerators in real-world code,
so I don't have much experience with them.  However, I would say that
if you have two pieces of code trying to access the same enumerator,
you've got a bug.  I would think that each call to GetEnumerator()
should return a brand new one, so that different parts of the code
don't interfere with each other.

If you do decide to return the same enumerator to different parts of
the code, or different instances of the same code, I'd say that at a
bare minimum you will need to be VERY careful about how you use the
enumerator (and for sure it will need to be written in a thread-safe
way to account for this multiple access usage), and it's very likely
there's a better way to design the code (like using a queue :) ).

> Concurrency isn't that simple for someone who hasn't had to deal with
> it. I had plenty of theory in school, including producer/consumer
> algorithms. Dealing with events seems similar to threads, but they
> take complete control. Threads at least switch context when they hit a
> lock.

Events and threads are, as I mentioned, orthogonal to each other.  An
event is really just a nice syntax for a multi-subscriber callback
mechanism.  When an event is raised, the handler always executes in the
same thread in which it was raised.  Multiple threads impose
synchronization requirements on your code, and these requirements are
the same whether you are using events or not.

That said, I never meant to imply that concurrency was simple.  It's
not.  If anything, my intent is to point out that concurrency is _not_
simple, and that your second design appears to have just made it more
complicated than it otherwise needed to be.

If you want to have multiple threads processing things, you _are_ going
to have to deal with concurrency.  So the question is not whether you
can get away from concurrency issues or not; you can't.  The question
is how complicated are you going to make those issues.

So far, it seems that you've made them very complicated.  :)

For fun, I'm thinking about working on a simple download simulation
that uses a queue to manage the downloads.  If and when it's finished,
I'll post the code here in case you or anyone else is interested. 
Might not be done today, as I've got a busy day, but maybe tomorrow.

Pete
Author
28 Nov 2007 9:45 PM
jehugaleahsa@gmail.com
On Nov 28, 12:00 pm, Peter Duniho <NpOeStPe***@NnOwSlPiAnMk.com>
wrote:
Show quote
> On 2007-11-28 07:01:58 -0800, "jehugalea***@gmail.com"
> <jehugalea***@gmail.com> said:
>
> > Well, the first go around, the queue being empty didn't mean I was
> > done. It occurred quite often that I would finish downloading all my
> > files before more files were added to the list.
>
> The queue being empty did in fact mean you were done, at least for the moment.
>
> In a typical queue design, you would gracefully deal with an empty
> queue.  A queue that's empty just means there's no work to do.  The
> consumer sits idle (either as an actual thread blocked on an wait
> event, or just a class that doesn't do anything until some code calls
> something that adds something new to the queue) until there's more work
> to do.  The logic is the same for the case of starting up some
> processing as it is for the case of temporarily running out of work to
> do and then being presented with some more.
>
> If your design didn't support that, then you probably did not separate
> the logic of the producer, consumer, and client of the queue well
> enough.
>
> > [...]
> > Here is a scenario: One download finishes and my code begins pulling
> > the next Download. However, the web extractor is not ready. While
> > waiting, another download finishes and now a second piece of code
> > begins pulling the next Download. Now I have two pieces of code trying
> > to access the same enumerator. Can I be sure that this won't corrupt
> > my enumerator? If I were to lock the IEnumerator<Download>, would this
> > cause a deadlock since they are different event handlers?
>
> I can't really comment on an enumerator that you haven't posted code
> for.  Also, I haven't used any custom enumerators in real-world code,
> so I don't have much experience with them.  However, I would say that
> if you have two pieces of code trying to access the same enumerator,
> you've got a bug.  I would think that each call to GetEnumerator()
> should return a brand new one, so that different parts of the code
> don't interfere with each other.
>
> If you do decide to return the same enumerator to different parts of
> the code, or different instances of the same code, I'd say that at a
> bare minimum you will need to be VERY careful about how you use the
> enumerator (and for sure it will need to be written in a thread-safe
> way to account for this multiple access usage), and it's very likely
> there's a better way to design the code (like using a queue :) ).
>
> > Concurrency isn't that simple for someone who hasn't had to deal with
> > it. I had plenty of theory in school, including producer/consumer
> > algorithms. Dealing with events seems similar to threads, but they
> > take complete control. Threads at least switch context when they hit a
> > lock.
>
> Events and threads are, as I mentioned, orthogonal to each other.  An
> event is really just a nice syntax for a multi-subscriber callback
> mechanism.  When an event is raised, the handler always executes in the
> same thread in which it was raised.  Multiple threads impose
> synchronization requirements on your code, and these requirements are
> the same whether you are using events or not.
>
> That said, I never meant to imply that concurrency was simple.  It's
> not.  If anything, my intent is to point out that concurrency is _not_
> simple, and that your second design appears to have just made it more
> complicated than it otherwise needed to be.
>
> If you want to have multiple threads processing things, you _are_ going
> to have to deal with concurrency.  So the question is not whether you
> can get away from concurrency issues or not; you can't.  The question
> is how complicated are you going to make those issues.
>
> So far, it seems that you've made them very complicated.  :)
>
> For fun, I'm thinking about working on a simple download simulation
> that uses a queue to manage the downloads.  If and when it's finished,
> I'll post the code here in case you or anyone else is interested. 
> Might not be done today, as I've got a busy day, but maybe tomorrow.
>
> Pete

Your extended effort to help me is commendable. Thank you very much.

> In a typical queue design, you would gracefully deal with an empty
> queue.  A queue that's empty just means there's no work to do.  The
> consumer sits idle (either as an actual thread blocked on an wait
> event, or just a class that doesn't do anything until some code calls
> something that adds something new to the queue) until there's more work
> to do.  The logic is the same for the case of starting up some
> processing as it is for the case of temporarily running out of work to
> do and then being presented with some more.

I grasp what you are saying, but I'm not sure what the thread does
while it is idle. That or I'm not sure how to wake it up.

When you use "yield return", it actually is very much like a thread.
It returns one thing and goes away until the next is needed. The class
processing the downloads does idle before the next Download is
yielded. This is just how "yield return" works and it did make my code
*seem* cleaner. All methods with "yield return" return IEnumerable.
You access the yielded data using an IEnumerator. So, I'm just using a
foreach loop. It looks like this:

public class DownloadManager
{
    WebExtractor extractor = new WebExtractor(/* Arguments */);
    bool cancelled = false;
    object cancelSync = new object();

    public void DownloadFiles()
    {
        // BEGIN THREAD
        foreach (Download download in extractor.Start()) //
WebExtractor.Start yield returns
                                                                              //
Downloads as they are found.
        {
                // add event handlers
                download.Start();
                lock (cancelSync)
                {
                    if (cancelled)
                    {
                        break;
                    }
                }
        }
        // END THREAD
    }

    public void Cancel()
    {
        // BEGIN THREAD
        lock (cancelSync)
        {
            cancelled = true;
        }
        // END THREAD
    }
}
Author
28 Nov 2007 10:07 PM
jehugaleahsa@gmail.com
With Semaphores for example:

public class DownloadManager
{
    WebExtractor extractor = new WebExtractor(/* Arguments */);
    bool cancelled = false;
    object cancelSync = new object();
    Semaphore semaphore = new Semaphore(5, 5);


    public void DownloadFiles()
    {
        // BEGIN THREAD
        foreach (Download download in extractor.Start()) //
WebExtractor.Start yield returns
                                                                              //
Downloads as they are found.
        {
                // add event handlers
                semaphore.WaitOne();
                download.StatusChanged += new
StatusChangedEventArgs(status_Changed);
                download.Start();
                lock (cancelSync)
                {
                    if (cancelled)
                    {
                        break;
                    }
                }
        }
        // END THREAD
    }


    private void status_Changed(object sender, StatusChangedEventArgs
e)
    {
         if (e.Status == DownloadStatus.Complete)
         {
             semaphore.Release();
         }
    }

    public void Cancel()
    {
        // BEGIN THREAD
        lock (cancelSync)
        {
            cancelled = true;
        }
        // END THREAD
    }
Author
29 Nov 2007 3:14 AM
jehugaleahsa@gmail.com
Show quote
On Nov 28, 3:07 pm, "jehugalea***@gmail.com" <jehugalea***@gmail.com>
wrote:
> With Semaphores for example:
>
> public class DownloadManager
> {
>     WebExtractor extractor = new WebExtractor(/* Arguments */);
>     bool cancelled = false;
>     object cancelSync = new object();
>     Semaphore semaphore = new Semaphore(5, 5);
>
>     public void DownloadFiles()
>     {
>         // BEGIN THREAD
>         foreach (Download download in extractor.Start()) //
> WebExtractor.Start yield returns
>                                                                               //
> Downloads as they are found.
>         {
>                 // add event handlers
>                 semaphore.WaitOne();
>                 download.StatusChanged += new
> StatusChangedEventArgs(status_Changed);
>                 download.Start();
>                 lock (cancelSync)
>                 {
>                     if (cancelled)
>                     {
>                         break;
>                     }
>                 }
>         }
>         // END THREAD
>     }
>
>     private void status_Changed(object sender, StatusChangedEventArgs
> e)
>     {
>          if (e.Status == DownloadStatus.Complete)
>          {
>              semaphore.Release();
>          }
>     }
>
>     public void Cancel()
>     {
>         // BEGIN THREAD
>         lock (cancelSync)
>         {
>             cancelled = true;
>         }
>         // END THREAD
>     }

Actually, it appears that using Semaphores with WebClient is a no-no.
Author
29 Nov 2007 2:47 PM
Ben Voigt [C++ MVP]
<jehugalea***@gmail.com> wrote in message
Show quote
news:15b074ac-3777-4b3d-bbfb-afeca6dd9784@o42g2000hsc.googlegroups.com...
> On Nov 28, 3:07 pm, "jehugalea***@gmail.com" <jehugalea***@gmail.com>
> wrote:
>> With Semaphores for example:
>>
>> public class DownloadManager
>> {
>>     WebExtractor extractor = new WebExtractor(/* Arguments */);
>>     bool cancelled = false;
>>     object cancelSync = new object();
>>     Semaphore semaphore = new Semaphore(5, 5);
>>
>>     public void DownloadFiles()
>>     {
>>         // BEGIN THREAD
>>         foreach (Download download in extractor.Start()) //
>> WebExtractor.Start yield returns
>>
>> //
>> Downloads as they are found.
>>         {
>>                 // add event handlers
>>                 semaphore.WaitOne();
>>                 download.StatusChanged += new
>> StatusChangedEventArgs(status_Changed);
>>                 download.Start();
>>                 lock (cancelSync)
>>                 {
>>                     if (cancelled)
>>                     {
>>                         break;
>>                     }
>>                 }
>>         }
>>         // END THREAD
>>     }
>>
>>     private void status_Changed(object sender, StatusChangedEventArgs
>> e)
>>     {
>>          if (e.Status == DownloadStatus.Complete)
>>          {
>>              semaphore.Release();
>>          }
>>     }
>>
>>     public void Cancel()
>>     {
>>         // BEGIN THREAD
>>         lock (cancelSync)
>>         {
>>             cancelled = true;
>>         }
>>         // END THREAD
>>     }
>
> Actually, it appears that using Semaphores with WebClient is a no-no.

That surprises me.

I thought you might have some issues with the spidering/page parsing not
running until there is a download slot available, and the code you posted
clearly won't cancel the spider until one of the downloads completes
(perhaps you can cancel each download somehow).

What exactly is going wrong?  Does it help to use BeginInvoke to perform the
download from a thread other than the one holding the semaphore?
Author
28 Nov 2007 7:26 PM
Ben Voigt [C++ MVP]
<jehugalea***@gmail.com> wrote in message
Show quote
news:71a6cde4-2d04-46f8-aa41-e3ec39226702@e23g2000prf.googlegroups.com...
> On Nov 27, 10:53 pm, Peter Duniho <NpOeStPe***@NnOwSlPiAnMk.com>
> wrote:
>> On 2007-11-27 21:23:40 -0800, "jehugalea***@gmail.com"
>> <jehugalea***@gmail.com> said:
>>
>> > My first implementation actually had a Queue<Download> that was
>> > consumed when I recieved that a download had finished. However, it was
>> > difficult for my code to say, "Hey, stop trying to consume!"
>>
>> Typically with a queue, that point is when the queue is empty.  It's
>> not usually difficult.
>>
>
> Well, the first go around, the queue being empty didn't mean I was
> done. It occurred quite often that I would finish downloading all my
> files before more files were added to the list. I should have
> mentioned that the application pulls all web pages off of a page and
> descends into those as well. It happened often that a web page was
> slow to download or that one would have many links, but not much
> media. I ended up having an empty queue regularly toward the beginning
> of a run.
>
> Since I had code for extracting html pages and another for specific
> file types, I had to keep them in sync so that the application would
> finish when and only when both were done. Again, this was a bit of a
> concurrency issue. Before I used the yield return method, my biggest
> indication that the program was being cancelled was a class-wide
> variable that need to be checked regularly (requiring lots of locks).
> However, I can just stop the web extractor now and the downloads will
> stop being yielded, which stops the downloader. The downloader can
> then cancel all running downloads and break out of the consuming loop.
> It did make concurrency simplier in this case.
>
> However, now I just have Downloads coming in as fast as they are
> found. I will try your approach of starting the next download when I
> have time. What I will have to do is make the Download consumer
> without a loop. But just MoveNext of the enumerator when I am
> indicated that a download finished.
>
> Here is a scenario: One download finishes and my code begins pulling
> the next Download. However, the web extractor is not ready. While
> waiting, another download finishes and now a second piece of code
> begins pulling the next Download. Now I have two pieces of code trying
> to access the same enumerator. Can I be sure that this won't corrupt
> my enumerator? If I were to lock the IEnumerator<Download>, would this
> cause a deadlock since they are different event handlers?

something like:

delegate ... DownloadProcessor(...);
Semaphore limit = new Semaphore(N);

foreach (Download down in GetDownloads()) {
  limit.WaitOne();
  DownloadProcessor dp = down.Process;
  dp.BeginInvoke(..., delegate { limit.Release(); } , null); // using the
AsyncCallback to release one more semaphore after each download completes
}

Show quote
>
> Concurrency isn't that simple for someone who hasn't had to deal with
> it. I had plenty of theory in school, including producer/consumer
> algorithms. Dealing with events seems similar to threads, but they
> take complete control. Threads at least switch context when they hit a
> lock.
>
> Thanks again,
> Travis
Author
28 Nov 2007 9:30 PM
jehugaleahsa@gmail.com
Show quote
On Nov 28, 12:26 pm, "Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote:
> <jehugalea***@gmail.com> wrote in message
>
> news:71a6cde4-2d04-46f8-aa41-e3ec39226702@e23g2000prf.googlegroups.com...
>
>
>
>
>
> > On Nov 27, 10:53 pm, Peter Duniho <NpOeStPe***@NnOwSlPiAnMk.com>
> > wrote:
> >> On 2007-11-27 21:23:40 -0800, "jehugalea***@gmail.com"
> >> <jehugalea***@gmail.com> said:
>
> >> > My first implementation actually had a Queue<Download> that was
> >> > consumed when I recieved that a download had finished. However, it was
> >> > difficult for my code to say, "Hey, stop trying to consume!"
>
> >> Typically with a queue, that point is when the queue is empty.  It's
> >> not usually difficult.
>
> > Well, the first go around, the queue being empty didn't mean I was
> > done. It occurred quite often that I would finish downloading all my
> > files before more files were added to the list. I should have
> > mentioned that the application pulls all web pages off of a page and
> > descends into those as well. It happened often that a web page was
> > slow to download or that one would have many links, but not much
> > media. I ended up having an empty queue regularly toward the beginning
> > of a run.
>
> > Since I had code for extracting html pages and another for specific
> > file types, I had to keep them in sync so that the application would
> > finish when and only when both were done. Again, this was a bit of a
> > concurrency issue. Before I used the yield return method, my biggest
> > indication that the program was being cancelled was a class-wide
> > variable that need to be checked regularly (requiring lots of locks).
> > However, I can just stop the web extractor now and the downloads will
> > stop being yielded, which stops the downloader. The downloader can
> > then cancel all running downloads and break out of the consuming loop.
> > It did make concurrency simplier in this case.
>
> > However, now I just have Downloads coming in as fast as they are
> > found. I will try your approach of starting the next download when I
> > have time. What I will have to do is make the Download consumer
> > without a loop. But just MoveNext of the enumerator when I am
> > indicated that a download finished.
>
> > Here is a scenario: One download finishes and my code begins pulling
> > the next Download. However, the web extractor is not ready. While
> > waiting, another download finishes and now a second piece of code
> > begins pulling the next Download. Now I have two pieces of code trying
> > to access the same enumerator. Can I be sure that this won't corrupt
> > my enumerator? If I were to lock the IEnumerator<Download>, would this
> > cause a deadlock since they are different event handlers?
>
> something like:
>
> delegate ... DownloadProcessor(...);
> Semaphore limit = new Semaphore(N);
>
> foreach (Download down in GetDownloads()) {
>   limit.WaitOne();
>   DownloadProcessor dp = down.Process;
>   dp.BeginInvoke(..., delegate { limit.Release(); } , null); // using the
> AsyncCallback to release one more semaphore after each download completes
>
>
>
> }
>
> > Concurrency isn't that simple for someone who hasn't had to deal with
> > it. I had plenty of theory in school, including producer/consumer
> > algorithms. Dealing with events seems similar to threads, but they
> > take complete control. Threads at least switch context when they hit a
> > lock.
>
> > Thanks again,
> > Travis- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

The semaphore tells me to wait. I will try that, when I get a chance,
as well. I will have to learn about Semaphores as well.

AddThis Social Bookmark Button