Fetch-on-demand iterators (was Re: [CDBI] Make CDBI go fast)

Michael G Schwern schwern at gmail.com
Thu Feb 15 16:40:02 GMT 2007


Here's a lump reply to the concerns about having iterators fetch on demand.

* Make iterators fetch rows on demand. (Done: see rt.cpan.org 24959)

> This sounds cool, but the change makes me nervous. Some DBMSs and/or DBDs
> (Sybase, for example) don't like unfinished statement handles hanging around.
> Could be problematic in a mod_perl environment?

Yes, this is a concern but its no worse a regular active statement handle.  It depends on how long your iterators live, you just now have to treat them like you would statement handles.  Further discussion and enhancements on that can be found in the ticket notes on rt.cpan.org.

One possibility is to move executing the statement handle into the iterator so it doesn't even execute the query until you've asked for the first object.  This would make it safer to hand around iterator objects.

I'm using this in a mod_perl environment with no problems yet.


> What about CDBI->search() in list context? I hope there would be no change in
> implementation there. I can certainly see real-world scenarios where
> pre-fetching would beat fetching/inflating on demand.

It has remained the same.

I've also optimized void context to not fetch or return anything... should you do that for some reason.  It should probably be a warning.


> Keep in mind that the behavior of the DBD classes when fetching rows
> from a handle varies.  With MySQL, it will fetch all the rows into
> memory as soon as you execute the statement, unless you explicitly
> tell it not to.  With PostgreSQL, I think you have to use cursors to
> prevent it from loading all the rows when you execute.

Whatever the backend implementation, it sure sped things up.  The performance of search()->first when the search returns a million rows went from about 2 seconds to almost negligable.  Remember, even if the database is doing the same amount of work it still has to transfer it to your process and then run it through DBI and then bind it all into a list of hashes.

Before

$ time perl -wle 'package Foo;  use base qw(Class::DBI);  Foo->connection("DBI:mysql:database=test;host=localhost", '', '');  Foo->table("foo");  Foo->columns( All => "id" );  print Foo->retrieve_all->first'1

real    0m1.946s
user    0m1.412s
sys     0m0.519s


After

0 windhund ~/devel/Class-DBI$ time perl -Ilib -wle 'package Foo;  use base qw(Class::DBI);  Foo->connection("DBI:mysql:database=test;host=localhost", '', '');  Foo->table("foo");  Foo->columns( All => "id" );  print Foo->retrieve_all->first'
1

real    0m0.778s
user    0m0.284s
sys     0m0.476s

That's for a simple table with a 100,000 rows.


> I guess the bulk of my searches are displaying a list of results a
> page at a time so I'm using LIMIT.

LIMIT is available via CDBI::AbstractSearch->search_where().


> I'd rather see some of CDBI::Sweet's features added into CDBI so using
> LIMIT is the way to reduce fetching un-wanted rows.

While using a LIMIT is preferrable it requires you know ahead of time how many rows you will need.  This is not always possible.


> I'd rather use the memory then do extra trips to the database

Do not be fooled!  There really is no "give me everything" bulk row loading optimization in DBI.  selectall/fetchall is just doing $sth->fetch in a loop!  This new iterator code does not do any extra trips to the database.  It should result in pretty much the same performance as before even when you do run through the whole iterator (its currently a bit slower.  I'm working on that).

In fact, CDBI->sth_to_objects just did this:

                $sth->execute(@$args) unless $sth->{Active};
                $sth->bind_columns(\(@data{ @{ $sth->{NAME_lc} } }));
                push @rows, {%data} while $sth->fetch;

I'm using fetchrow_hashref() instead.  Its open to tweaking.



More information about the ClassDBI mailing list