slow realloc: alternate method?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

slow realloc: alternate method?

Jacob Yocom-Piatt
i've got some C code that is reading from a 800 MB CSV file and allocates memory
for an array to store the data in. the method used is to read the CSV file
line-by-line and realloc additional space with each line read. having timed this
and found the realloc speed to be low when the array is large, i am aiming to
make this faster but am not sure about the best way to proceed.

the current code uses realloc in the manner suggested by the manpage:

    newsize = size + 1;
    time(&t1);                  // start timing realloc
    if ((newap = (int *)realloc(ap, newsize*sizeof(int))) == NULL) {
        free(ap);
        ap = NULL;
        size = 0;
        return (NULL);
    }
    time(&t2);                  // stop timing realloc; start timing fscanf

as the size of ap grows, so does the time it takes to realloc the space.

an alternative to this procedure would be to scan through the CSV file to
determine how many array entries i would need, realloc it all at once, then go
back through the CSV file again to read the data into the array. i'm not
confident this is the only way to do this and would appreciate any suggestions
for speeding up this procedure.

cheers,
jake

Reply | Threaded
Open this post in threaded view
|

Re: slow realloc: alternate method?

Dimitry Andric-2
Jacob Yocom-Piatt wrote:
> an alternative to this procedure would be to scan through the CSV file to
> determine how many array entries i would need, realloc it all at once, then go
> back through the CSV file again to read the data into the array.

Try starting with a reasonable number of lines, e.g. the average number
of lines in most of your csv files, and doubling the buffer each time
you fill it up.

Reply | Threaded
Open this post in threaded view
|

Re: slow realloc: alternate method?

Otto Moerbeek
In reply to this post by Jacob Yocom-Piatt
On Fri, 16 Jun 2006, Jacob Yocom-Piatt wrote:

> i've got some C code that is reading from a 800 MB CSV file and allocates memory
> for an array to store the data in. the method used is to read the CSV file
> line-by-line and realloc additional space with each line read. having timed this
> and found the realloc speed to be low when the array is large, i am aiming to
> make this faster but am not sure about the best way to proceed.
>
> the current code uses realloc in the manner suggested by the manpage:
>
>     newsize = size + 1;
>     time(&t1);                  // start timing realloc
>     if ((newap = (int *)realloc(ap, newsize*sizeof(int))) == NULL) {
>         free(ap);
>         ap = NULL;
>         size = 0;
>         return (NULL);
>     }
>     time(&t2);                  // stop timing realloc; start timing fscanf
>
> as the size of ap grows, so does the time it takes to realloc the space.
>
> an alternative to this procedure would be to scan through the CSV file to
> determine how many array entries i would need, realloc it all at once, then go
> back through the CSV file again to read the data into the array. i'm not
> confident this is the only way to do this and would appreciate any suggestions
> for speeding up this procedure.

You'll have to think how this works: realloc has to copy the data to
the newly allocated region if it is not able to extend the current
region.

If you do that for each time you'll increase the allocation size and
your size increment is small, you'll do a lot of work. The way you are
doing it now has quadratic time complexity.

So make your increment larger and start with a larger size. Maybe you
can estimate the initial size based on your file size. If you'll end
up allocating a too large area, just use realloc the decrease the size
after you're done.

Also thing about other data structures: you might be better of with a
linked list or tree here, depending on what you are doing with your
data.

        -Otto

Reply | Threaded
Open this post in threaded view
|

Re: slow realloc: alternate method?

Gilles Chehade
On Fri, Jun 16, 2006 at 10:14:07PM +0200, Otto Moerbeek wrote:

> On Fri, 16 Jun 2006, Jacob Yocom-Piatt wrote:
>
> > i've got some C code that is reading from a 800 MB CSV file and allocates memory
> > for an array to store the data in. the method used is to read the CSV file
> > line-by-line and realloc additional space with each line read. having timed this
> > and found the realloc speed to be low when the array is large, i am aiming to
> > make this faster but am not sure about the best way to proceed.
> >
> > the current code uses realloc in the manner suggested by the manpage:
> >
> >     newsize = size + 1;
> >     time(&t1);                  // start timing realloc
> >     if ((newap = (int *)realloc(ap, newsize*sizeof(int))) == NULL) {
> >         free(ap);
> >         ap = NULL;
> >         size = 0;
> >         return (NULL);
> >     }
> >     time(&t2);                  // stop timing realloc; start timing fscanf
> >
> > as the size of ap grows, so does the time it takes to realloc the space.
> >
> > an alternative to this procedure would be to scan through the CSV file to
> > determine how many array entries i would need, realloc it all at once, then go
> > back through the CSV file again to read the data into the array. i'm not
> > confident this is the only way to do this and would appreciate any suggestions
> > for speeding up this procedure.
>
> You'll have to think how this works: realloc has to copy the data to
> the newly allocated region if it is not able to extend the current
> region.
>
> If you do that for each time you'll increase the allocation size and
> your size increment is small, you'll do a lot of work. The way you are
> doing it now has quadratic time complexity.
>
> So make your increment larger and start with a larger size. Maybe you
> can estimate the initial size based on your file size. If you'll end
> up allocating a too large area, just use realloc the decrease the size
> after you're done.
>
> Also thing about other data structures: you might be better of with a
> linked list or tree here, depending on what you are doing with your
> data.
>
> -Otto

Also, if the purpose of this is to load the file in memory, why not just
mmap() it in the first place ?

Reply | Threaded
Open this post in threaded view
|

Re: slow realloc: alternate method?

Gilles Chehade
On Fri, Jun 16, 2006 at 10:40:29PM +0200, [hidden email] wrote:

> On Fri, Jun 16, 2006 at 10:14:07PM +0200, Otto Moerbeek wrote:
> > On Fri, 16 Jun 2006, Jacob Yocom-Piatt wrote:
> >
> > > i've got some C code that is reading from a 800 MB CSV file and allocates memory
> > > for an array to store the data in. the method used is to read the CSV file
> > > line-by-line and realloc additional space with each line read. having timed this
> > > and found the realloc speed to be low when the array is large, i am aiming to
> > > make this faster but am not sure about the best way to proceed.
> > >
> > > the current code uses realloc in the manner suggested by the manpage:
> > >
> > >     newsize = size + 1;
> > >     time(&t1);                  // start timing realloc
> > >     if ((newap = (int *)realloc(ap, newsize*sizeof(int))) == NULL) {
> > >         free(ap);
> > >         ap = NULL;
> > >         size = 0;
> > >         return (NULL);
> > >     }
> > >     time(&t2);                  // stop timing realloc; start timing fscanf
> > >
> > > as the size of ap grows, so does the time it takes to realloc the space.
> > >
> > > an alternative to this procedure would be to scan through the CSV file to
> > > determine how many array entries i would need, realloc it all at once, then go
> > > back through the CSV file again to read the data into the array. i'm not
> > > confident this is the only way to do this and would appreciate any suggestions
> > > for speeding up this procedure.
> >
> > You'll have to think how this works: realloc has to copy the data to
> > the newly allocated region if it is not able to extend the current
> > region.
> >
> > If you do that for each time you'll increase the allocation size and
> > your size increment is small, you'll do a lot of work. The way you are
> > doing it now has quadratic time complexity.
> >
> > So make your increment larger and start with a larger size. Maybe you
> > can estimate the initial size based on your file size. If you'll end
> > up allocating a too large area, just use realloc the decrease the size
> > after you're done.
> >
> > Also thing about other data structures: you might be better of with a
> > linked list or tree here, depending on what you are doing with your
> > data.
> >
> > -Otto
>
> Also, if the purpose of this is to load the file in memory, why not just
> mmap() it in the first place ?
>

Ooops ... I was doing two things at the same time and did not finish the
mail before sending it.

So basically, why not mmap() the file, go through the map counting \n
while replacing them by \0 until you reach end of map. allocate an
array the size of the counter and have each array entry point to where
it should in the memory map ?

Reply | Threaded
Open this post in threaded view
|

Re: slow realloc: alternate method?

Otto Moerbeek
On Fri, 16 Jun 2006, [hidden email] wrote:

> So basically, why not mmap() the file, go through the map counting \n
> while replacing them by \0 until you reach end of map. allocate an
> array the size of the counter and have each array entry point to where
> it should in the memory map ?

I implemented a similar technique in patch(1), minus the '\n' -> '\0'
replacing; patch does not use NUL terminated strings internally.

But it all depends on which data is being stored: just the lines from
the file, or data based on the chars found in the file.

        -Otto

Reply | Threaded
Open this post in threaded view
|

Re: slow realloc: alternate method?

Gilles Chehade
On Fri, Jun 16, 2006 at 11:17:39PM +0200, Otto Moerbeek wrote:

> On Fri, 16 Jun 2006, [hidden email] wrote:
>
> > So basically, why not mmap() the file, go through the map counting \n
> > while replacing them by \0 until you reach end of map. allocate an
> > array the size of the counter and have each array entry point to where
> > it should in the memory map ?
>
> I implemented a similar technique in patch(1), minus the '\n' -> '\0'
> replacing; patch does not use NUL terminated strings internally.
>
> But it all depends on which data is being stored: just the lines from
> the file, or data based on the chars found in the file.
>
> -Otto
>

Yup,
I used this in (function splitfields) where the delimiter was chosen
with getopt:

        http://etudiant.epitech.net/~veins/sort/sort.c

Reply | Threaded
Open this post in threaded view
|

Re: slow realloc: alternate method?

Jacob Yocom-Piatt
In reply to this post by Jacob Yocom-Piatt
---- Original message ----

>Date: Fri, 16 Jun 2006 22:14:07 +0200 (CEST)
>From: Otto Moerbeek <[hidden email]>  
>Subject: Re: slow realloc: alternate method?  
>To: Jacob Yocom-Piatt <[hidden email]>
>Cc: [hidden email]
>
>
>So make your increment larger and start with a larger size. Maybe you
>can estimate the initial size based on your file size. If you'll end
>up allocating a too large area, just use realloc the decrease the size
>after you're done.
>
>Also thing about other data structures: you might be better of with a
>linked list or tree here, depending on what you are doing with your
>data.

thx for all the suggestions guys. i like the idea of allocating too much space
and trimming it at the end.

the data is tick price data and once it's read into an array, it is searched for
patterns by accessing elements in sequence. linked lists were mentioned by
someone offlist.

Reply | Threaded
Open this post in threaded view
|

Re: slow realloc: alternate method?

Matthew R. Dempsky
In reply to this post by Jacob Yocom-Piatt
On Fri, Jun 16, 2006 at 10:55:05AM -0500, Jacob Yocom-Piatt wrote:

> the current code uses realloc in the manner suggested by the manpage:
>
>     newsize = size + 1;
>     time(&t1);                  // start timing realloc
>     if ((newap = (int *)realloc(ap, newsize*sizeof(int))) == NULL) {
>         free(ap);
>         ap = NULL;
>         size = 0;
>         return (NULL);
>     }
>     time(&t2);                  // stop timing realloc; start timing fscanf
>
> as the size of ap grows, so does the time it takes to realloc the space.

Growing your array by only a constant amount each iteration takes
quadratic time.  By instead doubling the array size each time as
necessary, you can reduce this to (amortized) linear time.  (I believe
the man page's intention was to show how to avoid leaking memory, not
how to write an efficient program.)

Alternatively, just do as others have suggested and mmap() the file and
make an extra preliminary pass.

Reply | Threaded
Open this post in threaded view
|

Re: slow realloc: alternate method?

Graham Toal
In reply to this post by Gilles Chehade
> Yup,
> I used this in (function splitfields) where the delimiter was chosen
> with getopt:
>
> http://etudiant.epitech.net/~veins/sort/sort.c

Oh yes, sort... that reminds me...

  http://www.gtoal.com/wordgames/sort/sort.[ch]

- see the above for the epitome of managing store yourself...

It's not pretty and I'm not proud of the actual code but I was very
pleased with the way I did the memory management.  It allocates a
single block of RAM at startup and partitions it rather sneakily
for all objects ever needed by the program, I believe as efficiently
as is possible.

I wrote this because we were sorting text files of 100's of Mb on
a machine that had 4Mb of physical ram; gnusort had a serious bug
in that you told it how much ram you wanted to use but it then went
over your request quite significantly.  This was semi-OK on Unix
where the use of VM would at least let the program run (albeit
thrashing VM a lot) but on our hardware with physical memory only,
it *always* failed a malloc at some point.  As has been noted in
this thread, realloc is evil!

The code is well commented. Some might say to excess :-)  It's
probably one of the most complex pieces of memory management I've
had to do in an application, and I didn't want to take the risk
that I'd have to do maintenance on it 20 years later, and take
one look at it and say "WTF???!"  (So I guess that means it'll
be due for maintenance in 2008 ;-) )

G

Reply | Threaded
Open this post in threaded view
|

Re: slow realloc: alternate method?

Graham Toal
In reply to this post by Matthew R. Dempsky
> Growing your array by only a constant amount each iteration takes
> quadratic time.  By instead doubling the array size each time as
> necessary, you can reduce this to (amortized) linear time.  (I believe
> the man page's intention was to show how to avoid leaking memory, not
> how to write an efficient program.)

See 'makespace' in
http://www.gtoal.com/compilers101/tinc/02_Graham_Toal/teeny.c
for an example of a generic way to do this.

You call makespace each time you access your flex array with an index,
and it will make sure the array is large enough such that the
element with that index is present.  (You don't need to call it
if you know that the index is smaller than the largest index
you've already accessed, but for safety it might be good practice
to call it before every array access - the overhead isn't
too high)

You could probably use the preprocessor to hide it entirely,
and just access your array with a macro, eg

  i = tickdata(n);
rather than
  i = tickdata[n];

and have the macro first call makespace...

G