Fwd: Re: I found a sort bug! - How to sort big files?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Re: I found a sort bug! - How to sort big files?

sort problem
Whoops. At least I thought it helped. The default sort with the "-H" worked for 132 minutes then said: no space left in /home (that had before the sort command: 111 GBytes FREE). And btw, df command said for free space: "-18 GByte", 104%.. what? Some kind of reserved space for root?


Why does it takes more then 111 GBytes to "sort -u" ~600 MByte sized files? This in nonsense.


So the default "sort" command is a  big pile of shit when it comes to files bigger then 60 MByte? .. lol

I can send the ~600 MByte txt files compressed if needed...

I was suprised... sort is a very old command..


-------- Original Message --------
From: "sort problem" <[hidden email]>
To: [hidden email]
Cc: [hidden email]
Subject: Re: I found a sort bug! - How to sort big files?
Date: Sat, 14 Mar 2015 08:39:55 -0400

o.m.g. It works.

Why doesn't sort uses this by default on files larger then 60 MByte?

Thanks!

-------- Original Message --------
From: Andreas Zeilmeier <[hidden email]>
Apparently from: [hidden email]
To: [hidden email]
Subject: Re: I found a sort bug! - How to sort big files?
Date: Sat, 14 Mar 2015 13:16:05 +0100

> On 03/14/15 12:49, sort problem wrote:
> > Hello,
> >
> > ----------
> > # uname -a
> > OpenBSD notebook.lan 5.6 GENERIC.MP#333 amd64
> > #
> > # du -sh small/                                                                                                                                                                                                
> > 663M    small/
> > # ls -lah small/*.txt | wc -l                                                                                                                                                                                  
> >       43
> > #
> > # cd small
> > # ulimit -n
> > 10000000
> > # sysctl | grep -i maxfiles
> > kern.maxfiles=1000000000
> > #
> > # grep open /etc/login.conf                                                                                                                                                                                    
> >         :openfiles-cur=100000:\
> >         :openfiles-cur=1280000:\
> >         :openfiles-cur=512:\
> > #
> > # sort -u *.txt -o out
> > Segmentation fault (core dumped)
> > #
> > ----------
> >
> > This is after a minute run.. The txt files have UTF-8 chars too. A line is maximum a few ten chars long in the txt files. All the txt files have UNIX eol's. There is enough storage, enough RAM, enough CPU. I'm even trying this with root user. The txt files are about ~60 000 000 lines.. not a big number... a reboot didn't help.
> >
> >
> >
> > Any ideas how can I use the "sort" command to actually sort? Please help!
> >
> >
> >
> > Thanks,
> >
> > btw, this happens on other UNIX OS too, lol... why do we have the sort command if it doesn't work?
> >
>
> Hi,
>
> have you tried the option '-H'?
> The manpage suggested this for files > 60MB.
>
>
> Regards,
>
> Andi

Reply | Threaded
Open this post in threaded view
|

Re: I found a sort bug! - How to sort big files?

Steve Litt
On Sun, 15 Mar 2015 09:53:34 -0400
"sort problem" <[hidden email]> wrote:

> Whoops. At least I thought it helped. The default sort with the "-H"
> worked for 132 minutes then said: no space left in /home (that had
> before the sort command: 111 GBytes FREE).

That's not surprising. -H implements a merge sort, meaning it's split
into lots and lots of files, each of which is again split into lots of
files, etc. It wouldn't surprise me to see a 60Mline file consume a
huge multiple of itself during a merge sort.

And of course, the algorithm might be swapping.

> And btw, df command said
> for free space: "-18 GByte", 104%.. what? Some kind of reserved space
> for root?
>
>
> Why does it takes more then 111 GBytes to "sort -u" ~600 MByte sized
> files? This in nonsense.
>
>
> So the default "sort" command is a  big pile of shit when it comes to
> files bigger then 60 MByte? .. lol

That doesn't surprise me. You originally said you have 60 million
lines. Sorting 60 million items is a difficult task for any algorithm.
You don't say how long each line is, or what they contain, or whether
they're all the same line length.

How would *you* sort so many items, and sort them in a fast yet generic
way? I mean, if RAM and disk space are at a premium, you could always
use a bubble sort, and in-place sort your array in a year or two.

If I were in your shoes, I'd write my own sort routine for the task.
Perhaps using qsort() (see
http://calmerthanyouare.org/2013/05/31/qsort-shootout.html). If there's
a way you can convert line contents into a number reflecting
alpha-order, you could even qsort() in RAM if you have quite a bit of
RAM, and then the last step is to run through the sorted list of
numbers and line numbers, and write the original file by line number.
There are probably a thousand other ways to do it.

But IMHO, sorting 60megalines isn't something I would expect a generic
sort command to easily and timely do out of the box.

SteveT

Steve Litt                *  http://www.troubleshooters.com/
Troubleshooting Training  *  Human Performance

Reply | Threaded
Open this post in threaded view
|

Re: I found a sort bug! - How to sort big files?

Kenneth Gober
In reply to this post by sort problem
I don't know why sort is giving you such problems.  there may be something
unusual about your specific input that it wasn't designed to handle (or it
might simply be a latent bug that has never been identified and fixed).

when I need to sort large files, I split(1) them into smaller pieces, then
sort(1) the pieces individually, then use sort(1) (with the -m option) to
merge the sorted pieces into a single large result file.  this has always
worked reliably for me (and because I was raised using 8-bit and 16-bit
computers I don't have any special expectations that programs should "just
work" when given very large inputs).

even if you think doing all this is too much bother, try doing it just
once.  you might be able to identify a specific chunk of your input that's
causing the problem, which will help move us all towards a proper solution
(or at least a caveat in the man page).

-ken

On Sun, Mar 15, 2015 at 9:53 AM, sort problem <[hidden email]>
wrote:

> Whoops. At least I thought it helped. The default sort with the "-H"
> worked for 132 minutes then said: no space left in /home (that had before
> the sort command: 111 GBytes FREE). And btw, df command said for free
> space: "-18 GByte", 104%.. what? Some kind of reserved space for root?
>
>
> Why does it takes more then 111 GBytes to "sort -u" ~600 MByte sized
> files? This in nonsense.
>
>
> So the default "sort" command is a  big pile of shit when it comes to
> files bigger then 60 MByte? .. lol
>
> I can send the ~600 MByte txt files compressed if needed...
>
> I was suprised... sort is a very old command..
>
>
> -------- Original Message --------
> From: "sort problem" <[hidden email]>
> To: [hidden email]
> Cc: [hidden email]
> Subject: Re: I found a sort bug! - How to sort big files?
> Date: Sat, 14 Mar 2015 08:39:55 -0400
>
> o.m.g. It works.
>
> Why doesn't sort uses this by default on files larger then 60 MByte?
>
> Thanks!
>
> -------- Original Message --------
> From: Andreas Zeilmeier <[hidden email]>
> Apparently from: [hidden email]
> To: [hidden email]
> Subject: Re: I found a sort bug! - How to sort big files?
> Date: Sat, 14 Mar 2015 13:16:05 +0100
>
> > On 03/14/15 12:49, sort problem wrote:
> > > Hello,
> > >
> > > ----------
> > > # uname -a
> > > OpenBSD notebook.lan 5.6 GENERIC.MP#333 amd64
> > > #
> > > # du -sh small/
> > > 663M    small/
> > > # ls -lah small/*.txt | wc -l
> > >       43
> > > #
> > > # cd small
> > > # ulimit -n
> > > 10000000
> > > # sysctl | grep -i maxfiles
> > > kern.maxfiles=1000000000
> > > #
> > > # grep open /etc/login.conf
> > >         :openfiles-cur=100000:\
> > >         :openfiles-cur=1280000:\
> > >         :openfiles-cur=512:\
> > > #
> > > # sort -u *.txt -o out
> > > Segmentation fault (core dumped)
> > > #
> > > ----------
> > >
> > > This is after a minute run.. The txt files have UTF-8 chars too. A
> line is maximum a few ten chars long in the txt files. All the txt files
> have UNIX eol's. There is enough storage, enough RAM, enough CPU. I'm even
> trying this with root user. The txt files are about ~60 000 000 lines.. not
> a big number... a reboot didn't help.
> > >
> > >
> > >
> > > Any ideas how can I use the "sort" command to actually sort? Please
> help!
> > >
> > >
> > >
> > > Thanks,
> > >
> > > btw, this happens on other UNIX OS too, lol... why do we have the sort
> command if it doesn't work?
> > >
> >
> > Hi,
> >
> > have you tried the option '-H'?
> > The manpage suggested this for files > 60MB.
> >
> >
> > Regards,
> >
> > Andi

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: I found a sort bug! - How to sort big files?

Ted Unangst-6
In reply to this post by sort problem
sort problem wrote:
> So the default "sort" command is a  big pile of shit when it comes to files bigger then 60 MByte? .. lol
>
> I can send the ~600 MByte txt files compressed if needed...
>
> I was suprised... sort is a very old command..

I think you have discovered the answer. :(

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: I found a sort bug! - How to sort big files?

Stuart Henderson
In reply to this post by sort problem
On 2015-03-15, sort problem <[hidden email]> wrote:
> So the default "sort" command is a  big pile of shit when it comes to files bigger then 60 MByte? .. lol

It's probably not the size, rather the contents of the files.

Reply | Threaded
Open this post in threaded view
|

Re: I found a sort bug! - How to sort big files?

Worik Stanton
In reply to this post by Steve Litt
On 16/03/15 06:43, Steve Litt wrote:
> But IMHO, sorting 60megalines isn't something I would expect a
> generic sort command to easily and timely do out of the box.

I would.  These days such files are getting more and more common.

But there is a warning in the man page for sort under "BUGS":

     "To sort files larger than 60MB, use sort -H; files larger than
704MB must be sorted in smaller pieces, then merged."

So it seams there is a bug in... "files larger than 60MB, use sort -H"
since that did not work for the OP.

Worik
--
Why is the legal status of chardonnay different to that of cannabis?
       [hidden email] 021-1680650, (03) 4821804
                          Aotearoa (New Zealand)
                             I voted for love

Reply | Threaded
Open this post in threaded view
|

Re: I found a sort bug! - How to sort big files?

Steve Litt
On Tue, 17 Mar 2015 08:58:56 +1300
worik <[hidden email]> wrote:

> On 16/03/15 06:43, Steve Litt wrote:
> > But IMHO, sorting 60megalines isn't something I would expect a
> > generic sort command to easily and timely do out of the box.
>
> I would.  These days such files are getting more and more common.
>
> But there is a warning in the man page for sort under "BUGS":
>
>      "To sort files larger than 60MB, use sort -H; files larger than
> 704MB must be sorted in smaller pieces, then merged."
>
> So it seams there is a bug in... "files larger than 60MB, use sort -H"
> since that did not work for the OP.
>
> Worik

Oh, jeez, you put your finger *right* on the problem Worik. Both I and
the OP read the manpage wrong. sort -H won't work for extremely big
files (more than 704MB). But there's a fairly easy solution...

An average line length can be found with wc and then dividing. Then
figure out how many lines would make about a 10MB file, and use split
-l to split the file into smaller files with that many lines. Then sort
each of those files, with no arguments, and finally use sort -m to
merge them all back together again into one sorted file.

According to the man page, the preceding should work just fine, and it
can pretty much be automated with a simple shellscript, so you can set
it to run and have it work while you do other things.

SteveT

Steve Litt                *  http://www.troubleshooters.com/
Troubleshooting Training  *  Human Performance