pkg_add vs python

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

pkg_add vs python

Marc Espie-2
sthen@ finally figured out most of the problem.

Turns out python is stupid enough to store path+timestamp in its compiled
*.pyc files  to know when to recompile.

Our package system doesn't look to closely at timestamps... so this explains
how *.pyc files sometimes get rebuilt, and then throw errors during pkg_delete.

The following patch is a work-around for the issue. It removes the possibility
to tie python files (recognize identical files and not re-extract them), thus
preserving the timestamps more (and making python packages more painful to
extract).

Note that this is actually a large problem.  There is work being done on
speeding up updates, and if we have to store/restore timestamps along with
other stuff, this has a cost  (more code and bigger packages) ! so it would
be WAYS better if python did stop doing that, which probably requires working
with upstream to make them aware of the issue.


In the meantime, people experiencing python-package deleting issues should
use the following patch. It's a large sledge-hammer, but it should make
the problem disappear, and it will probably be committed soon...

Index: OpenBSD/PkgAdd.pm
===================================================================
RCS file: /home/openbsd/cvs/src/usr.sbin/pkg_add/OpenBSD/PkgAdd.pm,v
retrieving revision 1.34
diff -u -p -r1.34 PkgAdd.pm
--- OpenBSD/PkgAdd.pm 28 Apr 2012 12:00:10 -0000 1.34
+++ OpenBSD/PkgAdd.pm 2 Nov 2012 12:12:56 -0000
@@ -84,6 +84,8 @@ sub tie_files
 {
  my ($self, $sha, $state) = @_;
  return if $self->{link} or $self->{symlink} or $self->{nochecksum};
+ # XXX python doesn't like this, overreliance on timestamps
+ return if $self->{name} =~ m/\.py$/;
  if (defined $sha->{$self->{d}->key}) {
  my $tied = $sha->{$self->{d}->key};
  # don't tie if there's a problem with the file

Reply | Threaded
Open this post in threaded view
|

Re: pkg_add vs python

Laurence Tratt
On Mon, Nov 05, 2012 at 02:15:53PM +0100, Marc Espie wrote:

> Turns out python is stupid enough to store path+timestamp in its compiled
> *.pyc files  to know when to recompile.

The "auto-recompile everything which is out of date" feature is ingenious but
there are at least two different ways of implementing it. One way is to check
the timestamps on X.py and X.pyc; if the latter is older than the former,
recompile, and (try to) cache the output to X.pyc. [This is how Converge
works, if anyone cares.] Judging by your comment, Python stores the timestamp
in the file itself (probably for portability/performance reasons) rather than
checking the timestamp from the filesystem. [I can't remember off-hand, nor
can I remember how PyPy does it either.]

I assume the only people who are having pkg_delete problems are running a
Python program as root? If so, in one sense, they're the lucky ones. The
unlucky ones would then be those running as non-root where the "is X.py newer
than X.pyc" check fails, forcing a recompilation, but which can't then save
out a cached file. So every time they run Python they'll be paying a
recompilation cost for such files.

The question this raises in my mind is the following. During an update, can
the timestamp of X.py be updated, but not X.pyc?

If so, then I think that would be a problem that pkg_update needs to fix:
timestamps (particularly the relative timestamps of files i.e. "X is older
than Y") are an important piece of meta-data.

If not, then it might be possible to make a case to the Python developers
that rather than storing timestamps in the pyc file, they should simply read
it from the filesystem. That would mean that both X.py and X.pyc could be
updated, but providing X.pyc is newer than X.py, Python would not try to
write out a cached file.


Laurie
--
Personal                                             http://tratt.net/laurie/
The Converge programming language                      http://convergepl.org/
   https://github.com/ltratt              http://twitter.com/laurencetratt

Reply | Threaded
Open this post in threaded view
|

Re: pkg_add vs python

Marc Espie-2
On Wed, Nov 07, 2012 at 11:05:23AM +0000, Laurence Tratt wrote:

> On Mon, Nov 05, 2012 at 02:15:53PM +0100, Marc Espie wrote:
>
> > Turns out python is stupid enough to store path+timestamp in its compiled
> > *.pyc files  to know when to recompile.
>
> The "auto-recompile everything which is out of date" feature is ingenious but
> there are at least two different ways of implementing it. One way is to check
> the timestamps on X.py and X.pyc; if the latter is older than the former,
> recompile, and (try to) cache the output to X.pyc. [This is how Converge
> works, if anyone cares.] Judging by your comment, Python stores the timestamp
> in the file itself (probably for portability/performance reasons) rather than
> checking the timestamp from the filesystem. [I can't remember off-hand, nor
> can I remember how PyPy does it either.]
>
> I assume the only people who are having pkg_delete problems are running a
> Python program as root? If so, in one sense, they're the lucky ones. The
> unlucky ones would then be those running as non-root where the "is X.py newer
> than X.pyc" check fails, forcing a recompilation, but which can't then save
> out a cached file. So every time they run Python they'll be paying a
> recompilation cost for such files.
>
> The question this raises in my mind is the following. During an update, can
> the timestamp of X.py be updated, but not X.pyc?
>
> If so, then I think that would be a problem that pkg_update needs to fix:
> timestamps (particularly the relative timestamps of files i.e. "X is older
> than Y") are an important piece of meta-data.
>
> If not, then it might be possible to make a case to the Python developers
> that rather than storing timestamps in the pyc file, they should simply read
> it from the filesystem. That would mean that both X.py and X.pyc could be
> updated, but providing X.pyc is newer than X.py, Python would not try to
> write out a cached file.

Actually, we discussed a possible approach. It looks reasonable to have
a "compile as package" mode (say through an env variable for instance) that
would disable the check and leave a mark in the compiled file that says the
check is not relevant.

It's as simple as 'store timestamp as 0000 in the .pyc file'...

Apparently, there are other tendrils elsewhere that make this a bit more
complicated, but for a package system, requiring timestamps to be consistent
makes things MORE brittle, in the presence of NFS, clock-skew, and various
other issues. For instance, one prominent member of the project does not
believe in clock synchronization and does builds over NFS on machines with
wildly inaccurate clocks...

So I'm rather happy the pkg* system fudges some timestamps already (linked
to the concept of tie()ing files, which allows some speed/space optimization
at the expense of some irrelevant metadata)...

Reply | Threaded
Open this post in threaded view
|

Re: pkg_add vs python

Laurence Tratt
On Wed, Nov 07, 2012 at 01:27:33PM +0100, Marc Espie wrote:

> Actually, we discussed a possible approach. It looks reasonable to have a
> "compile as package" mode (say through an env variable for instance) that
> would disable the check and leave a mark in the compiled file that says the
> check is not relevant.
>
> It's as simple as 'store timestamp as 0000 in the .pyc file'...

I guess something like this could fix lang/python, but my worry is that other
packages (e.g. possibly some of other dynamically typed languages in ports)
might be relying on timestamps. We probably don't notice pkg_delete problems
unless they run as root; there may still be performance problems (as it would
seem there currently are with lang/python). For example, if we ever get
lang/pypy into ports (it languishes in openbsd-wip...), we'd need to do a
similar hack as above for it.

> Apparently, there are other tendrils elsewhere that make this a bit more
> complicated, but for a package system, requiring timestamps to be
> consistent makes things MORE brittle, in the presence of NFS, clock-skew,
> and various other issues. For instance, one prominent member of the project
> does not believe in clock synchronization and does builds over NFS on
> machines with wildly inaccurate clocks...

The problems I'm thinking about only care about relative timestamps within a
package (i.e. comparing the two timestamps within a package for "which file
is newer?"). That's a much less stringent requirement, and may even cope well
enough with builds from clock-sync deniers?


Laurie