fsync(2) and I/O errors

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

fsync(2) and I/O errors

Maximilian Lorlacks
Good day,

It seems that fsync(2) may data after returning EIO once[1].  This
behaviour seems to cause problems with databases such as PostgreSQL
and goes contrary to the man page's description, which says that
"fsync() and fdatasync() should be used by programs that require a
file to be in a known state, for example, in building a simple
transaction facility."

Are there any plans to change fsync(2) to allow for retries and/or
persistent error reporting?

[1] https://wiki.postgresql.org/wiki/Fsync_Errors

Thank you for your time.

Regards,

Maximilian Lorlacks

Reply | Threaded
Open this post in threaded view
|

Re: fsync(2) and I/O errors

Ted Unangst-6
Maximilian Lorlacks wrote:

> Good day,
>
> It seems that fsync(2) may data after returning EIO once[1].  This
> behaviour seems to cause problems with databases such as PostgreSQL
> and goes contrary to the man page's description, which says that
> "fsync() and fdatasync() should be used by programs that require a
> file to be in a known state, for example, in building a simple
> transaction facility."
>
> Are there any plans to change fsync(2) to allow for retries and/or
> persistent error reporting?

No, this looks like a bug, but I'm not sure it's getting fixed soon.

Keeping a buf with an error in the delayed write list would probably have some
serious consequences. When would we ever remove it?

For completeness, I'll also add that if the syncer gets to the bad buf before
fsync, you might miss the error, but that's only after 30 seconds or so.

Reply | Threaded
Open this post in threaded view
|

Re: fsync(2) and I/O errors

Ted Unangst-6
Ted Unangst wrote:

> Maximilian Lorlacks wrote:
> > Good day,
> >
> > It seems that fsync(2) may data after returning EIO once[1].  This
> > behaviour seems to cause problems with databases such as PostgreSQL
> > and goes contrary to the man page's description, which says that
> > "fsync() and fdatasync() should be used by programs that require a
> > file to be in a known state, for example, in building a simple
> > transaction facility."
> >
> > Are there any plans to change fsync(2) to allow for retries and/or
> > persistent error reporting?
>
> No, this looks like a bug, but I'm not sure it's getting fixed soon.
>
> Keeping a buf with an error in the delayed write list would probably have some
> serious consequences. When would we ever remove it?

Thought about this some more. The best approach may be to set a flag in
the vnode that there was an IO error, and return that for any following fsync.

Userland can retry by closing the file and opening again, which will clear the
flag. Annoying, but I don't what else to do. The kernel can't know what the
application wants to do. I believe that will be posix compliant as well.

Reply | Threaded
Open this post in threaded view
|

Re: fsync(2) and I/O errors

Ted Unangst-6
Ted Unangst wrote:
> > Keeping a buf with an error in the delayed write list would probably have some
> > serious consequences. When would we ever remove it?
>
> Thought about this some more. The best approach may be to set a flag in
> the vnode that there was an IO error, and return that for any following fsync.
>
> Userland can retry by closing the file and opening again, which will clear the
> flag. Annoying, but I don't what else to do. The kernel can't know what the
> application wants to do. I believe that will be posix compliant as well.

I think this works, but I don't have an easy way to test it.

FreeBSD redirties the buf in this case, but I don't want to go down that path.
There's a lot of code that seems to assume a buf that makes it here will be
clean. Trying to fix up the buf to be dirty again seems very likely to triger
some assertion or panic elsewhere.

Instead, we note that the write failed and mark a flag in the vnode. Future
calls to fsync will then return EIO when this flag is set. We clear the flag
when the vnode is released.


Index: kern/vfs_bio.c
===================================================================
RCS file: /home/cvs/src/sys/kern/vfs_bio.c,v
retrieving revision 1.187
diff -u -p -r1.187 vfs_bio.c
--- kern/vfs_bio.c 21 Nov 2018 16:14:43 -0000 1.187
+++ kern/vfs_bio.c 31 Jan 2019 21:19:36 -0000
@@ -867,6 +867,11 @@ brelse(struct buf *bp)
  /* If it's not cacheable, or an error, mark it invalid. */
  if (ISSET(bp->b_flags, (B_NOCACHE|B_ERROR)))
  SET(bp->b_flags, B_INVAL);
+ /* If it's a write error, also mark the vnode as damaged. */
+ if (ISSET(bp->b_flags, B_ERROR) && !ISSET(bp->b_flags, B_READ)) {
+ if (bp->b_vp && bp->b_vp->v_type == VREG)
+ SET(bp->b_vp->v_bioflag, VBIOERROR);
+ }
 
  if (ISSET(bp->b_flags, B_INVAL)) {
  /*
Index: kern/vfs_subr.c
===================================================================
RCS file: /home/cvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.285
diff -u -p -r1.285 vfs_subr.c
--- kern/vfs_subr.c 21 Jan 2019 18:09:21 -0000 1.285
+++ kern/vfs_subr.c 31 Jan 2019 21:09:22 -0000
@@ -712,6 +712,7 @@ vputonfreelist(struct vnode *vp)
 #endif
 
  vp->v_bioflag |= VBIOONFREELIST;
+ vp->v_bioflag &= ~VBIOERROR;
 
  if (vp->v_holdcnt > 0)
  lst = &vnode_hold_list;
Index: kern/vfs_vops.c
===================================================================
RCS file: /home/cvs/src/sys/kern/vfs_vops.c,v
retrieving revision 1.19
diff -u -p -r1.19 vfs_vops.c
--- kern/vfs_vops.c 21 Jun 2018 14:17:23 -0000 1.19
+++ kern/vfs_vops.c 31 Jan 2019 21:04:43 -0000
@@ -336,7 +336,7 @@ int
 VOP_FSYNC(struct vnode *vp, struct ucred *cred, int waitfor,
     struct proc *p)
 {
- int r;
+ int r, s;
  struct vop_fsync_args a;
  a.a_vp = vp;
  a.a_cred = cred;
@@ -351,6 +351,10 @@ VOP_FSYNC(struct vnode *vp, struct ucred
  vp->v_inflight++;
  r = (vp->v_op->vop_fsync)(&a);
  vp->v_inflight--;
+ s = splbio();
+ if (r == 0 && vp->v_bioflag & VBIOERROR)
+ r = EIO;
+ splx(s);
  return r;
 }
 
Index: sys/vnode.h
===================================================================
RCS file: /home/cvs/src/sys/sys/vnode.h,v
retrieving revision 1.149
diff -u -p -r1.149 vnode.h
--- sys/vnode.h 23 Dec 2018 10:46:51 -0000 1.149
+++ sys/vnode.h 31 Jan 2019 20:52:37 -0000
@@ -149,6 +149,7 @@ struct vnode {
 #define VBIOWAIT 0x0001 /* waiting for output to complete */
 #define VBIOONSYNCLIST 0x0002 /* Vnode is on syncer worklist */
 #define VBIOONFREELIST  0x0004  /* Vnode is on a free list */
+#define VBIOERROR 0x0008  /* A write failed */
 
 /*
  * Vnode attributes.  A field value of VNOVAL represents a field whose value

Reply | Threaded
Open this post in threaded view
|

Re: fsync(2) and I/O errors

Alexander Bluhm
On Thu, Jan 31, 2019 at 04:26:45PM -0500, Ted Unangst wrote:
> Instead, we note that the write failed and mark a flag in the vnode. Future
> calls to fsync will then return EIO when this flag is set. We clear the flag
> when the vnode is released.

Sounds reasonable.

OK bluhm@

> Index: kern/vfs_bio.c
> ===================================================================
> RCS file: /home/cvs/src/sys/kern/vfs_bio.c,v
> retrieving revision 1.187
> diff -u -p -r1.187 vfs_bio.c
> --- kern/vfs_bio.c 21 Nov 2018 16:14:43 -0000 1.187
> +++ kern/vfs_bio.c 31 Jan 2019 21:19:36 -0000
> @@ -867,6 +867,11 @@ brelse(struct buf *bp)
>   /* If it's not cacheable, or an error, mark it invalid. */
>   if (ISSET(bp->b_flags, (B_NOCACHE|B_ERROR)))
>   SET(bp->b_flags, B_INVAL);
> + /* If it's a write error, also mark the vnode as damaged. */
> + if (ISSET(bp->b_flags, B_ERROR) && !ISSET(bp->b_flags, B_READ)) {
> + if (bp->b_vp && bp->b_vp->v_type == VREG)
> + SET(bp->b_vp->v_bioflag, VBIOERROR);
> + }
>  
>   if (ISSET(bp->b_flags, B_INVAL)) {
>   /*
> Index: kern/vfs_subr.c
> ===================================================================
> RCS file: /home/cvs/src/sys/kern/vfs_subr.c,v
> retrieving revision 1.285
> diff -u -p -r1.285 vfs_subr.c
> --- kern/vfs_subr.c 21 Jan 2019 18:09:21 -0000 1.285
> +++ kern/vfs_subr.c 31 Jan 2019 21:09:22 -0000
> @@ -712,6 +712,7 @@ vputonfreelist(struct vnode *vp)
>  #endif
>  
>   vp->v_bioflag |= VBIOONFREELIST;
> + vp->v_bioflag &= ~VBIOERROR;
>  
>   if (vp->v_holdcnt > 0)
>   lst = &vnode_hold_list;
> Index: kern/vfs_vops.c
> ===================================================================
> RCS file: /home/cvs/src/sys/kern/vfs_vops.c,v
> retrieving revision 1.19
> diff -u -p -r1.19 vfs_vops.c
> --- kern/vfs_vops.c 21 Jun 2018 14:17:23 -0000 1.19
> +++ kern/vfs_vops.c 31 Jan 2019 21:04:43 -0000
> @@ -336,7 +336,7 @@ int
>  VOP_FSYNC(struct vnode *vp, struct ucred *cred, int waitfor,
>      struct proc *p)
>  {
> - int r;
> + int r, s;
>   struct vop_fsync_args a;
>   a.a_vp = vp;
>   a.a_cred = cred;
> @@ -351,6 +351,10 @@ VOP_FSYNC(struct vnode *vp, struct ucred
>   vp->v_inflight++;
>   r = (vp->v_op->vop_fsync)(&a);
>   vp->v_inflight--;
> + s = splbio();
> + if (r == 0 && vp->v_bioflag & VBIOERROR)
> + r = EIO;
> + splx(s);
>   return r;
>  }
>  
> Index: sys/vnode.h
> ===================================================================
> RCS file: /home/cvs/src/sys/sys/vnode.h,v
> retrieving revision 1.149
> diff -u -p -r1.149 vnode.h
> --- sys/vnode.h 23 Dec 2018 10:46:51 -0000 1.149
> +++ sys/vnode.h 31 Jan 2019 20:52:37 -0000
> @@ -149,6 +149,7 @@ struct vnode {
>  #define VBIOWAIT 0x0001 /* waiting for output to complete */
>  #define VBIOONSYNCLIST 0x0002 /* Vnode is on syncer worklist */
>  #define VBIOONFREELIST  0x0004  /* Vnode is on a free list */
> +#define VBIOERROR 0x0008  /* A write failed */
>  
>  /*
>   * Vnode attributes.  A field value of VNOVAL represents a field whose value

Reply | Threaded
Open this post in threaded view
|

Re: fsync(2) and I/O errors

Maximilian Lorlacks

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Thursday, January 31, 2019 11:31 PM, Alexander Bluhm <[hidden email]> wrote:

> On Thu, Jan 31, 2019 at 04:26:45PM -0500, Ted Unangst wrote:
>
> > Instead, we note that the write failed and mark a flag in the vnode. Future
> > calls to fsync will then return EIO when this flag is set. We clear the flag
> > when the vnode is released.
>
> Sounds reasonable.
>
> OK bluhm@

People may object to errors being lost when the vnode is released,
as that would lose errors in a scenario like write -> close -> open
-> fsync.  I do not claim to know if anyone actually does that in the
wild, however.

If the above diff is accepted, it may be worth to also add the
following diff to fsync.2 to document the behavior:

diff --git lib/libc/sys/fsync.2 lib/libc/sys/fsync.2
index c9831ca09..5ee765986 100644
--- lib/libc/sys/fsync.2
+++ lib/libc/sys/fsync.2
@@ -66,6 +66,19 @@ and
 .Fn fdatasync
 should be used by programs that require a file to be in a known state,
 for example, in building a simple transaction facility.
+.Pp
+If
+.Fn fsync
+or
+.Fn fdatasync
+fails with
+.Er EIO ,
+the state of the on-disk data may only have been partially written.
+Future attempts to call these functions will continue failing with
+.Er EIO
+until the all copies of the underlying
+.Fa fd
+have been closed.
 .Sh RETURN VALUES
 .Rv -std fsync fdatasync
 .Sh ERRORS

Reply | Threaded
Open this post in threaded view
|

Re: fsync(2) and I/O errors

Abel Abraham Camarillo Ojeda-2
On Saturday, February 16, 2019, Maximilian Lorlacks <
[hidden email]> wrote:

>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Thursday, January 31, 2019 11:31 PM, Alexander Bluhm <
> [hidden email]> wrote:
>
> > On Thu, Jan 31, 2019 at 04:26:45PM -0500, Ted Unangst wrote:
> >
> > > Instead, we note that the write failed and mark a flag in the vnode.
> Future
> > > calls to fsync will then return EIO when this flag is set. We clear
> the flag
> > > when the vnode is released.
> >
> > Sounds reasonable.
> >
> > OK bluhm@
>
> People may object to errors being lost when the vnode is released,
> as that would lose errors in a scenario like write -> close -> open
> -> fsync.  I do not claim to know if anyone actually does that in the
> wild, however.


Sorry for my incomplete comment, but I remember a lengthy discussion in the
postgres mailing list about fsync, retries of it, loss of state and such,
they were very concerned about openbsd behavior I think...

Will try to find that thread ...


>
> If the above diff is accepted, it may be worth to also add the
> following diff to fsync.2 to document the behavior:
>
> diff --git lib/libc/sys/fsync.2 lib/libc/sys/fsync.2
> index c9831ca09..5ee765986 100644
> --- lib/libc/sys/fsync.2
> +++ lib/libc/sys/fsync.2
> @@ -66,6 +66,19 @@ and
>  .Fn fdatasync
>  should be used by programs that require a file to be in a known state,
>  for example, in building a simple transaction facility.
> +.Pp
> +If
> +.Fn fsync
> +or
> +.Fn fdatasync
> +fails with
> +.Er EIO ,
> +the state of the on-disk data may only have been partially written.
> +Future attempts to call these functions will continue failing with
> +.Er EIO
> +until the all copies of the underlying
> +.Fa fd
> +have been closed.
>  .Sh RETURN VALUES
>  .Rv -std fsync fdatasync
>  .Sh ERRORS
>
>
Reply | Threaded
Open this post in threaded view
|

Re: fsync(2) and I/O errors

Abel Abraham Camarillo Ojeda-2
On Saturday, February 16, 2019, Abel Abraham Camarillo Ojeda <
[hidden email]> wrote:

>
>
> On Saturday, February 16, 2019, Maximilian Lorlacks <
> [hidden email]> wrote:
>
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Thursday, January 31, 2019 11:31 PM, Alexander Bluhm <
>> [hidden email]> wrote:
>>
>> > On Thu, Jan 31, 2019 at 04:26:45PM -0500, Ted Unangst wrote:
>> >
>> > > Instead, we note that the write failed and mark a flag in the vnode.
>> Future
>> > > calls to fsync will then return EIO when this flag is set. We clear
>> the flag
>> > > when the vnode is released.
>> >
>> > Sounds reasonable.
>> >
>> > OK bluhm@
>>
>> People may object to errors being lost when the vnode is released,
>> as that would lose errors in a scenario like write -> close -> open
>> -> fsync.  I do not claim to know if anyone actually does that in the
>> wild, however.
>
>
> Sorry for my incomplete comment, but I remember a lengthy discussion in
> the postgres mailing list about fsync, retries of it, loss of state and
> such, they were very concerned about openbsd behavior I think...
>
> Will try to find that thread ...
>

Is this related?

https://lwn.net/Articles/752063/

Again, sorry if it's noise



>
>
>>
>> If the above diff is accepted, it may be worth to also add the
>> following diff to fsync.2 to document the behavior:
>>
>> diff --git lib/libc/sys/fsync.2 lib/libc/sys/fsync.2
>> index c9831ca09..5ee765986 100644
>> --- lib/libc/sys/fsync.2
>> +++ lib/libc/sys/fsync.2
>> @@ -66,6 +66,19 @@ and
>>  .Fn fdatasync
>>  should be used by programs that require a file to be in a known state,
>>  for example, in building a simple transaction facility.
>> +.Pp
>> +If
>> +.Fn fsync
>> +or
>> +.Fn fdatasync
>> +fails with
>> +.Er EIO ,
>> +the state of the on-disk data may only have been partially written.
>> +Future attempts to call these functions will continue failing with
>> +.Er EIO
>> +until the all copies of the underlying
>> +.Fa fd
>> +have been closed.
>>  .Sh RETURN VALUES
>>  .Rv -std fsync fdatasync
>>  .Sh ERRORS
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: fsync(2) and I/O errors

Stefan Sperling-5
In reply to this post by Maximilian Lorlacks
On Sat, Feb 16, 2019 at 07:40:42AM +0000, Maximilian Lorlacks wrote:
> People may object to errors being lost when the vnode is released,
> as that would lose errors in a scenario like write -> close -> open
> -> fsync.

Is there any guarantee an application will be operating on the same
underlying file, or filesystem, or even physical disk, after close()
followed by open()? I don't see how that could be a valid assumption.

Reply | Threaded
Open this post in threaded view
|

Re: fsync(2) and I/O errors

Maximilian Lorlacks
In reply to this post by Maximilian Lorlacks



Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, February 16, 2019 7:40 AM, Maximilian Lorlacks <[hidden email]> wrote:

> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Thursday, January 31, 2019 11:31 PM, Alexander Bluhm [hidden email] wrote:
>
> > On Thu, Jan 31, 2019 at 04:26:45PM -0500, Ted Unangst wrote:
> >
> > > Instead, we note that the write failed and mark a flag in the vnode. Future
> > > calls to fsync will then return EIO when this flag is set. We clear the flag
> > > when the vnode is released.
> >
> > Sounds reasonable.
> > OK bluhm@
>
> People may object to errors being lost when the vnode is released,
> as that would lose errors in a scenario like write -> close -> open
> -> fsync. I do not claim to know if anyone actually does that in the
> wild, however.
>
> If the above diff is accepted, it may be worth to also add the
> following diff to fsync.2 to document the behavior:
>
> diff --git lib/libc/sys/fsync.2 lib/libc/sys/fsync.2
> index c9831ca09..5ee765986 100644
> --- lib/libc/sys/fsync.2
> +++ lib/libc/sys/fsync.2
> @@ -66,6 +66,19 @@ and
> .Fn fdatasync
> should be used by programs that require a file to be in a known state,
> for example, in building a simple transaction facility.
> +.Pp
> +If
> +.Fn fsync
> +or
> +.Fn fdatasync
> +fails with
> +.Er EIO ,
> +the state of the on-disk data may only have been partially written.
> +Future attempts to call these functions will continue failing with
> +.Er EIO
> +until the all copies of the underlying
> +.Fa fd
> +have been closed.
> .Sh RETURN VALUES
> .Rv -std fsync fdatasync
> .Sh ERRORS

ping for man diff
(Sorry, Alexander Bluhm, I accidentally forgot to send to list)

Reply | Threaded
Open this post in threaded view
|

Re: fsync(2) and I/O errors

Ted Unangst-6
Oh, right, I reworded it slightly, but I think this is something we should
note.


Index: fsync.2
===================================================================
RCS file: /home/cvs/src/lib/libc/sys/fsync.2,v
retrieving revision 1.14
diff -u -p -r1.14 fsync.2
--- fsync.2 10 Sep 2015 17:55:21 -0000 1.14
+++ fsync.2 16 Apr 2019 20:18:03 -0000
@@ -66,6 +66,16 @@ and
 .Fn fdatasync
 should be used by programs that require a file to be in a known state,
 for example, in building a simple transaction facility.
+.Pp
+If
+.Fn fsync
+or
+.Fn fdatasync
+fails with
+.Er EIO ,
+the state of the on-disk data may have been only partially written.
+To guard against potential inconsistency, future calls will continue failing
+until all references to the file are closed.
 .Sh RETURN VALUES
 .Rv -std fsync fdatasync
 .Sh ERRORS