New scheduler for OpenBSD

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

New scheduler for OpenBSD

Michal Mazurek-2
Gregor Best attempted to improve the scheduler in 2011:
http://comments.gmane.org/gmane.os.openbsd.tech/27059
Here is another attempt, it takes up where the previous one left off.

This is also mostly based on the main idea behind Linux CFS or
BFS. I found BFS to be described more clearly:
http://ck.kolivas.org/patches/bfs/4.0/4.3/4.3-sched-bfs-467.patch


Some notes:

Chrome is still not very usable.

Much more work is needed, e.g. there is some MD code on sparc64 and
alpha that depends on spc_schedticks that needs to be understood and
rewritten.

Maybe using RB trees to queue what is usually no more than 5 elements
is overkill.

p_usrpri and p_priority will go away, so userland utilities like 'ps'
will need to be changed.

I also want to try and see if implementing one shared queue, instead of
keeping one queue per cpu will improve performance even further. Right
now there are some heuristics to determine whether a process should
switch cpus. This doesn't work very well yet, in my tests with the
attached code sometimes one queue was a second behind another. From
what I understand that's the idea behind BFS and the reason why it
doesn't scale to 4096 CPUs. I see that OpenBSD supports 256 CPUs on
sparc64:
./arch/sparc64/include/cpu.h:#define MAXCPUS    256

Nice is not yet supported, but it's easy to add (uncomment and modify
'offset' in sched_deadline(). I didn't want to add it yet to test just
how fair this method is. Currently all processes are equal.

Other operating systems provide a couple of different schedulers to
choose from, this diff overwrites the 4bsd scheduler with the new code
and leaves no choice.

With the new code I got some visible improvements when compiling the
kernel:
old: 1m46.74s real     5m08.48s user     1m27.37s system
old: 1m46.61s real     5m06.93s user     1m28.13s system
old: 1m49.91s real     5m19.71s user     1m28.25s system
new: 1m38.05s real     5m12.16s user     0m55.86s system
new: 1m38.70s real     5m12.36s user     0m57.82s system
new: 1m39.10s real     5m16.06s user     0m56.91s system

I would be interested in hearing if the new code broke something for
anyone.

Is there a chance for a scheduler rewrite like this to be commited?


Index: sys/kern/init_main.c
===================================================================
RCS file: /cvs/src/sys/kern/init_main.c,v
retrieving revision 1.248
diff -u -p -r1.248 init_main.c
--- sys/kern/init_main.c 3 Jan 2016 00:15:59 -0000 1.248
+++ sys/kern/init_main.c 12 Mar 2016 15:48:36 -0000
@@ -265,6 +265,8 @@ main(void *framep)
  */
  pr = &process0;
  process_initialize(pr, p);
+ p->p_deadline = 0;
+ p->p_rrticks = 10;
 
  LIST_INSERT_HEAD(&allprocess, pr, ps_list);
  atomic_setbits_int(&pr->ps_flags, PS_SYSTEM);
@@ -554,6 +556,7 @@ main(void *framep)
         /*
          * proc0: nothing to do, back to sleep
          */
+ printf("*** modified scheduler ***\n");
         while (1)
                 tsleep(&proc0, PVM, "scheduler", 0);
  /* NOTREACHED */
Index: sys/kern/kern_clock.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_clock.c,v
retrieving revision 1.88
diff -u -p -r1.88 kern_clock.c
--- sys/kern/kern_clock.c 11 Jun 2015 16:03:04 -0000 1.88
+++ sys/kern/kern_clock.c 12 Mar 2016 15:48:36 -0000
@@ -180,7 +180,7 @@ hardclock(struct clockframe *frame)
  if (stathz == 0)
  statclock(frame);
 
- if (--ci->ci_schedstate.spc_rrticks <= 0)
+ if (p && (--(p->p_rrticks) <= 0))
  roundrobin(ci);
 
  /*
@@ -398,15 +398,7 @@ statclock(struct clockframe *frame)
 
  if (p != NULL) {
  p->p_cpticks++;
- /*
- * If no schedclock is provided, call it here at ~~12-25 Hz;
- * ~~16 Hz is best
- */
- if (schedhz == 0) {
- if ((++curcpu()->ci_schedstate.spc_schedticks & 3) ==
-    0)
- schedclock(p);
- }
+ ++curcpu()->ci_schedstate.spc_schedticks;
  }
 }
 
Index: sys/kern/kern_fork.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_fork.c,v
retrieving revision 1.184
diff -u -p -r1.184 kern_fork.c
--- sys/kern/kern_fork.c 9 Oct 2015 01:10:27 -0000 1.184
+++ sys/kern/kern_fork.c 12 Mar 2016 15:48:36 -0000
@@ -498,6 +498,7 @@ fork1(struct proc *curp, int flags, void
  /*
  * Make child runnable and add to run queue.
  */
+ sched_deadline(p);
  if ((flags & FORK_IDLE) == 0) {
  SCHED_LOCK(s);
  p->p_stat = SRUN;
Index: sys/kern/kern_resource.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_resource.c,v
retrieving revision 1.55
diff -u -p -r1.55 kern_resource.c
--- sys/kern/kern_resource.c 5 Dec 2015 10:11:53 -0000 1.55
+++ sys/kern/kern_resource.c 12 Mar 2016 15:48:36 -0000
@@ -178,8 +178,6 @@ int
 donice(struct proc *curp, struct process *chgpr, int n)
 {
  struct ucred *ucred = curp->p_ucred;
- struct proc *p;
- int s;
 
  if (ucred->cr_uid != 0 && ucred->cr_ruid != 0 &&
     ucred->cr_uid != chgpr->ps_ucred->cr_uid &&
@@ -193,10 +191,13 @@ donice(struct proc *curp, struct process
  if (n < chgpr->ps_nice && suser(curp, 0))
  return (EACCES);
  chgpr->ps_nice = n;
+ /* XXXNICE */
+ /*
  SCHED_LOCK(s);
  TAILQ_FOREACH(p, &chgpr->ps_threads, p_thr_link)
  (void)resetpriority(p);
  SCHED_UNLOCK(s);
+ */
  return (0);
 }
 
Index: sys/kern/kern_sched.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sched.c,v
retrieving revision 1.41
diff -u -p -r1.41 kern_sched.c
--- sys/kern/kern_sched.c 23 Dec 2015 14:51:17 -0000 1.41
+++ sys/kern/kern_sched.c 12 Mar 2016 15:48:36 -0000
@@ -33,6 +33,21 @@ void sched_kthreads_create(void *);
 int sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p);
 struct proc *sched_steal_proc(struct cpu_info *);
 
+int
+sched_cmp_proc(struct proc *e1, struct proc *e2)
+{
+ if (e1->p_deadline == e2->p_deadline) {
+ /*
+ * this implementation of rb trees needs the keys to be unique
+ * XXX compare pointers for now, UB
+ */
+ return (e1 < e2 ? -1 : e1 > e2);
+ }
+ return (e1->p_deadline < e2->p_deadline ? -1 : e1->p_deadline > e2->p_deadline);
+}
+RB_PROTOTYPE(prochead2, proc, p_runq2, sched_cmp_proc);
+RB_GENERATE(prochead2, proc, p_runq2, sched_cmp_proc);
+
 /*
  * To help choosing which cpu should run which process we keep track
  * of cpus which are currently idle and which cpus have processes
@@ -81,10 +96,8 @@ void
 sched_init_cpu(struct cpu_info *ci)
 {
  struct schedstate_percpu *spc = &ci->ci_schedstate;
- int i;
 
- for (i = 0; i < SCHED_NQS; i++)
- TAILQ_INIT(&spc->spc_qs[i]);
+ RB_INIT(&spc->spc_rq);
 
  spc->spc_idleproc = NULL;
 
@@ -238,14 +251,13 @@ void
 setrunqueue(struct proc *p)
 {
  struct schedstate_percpu *spc;
- int queue = p->p_priority >> 2;
 
  SCHED_ASSERT_LOCKED();
  spc = &p->p_cpu->ci_schedstate;
  spc->spc_nrun++;
 
- TAILQ_INSERT_TAIL(&spc->spc_qs[queue], p, p_runq);
- spc->spc_whichqs |= (1 << queue);
+ RB_INSERT(prochead2, &spc->spc_rq, p);
+ spc->spc_whichqs = 1;
  cpuset_add(&sched_queued_cpus, p->p_cpu);
 
  if (cpuset_isset(&sched_idle_cpus, p->p_cpu))
@@ -256,17 +268,15 @@ void
 remrunqueue(struct proc *p)
 {
  struct schedstate_percpu *spc;
- int queue = p->p_priority >> 2;
 
  SCHED_ASSERT_LOCKED();
  spc = &p->p_cpu->ci_schedstate;
  spc->spc_nrun--;
 
- TAILQ_REMOVE(&spc->spc_qs[queue], p, p_runq);
- if (TAILQ_EMPTY(&spc->spc_qs[queue])) {
- spc->spc_whichqs &= ~(1 << queue);
- if (spc->spc_whichqs == 0)
- cpuset_del(&sched_queued_cpus, p->p_cpu);
+ RB_REMOVE(prochead2, &spc->spc_rq, p);
+ if (RB_EMPTY(&spc->spc_rq)) {
+ spc->spc_whichqs = 0;
+ cpuset_del(&sched_queued_cpus, p->p_cpu);
  }
 }
 
@@ -282,13 +292,11 @@ sched_chooseproc(void)
 #ifdef MULTIPROCESSOR
  if (spc->spc_schedflags & SPCF_SHOULDHALT) {
  if (spc->spc_whichqs) {
- for (queue = 0; queue < SCHED_NQS; queue++) {
- while ((p = TAILQ_FIRST(&spc->spc_qs[queue]))) {
- remrunqueue(p);
- p->p_cpu = sched_choosecpu(p);
- KASSERT(p->p_cpu != curcpu());
- setrunqueue(p);
- }
+ while ((p = RB_MIN(prochead2, &spc->spc_rq)) != NULL) {
+ remrunqueue(p);
+ p->p_cpu = sched_choosecpu(p);
+ KASSERT(p->p_cpu != curcpu());
+ setrunqueue(p);
  }
  }
  p = spc->spc_idleproc;
@@ -302,7 +310,7 @@ sched_chooseproc(void)
 again:
  if (spc->spc_whichqs) {
  queue = ffs(spc->spc_whichqs) - 1;
- p = TAILQ_FIRST(&spc->spc_qs[queue]);
+ p = RB_MIN(prochead2, &spc->spc_rq);
  remrunqueue(p);
  sched_noidle++;
  KASSERT(p->p_stat == SRUN);
@@ -478,7 +486,7 @@ sched_steal_proc(struct cpu_info *self)
  spc = &ci->ci_schedstate;
 
  queue = ffs(spc->spc_whichqs) - 1;
- TAILQ_FOREACH(p, &spc->spc_qs[queue], p_runq) {
+ RB_FOREACH(p, prochead2, &spc->spc_rq) {
  if (p->p_flag & P_CPUPEG)
  continue;
 
@@ -550,8 +558,9 @@ sched_proc_to_cpu_cost(struct cpu_info *
  * and the higher the priority of the proc.
  */
  if (!cpuset_isset(&sched_idle_cpus, ci)) {
- cost += (p->p_priority - spc->spc_curpriority) *
-    sched_cost_priority;
+ /* 100ms difference? */
+ cost += ((p->p_deadline - spc->spc_curdeadline) / 100000000)
+    * sched_cost_priority;
  cost += sched_cost_runnable;
  }
  if (cpuset_isset(&sched_queued_cpus, ci))
Index: sys/kern/kern_sig.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.193
diff -u -p -r1.193 kern_sig.c
--- sys/kern/kern_sig.c 9 Mar 2016 13:38:50 -0000 1.193
+++ sys/kern/kern_sig.c 12 Mar 2016 15:48:36 -0000
@@ -1894,6 +1894,7 @@ userret(struct proc *p)
  }
 
  p->p_cpu->ci_schedstate.spc_curpriority = p->p_priority = p->p_usrpri;
+ p->p_cpu->ci_schedstate.spc_curdeadline = p->p_deadline;
 }
 
 int
Index: sys/kern/kern_synch.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.129
diff -u -p -r1.129 kern_synch.c
--- sys/kern/kern_synch.c 9 Mar 2016 13:38:50 -0000 1.129
+++ sys/kern/kern_synch.c 12 Mar 2016 15:48:36 -0000
@@ -274,6 +274,7 @@ sleep_finish(struct sleep_state *sls, in
 #endif
 
  p->p_cpu->ci_schedstate.spc_curpriority = p->p_usrpri;
+ p->p_cpu->ci_schedstate.spc_curdeadline = p->p_deadline;
  SCHED_UNLOCK(sls->sls_s);
 
  /*
Index: sys/kern/sched_bsd.c
===================================================================
RCS file: /cvs/src/sys/kern/sched_bsd.c,v
retrieving revision 1.43
diff -u -p -r1.43 sched_bsd.c
--- sys/kern/sched_bsd.c 9 Mar 2016 13:38:50 -0000 1.43
+++ sys/kern/sched_bsd.c 12 Mar 2016 15:48:36 -0000
@@ -61,7 +61,6 @@ struct __mp_lock sched_lock;
 #endif
 
 void schedcpu(void *);
-void updatepri(struct proc *);
 
 void
 scheduler_start(void)
@@ -91,6 +90,7 @@ roundrobin(struct cpu_info *ci)
  spc->spc_rrticks = rrticks_init;
 
  if (ci->ci_curproc != NULL) {
+ sched_deadline(ci->ci_curproc);
  if (spc->spc_schedflags & SPCF_SEENRR) {
  /*
  * The process has already been through a roundrobin
@@ -109,74 +109,6 @@ roundrobin(struct cpu_info *ci)
  need_resched(ci);
 }
 
-/*
- * Constants for digital decay and forget:
- * 90% of (p_estcpu) usage in 5 * loadav time
- * 95% of (p_pctcpu) usage in 60 seconds (load insensitive)
- *          Note that, as ps(1) mentions, this can let percentages
- *          total over 100% (I've seen 137.9% for 3 processes).
- *
- * Note that hardclock updates p_estcpu and p_cpticks independently.
- *
- * We wish to decay away 90% of p_estcpu in (5 * loadavg) seconds.
- * That is, the system wants to compute a value of decay such
- * that the following for loop:
- * for (i = 0; i < (5 * loadavg); i++)
- * p_estcpu *= decay;
- * will compute
- * p_estcpu *= 0.1;
- * for all values of loadavg:
- *
- * Mathematically this loop can be expressed by saying:
- * decay ** (5 * loadavg) ~= .1
- *
- * The system computes decay as:
- * decay = (2 * loadavg) / (2 * loadavg + 1)
- *
- * We wish to prove that the system's computation of decay
- * will always fulfill the equation:
- * decay ** (5 * loadavg) ~= .1
- *
- * If we compute b as:
- * b = 2 * loadavg
- * then
- * decay = b / (b + 1)
- *
- * We now need to prove two things:
- * 1) Given factor ** (5 * loadavg) ~= .1, prove factor == b/(b+1)
- * 2) Given b/(b+1) ** power ~= .1, prove power == (5 * loadavg)
- *
- * Facts:
- *         For x close to zero, exp(x) =~ 1 + x, since
- *              exp(x) = 0! + x**1/1! + x**2/2! + ... .
- *              therefore exp(-1/b) =~ 1 - (1/b) = (b-1)/b.
- *         For x close to zero, ln(1+x) =~ x, since
- *              ln(1+x) = x - x**2/2 + x**3/3 - ...     -1 < x < 1
- *              therefore ln(b/(b+1)) = ln(1 - 1/(b+1)) =~ -1/(b+1).
- *         ln(.1) =~ -2.30
- *
- * Proof of (1):
- *    Solve (factor)**(power) =~ .1 given power (5*loadav):
- * solving for factor,
- *      ln(factor) =~ (-2.30/5*loadav), or
- *      factor =~ exp(-1/((5/2.30)*loadav)) =~ exp(-1/(2*loadav)) =
- *          exp(-1/b) =~ (b-1)/b =~ b/(b+1).                    QED
- *
- * Proof of (2):
- *    Solve (factor)**(power) =~ .1 given factor == (b/(b+1)):
- * solving for power,
- *      power*ln(b/(b+1)) =~ -2.30, or
- *      power =~ 2.3 * (b + 1) = 4.6*loadav + 2.3 =~ 5*loadav.  QED
- *
- * Actual power values for the implemented algorithm are as follows:
- *      loadav: 1       2       3       4
- *      power:  5.68    10.32   14.94   19.55
- */
-
-/* calculations for digital decay to forget 90% of usage in 5*loadav sec */
-#define loadfactor(loadav) (2 * (loadav))
-#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))
-
 /* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */
 fixpt_t ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */
 
@@ -197,17 +129,40 @@ fixpt_t ccpu = 0.95122942450071400909 *
 /*
  * Recompute process priorities, every second.
  */
+RB_PROTOTYPE(prochead2, proc, p_runq2, sched_cmp_proc);
 void
 schedcpu(void *arg)
 {
  struct timeout *to = (struct timeout *)arg;
- fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
  struct proc *p;
  int s;
- unsigned int newcpu;
  int phz;
 
  /*
+ CPU_INFO_ITERATOR cii;
+ struct cpu_info *ci;
+ int i = 0;
+ CPU_INFO_FOREACH(cii, ci) {
+ i++;
+ printf("%d: ", i);
+ if (ci->ci_curproc != NULL) {
+ printf("[%s %llu %d] ",
+ ci->ci_curproc->p_comm,
+ ci->ci_curproc->p_deadline,
+ ci->ci_curproc->p_rrticks);
+
+ }
+ RB_FOREACH(p, prochead2, &ci->ci_schedstate.spc_rq) {
+ printf("<%s %llu %d>",
+ p->p_comm,
+ p->p_deadline,
+ p->p_rrticks);
+ }
+ printf("\n");
+ }
+ */
+
+ /*
  * If we have a statistics clock, use that to calculate CPU
  * time, otherwise revert to using the profiling clock (which,
  * in turn, defaults to hz if there is no separate profiling
@@ -243,19 +198,6 @@ schedcpu(void *arg)
  (p->p_cpticks * FSCALE / phz)) >> FSHIFT;
 #endif
  p->p_cpticks = 0;
- newcpu = (u_int) decay_cpu(loadfac, p->p_estcpu);
- p->p_estcpu = newcpu;
- resetpriority(p);
- if (p->p_priority >= PUSER) {
- if (p->p_stat == SRUN &&
-    (p->p_priority / SCHED_PPQ) !=
-    (p->p_usrpri / SCHED_PPQ)) {
- remrunqueue(p);
- p->p_priority = p->p_usrpri;
- setrunqueue(p);
- } else
- p->p_priority = p->p_usrpri;
- }
  SCHED_UNLOCK(s);
  }
  uvm_meter();
@@ -264,27 +206,32 @@ schedcpu(void *arg)
 }
 
 /*
- * Recalculate the priority of a process after it has slept for a while.
- * For all load averages >= 1 and max p_estcpu of 255, sleeping for at
- * least six times the loadfactor will decay p_estcpu to zero.
+ * 10% more for every nice level. scaled by a power of 2, these are
+ * later divided
  */
+int nice_mul[] = {
+ 1024, 1126, 1239, 1362, 1499,
+ 1649, 1814, 1995, 2195, 2414,
+ 2655, 2921, 3213, 3535, 3888,
+ 4277, 4705, 5175, 5693, 6262,
+ 6888, 7577, 8335, 9169, 10086,
+ 11094, 12204, 13424, 14767, 16243,
+ 17868, 19655, 21620, 23782, 26160,
+ 28776, 31654, 34820, 38302, 42132
+};
+
 void
-updatepri(struct proc *p)
+sched_deadline(struct proc *p)
 {
- unsigned int newcpu = p->p_estcpu;
- fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
+ u_int64_t niffies;
+ /* u_int64_t offset = 100 * 1000 * 1000; */
+ struct timespec now;
 
- SCHED_ASSERT_LOCKED();
+ nanouptime(&now);
+ niffies = now.tv_sec * (1000 * 1000 * 1000) + now.tv_nsec;
 
- if (p->p_slptime > 5 * loadfac)
- p->p_estcpu = 0;
- else {
- p->p_slptime--; /* the first time was done in schedcpu */
- while (newcpu && --p->p_slptime)
- newcpu = (int) decay_cpu(loadfac, newcpu);
- p->p_estcpu = newcpu;
- }
- resetpriority(p);
+ p->p_deadline = niffies; // + offset;
+ p->p_rrticks = rrticks_init;
 }
 
 /*
@@ -326,6 +273,7 @@ preempt(struct proc *newp)
 
  SCHED_LOCK(s);
  p->p_priority = p->p_usrpri;
+ sched_deadline(p);
  p->p_stat = SRUN;
  p->p_cpu = sched_choosecpu(p);
  setrunqueue(p);
@@ -473,7 +421,7 @@ resched_proc(struct proc *p, u_char pri)
  * sched state, which we currently do not do.
  */
  ci = (p->p_cpu != NULL) ? p->p_cpu : curcpu();
- if (pri < ci->ci_schedstate.spc_curpriority)
+ if (ci->ci_curproc && p->p_deadline < ci->ci_curproc->p_deadline)
  need_resched(ci);
 }
 
@@ -509,56 +457,8 @@ setrunnable(struct proc *p)
  p->p_stat = SRUN;
  p->p_cpu = sched_choosecpu(p);
  setrunqueue(p);
- if (p->p_slptime > 1)
- updatepri(p);
  p->p_slptime = 0;
  resched_proc(p, p->p_priority);
-}
-
-/*
- * Compute the priority of a process when running in user mode.
- * Arrange to reschedule if the resulting priority is better
- * than that of the current process.
- */
-void
-resetpriority(struct proc *p)
-{
- unsigned int newpriority;
-
- SCHED_ASSERT_LOCKED();
-
- newpriority = PUSER + p->p_estcpu +
-    NICE_WEIGHT * (p->p_p->ps_nice - NZERO);
- newpriority = min(newpriority, MAXPRI);
- p->p_usrpri = newpriority;
- resched_proc(p, p->p_usrpri);
-}
-
-/*
- * We adjust the priority of the current process.  The priority of a process
- * gets worse as it accumulates CPU time.  The cpu usage estimator (p_estcpu)
- * is increased here.  The formula for computing priorities (in kern_synch.c)
- * will compute a different value each time p_estcpu increases. This can
- * cause a switch, but unless the priority crosses a PPQ boundary the actual
- * queue will not change.  The cpu usage estimator ramps up quite quickly
- * when the process is running (linearly), and decays away exponentially, at
- * a rate which is proportionally slower when the system is busy.  The basic
- * principle is that the system will 90% forget that the process used a lot
- * of CPU time in 5 * loadav seconds.  This causes the system to favor
- * processes which haven't run much recently, and to round-robin among other
- * processes.
- */
-void
-schedclock(struct proc *p)
-{
- int s;
-
- SCHED_LOCK(s);
- p->p_estcpu = ESTCPULIM(p->p_estcpu + 1);
- resetpriority(p);
- if (p->p_priority >= PUSER)
- p->p_priority = p->p_usrpri;
- SCHED_UNLOCK(s);
 }
 
 void (*cpu_setperf)(int);
Index: sys/sys/proc.h
===================================================================
RCS file: /cvs/src/sys/sys/proc.h,v
retrieving revision 1.217
diff -u -p -r1.217 proc.h
--- sys/sys/proc.h 9 Mar 2016 13:38:50 -0000 1.217
+++ sys/sys/proc.h 12 Mar 2016 15:48:37 -0000
@@ -44,6 +44,7 @@
 #include <sys/selinfo.h> /* For struct selinfo */
 #include <sys/syslimits.h> /* For LOGIN_NAME_MAX */
 #include <sys/queue.h>
+#include <sys/tree.h>
 #include <sys/timeout.h> /* For struct timeout */
 #include <sys/event.h> /* For struct klist */
 #include <sys/mutex.h> /* For struct mutex */
@@ -267,6 +268,7 @@ struct process {
 
 struct proc {
  TAILQ_ENTRY(proc) p_runq;
+ RB_ENTRY(proc) p_runq2;
  LIST_ENTRY(proc) p_list; /* List of all threads. */
 
  struct process *p_p; /* The process of this thread. */
@@ -320,6 +322,8 @@ struct proc {
 
 /* The following fields are all copied upon creation in fork. */
 #define p_startcopy p_sigmask
+ u_int64_t p_deadline;
+ int p_rrticks;
  sigset_t p_sigmask; /* Current signal mask. */
 
  u_char p_priority; /* Process priority. */
@@ -488,6 +492,7 @@ void fixjobc(struct process *, struct pg
 int inferior(struct process *, struct process *);
 void leavepgrp(struct process *);
 void preempt(struct proc *);
+void sched_deadline(struct proc *);
 void pgdelete(struct pgrp *);
 void procinit(void);
 void resetpriority(struct proc *);
@@ -570,4 +575,3 @@ struct cpu_info *cpuset_first(struct cpu
 
 #endif /* _KERNEL */
 #endif /* !_SYS_PROC_H_ */
-
Index: sys/sys/sched.h
===================================================================
RCS file: /cvs/src/sys/sys/sched.h,v
retrieving revision 1.40
diff -u -p -r1.40 sched.h
--- sys/sys/sched.h 9 Mar 2016 13:38:50 -0000 1.40
+++ sys/sys/sched.h 12 Mar 2016 15:48:37 -0000
@@ -70,6 +70,7 @@
 #define _SYS_SCHED_H_
 
 #include <sys/queue.h>
+#include <sys/tree.h>
 
 /*
  * Posix defines a <sched.h> which may want to include <sys/sched.h>
@@ -99,6 +100,7 @@ struct schedstate_percpu {
  u_int spc_schedticks; /* ticks for schedclock() */
  u_int64_t spc_cp_time[CPUSTATES]; /* CPU state statistics */
  u_char spc_curpriority; /* usrpri of curproc */
+ u_int64_t spc_curdeadline;
  int spc_rrticks; /* ticks until roundrobin() */
  int spc_pscnt; /* prof/stat counter */
  int spc_psdiv; /* prof/stat divisor */
@@ -109,6 +111,8 @@ struct schedstate_percpu {
 
  TAILQ_HEAD(prochead, proc) spc_qs[SCHED_NQS];
  volatile uint32_t spc_whichqs;
+
+ RB_HEAD(prochead2, proc) spc_rq;
 
 #ifdef notyet
  struct proc *spc_reaper; /* dead proc reaper */
--
Michal Mazurek

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Juan Francisco Cantero Hurtado
On Sat, Mar 12, 2016 at 05:36:21PM +0100, Michal Mazurek wrote:

> Gregor Best attempted to improve the scheduler in 2011:
> http://comments.gmane.org/gmane.os.openbsd.tech/27059
> Here is another attempt, it takes up where the previous one left off.
>
> This is also mostly based on the main idea behind Linux CFS or
> BFS. I found BFS to be described more clearly:
> http://ck.kolivas.org/patches/bfs/4.0/4.3/4.3-sched-bfs-467.patch
>
>
> Some notes:
>
> Chrome is still not very usable.

Chrome is not the best benchmark for the scheduler ATM.

Until the work in malloc to enhance the support for multithreaded
programs is finished, forget Chrome and Firefox.

>
> Much more work is needed, e.g. there is some MD code on sparc64 and
> alpha that depends on spc_schedticks that needs to be understood and
> rewritten.
>
> Maybe using RB trees to queue what is usually no more than 5 elements
> is overkill.
>
> p_usrpri and p_priority will go away, so userland utilities like 'ps'
> will need to be changed.
>
> I also want to try and see if implementing one shared queue, instead of
> keeping one queue per cpu will improve performance even further. Right
> now there are some heuristics to determine whether a process should
> switch cpus. This doesn't work very well yet, in my tests with the
> attached code sometimes one queue was a second behind another. From
> what I understand that's the idea behind BFS and the reason why it
> doesn't scale to 4096 CPUs. I see that OpenBSD supports 256 CPUs on
> sparc64:
> ./arch/sparc64/include/cpu.h:#define MAXCPUS    256
>
> Nice is not yet supported, but it's easy to add (uncomment and modify
> 'offset' in sched_deadline(). I didn't want to add it yet to test just
> how fair this method is. Currently all processes are equal.
>
> Other operating systems provide a couple of different schedulers to
> choose from, this diff overwrites the 4bsd scheduler with the new code
> and leaves no choice.
>
> With the new code I got some visible improvements when compiling the
> kernel:
> old: 1m46.74s real     5m08.48s user     1m27.37s system
> old: 1m46.61s real     5m06.93s user     1m28.13s system
> old: 1m49.91s real     5m19.71s user     1m28.25s system
> new: 1m38.05s real     5m12.16s user     0m55.86s system
> new: 1m38.70s real     5m12.36s user     0m57.82s system
> new: 1m39.10s real     5m16.06s user     0m56.91s system

That is going to make happy to people running bulk builds :)

>
> I would be interested in hearing if the new code broke something for
> anyone.
>
> Is there a chance for a scheduler rewrite like this to be commited?
>
>
> Index: sys/kern/init_main.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/init_main.c,v
> retrieving revision 1.248
> diff -u -p -r1.248 init_main.c
> --- sys/kern/init_main.c 3 Jan 2016 00:15:59 -0000 1.248
> +++ sys/kern/init_main.c 12 Mar 2016 15:48:36 -0000
> @@ -265,6 +265,8 @@ main(void *framep)
>   */
>   pr = &process0;
>   process_initialize(pr, p);
> + p->p_deadline = 0;
> + p->p_rrticks = 10;
>  
>   LIST_INSERT_HEAD(&allprocess, pr, ps_list);
>   atomic_setbits_int(&pr->ps_flags, PS_SYSTEM);
> @@ -554,6 +556,7 @@ main(void *framep)
>          /*
>           * proc0: nothing to do, back to sleep
>           */
> + printf("*** modified scheduler ***\n");
>          while (1)
>                  tsleep(&proc0, PVM, "scheduler", 0);
>   /* NOTREACHED */
> Index: sys/kern/kern_clock.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_clock.c,v
> retrieving revision 1.88
> diff -u -p -r1.88 kern_clock.c
> --- sys/kern/kern_clock.c 11 Jun 2015 16:03:04 -0000 1.88
> +++ sys/kern/kern_clock.c 12 Mar 2016 15:48:36 -0000
> @@ -180,7 +180,7 @@ hardclock(struct clockframe *frame)
>   if (stathz == 0)
>   statclock(frame);
>  
> - if (--ci->ci_schedstate.spc_rrticks <= 0)
> + if (p && (--(p->p_rrticks) <= 0))
>   roundrobin(ci);
>  
>   /*
> @@ -398,15 +398,7 @@ statclock(struct clockframe *frame)
>  
>   if (p != NULL) {
>   p->p_cpticks++;
> - /*
> - * If no schedclock is provided, call it here at ~~12-25 Hz;
> - * ~~16 Hz is best
> - */
> - if (schedhz == 0) {
> - if ((++curcpu()->ci_schedstate.spc_schedticks & 3) ==
> -    0)
> - schedclock(p);
> - }
> + ++curcpu()->ci_schedstate.spc_schedticks;
>   }
>  }
>  
> Index: sys/kern/kern_fork.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_fork.c,v
> retrieving revision 1.184
> diff -u -p -r1.184 kern_fork.c
> --- sys/kern/kern_fork.c 9 Oct 2015 01:10:27 -0000 1.184
> +++ sys/kern/kern_fork.c 12 Mar 2016 15:48:36 -0000
> @@ -498,6 +498,7 @@ fork1(struct proc *curp, int flags, void
>   /*
>   * Make child runnable and add to run queue.
>   */
> + sched_deadline(p);
>   if ((flags & FORK_IDLE) == 0) {
>   SCHED_LOCK(s);
>   p->p_stat = SRUN;
> Index: sys/kern/kern_resource.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_resource.c,v
> retrieving revision 1.55
> diff -u -p -r1.55 kern_resource.c
> --- sys/kern/kern_resource.c 5 Dec 2015 10:11:53 -0000 1.55
> +++ sys/kern/kern_resource.c 12 Mar 2016 15:48:36 -0000
> @@ -178,8 +178,6 @@ int
>  donice(struct proc *curp, struct process *chgpr, int n)
>  {
>   struct ucred *ucred = curp->p_ucred;
> - struct proc *p;
> - int s;
>  
>   if (ucred->cr_uid != 0 && ucred->cr_ruid != 0 &&
>      ucred->cr_uid != chgpr->ps_ucred->cr_uid &&
> @@ -193,10 +191,13 @@ donice(struct proc *curp, struct process
>   if (n < chgpr->ps_nice && suser(curp, 0))
>   return (EACCES);
>   chgpr->ps_nice = n;
> + /* XXXNICE */
> + /*
>   SCHED_LOCK(s);
>   TAILQ_FOREACH(p, &chgpr->ps_threads, p_thr_link)
>   (void)resetpriority(p);
>   SCHED_UNLOCK(s);
> + */
>   return (0);
>  }
>  
> Index: sys/kern/kern_sched.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_sched.c,v
> retrieving revision 1.41
> diff -u -p -r1.41 kern_sched.c
> --- sys/kern/kern_sched.c 23 Dec 2015 14:51:17 -0000 1.41
> +++ sys/kern/kern_sched.c 12 Mar 2016 15:48:36 -0000
> @@ -33,6 +33,21 @@ void sched_kthreads_create(void *);
>  int sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p);
>  struct proc *sched_steal_proc(struct cpu_info *);
>  
> +int
> +sched_cmp_proc(struct proc *e1, struct proc *e2)
> +{
> + if (e1->p_deadline == e2->p_deadline) {
> + /*
> + * this implementation of rb trees needs the keys to be unique
> + * XXX compare pointers for now, UB
> + */
> + return (e1 < e2 ? -1 : e1 > e2);
> + }
> + return (e1->p_deadline < e2->p_deadline ? -1 : e1->p_deadline > e2->p_deadline);
> +}
> +RB_PROTOTYPE(prochead2, proc, p_runq2, sched_cmp_proc);
> +RB_GENERATE(prochead2, proc, p_runq2, sched_cmp_proc);
> +
>  /*
>   * To help choosing which cpu should run which process we keep track
>   * of cpus which are currently idle and which cpus have processes
> @@ -81,10 +96,8 @@ void
>  sched_init_cpu(struct cpu_info *ci)
>  {
>   struct schedstate_percpu *spc = &ci->ci_schedstate;
> - int i;
>  
> - for (i = 0; i < SCHED_NQS; i++)
> - TAILQ_INIT(&spc->spc_qs[i]);
> + RB_INIT(&spc->spc_rq);
>  
>   spc->spc_idleproc = NULL;
>  
> @@ -238,14 +251,13 @@ void
>  setrunqueue(struct proc *p)
>  {
>   struct schedstate_percpu *spc;
> - int queue = p->p_priority >> 2;
>  
>   SCHED_ASSERT_LOCKED();
>   spc = &p->p_cpu->ci_schedstate;
>   spc->spc_nrun++;
>  
> - TAILQ_INSERT_TAIL(&spc->spc_qs[queue], p, p_runq);
> - spc->spc_whichqs |= (1 << queue);
> + RB_INSERT(prochead2, &spc->spc_rq, p);
> + spc->spc_whichqs = 1;
>   cpuset_add(&sched_queued_cpus, p->p_cpu);
>  
>   if (cpuset_isset(&sched_idle_cpus, p->p_cpu))
> @@ -256,17 +268,15 @@ void
>  remrunqueue(struct proc *p)
>  {
>   struct schedstate_percpu *spc;
> - int queue = p->p_priority >> 2;
>  
>   SCHED_ASSERT_LOCKED();
>   spc = &p->p_cpu->ci_schedstate;
>   spc->spc_nrun--;
>  
> - TAILQ_REMOVE(&spc->spc_qs[queue], p, p_runq);
> - if (TAILQ_EMPTY(&spc->spc_qs[queue])) {
> - spc->spc_whichqs &= ~(1 << queue);
> - if (spc->spc_whichqs == 0)
> - cpuset_del(&sched_queued_cpus, p->p_cpu);
> + RB_REMOVE(prochead2, &spc->spc_rq, p);
> + if (RB_EMPTY(&spc->spc_rq)) {
> + spc->spc_whichqs = 0;
> + cpuset_del(&sched_queued_cpus, p->p_cpu);
>   }
>  }
>  
> @@ -282,13 +292,11 @@ sched_chooseproc(void)
>  #ifdef MULTIPROCESSOR
>   if (spc->spc_schedflags & SPCF_SHOULDHALT) {
>   if (spc->spc_whichqs) {
> - for (queue = 0; queue < SCHED_NQS; queue++) {
> - while ((p = TAILQ_FIRST(&spc->spc_qs[queue]))) {
> - remrunqueue(p);
> - p->p_cpu = sched_choosecpu(p);
> - KASSERT(p->p_cpu != curcpu());
> - setrunqueue(p);
> - }
> + while ((p = RB_MIN(prochead2, &spc->spc_rq)) != NULL) {
> + remrunqueue(p);
> + p->p_cpu = sched_choosecpu(p);
> + KASSERT(p->p_cpu != curcpu());
> + setrunqueue(p);
>   }
>   }
>   p = spc->spc_idleproc;
> @@ -302,7 +310,7 @@ sched_chooseproc(void)
>  again:
>   if (spc->spc_whichqs) {
>   queue = ffs(spc->spc_whichqs) - 1;
> - p = TAILQ_FIRST(&spc->spc_qs[queue]);
> + p = RB_MIN(prochead2, &spc->spc_rq);
>   remrunqueue(p);
>   sched_noidle++;
>   KASSERT(p->p_stat == SRUN);
> @@ -478,7 +486,7 @@ sched_steal_proc(struct cpu_info *self)
>   spc = &ci->ci_schedstate;
>  
>   queue = ffs(spc->spc_whichqs) - 1;
> - TAILQ_FOREACH(p, &spc->spc_qs[queue], p_runq) {
> + RB_FOREACH(p, prochead2, &spc->spc_rq) {
>   if (p->p_flag & P_CPUPEG)
>   continue;
>  
> @@ -550,8 +558,9 @@ sched_proc_to_cpu_cost(struct cpu_info *
>   * and the higher the priority of the proc.
>   */
>   if (!cpuset_isset(&sched_idle_cpus, ci)) {
> - cost += (p->p_priority - spc->spc_curpriority) *
> -    sched_cost_priority;
> + /* 100ms difference? */
> + cost += ((p->p_deadline - spc->spc_curdeadline) / 100000000)
> +    * sched_cost_priority;
>   cost += sched_cost_runnable;
>   }
>   if (cpuset_isset(&sched_queued_cpus, ci))
> Index: sys/kern/kern_sig.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_sig.c,v
> retrieving revision 1.193
> diff -u -p -r1.193 kern_sig.c
> --- sys/kern/kern_sig.c 9 Mar 2016 13:38:50 -0000 1.193
> +++ sys/kern/kern_sig.c 12 Mar 2016 15:48:36 -0000
> @@ -1894,6 +1894,7 @@ userret(struct proc *p)
>   }
>  
>   p->p_cpu->ci_schedstate.spc_curpriority = p->p_priority = p->p_usrpri;
> + p->p_cpu->ci_schedstate.spc_curdeadline = p->p_deadline;
>  }
>  
>  int
> Index: sys/kern/kern_synch.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_synch.c,v
> retrieving revision 1.129
> diff -u -p -r1.129 kern_synch.c
> --- sys/kern/kern_synch.c 9 Mar 2016 13:38:50 -0000 1.129
> +++ sys/kern/kern_synch.c 12 Mar 2016 15:48:36 -0000
> @@ -274,6 +274,7 @@ sleep_finish(struct sleep_state *sls, in
>  #endif
>  
>   p->p_cpu->ci_schedstate.spc_curpriority = p->p_usrpri;
> + p->p_cpu->ci_schedstate.spc_curdeadline = p->p_deadline;
>   SCHED_UNLOCK(sls->sls_s);
>  
>   /*
> Index: sys/kern/sched_bsd.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/sched_bsd.c,v
> retrieving revision 1.43
> diff -u -p -r1.43 sched_bsd.c
> --- sys/kern/sched_bsd.c 9 Mar 2016 13:38:50 -0000 1.43
> +++ sys/kern/sched_bsd.c 12 Mar 2016 15:48:36 -0000
> @@ -61,7 +61,6 @@ struct __mp_lock sched_lock;
>  #endif
>  
>  void schedcpu(void *);
> -void updatepri(struct proc *);
>  
>  void
>  scheduler_start(void)
> @@ -91,6 +90,7 @@ roundrobin(struct cpu_info *ci)
>   spc->spc_rrticks = rrticks_init;
>  
>   if (ci->ci_curproc != NULL) {
> + sched_deadline(ci->ci_curproc);
>   if (spc->spc_schedflags & SPCF_SEENRR) {
>   /*
>   * The process has already been through a roundrobin
> @@ -109,74 +109,6 @@ roundrobin(struct cpu_info *ci)
>   need_resched(ci);
>  }
>  
> -/*
> - * Constants for digital decay and forget:
> - * 90% of (p_estcpu) usage in 5 * loadav time
> - * 95% of (p_pctcpu) usage in 60 seconds (load insensitive)
> - *          Note that, as ps(1) mentions, this can let percentages
> - *          total over 100% (I've seen 137.9% for 3 processes).
> - *
> - * Note that hardclock updates p_estcpu and p_cpticks independently.
> - *
> - * We wish to decay away 90% of p_estcpu in (5 * loadavg) seconds.
> - * That is, the system wants to compute a value of decay such
> - * that the following for loop:
> - * for (i = 0; i < (5 * loadavg); i++)
> - * p_estcpu *= decay;
> - * will compute
> - * p_estcpu *= 0.1;
> - * for all values of loadavg:
> - *
> - * Mathematically this loop can be expressed by saying:
> - * decay ** (5 * loadavg) ~= .1
> - *
> - * The system computes decay as:
> - * decay = (2 * loadavg) / (2 * loadavg + 1)
> - *
> - * We wish to prove that the system's computation of decay
> - * will always fulfill the equation:
> - * decay ** (5 * loadavg) ~= .1
> - *
> - * If we compute b as:
> - * b = 2 * loadavg
> - * then
> - * decay = b / (b + 1)
> - *
> - * We now need to prove two things:
> - * 1) Given factor ** (5 * loadavg) ~= .1, prove factor == b/(b+1)
> - * 2) Given b/(b+1) ** power ~= .1, prove power == (5 * loadavg)
> - *
> - * Facts:
> - *         For x close to zero, exp(x) =~ 1 + x, since
> - *              exp(x) = 0! + x**1/1! + x**2/2! + ... .
> - *              therefore exp(-1/b) =~ 1 - (1/b) = (b-1)/b.
> - *         For x close to zero, ln(1+x) =~ x, since
> - *              ln(1+x) = x - x**2/2 + x**3/3 - ...     -1 < x < 1
> - *              therefore ln(b/(b+1)) = ln(1 - 1/(b+1)) =~ -1/(b+1).
> - *         ln(.1) =~ -2.30
> - *
> - * Proof of (1):
> - *    Solve (factor)**(power) =~ .1 given power (5*loadav):
> - * solving for factor,
> - *      ln(factor) =~ (-2.30/5*loadav), or
> - *      factor =~ exp(-1/((5/2.30)*loadav)) =~ exp(-1/(2*loadav)) =
> - *          exp(-1/b) =~ (b-1)/b =~ b/(b+1).                    QED
> - *
> - * Proof of (2):
> - *    Solve (factor)**(power) =~ .1 given factor == (b/(b+1)):
> - * solving for power,
> - *      power*ln(b/(b+1)) =~ -2.30, or
> - *      power =~ 2.3 * (b + 1) = 4.6*loadav + 2.3 =~ 5*loadav.  QED
> - *
> - * Actual power values for the implemented algorithm are as follows:
> - *      loadav: 1       2       3       4
> - *      power:  5.68    10.32   14.94   19.55
> - */
> -
> -/* calculations for digital decay to forget 90% of usage in 5*loadav sec */
> -#define loadfactor(loadav) (2 * (loadav))
> -#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))
> -
>  /* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */
>  fixpt_t ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */
>  
> @@ -197,17 +129,40 @@ fixpt_t ccpu = 0.95122942450071400909 *
>  /*
>   * Recompute process priorities, every second.
>   */
> +RB_PROTOTYPE(prochead2, proc, p_runq2, sched_cmp_proc);
>  void
>  schedcpu(void *arg)
>  {
>   struct timeout *to = (struct timeout *)arg;
> - fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
>   struct proc *p;
>   int s;
> - unsigned int newcpu;
>   int phz;
>  
>   /*
> + CPU_INFO_ITERATOR cii;
> + struct cpu_info *ci;
> + int i = 0;
> + CPU_INFO_FOREACH(cii, ci) {
> + i++;
> + printf("%d: ", i);
> + if (ci->ci_curproc != NULL) {
> + printf("[%s %llu %d] ",
> + ci->ci_curproc->p_comm,
> + ci->ci_curproc->p_deadline,
> + ci->ci_curproc->p_rrticks);
> +
> + }
> + RB_FOREACH(p, prochead2, &ci->ci_schedstate.spc_rq) {
> + printf("<%s %llu %d>",
> + p->p_comm,
> + p->p_deadline,
> + p->p_rrticks);
> + }
> + printf("\n");
> + }
> + */
> +
> + /*
>   * If we have a statistics clock, use that to calculate CPU
>   * time, otherwise revert to using the profiling clock (which,
>   * in turn, defaults to hz if there is no separate profiling
> @@ -243,19 +198,6 @@ schedcpu(void *arg)
>   (p->p_cpticks * FSCALE / phz)) >> FSHIFT;
>  #endif
>   p->p_cpticks = 0;
> - newcpu = (u_int) decay_cpu(loadfac, p->p_estcpu);
> - p->p_estcpu = newcpu;
> - resetpriority(p);
> - if (p->p_priority >= PUSER) {
> - if (p->p_stat == SRUN &&
> -    (p->p_priority / SCHED_PPQ) !=
> -    (p->p_usrpri / SCHED_PPQ)) {
> - remrunqueue(p);
> - p->p_priority = p->p_usrpri;
> - setrunqueue(p);
> - } else
> - p->p_priority = p->p_usrpri;
> - }
>   SCHED_UNLOCK(s);
>   }
>   uvm_meter();
> @@ -264,27 +206,32 @@ schedcpu(void *arg)
>  }
>  
>  /*
> - * Recalculate the priority of a process after it has slept for a while.
> - * For all load averages >= 1 and max p_estcpu of 255, sleeping for at
> - * least six times the loadfactor will decay p_estcpu to zero.
> + * 10% more for every nice level. scaled by a power of 2, these are
> + * later divided
>   */
> +int nice_mul[] = {
> + 1024, 1126, 1239, 1362, 1499,
> + 1649, 1814, 1995, 2195, 2414,
> + 2655, 2921, 3213, 3535, 3888,
> + 4277, 4705, 5175, 5693, 6262,
> + 6888, 7577, 8335, 9169, 10086,
> + 11094, 12204, 13424, 14767, 16243,
> + 17868, 19655, 21620, 23782, 26160,
> + 28776, 31654, 34820, 38302, 42132
> +};
> +
>  void
> -updatepri(struct proc *p)
> +sched_deadline(struct proc *p)
>  {
> - unsigned int newcpu = p->p_estcpu;
> - fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
> + u_int64_t niffies;
> + /* u_int64_t offset = 100 * 1000 * 1000; */
> + struct timespec now;
>  
> - SCHED_ASSERT_LOCKED();
> + nanouptime(&now);
> + niffies = now.tv_sec * (1000 * 1000 * 1000) + now.tv_nsec;
>  
> - if (p->p_slptime > 5 * loadfac)
> - p->p_estcpu = 0;
> - else {
> - p->p_slptime--; /* the first time was done in schedcpu */
> - while (newcpu && --p->p_slptime)
> - newcpu = (int) decay_cpu(loadfac, newcpu);
> - p->p_estcpu = newcpu;
> - }
> - resetpriority(p);
> + p->p_deadline = niffies; // + offset;
> + p->p_rrticks = rrticks_init;
>  }
>  
>  /*
> @@ -326,6 +273,7 @@ preempt(struct proc *newp)
>  
>   SCHED_LOCK(s);
>   p->p_priority = p->p_usrpri;
> + sched_deadline(p);
>   p->p_stat = SRUN;
>   p->p_cpu = sched_choosecpu(p);
>   setrunqueue(p);
> @@ -473,7 +421,7 @@ resched_proc(struct proc *p, u_char pri)
>   * sched state, which we currently do not do.
>   */
>   ci = (p->p_cpu != NULL) ? p->p_cpu : curcpu();
> - if (pri < ci->ci_schedstate.spc_curpriority)
> + if (ci->ci_curproc && p->p_deadline < ci->ci_curproc->p_deadline)
>   need_resched(ci);
>  }
>  
> @@ -509,56 +457,8 @@ setrunnable(struct proc *p)
>   p->p_stat = SRUN;
>   p->p_cpu = sched_choosecpu(p);
>   setrunqueue(p);
> - if (p->p_slptime > 1)
> - updatepri(p);
>   p->p_slptime = 0;
>   resched_proc(p, p->p_priority);
> -}
> -
> -/*
> - * Compute the priority of a process when running in user mode.
> - * Arrange to reschedule if the resulting priority is better
> - * than that of the current process.
> - */
> -void
> -resetpriority(struct proc *p)
> -{
> - unsigned int newpriority;
> -
> - SCHED_ASSERT_LOCKED();
> -
> - newpriority = PUSER + p->p_estcpu +
> -    NICE_WEIGHT * (p->p_p->ps_nice - NZERO);
> - newpriority = min(newpriority, MAXPRI);
> - p->p_usrpri = newpriority;
> - resched_proc(p, p->p_usrpri);
> -}
> -
> -/*
> - * We adjust the priority of the current process.  The priority of a process
> - * gets worse as it accumulates CPU time.  The cpu usage estimator (p_estcpu)
> - * is increased here.  The formula for computing priorities (in kern_synch.c)
> - * will compute a different value each time p_estcpu increases. This can
> - * cause a switch, but unless the priority crosses a PPQ boundary the actual
> - * queue will not change.  The cpu usage estimator ramps up quite quickly
> - * when the process is running (linearly), and decays away exponentially, at
> - * a rate which is proportionally slower when the system is busy.  The basic
> - * principle is that the system will 90% forget that the process used a lot
> - * of CPU time in 5 * loadav seconds.  This causes the system to favor
> - * processes which haven't run much recently, and to round-robin among other
> - * processes.
> - */
> -void
> -schedclock(struct proc *p)
> -{
> - int s;
> -
> - SCHED_LOCK(s);
> - p->p_estcpu = ESTCPULIM(p->p_estcpu + 1);
> - resetpriority(p);
> - if (p->p_priority >= PUSER)
> - p->p_priority = p->p_usrpri;
> - SCHED_UNLOCK(s);
>  }
>  
>  void (*cpu_setperf)(int);
> Index: sys/sys/proc.h
> ===================================================================
> RCS file: /cvs/src/sys/sys/proc.h,v
> retrieving revision 1.217
> diff -u -p -r1.217 proc.h
> --- sys/sys/proc.h 9 Mar 2016 13:38:50 -0000 1.217
> +++ sys/sys/proc.h 12 Mar 2016 15:48:37 -0000
> @@ -44,6 +44,7 @@
>  #include <sys/selinfo.h> /* For struct selinfo */
>  #include <sys/syslimits.h> /* For LOGIN_NAME_MAX */
>  #include <sys/queue.h>
> +#include <sys/tree.h>
>  #include <sys/timeout.h> /* For struct timeout */
>  #include <sys/event.h> /* For struct klist */
>  #include <sys/mutex.h> /* For struct mutex */
> @@ -267,6 +268,7 @@ struct process {
>  
>  struct proc {
>   TAILQ_ENTRY(proc) p_runq;
> + RB_ENTRY(proc) p_runq2;
>   LIST_ENTRY(proc) p_list; /* List of all threads. */
>  
>   struct process *p_p; /* The process of this thread. */
> @@ -320,6 +322,8 @@ struct proc {
>  
>  /* The following fields are all copied upon creation in fork. */
>  #define p_startcopy p_sigmask
> + u_int64_t p_deadline;
> + int p_rrticks;
>   sigset_t p_sigmask; /* Current signal mask. */
>  
>   u_char p_priority; /* Process priority. */
> @@ -488,6 +492,7 @@ void fixjobc(struct process *, struct pg
>  int inferior(struct process *, struct process *);
>  void leavepgrp(struct process *);
>  void preempt(struct proc *);
> +void sched_deadline(struct proc *);
>  void pgdelete(struct pgrp *);
>  void procinit(void);
>  void resetpriority(struct proc *);
> @@ -570,4 +575,3 @@ struct cpu_info *cpuset_first(struct cpu
>  
>  #endif /* _KERNEL */
>  #endif /* !_SYS_PROC_H_ */
> -
> Index: sys/sys/sched.h
> ===================================================================
> RCS file: /cvs/src/sys/sys/sched.h,v
> retrieving revision 1.40
> diff -u -p -r1.40 sched.h
> --- sys/sys/sched.h 9 Mar 2016 13:38:50 -0000 1.40
> +++ sys/sys/sched.h 12 Mar 2016 15:48:37 -0000
> @@ -70,6 +70,7 @@
>  #define _SYS_SCHED_H_
>  
>  #include <sys/queue.h>
> +#include <sys/tree.h>
>  
>  /*
>   * Posix defines a <sched.h> which may want to include <sys/sched.h>
> @@ -99,6 +100,7 @@ struct schedstate_percpu {
>   u_int spc_schedticks; /* ticks for schedclock() */
>   u_int64_t spc_cp_time[CPUSTATES]; /* CPU state statistics */
>   u_char spc_curpriority; /* usrpri of curproc */
> + u_int64_t spc_curdeadline;
>   int spc_rrticks; /* ticks until roundrobin() */
>   int spc_pscnt; /* prof/stat counter */
>   int spc_psdiv; /* prof/stat divisor */
> @@ -109,6 +111,8 @@ struct schedstate_percpu {
>  
>   TAILQ_HEAD(prochead, proc) spc_qs[SCHED_NQS];
>   volatile uint32_t spc_whichqs;
> +
> + RB_HEAD(prochead2, proc) spc_rq;
>  
>  #ifdef notyet
>   struct proc *spc_reaper; /* dead proc reaper */
> --
> Michal Mazurek
>

--
Juan Francisco Cantero Hurtado http://juanfra.info

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Juan Francisco Cantero Hurtado
On Sat, Mar 12, 2016 at 08:35:31PM +0100, Juan Francisco Cantero Hurtado wrote:

> On Sat, Mar 12, 2016 at 05:36:21PM +0100, Michal Mazurek wrote:
> > Gregor Best attempted to improve the scheduler in 2011:
> > http://comments.gmane.org/gmane.os.openbsd.tech/27059
> > Here is another attempt, it takes up where the previous one left off.
> >
> > This is also mostly based on the main idea behind Linux CFS or
> > BFS. I found BFS to be described more clearly:
> > http://ck.kolivas.org/patches/bfs/4.0/4.3/4.3-sched-bfs-467.patch
> >
> >
> > Some notes:
> >
> > Chrome is still not very usable.
>
> Chrome is not the best benchmark for the scheduler ATM.
>
> Until the work in malloc to enhance the support for multithreaded
> programs is finished, forget Chrome and Firefox.
>
> >
> > Much more work is needed, e.g. there is some MD code on sparc64 and
> > alpha that depends on spc_schedticks that needs to be understood and
> > rewritten.
> >
> > Maybe using RB trees to queue what is usually no more than 5 elements
> > is overkill.
> >
> > p_usrpri and p_priority will go away, so userland utilities like 'ps'
> > will need to be changed.
> >
> > I also want to try and see if implementing one shared queue, instead of
> > keeping one queue per cpu will improve performance even further. Right
> > now there are some heuristics to determine whether a process should
> > switch cpus. This doesn't work very well yet, in my tests with the
> > attached code sometimes one queue was a second behind another. From
> > what I understand that's the idea behind BFS and the reason why it
> > doesn't scale to 4096 CPUs. I see that OpenBSD supports 256 CPUs on
> > sparc64:
> > ./arch/sparc64/include/cpu.h:#define MAXCPUS    256
> >
> > Nice is not yet supported, but it's easy to add (uncomment and modify
> > 'offset' in sched_deadline(). I didn't want to add it yet to test just
> > how fair this method is. Currently all processes are equal.
> >
> > Other operating systems provide a couple of different schedulers to
> > choose from, this diff overwrites the 4bsd scheduler with the new code
> > and leaves no choice.
> >
> > With the new code I got some visible improvements when compiling the
> > kernel:
> > old: 1m46.74s real     5m08.48s user     1m27.37s system
> > old: 1m46.61s real     5m06.93s user     1m28.13s system
> > old: 1m49.91s real     5m19.71s user     1m28.25s system
> > new: 1m38.05s real     5m12.16s user     0m55.86s system
> > new: 1m38.70s real     5m12.36s user     0m57.82s system
> > new: 1m39.10s real     5m16.06s user     0m56.91s system
>
> That is going to make happy to people running bulk builds :)
>
> >
> > I would be interested in hearing if the new code broke something for
> > anyone.
> >
> > Is there a chance for a scheduler rewrite like this to be commited?

From time to time, I run a batch video conversion. I tried the same
commands with and without your patch. Your patch is slowing down the
frames-per-second about 30%.

Here are the commands:
$ download some 20-30 minutes videos in mp4 format (youtube is fine for
this)
$ install the packages ffmpeg and libx264
$ mkdir output
$ for i in *.mp4; do ffmpeg -y -i "$i" -vf 'scale=320:-1' -map 0:v:0 -map 0:a:0 -sws_flags spline -map_metadata -1 -c:v libx264 -profile:v baseline -b:v 360k -c:a aac -strict -2 -ar 44100 -ac 2 -b:a 128k -movflags faststart "output/$i"; done



--
Juan Francisco Cantero Hurtado http://juanfra.info

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Martin Pieuchot
In reply to this post by Michal Mazurek-2
On 12/03/16(Sat) 17:36, Michal Mazurek wrote:
> [...]
> Some notes:
>
> Chrome is still not very usable.

Are you wanting to improve the browser experience on OpenBSD?  If that's
your goal then I'd suggest you to start by analysing how the browsers
behave.  My personal analysis makes me believe that librthread is what
needs some love.

> Much more work is needed, e.g. there is some MD code on sparc64 and
> alpha that depends on spc_schedticks that needs to be understood and
> rewritten.
>
>
> Maybe using RB trees to queue what is usually no more than 5 elements
> is overkill.

I think it's overkill and makes your diff harder to read.

> p_usrpri and p_priority will go away, so userland utilities like 'ps'
> will need to be changed.

So what's your plan for process priorities?

> I also want to try and see if implementing one shared queue, instead of
> keeping one queue per cpu will improve performance even further. Right
> now there are some heuristics to determine whether a process should
> switch cpus. This doesn't work very well yet, in my tests with the
> attached code sometimes one queue was a second behind another. From
> what I understand that's the idea behind BFS and the reason why it
> doesn't scale to 4096 CPUs. I see that OpenBSD supports 256 CPUs on
> sparc64:
> ./arch/sparc64/include/cpu.h:#define MAXCPUS    256

"improve performance" does not say much.  For which workload on which
machine?  There's maybe room for improvement in this area, but you
really need more than "building a kernel" as test bed.

> Is there a chance for a scheduler rewrite like this to be commited?

Maybe but not like this.  It would be nice if you could analyse various
workloads with the existing scheduler in order to argue that some of the
decisions made could be improved, then explain how the changes that
you're proposing in the existing scheduler improve things.

It's nice to see you experimenting with this and I think experimenting
is important but if your goal is to improve the existing software, you
also need to do measurements ;)

More comments on the diff below.

> Index: sys/kern/init_main.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/init_main.c,v
> retrieving revision 1.248
> diff -u -p -r1.248 init_main.c
> --- sys/kern/init_main.c 3 Jan 2016 00:15:59 -0000 1.248
> +++ sys/kern/init_main.c 12 Mar 2016 15:48:36 -0000
> @@ -265,6 +265,8 @@ main(void *framep)
>   */
>   pr = &process0;
>   process_initialize(pr, p);
> + p->p_deadline = 0;
> + p->p_rrticks = 10;

Here you could use rrticks_init ;)

>  
>   LIST_INSERT_HEAD(&allprocess, pr, ps_list);
>   atomic_setbits_int(&pr->ps_flags, PS_SYSTEM);
> @@ -554,6 +556,7 @@ main(void *framep)
>          /*
>           * proc0: nothing to do, back to sleep
>           */
> + printf("*** modified scheduler ***\n");
>          while (1)
>                  tsleep(&proc0, PVM, "scheduler", 0);
>   /* NOTREACHED */
> Index: sys/kern/kern_clock.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_clock.c,v
> retrieving revision 1.88
> diff -u -p -r1.88 kern_clock.c
> --- sys/kern/kern_clock.c 11 Jun 2015 16:03:04 -0000 1.88
> +++ sys/kern/kern_clock.c 12 Mar 2016 15:48:36 -0000
> @@ -180,7 +180,7 @@ hardclock(struct clockframe *frame)
>   if (stathz == 0)
>   statclock(frame);
>  
> - if (--ci->ci_schedstate.spc_rrticks <= 0)
> + if (p && (--(p->p_rrticks) <= 0))
>   roundrobin(ci);

That's an interesting change.  Why did you decide to move a per-CPU
counter to a per-process one?  Can you explain the effect it will have?
Or even better can you measure its effect on different workloads?

>   /*
> @@ -398,15 +398,7 @@ statclock(struct clockframe *frame)
>  
>   if (p != NULL) {
>   p->p_cpticks++;
> - /*
> - * If no schedclock is provided, call it here at ~~12-25 Hz;
> - * ~~16 Hz is best
> - */
> - if (schedhz == 0) {
> - if ((++curcpu()->ci_schedstate.spc_schedticks & 3) ==
> -    0)
> - schedclock(p);
> - }
> + ++curcpu()->ci_schedstate.spc_schedticks;
>   }
>  }
>  
> Index: sys/kern/kern_fork.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_fork.c,v
> retrieving revision 1.184
> diff -u -p -r1.184 kern_fork.c
> --- sys/kern/kern_fork.c 9 Oct 2015 01:10:27 -0000 1.184
> +++ sys/kern/kern_fork.c 12 Mar 2016 15:48:36 -0000
> @@ -498,6 +498,7 @@ fork1(struct proc *curp, int flags, void
>   /*
>   * Make child runnable and add to run queue.
>   */
> + sched_deadline(p);
>   if ((flags & FORK_IDLE) == 0) {
>   SCHED_LOCK(s);
>   p->p_stat = SRUN;
> Index: sys/kern/kern_resource.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_resource.c,v
> retrieving revision 1.55
> diff -u -p -r1.55 kern_resource.c
> --- sys/kern/kern_resource.c 5 Dec 2015 10:11:53 -0000 1.55
> +++ sys/kern/kern_resource.c 12 Mar 2016 15:48:36 -0000
> @@ -178,8 +178,6 @@ int
>  donice(struct proc *curp, struct process *chgpr, int n)
>  {
>   struct ucred *ucred = curp->p_ucred;
> - struct proc *p;
> - int s;
>  
>   if (ucred->cr_uid != 0 && ucred->cr_ruid != 0 &&
>      ucred->cr_uid != chgpr->ps_ucred->cr_uid &&
> @@ -193,10 +191,13 @@ donice(struct proc *curp, struct process
>   if (n < chgpr->ps_nice && suser(curp, 0))
>   return (EACCES);
>   chgpr->ps_nice = n;
> + /* XXXNICE */
> + /*
>   SCHED_LOCK(s);
>   TAILQ_FOREACH(p, &chgpr->ps_threads, p_thr_link)
>   (void)resetpriority(p);
>   SCHED_UNLOCK(s);
> + */
>   return (0);
>  }
>  
> Index: sys/kern/kern_sched.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_sched.c,v
> retrieving revision 1.41
> diff -u -p -r1.41 kern_sched.c
> --- sys/kern/kern_sched.c 23 Dec 2015 14:51:17 -0000 1.41
> +++ sys/kern/kern_sched.c 12 Mar 2016 15:48:36 -0000
> @@ -33,6 +33,21 @@ void sched_kthreads_create(void *);
>  int sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p);
>  struct proc *sched_steal_proc(struct cpu_info *);
>  
> +int
> +sched_cmp_proc(struct proc *e1, struct proc *e2)
> +{
> + if (e1->p_deadline == e2->p_deadline) {
> + /*
> + * this implementation of rb trees needs the keys to be unique
> + * XXX compare pointers for now, UB
> + */
> + return (e1 < e2 ? -1 : e1 > e2);
> + }
> + return (e1->p_deadline < e2->p_deadline ? -1 : e1->p_deadline > e2->p_deadline);
> +}
> +RB_PROTOTYPE(prochead2, proc, p_runq2, sched_cmp_proc);
> +RB_GENERATE(prochead2, proc, p_runq2, sched_cmp_proc);
> +
>  /*
>   * To help choosing which cpu should run which process we keep track
>   * of cpus which are currently idle and which cpus have processes
> @@ -81,10 +96,8 @@ void
>  sched_init_cpu(struct cpu_info *ci)
>  {
>   struct schedstate_percpu *spc = &ci->ci_schedstate;
> - int i;
>  
> - for (i = 0; i < SCHED_NQS; i++)
> - TAILQ_INIT(&spc->spc_qs[i]);
> + RB_INIT(&spc->spc_rq);
>  
>   spc->spc_idleproc = NULL;
>  
> @@ -238,14 +251,13 @@ void
>  setrunqueue(struct proc *p)
>  {
>   struct schedstate_percpu *spc;
> - int queue = p->p_priority >> 2;
>  
>   SCHED_ASSERT_LOCKED();
>   spc = &p->p_cpu->ci_schedstate;
>   spc->spc_nrun++;
>  
> - TAILQ_INSERT_TAIL(&spc->spc_qs[queue], p, p_runq);
> - spc->spc_whichqs |= (1 << queue);
> + RB_INSERT(prochead2, &spc->spc_rq, p);
> + spc->spc_whichqs = 1;
>   cpuset_add(&sched_queued_cpus, p->p_cpu);
>  
>   if (cpuset_isset(&sched_idle_cpus, p->p_cpu))
> @@ -256,17 +268,15 @@ void
>  remrunqueue(struct proc *p)
>  {
>   struct schedstate_percpu *spc;
> - int queue = p->p_priority >> 2;
>  
>   SCHED_ASSERT_LOCKED();
>   spc = &p->p_cpu->ci_schedstate;
>   spc->spc_nrun--;
>  
> - TAILQ_REMOVE(&spc->spc_qs[queue], p, p_runq);
> - if (TAILQ_EMPTY(&spc->spc_qs[queue])) {
> - spc->spc_whichqs &= ~(1 << queue);
> - if (spc->spc_whichqs == 0)
> - cpuset_del(&sched_queued_cpus, p->p_cpu);
> + RB_REMOVE(prochead2, &spc->spc_rq, p);
> + if (RB_EMPTY(&spc->spc_rq)) {
> + spc->spc_whichqs = 0;
> + cpuset_del(&sched_queued_cpus, p->p_cpu);
>   }
>  }
>  
> @@ -282,13 +292,11 @@ sched_chooseproc(void)
>  #ifdef MULTIPROCESSOR
>   if (spc->spc_schedflags & SPCF_SHOULDHALT) {
>   if (spc->spc_whichqs) {
> - for (queue = 0; queue < SCHED_NQS; queue++) {
> - while ((p = TAILQ_FIRST(&spc->spc_qs[queue]))) {
> - remrunqueue(p);
> - p->p_cpu = sched_choosecpu(p);
> - KASSERT(p->p_cpu != curcpu());
> - setrunqueue(p);
> - }
> + while ((p = RB_MIN(prochead2, &spc->spc_rq)) != NULL) {
> + remrunqueue(p);
> + p->p_cpu = sched_choosecpu(p);
> + KASSERT(p->p_cpu != curcpu());
> + setrunqueue(p);
>   }
>   }
>   p = spc->spc_idleproc;
> @@ -302,7 +310,7 @@ sched_chooseproc(void)
>  again:
>   if (spc->spc_whichqs) {
>   queue = ffs(spc->spc_whichqs) - 1;
> - p = TAILQ_FIRST(&spc->spc_qs[queue]);
> + p = RB_MIN(prochead2, &spc->spc_rq);
>   remrunqueue(p);
>   sched_noidle++;
>   KASSERT(p->p_stat == SRUN);
> @@ -478,7 +486,7 @@ sched_steal_proc(struct cpu_info *self)
>   spc = &ci->ci_schedstate;
>  
>   queue = ffs(spc->spc_whichqs) - 1;
> - TAILQ_FOREACH(p, &spc->spc_qs[queue], p_runq) {
> + RB_FOREACH(p, prochead2, &spc->spc_rq) {
>   if (p->p_flag & P_CPUPEG)
>   continue;
>  
> @@ -550,8 +558,9 @@ sched_proc_to_cpu_cost(struct cpu_info *
>   * and the higher the priority of the proc.
>   */
>   if (!cpuset_isset(&sched_idle_cpus, ci)) {
> - cost += (p->p_priority - spc->spc_curpriority) *
> -    sched_cost_priority;
> + /* 100ms difference? */
> + cost += ((p->p_deadline - spc->spc_curdeadline) / 100000000)
> +    * sched_cost_priority;
>   cost += sched_cost_runnable;
>   }
>   if (cpuset_isset(&sched_queued_cpus, ci))
> Index: sys/kern/kern_sig.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_sig.c,v
> retrieving revision 1.193
> diff -u -p -r1.193 kern_sig.c
> --- sys/kern/kern_sig.c 9 Mar 2016 13:38:50 -0000 1.193
> +++ sys/kern/kern_sig.c 12 Mar 2016 15:48:36 -0000
> @@ -1894,6 +1894,7 @@ userret(struct proc *p)
>   }
>  
>   p->p_cpu->ci_schedstate.spc_curpriority = p->p_priority = p->p_usrpri;
> + p->p_cpu->ci_schedstate.spc_curdeadline = p->p_deadline;
>  }
>  
>  int
> Index: sys/kern/kern_synch.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_synch.c,v
> retrieving revision 1.129
> diff -u -p -r1.129 kern_synch.c
> --- sys/kern/kern_synch.c 9 Mar 2016 13:38:50 -0000 1.129
> +++ sys/kern/kern_synch.c 12 Mar 2016 15:48:36 -0000
> @@ -274,6 +274,7 @@ sleep_finish(struct sleep_state *sls, in
>  #endif
>  
>   p->p_cpu->ci_schedstate.spc_curpriority = p->p_usrpri;
> + p->p_cpu->ci_schedstate.spc_curdeadline = p->p_deadline;
>   SCHED_UNLOCK(sls->sls_s);
>  
>   /*
> Index: sys/kern/sched_bsd.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/sched_bsd.c,v
> retrieving revision 1.43
> diff -u -p -r1.43 sched_bsd.c
> --- sys/kern/sched_bsd.c 9 Mar 2016 13:38:50 -0000 1.43
> +++ sys/kern/sched_bsd.c 12 Mar 2016 15:48:36 -0000
> @@ -61,7 +61,6 @@ struct __mp_lock sched_lock;
>  #endif
>  
>  void schedcpu(void *);
> -void updatepri(struct proc *);
>  
>  void
>  scheduler_start(void)
> @@ -91,6 +90,7 @@ roundrobin(struct cpu_info *ci)
>   spc->spc_rrticks = rrticks_init;
>  
>   if (ci->ci_curproc != NULL) {
> + sched_deadline(ci->ci_curproc);
>   if (spc->spc_schedflags & SPCF_SEENRR) {
>   /*
>   * The process has already been through a roundrobin
> @@ -109,74 +109,6 @@ roundrobin(struct cpu_info *ci)
>   need_resched(ci);
>  }
>  
> -/*
> - * Constants for digital decay and forget:
> - * 90% of (p_estcpu) usage in 5 * loadav time
> - * 95% of (p_pctcpu) usage in 60 seconds (load insensitive)
> - *          Note that, as ps(1) mentions, this can let percentages
> - *          total over 100% (I've seen 137.9% for 3 processes).
> - *
> - * Note that hardclock updates p_estcpu and p_cpticks independently.
> - *
> - * We wish to decay away 90% of p_estcpu in (5 * loadavg) seconds.
> - * That is, the system wants to compute a value of decay such
> - * that the following for loop:
> - * for (i = 0; i < (5 * loadavg); i++)
> - * p_estcpu *= decay;
> - * will compute
> - * p_estcpu *= 0.1;
> - * for all values of loadavg:
> - *
> - * Mathematically this loop can be expressed by saying:
> - * decay ** (5 * loadavg) ~= .1
> - *
> - * The system computes decay as:
> - * decay = (2 * loadavg) / (2 * loadavg + 1)
> - *
> - * We wish to prove that the system's computation of decay
> - * will always fulfill the equation:
> - * decay ** (5 * loadavg) ~= .1
> - *
> - * If we compute b as:
> - * b = 2 * loadavg
> - * then
> - * decay = b / (b + 1)
> - *
> - * We now need to prove two things:
> - * 1) Given factor ** (5 * loadavg) ~= .1, prove factor == b/(b+1)
> - * 2) Given b/(b+1) ** power ~= .1, prove power == (5 * loadavg)
> - *
> - * Facts:
> - *         For x close to zero, exp(x) =~ 1 + x, since
> - *              exp(x) = 0! + x**1/1! + x**2/2! + ... .
> - *              therefore exp(-1/b) =~ 1 - (1/b) = (b-1)/b.
> - *         For x close to zero, ln(1+x) =~ x, since
> - *              ln(1+x) = x - x**2/2 + x**3/3 - ...     -1 < x < 1
> - *              therefore ln(b/(b+1)) = ln(1 - 1/(b+1)) =~ -1/(b+1).
> - *         ln(.1) =~ -2.30
> - *
> - * Proof of (1):
> - *    Solve (factor)**(power) =~ .1 given power (5*loadav):
> - * solving for factor,
> - *      ln(factor) =~ (-2.30/5*loadav), or
> - *      factor =~ exp(-1/((5/2.30)*loadav)) =~ exp(-1/(2*loadav)) =
> - *          exp(-1/b) =~ (b-1)/b =~ b/(b+1).                    QED
> - *
> - * Proof of (2):
> - *    Solve (factor)**(power) =~ .1 given factor == (b/(b+1)):
> - * solving for power,
> - *      power*ln(b/(b+1)) =~ -2.30, or
> - *      power =~ 2.3 * (b + 1) = 4.6*loadav + 2.3 =~ 5*loadav.  QED
> - *
> - * Actual power values for the implemented algorithm are as follows:
> - *      loadav: 1       2       3       4
> - *      power:  5.68    10.32   14.94   19.55
> - */
> -
> -/* calculations for digital decay to forget 90% of usage in 5*loadav sec */
> -#define loadfactor(loadav) (2 * (loadav))
> -#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))
> -
>  /* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */
>  fixpt_t ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */
>  
> @@ -197,17 +129,40 @@ fixpt_t ccpu = 0.95122942450071400909 *
>  /*
>   * Recompute process priorities, every second.
>   */
> +RB_PROTOTYPE(prochead2, proc, p_runq2, sched_cmp_proc);
>  void
>  schedcpu(void *arg)
>  {
>   struct timeout *to = (struct timeout *)arg;
> - fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
>   struct proc *p;
>   int s;
> - unsigned int newcpu;
>   int phz;
>  
>   /*
> + CPU_INFO_ITERATOR cii;
> + struct cpu_info *ci;
> + int i = 0;
> + CPU_INFO_FOREACH(cii, ci) {
> + i++;
> + printf("%d: ", i);
> + if (ci->ci_curproc != NULL) {
> + printf("[%s %llu %d] ",
> + ci->ci_curproc->p_comm,
> + ci->ci_curproc->p_deadline,
> + ci->ci_curproc->p_rrticks);
> +
> + }
> + RB_FOREACH(p, prochead2, &ci->ci_schedstate.spc_rq) {
> + printf("<%s %llu %d>",
> + p->p_comm,
> + p->p_deadline,
> + p->p_rrticks);
> + }
> + printf("\n");
> + }
> + */
> +
> + /*
>   * If we have a statistics clock, use that to calculate CPU
>   * time, otherwise revert to using the profiling clock (which,
>   * in turn, defaults to hz if there is no separate profiling
> @@ -243,19 +198,6 @@ schedcpu(void *arg)
>   (p->p_cpticks * FSCALE / phz)) >> FSHIFT;
>  #endif
>   p->p_cpticks = 0;
> - newcpu = (u_int) decay_cpu(loadfac, p->p_estcpu);
> - p->p_estcpu = newcpu;
> - resetpriority(p);
> - if (p->p_priority >= PUSER) {
> - if (p->p_stat == SRUN &&
> -    (p->p_priority / SCHED_PPQ) !=
> -    (p->p_usrpri / SCHED_PPQ)) {
> - remrunqueue(p);
> - p->p_priority = p->p_usrpri;
> - setrunqueue(p);
> - } else
> - p->p_priority = p->p_usrpri;
> - }
>   SCHED_UNLOCK(s);
>   }
>   uvm_meter();
> @@ -264,27 +206,32 @@ schedcpu(void *arg)
>  }
>  
>  /*
> - * Recalculate the priority of a process after it has slept for a while.
> - * For all load averages >= 1 and max p_estcpu of 255, sleeping for at
> - * least six times the loadfactor will decay p_estcpu to zero.
> + * 10% more for every nice level. scaled by a power of 2, these are
> + * later divided
>   */
> +int nice_mul[] = {
> + 1024, 1126, 1239, 1362, 1499,
> + 1649, 1814, 1995, 2195, 2414,
> + 2655, 2921, 3213, 3535, 3888,
> + 4277, 4705, 5175, 5693, 6262,
> + 6888, 7577, 8335, 9169, 10086,
> + 11094, 12204, 13424, 14767, 16243,
> + 17868, 19655, 21620, 23782, 26160,
> + 28776, 31654, 34820, 38302, 42132
> +};
> +
>  void
> -updatepri(struct proc *p)
> +sched_deadline(struct proc *p)
>  {
> - unsigned int newcpu = p->p_estcpu;
> - fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
> + u_int64_t niffies;
> + /* u_int64_t offset = 100 * 1000 * 1000; */
> + struct timespec now;
>  
> - SCHED_ASSERT_LOCKED();
> + nanouptime(&now);
> + niffies = now.tv_sec * (1000 * 1000 * 1000) + now.tv_nsec;

Can you explain how you chose this period of time and what it is
supposed to represent?

> - if (p->p_slptime > 5 * loadfac)
> - p->p_estcpu = 0;
> - else {
> - p->p_slptime--; /* the first time was done in schedcpu */
> - while (newcpu && --p->p_slptime)
> - newcpu = (int) decay_cpu(loadfac, newcpu);
> - p->p_estcpu = newcpu;
> - }
> - resetpriority(p);
> + p->p_deadline = niffies; // + offset;
> + p->p_rrticks = rrticks_init;
>  }
>  
>  /*
> @@ -326,6 +273,7 @@ preempt(struct proc *newp)
>  
>   SCHED_LOCK(s);
>   p->p_priority = p->p_usrpri;
> + sched_deadline(p);
>   p->p_stat = SRUN;
>   p->p_cpu = sched_choosecpu(p);
>   setrunqueue(p);
> @@ -473,7 +421,7 @@ resched_proc(struct proc *p, u_char pri)
>   * sched state, which we currently do not do.
>   */
>   ci = (p->p_cpu != NULL) ? p->p_cpu : curcpu();
> - if (pri < ci->ci_schedstate.spc_curpriority)
> + if (ci->ci_curproc && p->p_deadline < ci->ci_curproc->p_deadline)
>   need_resched(ci);
>  }
>  
> @@ -509,56 +457,8 @@ setrunnable(struct proc *p)
>   p->p_stat = SRUN;
>   p->p_cpu = sched_choosecpu(p);
>   setrunqueue(p);
> - if (p->p_slptime > 1)
> - updatepri(p);

What's happening to p_slptime?

>   p->p_slptime = 0;
>   resched_proc(p, p->p_priority);
> -}
> -
> -/*
> - * Compute the priority of a process when running in user mode.
> - * Arrange to reschedule if the resulting priority is better
> - * than that of the current process.
> - */
> -void
> -resetpriority(struct proc *p)
> -{
> - unsigned int newpriority;
> -
> - SCHED_ASSERT_LOCKED();
> -
> - newpriority = PUSER + p->p_estcpu +
> -    NICE_WEIGHT * (p->p_p->ps_nice - NZERO);
> - newpriority = min(newpriority, MAXPRI);
> - p->p_usrpri = newpriority;
> - resched_proc(p, p->p_usrpri);
> -}
> -
> -/*
> - * We adjust the priority of the current process.  The priority of a process
> - * gets worse as it accumulates CPU time.  The cpu usage estimator (p_estcpu)
> - * is increased here.  The formula for computing priorities (in kern_synch.c)
> - * will compute a different value each time p_estcpu increases. This can
> - * cause a switch, but unless the priority crosses a PPQ boundary the actual
> - * queue will not change.  The cpu usage estimator ramps up quite quickly
> - * when the process is running (linearly), and decays away exponentially, at
> - * a rate which is proportionally slower when the system is busy.  The basic
> - * principle is that the system will 90% forget that the process used a lot
> - * of CPU time in 5 * loadav seconds.  This causes the system to favor
> - * processes which haven't run much recently, and to round-robin among other
> - * processes.
> - */
> -void
> -schedclock(struct proc *p)
> -{
> - int s;
> -
> - SCHED_LOCK(s);
> - p->p_estcpu = ESTCPULIM(p->p_estcpu + 1);
> - resetpriority(p);
> - if (p->p_priority >= PUSER)
> - p->p_priority = p->p_usrpri;
> - SCHED_UNLOCK(s);
>  }
>  
>  void (*cpu_setperf)(int);
> Index: sys/sys/proc.h
> ===================================================================
> RCS file: /cvs/src/sys/sys/proc.h,v
> retrieving revision 1.217
> diff -u -p -r1.217 proc.h
> --- sys/sys/proc.h 9 Mar 2016 13:38:50 -0000 1.217
> +++ sys/sys/proc.h 12 Mar 2016 15:48:37 -0000
> @@ -44,6 +44,7 @@
>  #include <sys/selinfo.h> /* For struct selinfo */
>  #include <sys/syslimits.h> /* For LOGIN_NAME_MAX */
>  #include <sys/queue.h>
> +#include <sys/tree.h>
>  #include <sys/timeout.h> /* For struct timeout */
>  #include <sys/event.h> /* For struct klist */
>  #include <sys/mutex.h> /* For struct mutex */
> @@ -267,6 +268,7 @@ struct process {
>  
>  struct proc {
>   TAILQ_ENTRY(proc) p_runq;
> + RB_ENTRY(proc) p_runq2;
>   LIST_ENTRY(proc) p_list; /* List of all threads. */
>  
>   struct process *p_p; /* The process of this thread. */
> @@ -320,6 +322,8 @@ struct proc {
>  
>  /* The following fields are all copied upon creation in fork. */
>  #define p_startcopy p_sigmask
> + u_int64_t p_deadline;
> + int p_rrticks;
>   sigset_t p_sigmask; /* Current signal mask. */
>  
>   u_char p_priority; /* Process priority. */
> @@ -488,6 +492,7 @@ void fixjobc(struct process *, struct pg
>  int inferior(struct process *, struct process *);
>  void leavepgrp(struct process *);
>  void preempt(struct proc *);
> +void sched_deadline(struct proc *);
>  void pgdelete(struct pgrp *);
>  void procinit(void);
>  void resetpriority(struct proc *);
> @@ -570,4 +575,3 @@ struct cpu_info *cpuset_first(struct cpu
>  
>  #endif /* _KERNEL */
>  #endif /* !_SYS_PROC_H_ */
> -
> Index: sys/sys/sched.h
> ===================================================================
> RCS file: /cvs/src/sys/sys/sched.h,v
> retrieving revision 1.40
> diff -u -p -r1.40 sched.h
> --- sys/sys/sched.h 9 Mar 2016 13:38:50 -0000 1.40
> +++ sys/sys/sched.h 12 Mar 2016 15:48:37 -0000
> @@ -70,6 +70,7 @@
>  #define _SYS_SCHED_H_
>  
>  #include <sys/queue.h>
> +#include <sys/tree.h>
>  
>  /*
>   * Posix defines a <sched.h> which may want to include <sys/sched.h>
> @@ -99,6 +100,7 @@ struct schedstate_percpu {
>   u_int spc_schedticks; /* ticks for schedclock() */
>   u_int64_t spc_cp_time[CPUSTATES]; /* CPU state statistics */
>   u_char spc_curpriority; /* usrpri of curproc */
> + u_int64_t spc_curdeadline;
>   int spc_rrticks; /* ticks until roundrobin() */
>   int spc_pscnt; /* prof/stat counter */
>   int spc_psdiv; /* prof/stat divisor */
> @@ -109,6 +111,8 @@ struct schedstate_percpu {
>  
>   TAILQ_HEAD(prochead, proc) spc_qs[SCHED_NQS];
>   volatile uint32_t spc_whichqs;
> +
> + RB_HEAD(prochead2, proc) spc_rq;
>  
>  #ifdef notyet
>   struct proc *spc_reaper; /* dead proc reaper */
> --
> Michal Mazurek
>

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Amit Kulkarni-5
In reply to this post by Michal Mazurek-2
On Sat, Mar 12, 2016 at 10:36 AM, Michal Mazurek <[hidden email]>
wrote:

> Gregor Best attempted to improve the scheduler in 2011:
> http://comments.gmane.org/gmane.os.openbsd.tech/27059
> Here is another attempt, it takes up where the previous one left off.
>
> This is also mostly based on the main idea behind Linux CFS or
> BFS. I found BFS to be described more clearly:
> http://ck.kolivas.org/patches/bfs/4.0/4.3/4.3-sched-bfs-467.patch
>
>
> Some notes:
>
> Chrome is still not very usable.
>
> Much more work is needed, e.g. there is some MD code on sparc64 and
> alpha that depends on spc_schedticks that needs to be understood and
> rewritten.
>
> Maybe using RB trees to queue what is usually no more than 5 elements
> is overkill.
>
> p_usrpri and p_priority will go away, so userland utilities like 'ps'
> will need to be changed.
>
> I also want to try and see if implementing one shared queue, instead of
> keeping one queue per cpu will improve performance even further. Right
> now there are some heuristics to determine whether a process should
> switch cpus. This doesn't work very well yet, in my tests with the
> attached code sometimes one queue was a second behind another. From
> what I understand that's the idea behind BFS and the reason why it
> doesn't scale to 4096 CPUs. I see that OpenBSD supports 256 CPUs on
> sparc64:
> ./arch/sparc64/include/cpu.h:#define MAXCPUS    256
>
>
Hi Michal,
One shared queue is bad when # of CPU goes on increasing as it effectively
trends towards a global lock.
Thanks
Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Michal Mazurek-2
In reply to this post by Juan Francisco Cantero Hurtado
On 04:41:05, 13.03.16, Juan Francisco Cantero Hurtado wrote:
> Here are the commands:
> ...
> ffmpeg
> ...

Thank you for this.

ffmpeg runs differently from gcc or make - it creates a lot of threads.
I can verify that it is indeed slower. Instead of spending 2 seconds in
'system' it takes 30 or 40 seconds on the new scheduler.

After some tests I realised that ffmpeg, when running on the present BSD
scheduler, will call sys_sched_yield() 321,068 times. When running on the
new scheduler, it will call sys_sched_yield() 2,507,894 times - nearly ten
times that. The bulk of the calls are made from this function:

lib/librthread/rthread.c:
void
_spinlock(volatile struct _spinlock *lock)
{
        while (_atomic_lock(&lock->ticket))
                sched_yield();
}

To verify I replaced sched_yield() with usleep():

void
_spinlock(volatile struct _spinlock *lock)
{
        while (_atomic_lock(&lock->ticket))
                usleep(1);
}

The number of calls to yield() dropped to 4,576.



I had a look how the Linux BFS scheduler solves this problem, and I saw
this:
/**
 * yield - yield the current processor to other threads.
 *
 * Do not ever use this function, there's a 99% chance you're doing it wrong.
 *
 * The scheduler is at all times free to pick the calling task as the most
 * eligible task to run, if removing the yield() call from your code breaks
 * it, its already broken.
 *
 * Typical broken usage is:
 *
 * while (!event)
 *     yield();
 *
 * where one assumes that yield() will let 'the other' process run that will
 * make event true. If the current task is a SCHED_FIFO task that will never
 * happen. Never use yield() as a progress guarantee!!
 *
 * If you want to use yield() to wait for something, use wait_event().
 * If you want to use yield() to be 'nice' for others, use cond_resched().
 * If you still want to use yield(), do not!
 */
void __sched yield(void)
{
        set_current_state(TASK_RUNNING);
        sys_sched_yield();
}
EXPORT_SYMBOL(yield);


This is where I get stuck, I don't know how to replace the call to
sched_yield(), or whether it is a good idea to do so. Any advice?

--
Michal Mazurek

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Martin Pieuchot
On 14/03/16(Mon) 16:05, Michal Mazurek wrote:

> On 04:41:05, 13.03.16, Juan Francisco Cantero Hurtado wrote:
> > Here are the commands:
> > ...
> > ffmpeg
> > ...
>
> Thank you for this.
>
> ffmpeg runs differently from gcc or make - it creates a lot of threads.
> I can verify that it is indeed slower. Instead of spending 2 seconds in
> 'system' it takes 30 or 40 seconds on the new scheduler.
>
> After some tests I realised that ffmpeg, when running on the present BSD
> scheduler, will call sys_sched_yield() 321,068 times. When running on the
> new scheduler, it will call sys_sched_yield() 2,507,894 times - nearly ten
> times that. The bulk of the calls are made from this function:
>
> lib/librthread/rthread.c:
> void
> _spinlock(volatile struct _spinlock *lock)
> {
> while (_atomic_lock(&lock->ticket))
> sched_yield();
> }
>
> To verify I replaced sched_yield() with usleep():
>
> void
> _spinlock(volatile struct _spinlock *lock)
> {
> while (_atomic_lock(&lock->ticket))
> usleep(1);
> }
>
> The number of calls to yield() dropped to 4,576.

This is really similar to what I observed with Firefox and Chrome.

> This is where I get stuck, I don't know how to replace the call to
> sched_yield(), or whether it is a good idea to do so. Any advice?

I'd assume the problem is not in the _spinlock() implementation itself
but rather on the code calling this lock.  Do you know which code is
triggering this?

See r1.13 of lib/librthread/rthread_libc.c, this is an example where a
the use of a spinlock created similar symptoms.

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Michal Mazurek-2
In reply to this post by Martin Pieuchot
On 16:35:49, 13.03.16, Martin Pieuchot wrote:

> On 12/03/16(Sat) 17:36, Michal Mazurek wrote:
> > [...]
> > Some notes:
> >
> > Chrome is still not very usable.
>
> Are you wanting to improve the browser experience on OpenBSD?  If that's
> your goal then I'd suggest you to start by analysing how the browsers
> behave.  My personal analysis makes me believe that librthread is what
> needs some love.

It seems you were right, see my other email.

> > p_usrpri and p_priority will go away, so userland utilities like 'ps'
> > will need to be changed.
>
> So what's your plan for process priorities?
> ...
> "improve performance" does not say much.  For which workload on which
> machine?  There's maybe room for improvement in this area, but you
> really need more than "building a kernel" as test bed.

Replace them with "deadlines", to achieve "fair" scheduling. This should
provide superior scheduling for interactive applications. We'll see :)

> > - if (--ci->ci_schedstate.spc_rrticks <= 0)
> > + if (p && (--(p->p_rrticks) <= 0))
> >   roundrobin(ci);
>
> That's an interesting change.  Why did you decide to move a per-CPU
> counter to a per-process one?  Can you explain the effect it will have?
> Or even better can you measure its effect on different workloads?

That's the supposedly genious idea behind BFS. An interactive process
can yield, and let batch processes do their work. Then when it's time to
process cursor movement, it will come back with the best priority, but
with used up p_rrticks to prevent abuse. We'll see...

Perhaps it should be operated on as an element of the ci_schedstate
structure, similar to how spc_curpriority is a copy of p_piority now,
and copied to/from proc during context switches? I don't know yet.

> > - SCHED_ASSERT_LOCKED();
> > + nanouptime(&now);
> > + niffies = now.tv_sec * (1000 * 1000 * 1000) + now.tv_nsec;
>
> Can you explain how you chose this period of time and what it is
> supposed to represent?

These are niffies, like jiffies but nanoseconds. A high precision
monotinically increasing counter. The value itself isn't really
important. Some multipliers for 'nice' offsets will need to be
chosen though.

> > @@ -509,56 +457,8 @@ setrunnable(struct proc *p)
> >   p->p_stat = SRUN;
> >   p->p_cpu = sched_choosecpu(p);
> >   setrunqueue(p);
> > - if (p->p_slptime > 1)
> > - updatepri(p);
>
> What's happening to p_slptime?

It's not needed for the new scheduler. The 4BSD scheduler does stuff to
the priorities based on how many seconds a process has been sleeping.

--
Michal Mazurek

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Alexandre Ratchov-2
In reply to this post by Michal Mazurek-2
On Sat, Mar 12, 2016 at 05:36:21PM +0100, Michal Mazurek wrote:
>
> p_usrpri and p_priority will go away, so userland utilities like 'ps'
> will need to be changed.
>

AFAIU, this would hurt interactive programs (audio, players, games,
etc).  Currently i/o bound processes wake up with increased
priority and steal the cpu from cpu-bound processes.  Removing
p_priority would break this mechanism.

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Michal Mazurek-2
In reply to this post by Martin Pieuchot
On 16:28:33, 14.03.16, Martin Pieuchot wrote:

> > The number of calls to yield() dropped to 4,576.
>
> This is really similar to what I observed with Firefox and Chrome.
>
> > This is where I get stuck, I don't know how to replace the call to
> > sched_yield(), or whether it is a good idea to do so. Any advice?
>
> I'd assume the problem is not in the _spinlock() implementation itself
> but rather on the code calling this lock.  Do you know which code is
> triggering this?
>
> See r1.13 of lib/librthread/rthread_libc.c, this is an example where a
> the use of a spinlock created similar symptoms.

I don't know how to fix it, but in the meanwhile here is a workaround so
I can continue working on the scheduler. In yield():

+ tmp = p->p_rrticks;
+ sched_deadline(p);
+ p->p_rrticks = tmp;

So penalise a process calling yield() by giving it the worst deadline,
i.e. make it go to the end of the run queue.

Other than this, no changes.

Chrome still isn't smooth.

Please test, and let me know if the performance of something else
degrades.


Index: kern/init_main.c
===================================================================
RCS file: /cvs/src/sys/kern/init_main.c,v
retrieving revision 1.248
diff -u -p -r1.248 init_main.c
--- kern/init_main.c 3 Jan 2016 00:15:59 -0000 1.248
+++ kern/init_main.c 15 Mar 2016 13:56:58 -0000
@@ -265,6 +265,8 @@ main(void *framep)
  */
  pr = &process0;
  process_initialize(pr, p);
+ p->p_deadline = 0;
+ p->p_rrticks = 10;
 
  LIST_INSERT_HEAD(&allprocess, pr, ps_list);
  atomic_setbits_int(&pr->ps_flags, PS_SYSTEM);
@@ -554,6 +556,7 @@ main(void *framep)
         /*
          * proc0: nothing to do, back to sleep
          */
+ printf("*** modified scheduler ***\n");
         while (1)
                 tsleep(&proc0, PVM, "scheduler", 0);
  /* NOTREACHED */
Index: kern/kern_clock.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_clock.c,v
retrieving revision 1.88
diff -u -p -r1.88 kern_clock.c
--- kern/kern_clock.c 11 Jun 2015 16:03:04 -0000 1.88
+++ kern/kern_clock.c 15 Mar 2016 13:56:58 -0000
@@ -180,7 +180,7 @@ hardclock(struct clockframe *frame)
  if (stathz == 0)
  statclock(frame);
 
- if (--ci->ci_schedstate.spc_rrticks <= 0)
+ if (p && (--(p->p_rrticks) <= 0))
  roundrobin(ci);
 
  /*
@@ -398,15 +398,7 @@ statclock(struct clockframe *frame)
 
  if (p != NULL) {
  p->p_cpticks++;
- /*
- * If no schedclock is provided, call it here at ~~12-25 Hz;
- * ~~16 Hz is best
- */
- if (schedhz == 0) {
- if ((++curcpu()->ci_schedstate.spc_schedticks & 3) ==
-    0)
- schedclock(p);
- }
+ ++curcpu()->ci_schedstate.spc_schedticks;
  }
 }
 
Index: kern/kern_fork.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_fork.c,v
retrieving revision 1.185
diff -u -p -r1.185 kern_fork.c
--- kern/kern_fork.c 11 Mar 2016 19:10:14 -0000 1.185
+++ kern/kern_fork.c 15 Mar 2016 13:56:58 -0000
@@ -498,6 +498,7 @@ fork1(struct proc *curp, int flags, void
  /*
  * Make child runnable and add to run queue.
  */
+ sched_deadline(p);
  if ((flags & FORK_IDLE) == 0) {
  SCHED_LOCK(s);
  p->p_stat = SRUN;
Index: kern/kern_resource.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_resource.c,v
retrieving revision 1.55
diff -u -p -r1.55 kern_resource.c
--- kern/kern_resource.c 5 Dec 2015 10:11:53 -0000 1.55
+++ kern/kern_resource.c 15 Mar 2016 13:56:58 -0000
@@ -178,8 +178,6 @@ int
 donice(struct proc *curp, struct process *chgpr, int n)
 {
  struct ucred *ucred = curp->p_ucred;
- struct proc *p;
- int s;
 
  if (ucred->cr_uid != 0 && ucred->cr_ruid != 0 &&
     ucred->cr_uid != chgpr->ps_ucred->cr_uid &&
@@ -193,10 +191,13 @@ donice(struct proc *curp, struct process
  if (n < chgpr->ps_nice && suser(curp, 0))
  return (EACCES);
  chgpr->ps_nice = n;
+ /* XXXNICE */
+ /*
  SCHED_LOCK(s);
  TAILQ_FOREACH(p, &chgpr->ps_threads, p_thr_link)
  (void)resetpriority(p);
  SCHED_UNLOCK(s);
+ */
  return (0);
 }
 
Index: kern/kern_sched.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sched.c,v
retrieving revision 1.41
diff -u -p -r1.41 kern_sched.c
--- kern/kern_sched.c 23 Dec 2015 14:51:17 -0000 1.41
+++ kern/kern_sched.c 15 Mar 2016 13:56:58 -0000
@@ -33,6 +33,21 @@ void sched_kthreads_create(void *);
 int sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p);
 struct proc *sched_steal_proc(struct cpu_info *);
 
+int
+sched_cmp_proc(struct proc *e1, struct proc *e2)
+{
+ if (e1->p_deadline == e2->p_deadline) {
+ /*
+ * this implementation of rb trees needs the keys to be unique
+ * XXX compare pointers for now, UB
+ */
+ return (e1 < e2 ? -1 : e1 > e2);
+ }
+ return (e1->p_deadline < e2->p_deadline ? -1 : e1->p_deadline > e2->p_deadline);
+}
+RB_PROTOTYPE(prochead2, proc, p_runq2, sched_cmp_proc);
+RB_GENERATE(prochead2, proc, p_runq2, sched_cmp_proc);
+
 /*
  * To help choosing which cpu should run which process we keep track
  * of cpus which are currently idle and which cpus have processes
@@ -81,10 +96,8 @@ void
 sched_init_cpu(struct cpu_info *ci)
 {
  struct schedstate_percpu *spc = &ci->ci_schedstate;
- int i;
 
- for (i = 0; i < SCHED_NQS; i++)
- TAILQ_INIT(&spc->spc_qs[i]);
+ RB_INIT(&spc->spc_rq);
 
  spc->spc_idleproc = NULL;
 
@@ -238,14 +251,13 @@ void
 setrunqueue(struct proc *p)
 {
  struct schedstate_percpu *spc;
- int queue = p->p_priority >> 2;
 
  SCHED_ASSERT_LOCKED();
  spc = &p->p_cpu->ci_schedstate;
  spc->spc_nrun++;
 
- TAILQ_INSERT_TAIL(&spc->spc_qs[queue], p, p_runq);
- spc->spc_whichqs |= (1 << queue);
+ RB_INSERT(prochead2, &spc->spc_rq, p);
+ spc->spc_whichqs = 1;
  cpuset_add(&sched_queued_cpus, p->p_cpu);
 
  if (cpuset_isset(&sched_idle_cpus, p->p_cpu))
@@ -256,17 +268,15 @@ void
 remrunqueue(struct proc *p)
 {
  struct schedstate_percpu *spc;
- int queue = p->p_priority >> 2;
 
  SCHED_ASSERT_LOCKED();
  spc = &p->p_cpu->ci_schedstate;
  spc->spc_nrun--;
 
- TAILQ_REMOVE(&spc->spc_qs[queue], p, p_runq);
- if (TAILQ_EMPTY(&spc->spc_qs[queue])) {
- spc->spc_whichqs &= ~(1 << queue);
- if (spc->spc_whichqs == 0)
- cpuset_del(&sched_queued_cpus, p->p_cpu);
+ RB_REMOVE(prochead2, &spc->spc_rq, p);
+ if (RB_EMPTY(&spc->spc_rq)) {
+ spc->spc_whichqs = 0;
+ cpuset_del(&sched_queued_cpus, p->p_cpu);
  }
 }
 
@@ -282,13 +292,11 @@ sched_chooseproc(void)
 #ifdef MULTIPROCESSOR
  if (spc->spc_schedflags & SPCF_SHOULDHALT) {
  if (spc->spc_whichqs) {
- for (queue = 0; queue < SCHED_NQS; queue++) {
- while ((p = TAILQ_FIRST(&spc->spc_qs[queue]))) {
- remrunqueue(p);
- p->p_cpu = sched_choosecpu(p);
- KASSERT(p->p_cpu != curcpu());
- setrunqueue(p);
- }
+ while ((p = RB_MIN(prochead2, &spc->spc_rq)) != NULL) {
+ remrunqueue(p);
+ p->p_cpu = sched_choosecpu(p);
+ KASSERT(p->p_cpu != curcpu());
+ setrunqueue(p);
  }
  }
  p = spc->spc_idleproc;
@@ -302,7 +310,7 @@ sched_chooseproc(void)
 again:
  if (spc->spc_whichqs) {
  queue = ffs(spc->spc_whichqs) - 1;
- p = TAILQ_FIRST(&spc->spc_qs[queue]);
+ p = RB_MIN(prochead2, &spc->spc_rq);
  remrunqueue(p);
  sched_noidle++;
  KASSERT(p->p_stat == SRUN);
@@ -478,7 +486,7 @@ sched_steal_proc(struct cpu_info *self)
  spc = &ci->ci_schedstate;
 
  queue = ffs(spc->spc_whichqs) - 1;
- TAILQ_FOREACH(p, &spc->spc_qs[queue], p_runq) {
+ RB_FOREACH(p, prochead2, &spc->spc_rq) {
  if (p->p_flag & P_CPUPEG)
  continue;
 
@@ -550,8 +558,6 @@ sched_proc_to_cpu_cost(struct cpu_info *
  * and the higher the priority of the proc.
  */
  if (!cpuset_isset(&sched_idle_cpus, ci)) {
- cost += (p->p_priority - spc->spc_curpriority) *
-    sched_cost_priority;
  cost += sched_cost_runnable;
  }
  if (cpuset_isset(&sched_queued_cpus, ci))
Index: kern/kern_sig.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.193
diff -u -p -r1.193 kern_sig.c
--- kern/kern_sig.c 9 Mar 2016 13:38:50 -0000 1.193
+++ kern/kern_sig.c 15 Mar 2016 13:56:58 -0000
@@ -1894,6 +1894,7 @@ userret(struct proc *p)
  }
 
  p->p_cpu->ci_schedstate.spc_curpriority = p->p_priority = p->p_usrpri;
+ p->p_cpu->ci_schedstate.spc_curdeadline = p->p_deadline;
 }
 
 int
Index: kern/kern_synch.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.129
diff -u -p -r1.129 kern_synch.c
--- kern/kern_synch.c 9 Mar 2016 13:38:50 -0000 1.129
+++ kern/kern_synch.c 15 Mar 2016 13:56:58 -0000
@@ -274,6 +274,7 @@ sleep_finish(struct sleep_state *sls, in
 #endif
 
  p->p_cpu->ci_schedstate.spc_curpriority = p->p_usrpri;
+ p->p_cpu->ci_schedstate.spc_curdeadline = p->p_deadline;
  SCHED_UNLOCK(sls->sls_s);
 
  /*
Index: kern/sched_bsd.c
===================================================================
RCS file: /cvs/src/sys/kern/sched_bsd.c,v
retrieving revision 1.43
diff -u -p -r1.43 sched_bsd.c
--- kern/sched_bsd.c 9 Mar 2016 13:38:50 -0000 1.43
+++ kern/sched_bsd.c 15 Mar 2016 13:56:58 -0000
@@ -61,7 +61,6 @@ struct __mp_lock sched_lock;
 #endif
 
 void schedcpu(void *);
-void updatepri(struct proc *);
 
 void
 scheduler_start(void)
@@ -88,9 +87,8 @@ roundrobin(struct cpu_info *ci)
 {
  struct schedstate_percpu *spc = &ci->ci_schedstate;
 
- spc->spc_rrticks = rrticks_init;
-
  if (ci->ci_curproc != NULL) {
+ sched_deadline(ci->ci_curproc);
  if (spc->spc_schedflags & SPCF_SEENRR) {
  /*
  * The process has already been through a roundrobin
@@ -109,74 +107,6 @@ roundrobin(struct cpu_info *ci)
  need_resched(ci);
 }
 
-/*
- * Constants for digital decay and forget:
- * 90% of (p_estcpu) usage in 5 * loadav time
- * 95% of (p_pctcpu) usage in 60 seconds (load insensitive)
- *          Note that, as ps(1) mentions, this can let percentages
- *          total over 100% (I've seen 137.9% for 3 processes).
- *
- * Note that hardclock updates p_estcpu and p_cpticks independently.
- *
- * We wish to decay away 90% of p_estcpu in (5 * loadavg) seconds.
- * That is, the system wants to compute a value of decay such
- * that the following for loop:
- * for (i = 0; i < (5 * loadavg); i++)
- * p_estcpu *= decay;
- * will compute
- * p_estcpu *= 0.1;
- * for all values of loadavg:
- *
- * Mathematically this loop can be expressed by saying:
- * decay ** (5 * loadavg) ~= .1
- *
- * The system computes decay as:
- * decay = (2 * loadavg) / (2 * loadavg + 1)
- *
- * We wish to prove that the system's computation of decay
- * will always fulfill the equation:
- * decay ** (5 * loadavg) ~= .1
- *
- * If we compute b as:
- * b = 2 * loadavg
- * then
- * decay = b / (b + 1)
- *
- * We now need to prove two things:
- * 1) Given factor ** (5 * loadavg) ~= .1, prove factor == b/(b+1)
- * 2) Given b/(b+1) ** power ~= .1, prove power == (5 * loadavg)
- *
- * Facts:
- *         For x close to zero, exp(x) =~ 1 + x, since
- *              exp(x) = 0! + x**1/1! + x**2/2! + ... .
- *              therefore exp(-1/b) =~ 1 - (1/b) = (b-1)/b.
- *         For x close to zero, ln(1+x) =~ x, since
- *              ln(1+x) = x - x**2/2 + x**3/3 - ...     -1 < x < 1
- *              therefore ln(b/(b+1)) = ln(1 - 1/(b+1)) =~ -1/(b+1).
- *         ln(.1) =~ -2.30
- *
- * Proof of (1):
- *    Solve (factor)**(power) =~ .1 given power (5*loadav):
- * solving for factor,
- *      ln(factor) =~ (-2.30/5*loadav), or
- *      factor =~ exp(-1/((5/2.30)*loadav)) =~ exp(-1/(2*loadav)) =
- *          exp(-1/b) =~ (b-1)/b =~ b/(b+1).                    QED
- *
- * Proof of (2):
- *    Solve (factor)**(power) =~ .1 given factor == (b/(b+1)):
- * solving for power,
- *      power*ln(b/(b+1)) =~ -2.30, or
- *      power =~ 2.3 * (b + 1) = 4.6*loadav + 2.3 =~ 5*loadav.  QED
- *
- * Actual power values for the implemented algorithm are as follows:
- *      loadav: 1       2       3       4
- *      power:  5.68    10.32   14.94   19.55
- */
-
-/* calculations for digital decay to forget 90% of usage in 5*loadav sec */
-#define loadfactor(loadav) (2 * (loadav))
-#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))
-
 /* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */
 fixpt_t ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */
 
@@ -197,14 +127,13 @@ fixpt_t ccpu = 0.95122942450071400909 *
 /*
  * Recompute process priorities, every second.
  */
+RB_PROTOTYPE(prochead2, proc, p_runq2, sched_cmp_proc);
 void
 schedcpu(void *arg)
 {
  struct timeout *to = (struct timeout *)arg;
- fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
  struct proc *p;
  int s;
- unsigned int newcpu;
  int phz;
 
  /*
@@ -243,19 +172,6 @@ schedcpu(void *arg)
  (p->p_cpticks * FSCALE / phz)) >> FSHIFT;
 #endif
  p->p_cpticks = 0;
- newcpu = (u_int) decay_cpu(loadfac, p->p_estcpu);
- p->p_estcpu = newcpu;
- resetpriority(p);
- if (p->p_priority >= PUSER) {
- if (p->p_stat == SRUN &&
-    (p->p_priority / SCHED_PPQ) !=
-    (p->p_usrpri / SCHED_PPQ)) {
- remrunqueue(p);
- p->p_priority = p->p_usrpri;
- setrunqueue(p);
- } else
- p->p_priority = p->p_usrpri;
- }
  SCHED_UNLOCK(s);
  }
  uvm_meter();
@@ -264,27 +180,32 @@ schedcpu(void *arg)
 }
 
 /*
- * Recalculate the priority of a process after it has slept for a while.
- * For all load averages >= 1 and max p_estcpu of 255, sleeping for at
- * least six times the loadfactor will decay p_estcpu to zero.
+ * 10% more for every nice level. scaled by a power of 2, these are
+ * later divided
  */
+int nice_mul[] = {
+ 1024, 1126, 1239, 1362, 1499,
+ 1649, 1814, 1995, 2195, 2414,
+ 2655, 2921, 3213, 3535, 3888,
+ 4277, 4705, 5175, 5693, 6262,
+ 6888, 7577, 8335, 9169, 10086,
+ 11094, 12204, 13424, 14767, 16243,
+ 17868, 19655, 21620, 23782, 26160,
+ 28776, 31654, 34820, 38302, 42132
+};
+
 void
-updatepri(struct proc *p)
+sched_deadline(struct proc *p)
 {
- unsigned int newcpu = p->p_estcpu;
- fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
+ u_int64_t niffies;
+ /* u_int64_t offset = 100 * 1000 * 1000; */
+ struct timespec now;
 
- SCHED_ASSERT_LOCKED();
+ nanouptime(&now);
+ niffies = now.tv_sec * (1000 * 1000 * 1000) + now.tv_nsec;
 
- if (p->p_slptime > 5 * loadfac)
- p->p_estcpu = 0;
- else {
- p->p_slptime--; /* the first time was done in schedcpu */
- while (newcpu && --p->p_slptime)
- newcpu = (int) decay_cpu(loadfac, newcpu);
- p->p_estcpu = newcpu;
- }
- resetpriority(p);
+ p->p_deadline = niffies; // + offset;
+ p->p_rrticks = rrticks_init;
 }
 
 /*
@@ -296,8 +217,16 @@ yield(void)
 {
  struct proc *p = curproc;
  int s;
+ int tmp;
 
  SCHED_LOCK(s);
+ /*
+ * workaround for rthreads calling yield() from _spinlock
+ */
+ tmp = p->p_rrticks;
+ sched_deadline(p);
+ p->p_rrticks = tmp;
+
  p->p_priority = p->p_usrpri;
  p->p_stat = SRUN;
  setrunqueue(p);
@@ -326,6 +255,7 @@ preempt(struct proc *newp)
 
  SCHED_LOCK(s);
  p->p_priority = p->p_usrpri;
+ sched_deadline(p);
  p->p_stat = SRUN;
  p->p_cpu = sched_choosecpu(p);
  setrunqueue(p);
@@ -455,7 +385,7 @@ mi_switch(void)
 }
 
 static __inline void
-resched_proc(struct proc *p, u_char pri)
+resched_proc(struct proc *p)
 {
  struct cpu_info *ci;
 
@@ -473,7 +403,7 @@ resched_proc(struct proc *p, u_char pri)
  * sched state, which we currently do not do.
  */
  ci = (p->p_cpu != NULL) ? p->p_cpu : curcpu();
- if (pri < ci->ci_schedstate.spc_curpriority)
+ if (ci->ci_curproc && p->p_deadline < ci->ci_curproc->p_deadline)
  need_resched(ci);
 }
 
@@ -509,56 +439,8 @@ setrunnable(struct proc *p)
  p->p_stat = SRUN;
  p->p_cpu = sched_choosecpu(p);
  setrunqueue(p);
- if (p->p_slptime > 1)
- updatepri(p);
  p->p_slptime = 0;
- resched_proc(p, p->p_priority);
-}
-
-/*
- * Compute the priority of a process when running in user mode.
- * Arrange to reschedule if the resulting priority is better
- * than that of the current process.
- */
-void
-resetpriority(struct proc *p)
-{
- unsigned int newpriority;
-
- SCHED_ASSERT_LOCKED();
-
- newpriority = PUSER + p->p_estcpu +
-    NICE_WEIGHT * (p->p_p->ps_nice - NZERO);
- newpriority = min(newpriority, MAXPRI);
- p->p_usrpri = newpriority;
- resched_proc(p, p->p_usrpri);
-}
-
-/*
- * We adjust the priority of the current process.  The priority of a process
- * gets worse as it accumulates CPU time.  The cpu usage estimator (p_estcpu)
- * is increased here.  The formula for computing priorities (in kern_synch.c)
- * will compute a different value each time p_estcpu increases. This can
- * cause a switch, but unless the priority crosses a PPQ boundary the actual
- * queue will not change.  The cpu usage estimator ramps up quite quickly
- * when the process is running (linearly), and decays away exponentially, at
- * a rate which is proportionally slower when the system is busy.  The basic
- * principle is that the system will 90% forget that the process used a lot
- * of CPU time in 5 * loadav seconds.  This causes the system to favor
- * processes which haven't run much recently, and to round-robin among other
- * processes.
- */
-void
-schedclock(struct proc *p)
-{
- int s;
-
- SCHED_LOCK(s);
- p->p_estcpu = ESTCPULIM(p->p_estcpu + 1);
- resetpriority(p);
- if (p->p_priority >= PUSER)
- p->p_priority = p->p_usrpri;
- SCHED_UNLOCK(s);
+ resched_proc(p);
 }
 
 void (*cpu_setperf)(int);
Index: sys/proc.h
===================================================================
RCS file: /cvs/src/sys/sys/proc.h,v
retrieving revision 1.217
diff -u -p -r1.217 proc.h
--- sys/proc.h 9 Mar 2016 13:38:50 -0000 1.217
+++ sys/proc.h 15 Mar 2016 13:56:58 -0000
@@ -44,6 +44,7 @@
 #include <sys/selinfo.h> /* For struct selinfo */
 #include <sys/syslimits.h> /* For LOGIN_NAME_MAX */
 #include <sys/queue.h>
+#include <sys/tree.h>
 #include <sys/timeout.h> /* For struct timeout */
 #include <sys/event.h> /* For struct klist */
 #include <sys/mutex.h> /* For struct mutex */
@@ -267,6 +268,7 @@ struct process {
 
 struct proc {
  TAILQ_ENTRY(proc) p_runq;
+ RB_ENTRY(proc) p_runq2;
  LIST_ENTRY(proc) p_list; /* List of all threads. */
 
  struct process *p_p; /* The process of this thread. */
@@ -320,6 +322,8 @@ struct proc {
 
 /* The following fields are all copied upon creation in fork. */
 #define p_startcopy p_sigmask
+ u_int64_t p_deadline;
+ int p_rrticks;
  sigset_t p_sigmask; /* Current signal mask. */
 
  u_char p_priority; /* Process priority. */
@@ -488,6 +492,7 @@ void fixjobc(struct process *, struct pg
 int inferior(struct process *, struct process *);
 void leavepgrp(struct process *);
 void preempt(struct proc *);
+void sched_deadline(struct proc *);
 void pgdelete(struct pgrp *);
 void procinit(void);
 void resetpriority(struct proc *);
@@ -570,4 +575,3 @@ struct cpu_info *cpuset_first(struct cpu
 
 #endif /* _KERNEL */
 #endif /* !_SYS_PROC_H_ */
-
Index: sys/sched.h
===================================================================
RCS file: /cvs/src/sys/sys/sched.h,v
retrieving revision 1.40
diff -u -p -r1.40 sched.h
--- sys/sched.h 9 Mar 2016 13:38:50 -0000 1.40
+++ sys/sched.h 15 Mar 2016 13:56:58 -0000
@@ -70,6 +70,7 @@
 #define _SYS_SCHED_H_
 
 #include <sys/queue.h>
+#include <sys/tree.h>
 
 /*
  * Posix defines a <sched.h> which may want to include <sys/sched.h>
@@ -99,6 +100,7 @@ struct schedstate_percpu {
  u_int spc_schedticks; /* ticks for schedclock() */
  u_int64_t spc_cp_time[CPUSTATES]; /* CPU state statistics */
  u_char spc_curpriority; /* usrpri of curproc */
+ u_int64_t spc_curdeadline;
  int spc_rrticks; /* ticks until roundrobin() */
  int spc_pscnt; /* prof/stat counter */
  int spc_psdiv; /* prof/stat divisor */
@@ -109,6 +111,8 @@ struct schedstate_percpu {
 
  TAILQ_HEAD(prochead, proc) spc_qs[SCHED_NQS];
  volatile uint32_t spc_whichqs;
+
+ RB_HEAD(prochead2, proc) spc_rq;
 
 #ifdef notyet
  struct proc *spc_reaper; /* dead proc reaper */


--
Michal Mazurek

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Michal Mazurek-2
In reply to this post by Alexandre Ratchov-2
On 14:57:40, 15.03.16, Alexandre Ratchov wrote:

> On Sat, Mar 12, 2016 at 05:36:21PM +0100, Michal Mazurek wrote:
> >
> > p_usrpri and p_priority will go away, so userland utilities like 'ps'
> > will need to be changed.
> >
>
> AFAIU, this would hurt interactive programs (audio, players, games,
> etc).  Currently i/o bound processes wake up with increased
> priority and steal the cpu from cpu-bound processes.  Removing
> p_priority would break this mechanism.

No, with the new scheduler a process gets assigned a 'deadline' - the
value of an arbitrary monotonically increasing timer. If it runs out of
ticks it gets a new deadline. The process with the smallest deadline
gets to run.

If a process goes to sleep, and then after a while wakes up, it will
have a smaller deadline than the processes that were allwed to keep
running (and thus kept getting new deadlines). It will therefore preempt
other tasks.

--
Michal Mazurek

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Chris Cappuccio
In reply to this post by Michal Mazurek-2
Michal Mazurek [[hidden email]] wrote:

> On 16:28:33, 14.03.16, Martin Pieuchot wrote:
> > > The number of calls to yield() dropped to 4,576.
> >
> > This is really similar to what I observed with Firefox and Chrome.
> >
> > > This is where I get stuck, I don't know how to replace the call to
> > > sched_yield(), or whether it is a good idea to do so. Any advice?
> >
> > I'd assume the problem is not in the _spinlock() implementation itself
> > but rather on the code calling this lock.  Do you know which code is
> > triggering this?
> >
> > See r1.13 of lib/librthread/rthread_libc.c, this is an example where a
> > the use of a spinlock created similar symptoms.
>
> I don't know how to fix it, but in the meanwhile here is a workaround so
> I can continue working on the scheduler. In yield():
>
> + tmp = p->p_rrticks;
> + sched_deadline(p);
> + p->p_rrticks = tmp;
>
> So penalise a process calling yield() by giving it the worst deadline,
> i.e. make it go to the end of the run queue.
>
> Other than this, no changes.
>
> Chrome still isn't smooth.
>

On a frankenstein 5.9 kernel with bob beck's latest NOWAIT vfs_biomem buffer
alloc and this, I'm getting very smooth action on chrome, with video, even on
an old x201. I don't remember the last time it could re-open 20 tabs without
constantly pausing most of itself, or the last time youtube videos would
play on this laptop in chrome, without random and frequenty stuttering.

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Ray Lai-3
In reply to this post by Michal Mazurek-2
With this diff on my X200, YouTube used to be a stuttering mess on both
chrome and firefox, and is now buttery smooth, even at 1080p. Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Henrik Friedrichsen
In reply to this post by Michal Mazurek-2
Hey,

On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:
> Chrome still isn't smooth.
>
> Please test, and let me know if the performance of something else
> degrades.

While Chrome may not be 100% smooth yet, the system is a lot more
interactive. I can now play YouTube videos without stutters while doing
other things.

So for me it's a major improvement, thanks!

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Juan Francisco Cantero Hurtado
In reply to this post by Michal Mazurek-2
On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:

> On 16:28:33, 14.03.16, Martin Pieuchot wrote:
> > > The number of calls to yield() dropped to 4,576.
> >
> > This is really similar to what I observed with Firefox and Chrome.
> >
> > > This is where I get stuck, I don't know how to replace the call to
> > > sched_yield(), or whether it is a good idea to do so. Any advice?
> >
> > I'd assume the problem is not in the _spinlock() implementation itself
> > but rather on the code calling this lock.  Do you know which code is
> > triggering this?
> >
> > See r1.13 of lib/librthread/rthread_libc.c, this is an example where a
> > the use of a spinlock created similar symptoms.
>
> I don't know how to fix it, but in the meanwhile here is a workaround so
> I can continue working on the scheduler. In yield():
>
> + tmp = p->p_rrticks;
> + sched_deadline(p);
> + p->p_rrticks = tmp;
>
> So penalise a process calling yield() by giving it the worst deadline,
> i.e. make it go to the end of the run queue.
>
> Other than this, no changes.
>
> Chrome still isn't smooth.
>
> Please test, and let me know if the performance of something else
> degrades.

The performance in the ffmpeg test has increased about 30-40% compared with
the default scheduler.

Please keep working on it.

--
Juan Francisco Cantero Hurtado http://juanfra.info

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Edd Barrett-3
In reply to this post by Henrik Friedrichsen
On Thu, Mar 17, 2016 at 09:26:08PM +0100, Henrik Friedrichsen wrote:
> On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:
> > Chrome still isn't smooth.
> >
> > Please test, and let me know if the performance of something else
> > degrades.
>
> While Chrome may not be 100% smooth yet, the system is a lot more
> interactive. I can now play YouTube videos without stutters while doing
> other things.

I can't vouch for the code, but this makes video playback in firefox
usable on my x240t. Before it would stutter beyond belief.

I'll run with this for a while and let you know if anything comes up.

--
Best Regards
Edd Barrett

http://www.theunixzoo.co.uk

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Alexandre Ratchov-2
In reply to this post by Michal Mazurek-2
On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:
>
> Please test, and let me know if the performance of something else
> degrades.
>

With your diff firefox consumes twice less cpu (watched the same
video with and without the diff).  This suggests firefox spins
somewhere and your diff makes it spin less.

When audio device is using small blocks it stutters more with your
diff though; according to device counters the stuttering is caused
by sndiod not getting the cpu fast enough.

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Norman Golisz-3
In reply to this post by Edd Barrett-3
Hi Michal,

On Fri Mar 18 2016 10:03, Edd Barrett wrote:

> On Thu, Mar 17, 2016 at 09:26:08PM +0100, Henrik Friedrichsen wrote:
> > On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:
> > > Chrome still isn't smooth.
> > >
> > > Please test, and let me know if the performance of something else
> > > degrades.
> >
> > While Chrome may not be 100% smooth yet, the system is a lot more
> > interactive. I can now play YouTube videos without stutters while doing
> > other things.
>
> I can't vouch for the code, but this makes video playback in firefox
> usable on my x240t. Before it would stutter beyond belief.
>
> I'll run with this for a while and let you know if anything comes up.

I can also confirm this patch makes a HUGE difference in video playback
performance in firefox and minitube on my T400.

This is the first time I can watch videos without stuttering (even in
HD/full screen).

And it seems to improve "GUI responsiveness" in general, too.

Thank you for working on this!

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Bob Beck-3
this is cool .. but

I would be interested in someone comparing server workloads, as
opposed to interactive GUI response, using this.

I wouldn't be surprised that inspiriation from BFS would produce
better interactive response, my bigger concern
would be does this adversely impact how we deal with non-interactive workloads.


On Fri, Mar 18, 2016 at 8:56 AM, Norman Golisz <[hidden email]> wrote:

> Hi Michal,
>
> On Fri Mar 18 2016 10:03, Edd Barrett wrote:
>> On Thu, Mar 17, 2016 at 09:26:08PM +0100, Henrik Friedrichsen wrote:
>> > On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:
>> > > Chrome still isn't smooth.
>> > >
>> > > Please test, and let me know if the performance of something else
>> > > degrades.
>> >
>> > While Chrome may not be 100% smooth yet, the system is a lot more
>> > interactive. I can now play YouTube videos without stutters while doing
>> > other things.
>>
>> I can't vouch for the code, but this makes video playback in firefox
>> usable on my x240t. Before it would stutter beyond belief.
>>
>> I'll run with this for a while and let you know if anything comes up.
>
> I can also confirm this patch makes a HUGE difference in video playback
> performance in firefox and minitube on my T400.
>
> This is the first time I can watch videos without stuttering (even in
> HD/full screen).
>
> And it seems to improve "GUI responsiveness" in general, too.
>
> Thank you for working on this!
>

Reply | Threaded
Open this post in threaded view
|

Re: New scheduler for OpenBSD

Michael McConville-3
Bob Beck wrote:
> this is cool .. but
>
> I would be interested in someone comparing server workloads, as
> opposed to interactive GUI response, using this.
>
> I wouldn't be surprised that inspiriation from BFS would produce
> better interactive response, my bigger concern would be does this
> adversely impact how we deal with non-interactive workloads.

Those interested might find the benchmarks/siege port useful.

12