[dtrace-discuss] /proc/<pid>/map access is sloooow when using large shared memory segments

Paul van den Bogaard Paul.VandenBogaard at Sun.COM
Fri Dec 7 08:42:07 PST 2007


In an earlier post (USDT probes from PostgreSQL) I mentioned that the pread syscall did take approx 350 msecs on a /proc/<pid>/map file where a pid is that from a process that is attached to a large shared memory segment. Same goes for fstat, and lstat64.

Since starting a dtrace process on processes that are attached to large shm segs  for mulitple processes means all their map-files are read in sequential order. This grows to rather unfriendly "compile" times (to almost an hour) for dtrace when attaching to my PostgreSQL processes.

Initially I thought it would be related to my PostgreSQL processes. 
I do not think this anymore.
BTW these tests are done on the "problem machine", a T2000 8 core @1000MHz, 32GB intern.

A 

timex ls /proc/pid/map 

already shows this amount of time being spend (where pid is the pid of a process executing the image of a self made program that create a 16G shared mem segment)

Using dtrace on this ls process show lstat64 consuming the time I am looking for.

Ok, created the following script to dive deeper (can only hope it does what I think it does :-)

Script is started like

dtrace -qs script.d  -c  "ls /prob/00000/map"


BEGIN {
        printf("Target = %d\n", $target);
        self->depth = 0;
        started = timestamp;
}
 
syscall::lstat64:entry
/pid == $target/
{
        self->filename = basename(copyinstr(arg0) );
}
 
syscall::lstat64:entry
/self->filename == "map"/
{
        self->flagIt = 1;     /* OK the show begins */
        self->depth = 0;
}
 
syscall::lstat64:return
/self->flagIt /
{
        self->flagIt = 0;     /* And the show stops */
}
 
/* Already found some things in a previous version. Want to know the caller of a function
    that takes time. This is what the self->parent is used for.
*/
fbt:::entry
/self->flagIt/
{
        self->depth++;
        self->ts[self->depth] = timestamp;
        self->parent[self->depth] = probefunc;
}
 
fbt:::return
/self->depth /
{
        @[self->parent[self->depth-1], self->depth, probefunc] = sum(timestamp - self->ts[ self->depth ] );
        self->parent[self->depth] = 0;
        self->depth--;
}
 
 
END {
        printf("WALL: %d\n", (timestamp - started) / 1000 );
        normalize(@, 1000);
        printa("%20s %3d %20s %@20d\n", @);
}


This little magic gives me the following output (stripped many lines of calls that do not consume time)
   
WALL: 113,6 seconds

               lookuppnat   5                 lookuppnvp                      5,540
          lookupnameat   4                  lookuppnat                      5,589
           cstatat_getvp   3              lookupnameat                      5,686
            cstatat64_32   2               cstatat_getvp                      5,782
           pr_pagev_fill   8        segspt_shmgetprot                      5,961
               pr_getprot   7                pr_pagev_fill                      6,081
           pr_pagev_fill   9         segspt_shmgetprot                    39,266
   pr_pagev_nextprot   8                pr_pagev_fill                     39,646
           getwatchprot   9                       avl_find               29,973,247
   pr_pagev_nextprot   8                getwatchprot              87,968,115
               pr_getprot   7        pr_pagev_nextprot            113,528,006
                    prnsegs   6                    pr_getprot            113,537,387
                 prgetattr   5                         prnsegs            113,539,929
             fop_getattr   4                       prgetattr            113,540,258
              cstat64_32   3                   fop_getattr            113,540,307
           cstatat64_32   2                    cstat64_32            113,540,480
                                 1                 cstatat64_32            113,547,874

avl_find seems not to call any children, and therefore consumes quite some time (30 secs)
getwatchprot used 88, of which 30 are used by this avl_find.

The source shows (http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/proc/prsubr.c):
   3449 /*
   3450  * Return the original protections for the specified page.
   3451  */
   3452 static void
   3453 getwatchprot(struct as *as, caddr_t addr, uint_t *prot)
   3454 {
   3455 	struct watched_page *pwp;
   3456 	struct watched_page tpw;
   3457 
   3458 	ASSERT(AS_LOCK_HELD(as, &as->a_lock));
   3459 
   3460 	tpw.wp_vaddr = (caddr_t)((uintptr_t)addr & (uintptr_t)PAGEMASK);
   3461 	if ((pwp = avl_find(&as->a_wpage, &tpw, NULL)) != NULL)
   3462 		*prot = pwp->wp_oprot;
   3463 }

My idea therefore is the AS_LOCK_HELD is responsible for approx 58 second of times spend waiting.
Unfortunately this URL points to nothing valid. Neither does avl_find.

Now reducing my shared memory segment to 1GB (arbitrary size  way smaller than 16 GB) the results are:

           getwatchprot   9                    avl_find              1,950,850
   pr_pagev_nextprot   8            getwatchprot              5,737,536
               pr_getprot   7    pr_pagev_nextprot              7,412,876
                    prnsegs   6                pr_getprot              7,419,244
                 prgetattr   5                     prnsegs              7,421,813
             fop_getattr   4                   prgetattr              7,422,144
              cstat64_32   3               fop_getattr              7,422,193
           cstatat64_32   2                cstat64_32              7,422,366
                                 1             cstatat64_32              7,430,020

Looks like there is a lineair relationship between segment size and time spend in getwatchprot.

Clues why this takes so long?

BTW did the same  test on a V890, 16*1350MHz. Did not take that long, but still quite some time (approx 2 times faster than T2000).

A pmap on this process shows it uses many 256MB pages and of course a couple of 4M, 64K and 8K. shared segment is shmat-ted with SHM_SHARE_MMU flag set.

Now a recompile of this program without the SHM_SHARE_MMU flag does miracles

The results now are:

            pr_getprot   7    pr_pagev_nextprot                 6,414
                 prnsegs   6           pr_getprot                 8,759
              prgetattr   5              prnsegs                11,447
          fop_getattr   4            prgetattr                11,776
           cstat64_32   3          fop_getattr                11,823
        cstatat64_32   2           cstat64_32                11,996
                             1         cstatat64_32                19,479

pmap-ping this process shows the largest segment is 4M. Could this be due to 256MB pages than?


--
This message posted from opensolaris.org


More information about the dtrace-discuss mailing list