[dtrace-discuss] /proc/<pid>/map access is sloooow when using large shared memory segments
Paul van den Bogaard
Paul.VandenBogaard at Sun.COM
Fri Dec 7 08:42:07 PST 2007
In an earlier post (USDT probes from PostgreSQL) I mentioned that the pread syscall did take approx 350 msecs on a /proc/<pid>/map file where a pid is that from a process that is attached to a large shared memory segment. Same goes for fstat, and lstat64.
Since starting a dtrace process on processes that are attached to large shm segs for mulitple processes means all their map-files are read in sequential order. This grows to rather unfriendly "compile" times (to almost an hour) for dtrace when attaching to my PostgreSQL processes.
Initially I thought it would be related to my PostgreSQL processes.
I do not think this anymore.
BTW these tests are done on the "problem machine", a T2000 8 core @1000MHz, 32GB intern.
A
timex ls /proc/pid/map
already shows this amount of time being spend (where pid is the pid of a process executing the image of a self made program that create a 16G shared mem segment)
Using dtrace on this ls process show lstat64 consuming the time I am looking for.
Ok, created the following script to dive deeper (can only hope it does what I think it does :-)
Script is started like
dtrace -qs script.d -c "ls /prob/00000/map"
BEGIN {
printf("Target = %d\n", $target);
self->depth = 0;
started = timestamp;
}
syscall::lstat64:entry
/pid == $target/
{
self->filename = basename(copyinstr(arg0) );
}
syscall::lstat64:entry
/self->filename == "map"/
{
self->flagIt = 1; /* OK the show begins */
self->depth = 0;
}
syscall::lstat64:return
/self->flagIt /
{
self->flagIt = 0; /* And the show stops */
}
/* Already found some things in a previous version. Want to know the caller of a function
that takes time. This is what the self->parent is used for.
*/
fbt:::entry
/self->flagIt/
{
self->depth++;
self->ts[self->depth] = timestamp;
self->parent[self->depth] = probefunc;
}
fbt:::return
/self->depth /
{
@[self->parent[self->depth-1], self->depth, probefunc] = sum(timestamp - self->ts[ self->depth ] );
self->parent[self->depth] = 0;
self->depth--;
}
END {
printf("WALL: %d\n", (timestamp - started) / 1000 );
normalize(@, 1000);
printa("%20s %3d %20s %@20d\n", @);
}
This little magic gives me the following output (stripped many lines of calls that do not consume time)
WALL: 113,6 seconds
lookuppnat 5 lookuppnvp 5,540
lookupnameat 4 lookuppnat 5,589
cstatat_getvp 3 lookupnameat 5,686
cstatat64_32 2 cstatat_getvp 5,782
pr_pagev_fill 8 segspt_shmgetprot 5,961
pr_getprot 7 pr_pagev_fill 6,081
pr_pagev_fill 9 segspt_shmgetprot 39,266
pr_pagev_nextprot 8 pr_pagev_fill 39,646
getwatchprot 9 avl_find 29,973,247
pr_pagev_nextprot 8 getwatchprot 87,968,115
pr_getprot 7 pr_pagev_nextprot 113,528,006
prnsegs 6 pr_getprot 113,537,387
prgetattr 5 prnsegs 113,539,929
fop_getattr 4 prgetattr 113,540,258
cstat64_32 3 fop_getattr 113,540,307
cstatat64_32 2 cstat64_32 113,540,480
1 cstatat64_32 113,547,874
avl_find seems not to call any children, and therefore consumes quite some time (30 secs)
getwatchprot used 88, of which 30 are used by this avl_find.
The source shows (http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/proc/prsubr.c):
3449 /*
3450 * Return the original protections for the specified page.
3451 */
3452 static void
3453 getwatchprot(struct as *as, caddr_t addr, uint_t *prot)
3454 {
3455 struct watched_page *pwp;
3456 struct watched_page tpw;
3457
3458 ASSERT(AS_LOCK_HELD(as, &as->a_lock));
3459
3460 tpw.wp_vaddr = (caddr_t)((uintptr_t)addr & (uintptr_t)PAGEMASK);
3461 if ((pwp = avl_find(&as->a_wpage, &tpw, NULL)) != NULL)
3462 *prot = pwp->wp_oprot;
3463 }
My idea therefore is the AS_LOCK_HELD is responsible for approx 58 second of times spend waiting.
Unfortunately this URL points to nothing valid. Neither does avl_find.
Now reducing my shared memory segment to 1GB (arbitrary size way smaller than 16 GB) the results are:
getwatchprot 9 avl_find 1,950,850
pr_pagev_nextprot 8 getwatchprot 5,737,536
pr_getprot 7 pr_pagev_nextprot 7,412,876
prnsegs 6 pr_getprot 7,419,244
prgetattr 5 prnsegs 7,421,813
fop_getattr 4 prgetattr 7,422,144
cstat64_32 3 fop_getattr 7,422,193
cstatat64_32 2 cstat64_32 7,422,366
1 cstatat64_32 7,430,020
Looks like there is a lineair relationship between segment size and time spend in getwatchprot.
Clues why this takes so long?
BTW did the same test on a V890, 16*1350MHz. Did not take that long, but still quite some time (approx 2 times faster than T2000).
A pmap on this process shows it uses many 256MB pages and of course a couple of 4M, 64K and 8K. shared segment is shmat-ted with SHM_SHARE_MMU flag set.
Now a recompile of this program without the SHM_SHARE_MMU flag does miracles
The results now are:
pr_getprot 7 pr_pagev_nextprot 6,414
prnsegs 6 pr_getprot 8,759
prgetattr 5 prnsegs 11,447
fop_getattr 4 prgetattr 11,776
cstat64_32 3 fop_getattr 11,823
cstatat64_32 2 cstat64_32 11,996
1 cstatat64_32 19,479
pmap-ping this process shows the largest segment is 4M. Could this be due to 256MB pages than?
--
This message posted from opensolaris.org
More information about the dtrace-discuss
mailing list