Running command perf stat -vv -e cpu_cycles -C0 -- true
displays this warning:
Attempting to add event pmu 'cpum_cf' with 'cpu_cycles,'
that may result in non-fatal errors
Make the PMU cpum_cf selectable and avoid this warning.
While at it also fix this warning for PMUs pai_crypto and pai_ext.
Output before:
# ./perf stat -vv -e cpu_cycles -C0 -- true
Using CPUID IBM,3931,704,A01,3.7,002f
Attempting to add event pmu 'cpum_cf' with 'cpu_cycles,'
that may result in non-fatal errors
After aliases, add event pmu 'cpum_cf' with 'event,'
that may result in non-fatal errors
cpu_cycles -> cpum_cf/event=0/
Control descriptor is not initialized
------------------------------------------------------------
perf_event_attr:
type 10
size 128
config 0x1001
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 3
cpu_cycles: 0: 290434 2479172 2479172:
cpu_cycles: 290434 2479172 2479172
Performance counter stats for 'CPU(s) 0':
290,434 cpu_cycles
0.002465617 seconds time elapsed
#
Now the warning "Attempting to add event pmu 'cpum_cf' ..."
does not show up anymore.
Output after:
# ./perf stat -vv -e cpu_cycles -C0 -- true
Using CPUID IBM,3931,704,A01,3.7,002f
After aliases, add event pmu 'cpum_cf' with 'event,'
that may result in non-fatal errors
cpu_cycles -> cpum_cf/event=0/
Control descriptor is not initialized
....
Performance counter stats for 'CPU(s) 0':
357,023 cpu_cycles
0.002454995 seconds time elapsed
#
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Link: https://lore.kernel.org/r/20230316074946.41110-1-tmricht@linux.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Intel Flexible Return and Event Delivery (FRED) adds instructions ERETS
(return to supervisor) and ERETU (return to user). Intel PT instruction
decoder needs to know about these instructions because they are
branch instructions. Similar to IRET instructions, when the decoder
encounters one of these instructions it will match it to a TIP (target
instruction pointer) packet that informs what the branch destination is.
The existing "x86 instruction decoder - new instructions" test can be
used to test the result e.g.
$ perf test -v ins |& grep eret
Decoded ok: f2 0f 01 ca erets
Decoded ok: f3 0f 01 ca eretu
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20230320183517.15099-2-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This change is intended to be a no-op for most current cases, the
default sort order is the order the events were parsed. Where it
varies is in how groups are handled. Previously an uncore and core
event that are grouped would most often cause the group to be removed:
```
$ perf stat -e '{instructions,uncore_imc_free_running_0/data_total/}' -a sleep 1
WARNING: grouped events cpus do not match, disabling group:
anon group { instructions, uncore_imc_free_running_0/data_total/ }
...
```
However, when wildcards are used the events should be re-sorted and
re-grouped in parse_events__set_leader, but this currently fails for
simple examples:
```
$ perf stat -e '{uncore_imc_free_running/data_read/,uncore_imc_free_running/data_write/}' -a sleep 1
Performance counter stats for 'system wide':
<not counted> MiB uncore_imc_free_running/data_read/
<not counted> MiB uncore_imc_free_running/data_write/
1.000996992 seconds time elapsed
```
A futher failure mode, fixed in this patch, is to force topdown events
into a group.
This change moves sorting the evsels in the evlist after parsing. It
requires parsing to set up groups. First the evsels are sorted
respecting the existing groupings and parse order, but also reordering
to ensure evsels of the same PMU and group appear together. So that
software and aux events respect groups, their pmu_name is taken from
the group leader. The sorting is done with list_sort removing a memory
allocation.
After sorting a pass is done to correct the group leaders and for
topdown events ensuring they have a group leader.
This fixes the problems seen before:
```
$ perf stat -e '{uncore_imc_free_running/data_read/,uncore_imc_free_running/data_write/}' -a sleep 1
Performance counter stats for 'system wide':
727.42 MiB uncore_imc_free_running/data_read/
81.84 MiB uncore_imc_free_running/data_write/
1.000948615 seconds time elapsed
```
As well as making groups not fail for cases like:
```
$ perf stat -e '{imc_free_running_0/data_total/,imc_free_running_1/data_total/}' -a sleep 1
Performance counter stats for 'system wide':
256.47 MiB imc_free_running_0/data_total/
256.48 MiB imc_free_running_1/data_total/
1.001165442 seconds time elapsed
```
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Florian Fischer <florian.fischer@muhq.space>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@arm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kim Phillips <kim.phillips@amd.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Steinar H. Gunderson <sesse@google.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suzuki Poulouse <suzuki.poulose@arm.com>
Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
Link: https://lore.kernel.org/r/20230312021543.3060328-2-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The Retire Latency field is added in the var3_w of the
PERF_SAMPLE_WEIGHT_STRUCT. The Retire Latency reports pipeline stall of
this instruction compared to the previous instruction in cycles. That's
quite useful to display the information with perf mem report.
The p_stage_cyc for Power is also from the var3_w. Union the p_stage_cyc
and retire_lat to share the code.
Implement X86 specific codes to display the X86 specific header.
Add a new sort key retire_lat for the Retire Latency.
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20230104201349.1451191-8-kan.liang@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This patch isn't intended to have any effect on the compiled code. It
just removes one level of indirection: calling the *host* compiler to
build and then run a program that just printf:s the numerical entries of
the syscall-table. In other words, the generated syscalls.c changes
from:
[46] = "ftruncate",
to:
[__NR3264_ftruncate] = "ftruncate",
The latter is as good as the former to the user of perf, and this can be
done directly by the shell-script. The syscalls defined as non-literal
values (like "#define __NR_ftruncate __NR3264_ftruncate") are trivially
resolved at compile-time without namespace-leaking and/or collision for
its sole user, perf/util/syscalltbl.c, that just #includes the generated
file. A future "-mabi=32" support would probably have to handle this
differently, but that is a pre-existing problem not affected by this
simplification.
Calling the *host* compiler only complicates things and accidentally can
get a completely wrong set of files and syscall numbers, see earlier
commits. Note that the script parameter hostcc is now unused.
At the time of this patch, powerpc (the origin, see comments), and also
e.g. x86 has moved on, from filtering "gcc -dM -E" output to reading
separate specific text-file, a table of syscall numbers. IMHO should
arm64 consider adopting this.
Signed-off-by: Hans-Peter Nilsson <hp@axis.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Reviewed-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.garry@huawei.com>
Cc: Kim Phillips <kim.phillips@arm.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lore.kernel.org/r/20201228024159.2BB66203B5@pchp3.se.axis.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes in these csets:
ce883a2ba3 ("powerpc/32: fix syscall wrappers with 64-bit arguments")
That doesn't cause any changes in the perf tools.
This table is used in tools perf to allow features as described in the
last update to this file.
This addresses this perf build warning:
Warning: Kernel ABI header at 'tools/perf/arch/powerpc/entry/syscalls/syscall.tbl' differs from latest version at 'arch/powerpc/kernel/syscalls/syscall.tbl'
diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/Y6H0C5plZ4V4aiPm@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Remove the LIBTRACEEVENT_DYNAMIC and LIBTRACEFS_DYNAMIC make command
line variables.
If libtraceevent isn't installed or NO_LIBTRACEEVENT=1 is passed to the
build, don't compile in libtraceevent and libtracefs support.
This also disables CONFIG_TRACE that controls "perf trace".
CONFIG_LIBTRACEEVENT is used to control enablement in Build/Makefiles,
HAVE_LIBTRACEEVENT is used in C code.
Without HAVE_LIBTRACEEVENT tracepoints are disabled and as such the
commands kmem, kwork, lock, sched and timechart are removed. The
majority of commands continue to work including "perf test".
Committer notes:
Fixed up a tools/perf/util/Build reject and added:
#include <traceevent/event-parse.h>
to tools/perf/util/scripting-engines/trace-event-perl.c.
Committer testing:
$ rpm -qi libtraceevent-devel
Name : libtraceevent-devel
Version : 1.5.3
Release : 2.fc36
Architecture: x86_64
Install Date: Mon 25 Jul 2022 03:20:19 PM -03
Group : Unspecified
Size : 27728
License : LGPLv2+ and GPLv2+
Signature : RSA/SHA256, Fri 15 Apr 2022 02:11:58 PM -03, Key ID 999f7cbf38ab71f4
Source RPM : libtraceevent-1.5.3-2.fc36.src.rpm
Build Date : Fri 15 Apr 2022 10:57:01 AM -03
Build Host : buildvm-x86-05.iad2.fedoraproject.org
Packager : Fedora Project
Vendor : Fedora Project
URL : https://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git/
Bug URL : https://bugz.fedoraproject.org/libtraceevent
Summary : Development headers of libtraceevent
Description :
Development headers of libtraceevent-libs
$
Default build:
$ ldd ~/bin/perf | grep tracee
libtraceevent.so.1 => /lib64/libtraceevent.so.1 (0x00007f1dcaf8f000)
$
# perf trace -e sched:* --max-events 10
0.000 migration/0/17 sched:sched_migrate_task(comm: "", pid: 1603763 (perf), prio: 120, dest_cpu: 1)
0.005 migration/0/17 sched:sched_wake_idle_without_ipi(cpu: 1)
0.011 migration/0/17 sched:sched_switch(prev_comm: "", prev_pid: 17 (migration/0), prev_state: 1, next_comm: "", next_prio: 120)
1.173 :0/0 sched:sched_wakeup(comm: "", pid: 3138 (gnome-terminal-), prio: 120)
1.180 :0/0 sched:sched_switch(prev_comm: "", prev_prio: 120, next_comm: "", next_pid: 3138 (gnome-terminal-), next_prio: 120)
0.156 migration/1/21 sched:sched_migrate_task(comm: "", pid: 1603763 (perf), prio: 120, orig_cpu: 1, dest_cpu: 2)
0.160 migration/1/21 sched:sched_wake_idle_without_ipi(cpu: 2)
0.166 migration/1/21 sched:sched_switch(prev_comm: "", prev_pid: 21 (migration/1), prev_state: 1, next_comm: "", next_prio: 120)
1.183 :0/0 sched:sched_wakeup(comm: "", pid: 1602985 (kworker/u16:0-f), prio: 120, target_cpu: 1)
1.186 :0/0 sched:sched_switch(prev_comm: "", prev_prio: 120, next_comm: "", next_pid: 1602985 (kworker/u16:0-f), next_prio: 120)
#
Had to tweak tools/perf/util/setup.py to make sure the python binding
shared object links with libtraceevent if -DHAVE_LIBTRACEEVENT is
present in CFLAGS.
Building with NO_LIBTRACEEVENT=1 uncovered some more build failures:
- Make building of data-convert-bt.c to CONFIG_LIBTRACEEVENT=y
- perf-$(CONFIG_LIBTRACEEVENT) += scripts/
- bpf_kwork.o needs also to be dependent on CONFIG_LIBTRACEEVENT=y
- The python binding needed some fixups and util/trace-event.c can't be
built and linked with the python binding shared object, so remove it
in tools/perf/util/setup.py and exclude it from the list of
dependencies in the python/perf.so Makefile.perf target.
Building without libtraceevent-devel installed uncovered more build
failures:
- The python binding tools/perf/util/python.c was assuming that
traceevent/parse-events.h was always available, which was the case
when we defaulted to using the in-kernel tools/lib/traceevent/ files,
now we need to enclose it under ifdef HAVE_LIBTRACEEVENT, just like
the other parts of it that deal with tracepoints.
- We have to ifdef the rules in the Build files with
CONFIG_LIBTRACEEVENT=y to build builtin-trace.c and
tools/perf/trace/beauty/ as we only ifdef setting CONFIG_TRACE=y when
setting NO_LIBTRACEEVENT=1 in the make command line, not when we don't
detect libtraceevent-devel installed in the system. Simplification here
to avoid these two ways of disabling builtin-trace.c and not having
CONFIG_TRACE=y when libtraceevent-devel isn't installed is the clean
way.
From Athira:
<quote>
tools/perf/arch/powerpc/util/Build
-perf-y += kvm-stat.o
+perf-$(CONFIG_LIBTRACEEVENT) += kvm-stat.o
</quote>
Then, ditto for arm64 and s390, detected by container cross build tests.
- s/390 uses test__checkevent_tracepoint() that is now only available if
HAVE_LIBTRACEEVENT is defined, enclose the callsite with ifder HAVE_LIBTRACEEVENT.
Also from Athira:
<quote>
With this change, I could successfully compile in these environment:
- Without libtraceevent-devel installed
- With libtraceevent-devel installed
- With “make NO_LIBTRACEEVENT=1”
</quote>
Then, finally rename CONFIG_TRACEEVENT to CONFIG_LIBTRACEEVENT for
consistency with other libraries detected in tools/perf/.
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: bpf@vger.kernel.org
Link: http://lore.kernel.org/lkml/20221205225940.3079667-3-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When using "sort -nu", arm64 syscalls were lost. That is, the io_setup
syscall (number 0) and all but one (typically ftruncate; 64) of the
syscalls that are defined symbolically (like "#define __NR_ftruncate
__NR3264_ftruncate") at the point where "sort" is applied.
This creation-of-syscalls.c-scheme is, judging from comments,
copy-pasted from powerpc, and worked there because at the time, its
tools/arch/powerpc/include/uapi/asm/unistd.h had *literals*, like
"#define __NR_ftruncate 93".
With sort being numeric and the non-numeric key effectively evaluating
to 0, the sort option "-u" means these "duplicates" are removed.
There's no need to remove syscall lines with duplicate numbers for arm64
because there are none, so let's fix that by just losing the "-u".
Having the table numerically sorted on syscall-number for the rest of
the syscalls looks nice, so keep the "-n".
Reviewed-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Hans-Peter Nilsson <hp@axis.com>
Tested-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.garry@huawei.com>
Cc: Kim Phillips <kim.phillips@arm.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lore.kernel.org/r/20201228023941.E0DE2203B5@pchp3.se.axis.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Some places were including event.h just to get 'struct perf_sample',
move it to a separate place so that we speed up a bit the build.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes in these csets:
e237506238 ("powerpc/32: fix syscall wrappers with 64-bit arguments of unaligned register-pairs")
That doesn't cause any changes in the perf tools.
As a reminder, this table is used in tools perf to allow features such as:
[root@five ~]# perf trace -e set_mempolicy_home_node
^C[root@five ~]#
[root@five ~]# perf trace -v -e set_mempolicy_home_node
Using CPUID AuthenticAMD-25-21-0
event qualifier tracepoint filter: (common_pid != 253729 && common_pid != 3585) && (id == 450)
mmap size 528384B
^C[root@five ~]
[root@five ~]# perf trace -v -e set* --max-events 5
Using CPUID AuthenticAMD-25-21-0
event qualifier tracepoint filter: (common_pid != 253734 && common_pid != 3585) && (id == 38 || id == 54 || id == 105 || id == 106 || id == 109 || id == 112 || id == 113 || id == 114 || id == 116 || id == 117 || id == 119 || id == 122 || id == 123 || id == 141 || id == 160 || id == 164 || id == 170 || id == 171 || id == 188 || id == 205 || id == 218 || id == 238 || id == 273 || id == 308 || id == 450)
mmap size 528384B
0.000 ( 0.008 ms): bash/253735 setpgid(pid: 253735 (bash), pgid: 253735 (bash)) = 0
6849.011 ( 0.008 ms): bash/16046 setpgid(pid: 253736 (bash), pgid: 253736 (bash)) = 0
6849.080 ( 0.005 ms): bash/253736 setpgid(pid: 253736 (bash), pgid: 253736 (bash)) = 0
7437.718 ( 0.009 ms): gnome-shell/253737 set_robust_list(head: 0x7f34b527e920, len: 24) = 0
13445.986 ( 0.010 ms): bash/16046 setpgid(pid: 253738 (bash), pgid: 253738 (bash)) = 0
[root@five ~]#
That is the filter expression attached to the raw_syscalls:sys_{enter,exit}
tracepoints.
$ find tools/perf/arch/ -name "syscall*tbl" | xargs grep -w set_mempolicy_home_node
tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl:450 common set_mempolicy_home_node sys_set_mempolicy_home_node
tools/perf/arch/powerpc/entry/syscalls/syscall.tbl:450 nospu set_mempolicy_home_node sys_set_mempolicy_home_node
tools/perf/arch/s390/entry/syscalls/syscall.tbl:450 common set_mempolicy_home_node sys_set_mempolicy_home_node sys_set_mempolicy_home_node
tools/perf/arch/x86/entry/syscalls/syscall_64.tbl:450 common set_mempolicy_home_node sys_set_mempolicy_home_node
$
$ grep -w set_mempolicy_home_node /tmp/build/perf/arch/x86/include/generated/asm/syscalls_64.c
[450] = "set_mempolicy_home_node",
$
This addresses these perf build warnings:
Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h'
diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
Warning: Kernel ABI header at 'tools/perf/arch/x86/entry/syscalls/syscall_64.tbl' differs from latest version at 'arch/x86/entry/syscalls/syscall_64.tbl'
diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
Warning: Kernel ABI header at 'tools/perf/arch/powerpc/entry/syscalls/syscall.tbl' differs from latest version at 'arch/powerpc/kernel/syscalls/syscall.tbl'
diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
Warning: Kernel ABI header at 'tools/perf/arch/s390/entry/syscalls/syscall.tbl' differs from latest version at 'arch/s390/kernel/syscalls/syscall.tbl'
diff -u tools/perf/arch/s390/entry/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl
Warning: Kernel ABI header at 'tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl' differs from latest version at 'arch/mips/kernel/syscalls/syscall_n64.tbl'
diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_n64.tbl
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/lkml/Y01HN2DGkWz8tC%2FJ@kernel.org/
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull perf tools updates from Arnaldo Carvalho de Melo:
- Add support for AMD on 'perf mem' and 'perf c2c', the kernel
enablement patches went via tip.
Example:
$ sudo perf mem record -- -c 10000
^C[ perf record: Woken up 227 times to write data ]
[ perf record: Captured and wrote 58.760 MB perf.data (836978 samples) ]
$ sudo perf mem report -F mem,sample,snoop
Samples: 836K of event 'ibs_op//', Event count (approx.): 8418762
Memory access Samples Snoop
N/A 700620 N/A
L1 hit 126675 N/A
L2 hit 424 N/A
L3 hit 664 HitM
L3 hit 10 N/A
Local RAM hit 2 N/A
Remote RAM (1 hop) hit 8558 N/A
Remote Cache (1 hop) hit 3 N/A
Remote Cache (1 hop) hit 2 HitM
Remote Cache (2 hops) hit 10 HitM
Remote Cache (2 hops) hit 6 N/A
Uncached hit 4 N/A
$
- "perf lock" improvements:
- Add -E/--entries option to limit the number of entries to
display, say to ask for just the top 5 contended locks.
- Add -q/--quiet option to suppress header and debug messages.
- Add a 'perf test' kernel lock contention entry to test 'perf
lock'.
- "perf lock contention" improvements:
- Ask BPF's bpf_get_stackid() to skip some callchain entries.
The ones closer to the tooling are bpf related and not that
interesting, the ones calling the locking function are the ones
we're interested in, example of a full, unskipped callstack:
- Allow changing the callstack depth and number of entries to skip.
1 10.74 us 10.74 us 10.74 us spinlock __bpf_trace_contention_begin+0xb
0xffffffffc03b5c47 bpf_prog_bf07ae9e2cbd02c5_contention_begin+0x117
0xffffffffc03b5c47 bpf_prog_bf07ae9e2cbd02c5_contention_begin+0x117
0xffffffffbb8b8e75 bpf_trace_run2+0x35
0xffffffffbb7eab9b __bpf_trace_contention_begin+0xb
0xffffffffbb7ebe75 queued_spin_lock_slowpath+0x1f5
0xffffffffbc1c26ff _raw_spin_lock+0x1f
0xffffffffbb841015 tick_do_update_jiffies64+0x25
0xffffffffbb8409ee tick_irq_enter+0x9e
- Show full callstack in verbose mode (-v option), sometimes this
is desirable instead of showing just one callstack entry.
- Allow multiple time ranges in 'perf record --delay' to help in
reducing the amount of data collected from hardware tracing (Intel
PT, etc) when there is a rough idea of periods of time where events
of interest take time.
- Add Intel PT to record only decoder debug messages when error
happens.
- Improve layout of Intel PT man page.
- Add new branch types: alignment, data and inst faults and arch
specific ones, such as fiq, debug_halt, debug_exit, debug_inst and
debug_data on arm64.
Kernel enablement went thru the tip tree.
- Fix 'perf probe' error log check in 'perf test' when no debuginfo is
available.
- Fix 'perf stat' aggregation mode logic, it should be looking at the
CPU not at the core number.
- Fix flags parsing in 'perf trace' filters.
- Introduce compact encoding of CPU range encoding on perf.data, to
avoid having a bitmap with all the CPUs.
- Improvements to the 'perf stat' metrics, including adding
"core_wide", and computing "smt" from the CPU topology.
- Add support to the new PERF_FORMAT_LOST perf_event_attr.read_format,
that allows tooling to ask for the precise number of lost samples for
a given event.
- Add 'addr' sort key to see just the address of sampled instructions:
$ perf record -o- true | perf report -i- -s addr
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB - ]
# Samples: 12 of event 'cycles:u'
# Event count (approx.): 252512
#
# Overhead Address
# ........ ..................
42.96% 0x7f96f08443d7
29.55% 0x7f96f0859b50
14.76% 0x7f96f0852e02
8.30% 0x7f96f0855028
4.43% 0xffffffff8de01087
perf annotate: Toggle full address <-> offset display
- Add 'f' hotkey to the 'perf annotate' TUI interface when in
'disassembler output' mode ('o' hotkey) to toggle showing full
virtual address or just the offset.
- Cache DSO build-ids when synthesizing PERF_RECORD_MMAP records for
pre-existing threads, at the start of a 'perf record' session,
speeding up that record startup phase.
- Add a command line option to specify build ids in 'perf inject'.
- Update JSON event files for the Intel alderlake, broadwell,
broadwellde, broadwellx, cascadelakex, haswell, haswellx, icelake,
icelakex, ivybridge, ivytown, jaketown, sandybridge, sapphirerapids,
skylake, skylakex, and tigerlake processors.
- Update vendor JSON event files for the ARM Neoverse V1 and E1
platforms.
- Add a 'perf test' entry for 'perf mem' where a struct has false
sharing and this gets detected in the 'perf mem' output, tested with
Intel, AMD and ARM64 systems.
- Add a 'perf test' entry to test the resolution of java symbols, where
an output like this is expected:
8.18% jshell jitted-50116-29.so [.] Interpreter
0.75% Thread-1 jitted-83602-1670.so [.] jdk.internal.jimage.BasicImageReader.getString(int)
- Add tests for the ARM64 CoreSight hardware tracing feature, with
specially crafted pureloop, memcpy, thread loop and unroll tread that
then gets traced and the output compared with expected output.
Documentation explaining it is also included.
- Add per thread Intel PT 'perf test' entry to check that
PERF_RECORD_TEXT_POKE events are recorded per CPU, resulting in a
mixture of per thread and per CPU events and mmaps, verify that this
gets all recorded correctly.
- Introduce pthread mutex wrappers to allow for building with clang's
-Wthread-safety, i.e. using the "guarded_by" "pt_guarded_by"
"lockable", "exclusive_lock_function", "exclusive_trylock_function",
"exclusive_locks_required", and "no_thread_safety_analysis" compiler
function attributes.
- Fix empty version number when building outside of a git repo.
- Improve feature detection display when multiple versions of a feature
are present, such as for binutils libbfd, that has a mix of possible
ways to detect according to the Linux distribution.
Previously in some cases we had:
Auto-detecting system features
<SNIP>
... libbfd: [ on ]
... libbfd-liberty: [ on ]
... libbfd-liberty-z: [ on ]
<SNIP>
Now for this case we show just the main feature:
Auto-detecting system features
<SNIP>
... libbfd: [ on ]
<SNIP>
- Remove some unused structs, variables, macros, function prototypes
and includes from various places.
* tag 'perf-tools-for-v6.1-1-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (169 commits)
perf script: Add missing fields in usage hint
perf mem: Print "LFB/MAB" for PERF_MEM_LVLNUM_LFB
perf mem/c2c: Avoid printing empty lines for unsupported events
perf mem/c2c: Add load store event mappings for AMD
perf mem/c2c: Set PERF_SAMPLE_WEIGHT for LOAD_STORE events
perf mem: Add support for printing PERF_MEM_LVLNUM_{CXL|IO}
perf amd ibs: Sync arch/x86/include/asm/amd-ibs.h header with the kernel
tools headers UAPI: Sync include/uapi/linux/perf_event.h header with the kernel
perf stat: Fix cpu check to use id.cpu.cpu in aggr_printout()
perf test coresight: Add relevant documentation about ARM64 CoreSight testing
perf test: Add git ignore for tmp and output files of ARM CoreSight tests
perf test coresight: Add unroll thread test shell script
perf test coresight: Add unroll thread test tool
perf test coresight: Add thread loop test shell scripts
perf test coresight: Add thread loop test tool
perf test coresight: Add memcpy thread test shell script
perf test coresight: Add memcpy thread test tool
perf test: Add git ignore for perf data generated by the ARM CoreSight tests
perf test: Add arm64 asm pureloop test shell script
perf test: Add asm pureloop test tool
...