No event is printed in the "Branch Counter" column on hybrid machines.
For example,
$ perf record -e "{cpu_core/branch-instructions/pp,cpu_core/branches/}:S" -j any,counter
$ perf report --total-cycles
# Branch counter abbr list:
# cpu_core/branch-instructions/pp = A
# cpu_core/branches/ = B
# '-' No event occurs
# '+' Event occurrences may be lost due to branch counter saturated
#
# Sampled Cycles% Sampled Cycles Avg Cycles% Avg Cycles Branch Counter
# ............... .............. ........... .......... ..............
44.54% 727.1K 0.00% 1 |+ |+ |
36.31% 592.7K 0.00% 2 |+ |+ |
17.83% 291.1K 0.00% 1 |+ |+ |
The branch counter information (br_cntr_width and br_cntr_nr) in the
perf_env is retrieved from the CPU_PMU_CAPS. However, the CPU_PMU_CAPS
is not available on hybrid machines. Without the width information, the
number of occurrences of an event cannot be calculated.
For a hybrid machine, the caps information should be retrieved from the
PMU_CAPS, and stored in the perf_env->pmu_caps.
Add a perf_env__find_br_cntr_info() to return the correct branch counter
information from the corresponding fields.
Committer notes:
While testing I couldn't s ee those "Branch counter" columns enabled by
pressing 'B' on the TUI, after reporting it to the list Kan explained
the situation:
<quote Kan Liang>
For a hybrid client, the "Branch Counter" feature is only supported
starting from the just released Lunar Lake. Perf falls back to only
"ANY" on your Raptor Lake.
The "The branch counter is not available" message is expected.
Here is the 'perf evlist' result from my Lunar Lake machine,
# perf evlist -v
cpu_core/branch-instructions/pp: type: 4 (cpu_core), size: 136, config: 0xc4 (branch-instructions), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|READ|PERIOD|BRANCH_STACK|IDENTIFIER, read_format: ID|GROUP|LOST, disabled: 1, freq: 1, enable_on_exec: 1, precise_ip: 2, sample_id_all: 1, exclude_guest: 1, branch_sample_type: ANY|COUNTERS
#
</quote>
Fixes: 6f9d8d1de2 ("perf script: Add branch counters")
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240909184201.553519-1-kan.liang@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When annotating a basic block, it's useful to display the occurrences
of other events in the block.
The branch counter feature is only available for newer Intel platforms.
So a dedicated option to display the branch counters is not introduced.
Reuse the existing --total-cycles option, which triggers the annotation
of a basic block and displays the cycle-related annotation.
When the branch counters information is available, the branch counters
are automatically appended after all the cycle-related annotation.
Accounting the branch counters as well when accounting the cycles in
hist__account_cycles().
In 'struct annotated_branch', introduce a br_cntr array to save the
accumulation of each branch counter.
In a sample, all the branch counters for a branch are saved in a u64
space.
Because the saturation of a branch counter is small, e.g., for Intel
Sierra Forest, the saturation is only 3.
Add ANNOTATION__BR_CNTR_SATURATED_FLAG to indicate if a branch counter
once saturated. That can be used to indicate a potential event lost
because of the saturation.
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20240813160208.2493643-5-kan.liang@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
In find_data_type(), it creates and deletes a debug info whenver it
tries to find data type for a sample. This is inefficient and it most
likely accesses the same binary again and again.
Let's add a single entry cache the debug info structure for the last DSO.
Depending on sample data, it usually gives me 2~3x (and sometimes more)
speed ups.
Note that this will introduce a little difference in the output due to
the order of checking stack operations. It used to check the stack ops
before checking the availability of debug info but I moved it after the
symbol check. So it'll report stack operations in DSOs without debug
info as unknown. But I think it's ok and better to have the checking
near the caching logic.
Committer testing:
root@x1:~# perf mem record -a sleep 5s
root@x1:~# perf evlist
cpu_atom/mem-loads,ldlat=30/P
cpu_atom/mem-stores/P
dummy:u
root@x1:~# diff -u before after
--- before 2024-08-08 09:33:53.880780784 -0300
+++ after 2024-08-08 09:35:13.917325041 -0300
@@ -81,8 +81,8 @@
# Overhead Data Type
# ........ .........
#
- 55.43% (unknown)
- 11.61% (stack operation)
+ 55.56% (unknown)
+ 11.48% (stack operation)
4.93% struct pcpu_hot
3.26% unsigned int
2.48% struct
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240805234648.1453689-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Since the "ins.name" is not set while using raw instruction,
'perf annotate' with insn-stat gives wrong data:
Result from "./perf annotate --data-type --insn-stat":
Annotate Instruction stats
total 615, ok 419 (68.1%), bad 196 (31.9%)
Name : Good Bad
-----------------------------------------------------------
: 419 196
This patch sets "dl->ins.name" in arch specific function
"check_ppc_insn" while initialising "struct disasm_line".
Also update "ins_find" function to pass "struct disasm_line" as a
parameter so as to set its name field in arch specific call.
With the patch changes:
Annotate Instruction stats
total 609, ok 446 (73.2%), bad 163 (26.8%)
Name/opcode : Good Bad
-----------------------------------------------------------
58 : 323 80
32 : 49 43
34 : 33 11
OP_31_XOP_LDX : 8 20
40 : 23 0
OP_31_XOP_LWARX : 5 1
OP_31_XOP_LWZX : 2 3
OP_31_XOP_LDARX : 3 0
33 : 0 2
OP_31_XOP_LBZX : 0 1
OP_31_XOP_LWAX : 0 1
OP_31_XOP_LHZX : 0 1
Reviewed-by: Kajol Jain <kjain@linux.ibm.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Tested-by: Kajol Jain <kjain@linux.ibm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Akanksha J N <akanksha@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Disha Goel <disgoel@linux.vnet.ibm.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Link: https://lore.kernel.org/lkml/20240718084358.72242-16-atrajeev@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
For data type profiling, it removed non-instruction lines from the list
of annotation lines. It was to simplify the implementation dealing with
instructions like to calculate the PC-relative address and to search the
shortest path to the target instruction or basic block.
But it means that it removes all the comments and debug information in
the annotate output like source file name and line numbers. To support
both code annotation and data type annotation, it'd be better to keep
the non-instruction lines as well.
So this change is to skip those lines during the data type profiling
and to display them in the normal perf annotate output.
No function changes intended (other than having more lines).
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240405211800.1412920-4-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The symbol__annotate2() initializes some data structures needed by TUI.
It has a logic to prevent calling it multiple times by checking if it
has the annotated source. But data type profiling uses a different
code (symbol__annotate) to allocate the annotated lines in advance.
So TUI missed to call symbol__annotate2() when it shows the annotation
browser.
Make symbol__annotate() reentrant and handle that situation properly.
This fixes a crash in the annotation browser started by perf report in
TUI like below.
$ perf report -s type,sym --tui
# and press 'a' key and then move down
Fixes: 81e57deec3 ("perf report: Support data type profiling")
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240405211800.1412920-2-namhyung@kernel.org
It should pass a proper address (i.e. suitable for objdump or addr2line)
to get_srcline() in order to work correctly. It used to pass an address
with map__rip_2objdump() as the second argument but later it's changed
to use notes->start. It's ok in normal cases but it can be changed when
annotate_opts.full_addr is set. So let's convert the address directly
instead of using the notes->start.
Also the last argument is an IP to print symbol offset if requested. So
it should pass symbol-relative address.
Fixes: 7d18a824b5 ("perf annotate: Toggle full address <-> offset display")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240404175716.1225482-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
In some older distros the build is failing due to
-Werror=maybe-uninitialized, in this case we know that this isn't the
case because 'arch' gets initialized by evsel__get_arch(), so make sure
it is initialized to NULL before returning from evsel__get_arch(), as
suggested by Ian Rogers.
E.g.:
32 17.12 opensuse:15.5 : FAIL gcc version 7.5.0 (SUSE Linux)
util/annotate.c: In function 'hist_entry__get_data_type':
util/annotate.c:2269:15: error: 'arch' may be used uninitialized in this function [-Werror=maybe-uninitialized]
struct arch *arch;
^~~~
cc1: all warnings being treated as errors
43 7.30 ubuntu:18.04-x-powerpc64el : FAIL gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
util/annotate.c: In function 'hist_entry__get_data_type':
util/annotate.c:2351:36: error: 'arch' may be used uninitialized in this function [-Werror=maybe-uninitialized]
if (map__dso(ms->map)->kernel && arch__is(arch, "x86") &&
^~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
Suggested-by: Ian Rogers <irogers@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/CAP-5=fUqtjxAsmdGrnkjhUTLHs-JvV10TtxyocpYDJK_+LYTiQ@mail.gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The util/annotate.c code has both disassembly and sample annotation
related codes. Factor out the disasm part so that it can be handled
more easily.
No functional changes intended.
Committer notes:
Add missing include env.h, util.h, bpf-event.h and bpf-util.h to
disasm.c, to fix things like:
util/disasm.c: In function ‘symbol__disassemble_bpf’:
util/disasm.c:1203:9: error: implicit declaration of function ‘perf_exe’ [-Werror=implicit-function-declaration]
1203 | perf_exe(tpath, sizeof(tpath));
| ^~~~~~~~
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240329215812.537846-4-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
On x86, the kernel gets the current task using the current macro like
below:
#define current get_current()
static __always_inline struct task_struct *get_current(void)
{
return this_cpu_read_stable(pcpu_hot.current_task);
}
So it returns the current_task field of struct pcpu_hot which is the
first member. On my build, it's located at 0x32940.
$ nm vmlinux | grep pcpu_hot
0000000000032940 D pcpu_hot
And the current macro generates the instructions like below:
mov %gs:0x32940, %rcx
So the %gs segment register points to the beginning of the per-cpu
region of this cpu and it points the variable with a constant.
Let's update the instruction location info to have a segment register
and handle %gs in kernel to look up a global variable. Pretend it as
a global variable by changing the register number to DWARF_REG_PC.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20240319055115.4063940-18-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Now symbol histogram uses an array to save per-offset sample counts.
But it wastes a lot of memory if the symbol has a few samples only.
Add a hashmap to save values only for actual samples.
For now, it has duplicate histogram (one in the existing array and
another in the new hash map). Once it can convert to use the hash
in all places, we can get rid of the array later.
Reviewed-by: Ian Rogers <irogers@google.com>
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240304230815.1440583-2-namhyung@kernel.org
Global variables are accessed using PC-relative address so it needs to
be handled separately. The PC-rel addressing is detected by using
DWARF_REG_PC. On x86, %rip register would be used.
The address can be calculated using the ip and offset in the
instruction. But it should start from the next instruction so add
calculate_pcrel_addr() to do it properly.
But global variables defined in a different file would only have a
declaration which doesn't include a location list. So it first tries
to get the type info using the address, and then looks up the variable
declarations using name. The name of global variables should be get
from the symbol table. The declaration would have the type info.
So extend find_var_type() to take both address and name for global
variables.
The stat is now looks like:
Annotate data type stats:
total 294, ok 153 (52.0%), bad 141 (48.0%)
-----------------------------------------------------------
30 : no_sym
32 : no_mem_ops
61 : no_var
10 : no_typeinfo
8 : bad_offset
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240117062657.985479-7-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
A typical function prologue and epilogue include multiple stack
operations to save and restore the current value of registers.
On x86, it looks like below:
push r15
push r14
push r13
push r12
...
pop r12
pop r13
pop r14
pop r15
ret
As these all touches the stack memory region, chances are high that they
appear in a memory profile data. But these are not used for any real
purpose yet so it'd return no types.
One of my profile type shows that non neglible portion of data came from
the stack operations. It also seems GCC generates more stack operations
than clang.
Annotate Instruction stats
total 264, ok 169 (64.0%), bad 95 (36.0%)
Name : Good Bad
-----------------------------------------------------------
movq : 49 27
movl : 24 9
popq : 0 19 <-- here
cmpl : 17 2
addq : 14 1
cmpq : 12 2
cmpxchgl : 3 7
Instead of dealing them as unknown, let's create a seperate pseudo type
to represent those stack operations separately.
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240117062657.985479-5-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>