[ Upstream commit 92974a1d00 ]
current 'sample' action doesn't push the mac header of ingress packets if
they are received by a layer 3 tunnel (like gre or sit); but it forgot to
check for gre over ipv6, so the following script:
# tc q a dev $d clsact
# tc f a dev $d ingress protocol ip flower ip_proto icmp action sample \
> group 100 rate 1
# psample -v -g 100
dumps everything, including outer header and mac, when $d is a gre tunnel
over ipv6. Fix this adding a missing label for ARPHRD_IP6GRE devices.
Fixes: 5c5670fae4 ("net/sched: Introduce sample tc action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b88dd52c62 ]
Whenever MQ is not used on a multiqueue device, we experience
serious reordering problems. Bisection found the cited
commit.
The issue can be described this way :
- A single qdisc hierarchy is shared by all transmit queues.
(eg : tc qdisc replace dev eth0 root fq_codel)
- When/if try_bulk_dequeue_skb_slow() dequeues a packet targetting
a different transmit queue than the one used to build a packet train,
we stop building the current list and save the 'bad' skb (P1) in a
special queue. (bad_txq)
- When dequeue_skb() calls qdisc_dequeue_skb_bad_txq() and finds this
skb (P1), it checks if the associated transmit queues is still in frozen
state. If the queue is still blocked (by BQL or NIC tx ring full),
we leave the skb in bad_txq and return NULL.
- dequeue_skb() calls q->dequeue() to get another packet (P2)
The other packet can target the problematic queue (that we found
in frozen state for the bad_txq packet), but another cpu just ran
TX completion and made room in the txq that is now ready to accept
new packets.
- Packet P2 is sent while P1 is still held in bad_txq, P1 might be sent
at next round. In practice P2 is the lead of a big packet train
(P2,P3,P4 ...) filling the BQL budget and delaying P1 by many packets :/
To solve this problem, we have to block the dequeue process as long
as the first packet in bad_txq can not be sent. Reordering issues
disappear and no side effects have been seen.
Fixes: a53851e2c3 ("net: sched: explicit locking in gso_cpu fallback")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 981471bd3a ]
The net pointer in struct xt_tgdtor_param is not explicitly
initialized therefore is still NULL when dereferencing it.
So we have to find a way to pass the correct net pointer to
ipt_destroy_target().
The best way I find is just saving the net pointer inside the per
netns struct tcf_idrinfo, which could make this patch smaller.
Fixes: 0c66dc1ea3 ("netfilter: conntrack: register hooks in netns when needed by ruleset")
Reported-and-tested-by: itugrok@yahoo.com
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit dbf47a2a09 ]
Action sample doesn't properly handle psample_group pointer in overwrite
case. Following issues need to be fixed:
- In tcf_sample_init() function RCU_INIT_POINTER() is used to set
s->psample_group, even though we neither setting the pointer to NULL, nor
preventing concurrent readers from accessing the pointer in some way.
Use rcu_swap_protected() instead to safely reset the pointer.
- Old value of s->psample_group is not released or deallocated in any way,
which results resource leak. Use psample_group_put() on non-NULL value
obtained with rcu_swap_protected().
- The function psample_group_put() that released reference to struct
psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu
grace period when deallocating it. Extend struct psample_group with rcu
head and use kfree_rcu when freeing it.
Fixes: 5c5670fae4 ("net/sched: Introduce sample tc action")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 7be8ef2cdb ]
Currently init call of all actions (except ipt) init their 'parm'
structure as a direct pointer to nla data in skb. This leads to race
condition when some of the filter actions were initialized successfully
(and were assigned with idr action index that was written directly
into nla data), but then were deleted and retried (due to following
action module missing or classifier-initiated retry), in which case
action init code tries to insert action to idr with index that was
assigned on previous iteration. During retry the index can be reused
by another action that was inserted concurrently, which causes
unintended action sharing between filters.
To fix described race condition, save action idr index to temporary
stack-allocated variable instead on nla data.
Fixes: 0190c1d452 ("net: sched: atomically check-allocate action")
Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b35475c549 ]
Add get_fill_size() routine used to calculate the action size
when building a batch of events.
Fixes: c7e2b9689 ("sched: introduce vlan action")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 051c7b39be ]
In dequeue_func(), there is an if statement on line 74 to check whether
skb is NULL:
if (skb)
When skb is NULL, it is used on line 77:
prefetch(&skb->end);
Thus, a possible null-pointer dereference may occur.
To fix this bug, skb->end is used when skb is not NULL.
This bug is found by a static analysis tool STCheck written by us.
Fixes: 76e3cc126b ("codel: Controlled Delay AQM")
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3f05e6886a ]
For qdisc's that support TC filters and set TCQ_F_CAN_BYPASS,
notably fq_codel, it makes no sense to let packets bypass the TC
filters we setup in any scenario, otherwise our packets steering
policy could not be enforced.
This can be reproduced easily with the following script:
ip li add dev dummy0 type dummy
ifconfig dummy0 up
tc qd add dev dummy0 root fq_codel
tc filter add dev dummy0 parent 8001: protocol arp basic action mirred egress redirect dev lo
tc filter add dev dummy0 parent 8001: protocol ip basic action mirred egress redirect dev lo
ping -I dummy0 192.168.112.1
Without this patch, packets are sent directly to dummy0 without
hitting any of the filters. With this patch, packets are redirected
to loopback as expected.
This fix is not perfect, it only unsets the flag but does not set it back
because we have to save the information somewhere in the qdisc if we
really want that. Note, both fq_codel and sfq clear this flag in their
->bind_tcf() but this is clearly not sufficient when we don't use any
class ID.
Fixes: 23624935e0 ("net_sched: TCQ_F_CAN_BYPASS generalization")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4097e9d250 ]
Function tcf_action_dump() relies on tc_action->order field when starting
nested nla to send action data to userspace. This approach breaks in
several cases:
- When multiple filters point to same shared action, tc_action->order field
is overwritten each time it is attached to filter. This causes filter
dump to output action with incorrect attribute for all filters that have
the action in different position (different order) from the last set
tc_action->order value.
- When action data is displayed using tc action API (RTM_GETACTION), action
order is overwritten by tca_action_gd() according to its position in
resulting array of nl attributes, which will break filter dump for all
filters attached to that shared action that expect it to have different
order value.
Don't rely on tc_action->order when dumping actions. Set nla according to
action position in resulting array of actions instead.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4976e3c683 ]
The logic in cake_select_tin() was getting a bit hairy, and it turns out we
can simplify it quite a bit. This also allows us to get rid of one of the
two diffserv parsing functions, which has the added benefit that
already-zeroed DSCP fields won't get re-written.
Suggested-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c87b4ecdbe ]
There is not actually any guarantee that the IP headers are valid before we
access the DSCP bits of the packets. Fix this using the same approach taken
in sch_dsmark.
Reported-by: Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b2100cc56f ]
We shouldn't be using skb->protocol directly as that will miss cases with
hardware-accelerated VLAN tags. Use the helper instead to get the right
protocol number.
Reported-by: Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0db6f8befc ]
It returned always NULL, thus it was never possible to get the filter.
Example:
$ ip link add foo type dummy
$ ip link add bar type dummy
$ tc qdisc add dev foo clsact
$ tc filter add dev foo protocol all pref 1 ingress handle 1234 \
matchall action mirred ingress mirror dev bar
Before the patch:
$ tc filter get dev foo protocol all pref 1 ingress handle 1234 matchall
Error: Specified filter handle not found.
We have an error talking to the kernel
After:
$ tc filter get dev foo protocol all pref 1 ingress handle 1234 matchall
filter ingress protocol all pref 1 matchall chain 0 handle 0x4d2
not_in_hw
action order 1: mirred (Ingress Mirror to device bar) pipe
index 1 ref 1 bind 1
CC: Yotam Gigi <yotamg@mellanox.com>
CC: Jiri Pirko <jiri@mellanox.com>
Fixes: fd62d9f5c5 ("net/sched: matchall: Fix configuration race")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 064c5d6881 ]
A new mirred action is created by the tcf_mirred_init function. This
contains a list head struct which is inserted into a global list on
successful creation of a new action. However, after a creation, it is
still possible to error out and call the tcf_idr_release function. This,
in turn, calls the act_mirr cleanup function via __tcf_idr_release and
__tcf_action_put. This cleanup function tries to delete the list entry
which is as yet uninitialised, leading to a NULL pointer exception.
Fix this by initialising the list entry on creation of a new action.
Bug report:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 8000000840c73067 P4D 8000000840c73067 PUD 858dcc067 PMD 0
Oops: 0002 [#1] SMP PTI
CPU: 32 PID: 5636 Comm: handler194 Tainted: G OE 5.0.0+ #186
Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.3.6 06/03/2015
RIP: 0010:tcf_mirred_release+0x42/0xa7 [act_mirred]
Code: f0 90 39 c0 e8 52 04 57 c8 48 c7 c7 b8 80 39 c0 e8 94 fa d4 c7 48 8b 93 d0 00 00 00 48 8b 83 d8 00 00 00 48 c7 c7 f0 90 39 c0 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 83 d0 00
RSP: 0018:ffffac4aa059f688 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff9dcd1b214d00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9dcd1fa165f8 RDI: ffffffffc03990f0
RBP: ffff9dccf9c7af80 R08: 0000000000000a3b R09: 0000000000000000
R10: ffff9dccfa11f420 R11: 0000000000000000 R12: 0000000000000001
R13: ffff9dcd16b433c0 R14: ffff9dcd1b214d80 R15: 0000000000000000
FS: 00007f441bfff700(0000) GS:ffff9dcd1fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000839e64004 CR4: 00000000001606e0
Call Trace:
tcf_action_cleanup+0x59/0xca
__tcf_action_put+0x54/0x6b
__tcf_idr_release.cold.33+0x9/0x12
tcf_mirred_init.cold.20+0x22e/0x3b0 [act_mirred]
tcf_action_init_1+0x3d0/0x4c0
tcf_action_init+0x9c/0x130
tcf_exts_validate+0xab/0xc0
fl_change+0x1ca/0x982 [cls_flower]
tc_new_tfilter+0x647/0x8d0
? load_balance+0x14b/0x9e0
rtnetlink_rcv_msg+0xe3/0x370
? __switch_to_asm+0x40/0x70
? __switch_to_asm+0x34/0x70
? _cond_resched+0x15/0x30
? __kmalloc_node_track_caller+0x1d4/0x2b0
? rtnl_calcit.isra.31+0xf0/0xf0
netlink_rcv_skb+0x49/0x110
netlink_unicast+0x16f/0x210
netlink_sendmsg+0x1df/0x390
sock_sendmsg+0x36/0x40
___sys_sendmsg+0x27b/0x2c0
? futex_wake+0x80/0x140
? do_futex+0x2b9/0xac0
? ep_scan_ready_list.constprop.22+0x1f2/0x210
? ep_poll+0x7a/0x430
__sys_sendmsg+0x47/0x80
do_syscall_64+0x55/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 4e232818bd ("net: sched: act_mirred: remove dependency on rtnl lock")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ecb3dea400 ]
When adding new filter to flower classifier, fl_change() inserts it to
handle_idr before initializing filter extensions and assigning it a mask.
Normally this ordering doesn't matter because all flower classifier ops
callbacks assume rtnl lock protection. However, when filter has an action
that doesn't have its kernel module loaded, rtnl lock is released before
call to request_module(). During this time the filter can be accessed bu
concurrent task before its initialization is completed, which can lead to a
crash.
Example case of NULL pointer dereference in concurrent dump:
Task 1 Task 2
tc_new_tfilter()
fl_change()
idr_alloc_u32(fnew)
fl_set_parms()
tcf_exts_validate()
tcf_action_init()
tcf_action_init_1()
rtnl_unlock()
request_module()
... rtnl_lock()
tc_dump_tfilter()
tcf_chain_dump()
fl_walk()
idr_get_next_ul()
tcf_node_dump()
tcf_fill_node()
fl_dump()
mask = &f->mask->key; <- NULL ptr
rtnl_lock()
Extension initialization and mask assignment don't depend on fnew->handle
that is allocated by idr_alloc_u32(). Move idr allocation code after action
creation and mask assignment in fl_change() to prevent concurrent access
to not fully initialized filter when rtnl lock is released to load action
module.
Fixes: 01683a1469 ("net: sched: refactor flower walk to iterate over idr")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a3df633a3c ]
Metadata pointer is only initialized for action TCA_TUNNEL_KEY_ACT_SET, but
it is unconditionally dereferenced in tunnel_key_init() error handler.
Verify that metadata pointer is not NULL before dereferencing it in
tunnel_key_init error handling code.
Fixes: ee28bb56ac ("net/sched: fix memory leak in act_tunnel_key_init()")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6191da9806 ]
when act_skbedit was converted to use RCU in the data plane, we added an
error path, but we forgot to drop the action refcount in case of failure
during a 'replace' operation:
# tc actions add action skbedit ptype otherhost pass index 100
# tc action show action skbedit
total acts 1
action order 0: skbedit ptype otherhost pass
index 100 ref 1 bind 0
# tc actions replace action skbedit ptype otherhost drop index 100
RTNETLINK answers: Cannot allocate memory
We have an error talking to the kernel
# tc action show action skbedit
total acts 1
action order 0: skbedit ptype otherhost pass
index 100 ref 2 bind 0
Ensure we call tcf_idr_release(), in case 'params_new' allocation failed,
also when the action is being replaced.
Fixes: c749cdda90 ("net/sched: act_skbedit: don't use spinlock in the data path")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8f67c90ee9 ]
After commit 4e8ddd7f17 ("net: sched: don't release reference on action
overwrite"), the error path of all actions was converted to drop refcount
also when the action was being overwritten. But we forgot act_ipt_init(),
in case allocation of 'tname' was not successful:
# tc action add action xt -j LOG --log-prefix hello index 100
tablename: mangle hook: NF_IP_POST_ROUTING
target: LOG level warning prefix "hello" index 100
# tc action show action xt
total acts 1
action order 0: tablename: mangle hook: NF_IP_POST_ROUTING
target LOG level warning prefix "hello"
index 100 ref 1 bind 0
# tc action replace action xt -j LOG --log-prefix world index 100
tablename: mangle hook: NF_IP_POST_ROUTING
target: LOG level warning prefix "world" index 100
RTNETLINK answers: Cannot allocate memory
We have an error talking to the kernel
# tc action show action xt
total acts 1
action order 0: tablename: mangle hook: NF_IP_POST_ROUTING
target LOG level warning prefix "hello"
index 100 ref 2 bind 0
Ensure we call tcf_idr_release(), in case 'tname' allocation failed, also
when the action is being replaced.
Fixes: 4e8ddd7f17 ("net: sched: don't release reference on action overwrite")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5845f70638 ]
It can be reproduced by following steps:
1. virtio_net NIC is configured with gso/tso on
2. configure nginx as http server with an index file bigger than 1M bytes
3. use tc netem to produce duplicate packets and delay:
tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
4. continually curl the nginx http server to get index file on client
5. BUG_ON is seen quickly
[10258690.371129] kernel BUG at net/core/skbuff.c:4028!
[10258690.371748] invalid opcode: 0000 [#1] SMP PTI
[10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 5.0.0-rc6 #2
[10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
[10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
[10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
[10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
[10258690.372094] FS: 0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
[10258690.372094] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
[10258690.372094] Call Trace:
[10258690.372094] <IRQ>
[10258690.372094] skb_to_sgvec+0x11/0x40
[10258690.372094] start_xmit+0x38c/0x520 [virtio_net]
[10258690.372094] dev_hard_start_xmit+0x9b/0x200
[10258690.372094] sch_direct_xmit+0xff/0x260
[10258690.372094] __qdisc_run+0x15e/0x4e0
[10258690.372094] net_tx_action+0x137/0x210
[10258690.372094] __do_softirq+0xd6/0x2a9
[10258690.372094] irq_exit+0xde/0xf0
[10258690.372094] smp_apic_timer_interrupt+0x74/0x140
[10258690.372094] apic_timer_interrupt+0xf/0x20
[10258690.372094] </IRQ>
In __skb_to_sgvec(), the skb->len is not equal to the sum of the skb's
linear data size and nonlinear data size, thus BUG_ON triggered.
Because the skb is cloned and a part of nonlinear data is split off.
Duplicate packet is cloned in netem_enqueue() and may be delayed
some time in qdisc. When qdisc len reached the limit and returns
NET_XMIT_DROP, the skb will be retransmit later in write queue.
the skb will be fragmented by tso_fragment(), the limit size
that depends on cwnd and mss decrease, the skb's nonlinear
data will be split off. The length of the skb cloned by netem
will not be updated. When we use virtio_net NIC and invoke skb_to_sgvec(),
the BUG_ON trigger.
To fix it, netem returns NET_XMIT_SUCCESS to upper stack
when it clones a duplicate packet.
Fixes: 35d889d1 ("sch_netem: fix skb leak in netem_enqueue()")
Signed-off-by: Sheng Lan <lansheng@huawei.com>
Reported-by: Qin Ji <jiqin.ji@huawei.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 46b1c18f9d ]
In the series fc8b81a598 ("Merge branch 'lockless-qdisc-series'")
John made the assumption that the data path had no need to read
the qdisc qlen (number of packets in the qdisc).
It is true when pfifo_fast is used as the root qdisc, or as direct MQ/MQPRIO
children.
But pfifo_fast can be used as leaf in class full qdiscs, and existing
logic needs to access the child qlen in an efficient way.
HTB breaks badly, since it uses cl->leaf.q->q.qlen in :
htb_activate() -> WARN_ON()
htb_dequeue_tree() to decide if a class can be htb_deactivated
when it has no more packets.
HFSC, DRR, CBQ, QFQ have similar issues, and some calls to
qdisc_tree_reduce_backlog() also read q.qlen directly.
Using qdisc_qlen_sum() (which iterates over all possible cpus)
in the data path is a non starter.
It seems we have to put back qlen in a central location,
at least for stable kernels.
For all qdisc but pfifo_fast, qlen is guarded by the qdisc lock,
so the existing q.qlen{++|--} are correct.
For 'lockless' qdisc (pfifo_fast so far), we need to use atomic_{inc|dec}()
because the spinlock might be not held (for example from
pfifo_fast_enqueue() and pfifo_fast_dequeue())
This patch adds atomic_qlen (in the same location than qlen)
and renames the following helpers, since we want to express
they can be used without qdisc lock, and that qlen is no longer percpu.
- qdisc_qstats_cpu_qlen_dec -> qdisc_qstats_atomic_qlen_dec()
- qdisc_qstats_cpu_qlen_inc -> qdisc_qstats_atomic_qlen_inc()
Later (net-next) we might revert this patch by tracking all these
qlen uses and replace them by a more efficient method (not having
to access a precise qlen, but an empty/non_empty status that might
be less expensive to maintain/track).
Another possibility is to have a legacy pfifo_fast version that would
be used when used a a child qdisc, since the parent qdisc needs
a spinlock anyway. But then, future lockless qdiscs would also
have the same problem.
Fixes: 7e66016f2c ("net: sched: helpers to sum qlen and qlen for per cpu logic")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1db817e75f ]
struct tcindex_filter_result contains two parts:
struct tcf_exts and struct tcf_result.
For the local variable 'cr', its exts part is never used but
initialized without being released properly on success path. So
just completely remove the exts part to fix this leak.
For the local variable 'new_filter_result', it is never properly
released if not used by 'r' on success path.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 033b228e7f ]
When tcindex_destroy() destroys all the filter results in
the perfect hash table, it invokes the walker to delete
each of them. However, results with class==0 are skipped
in either tcindex_walk() or tcindex_delete(), which causes
a memory leak reported by kmemleak.
This patch fixes it by skipping the walker and directly
deleting these filter results so we don't miss any filter
result.
As a result of this change, we have to initialize exts->net
properly in tcindex_alloc_perfect_hash(). For net-next, we
need to consider whether we should initialize ->net in
tcf_exts_init() instead, before that just directly test
CONFIG_NET_CLS_ACT=y.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8015d93ebd ]
tcindex_destroy() invokes tcindex_destroy_element() via
a walker to delete each filter result in its perfect hash
table, and tcindex_destroy_element() calls tcindex_delete()
which schedules tcf RCU works to do the final deletion work.
Unfortunately this races with the RCU callback
__tcindex_destroy(), which could lead to use-after-free as
reported by Adrian.
Fix this by migrating this RCU callback to tcf RCU work too,
as that workqueue is ordered, we will not have use-after-free.
Note, we don't need to hold netns refcnt because we don't call
tcf_exts_destroy() here.
Fixes: 27ce4f05e2 ("net_sched: use tcf_queue_work() in tcindex filter")
Reported-by: Adrian <bugs@abtelecom.ro>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2cddd20147 ]
Recent changes (especially 05cd271fd6 ("cls_flower: Support multiple
masks per priority")) in the fl_flow_mask structure grow it and its
current size e.g. on x86_64 with defconfig is 760 bytes and more than
1024 bytes with some debug options enabled. Prior the mentioned commit
its size was 176 bytes (using defconfig on x86_64).
With regard to this fact it's reasonable to allocate this structure
dynamically in fl_change() to reduce its stack size.
v2:
- use kzalloc() instead of kcalloc()
Fixes: 05cd271fd6 ("cls_flower: Support multiple masks per priority")
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cd0c4e70fc ]
Martin reported a set of filters don't work after changing
from reclassify to continue. Looking into the code, it
looks like skb protocol is not always fetched for each
iteration of the filters. But, as demonstrated by Martin,
TC actions could modify skb->protocol, for example act_vlan,
this means we have to refetch skb protocol in each iteration,
rather than using the one we fetch in the beginning of the loop.
This bug is _not_ introduced by commit 3b3ae88026
("net: sched: consolidate tc_classify{,_compat}"), technically,
if act_vlan is the only action that modifies skb protocol, then
it is commit c7e2b9689e ("sched: introduce vlan action") which
introduced this bug.
Reported-by: Martin Olsson <martin.olsson+netdev@sentorsecurity.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 63c82997f5 ]
TCA_FLOWER_KEY_ENC_OPTS and TCA_FLOWER_KEY_ENC_OPTS_MASK can only
currently contain further nested attributes, which are parsed by
hand, so the policy is never actually used resulting in a W=1
build warning:
net/sched/cls_flower.c:492:1: warning: ‘enc_opts_policy’ defined but not used [-Wunused-const-variable=]
enc_opts_policy[TCA_FLOWER_KEY_ENC_OPTS_MAX + 1] = {
Add the validation anyway to avoid potential bugs when other
attributes are added and to make the attribute structure slightly
more clear. Validation will also set extact to point to bad
attribute on error.
Fixes: 0a6e77784f ("net/sched: allow flower to match tunnel options")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 19ab69107d ]
tcf_idr_check_alloc() can return a negative value, on allocation failures
(-ENOMEM) or IDR exhaustion (-ENOSPC): don't leak keys_ex in these cases.
Fixes: 0190c1d452 ("net: sched: atomically check-allocate action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e72bde6b66 upstream.
Marco reported an error with hfsc:
root@Calimero:~# tc qdisc add dev eth0 root handle 1:0 hfsc default 1
Error: Attribute failed policy validation.
Apparently a few implementations pass TCA_OPTIONS as a binary instead
of nested attribute, so drop TCA_OPTIONS from the policy.
Fixes: 8b4c3cdd9d ("net: sched: Add policy validation for tc attributes")
Reported-by: Marco Berizzi <pupilla@libero.it>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 38b4f18d56 ]
gred_change_table_def() takes a pointer to TCA_GRED_DPS attribute,
and expects it will be able to interpret its contents as
struct tc_gred_sopt. Pass the correct gred attribute, instead of
TCA_OPTIONS.
This bug meant the table definition could never be changed after
Qdisc was initialized (unless whatever TCA_OPTIONS contained both
passed netlink validation and was a valid struct tc_gred_sopt...).
Old behaviour:
$ ip link add type dummy
$ tc qdisc replace dev dummy0 parent root handle 7: \
gred setup vqs 4 default 0
$ tc qdisc replace dev dummy0 parent root handle 7: \
gred setup vqs 4 default 0
RTNETLINK answers: Invalid argument
Now:
$ ip link add type dummy
$ tc qdisc replace dev dummy0 parent root handle 7: \
gred setup vqs 4 default 0
$ tc qdisc replace dev dummy0 parent root handle 7: \
gred setup vqs 4 default 0
$ tc qdisc replace dev dummy0 parent root handle 7: \
gred setup vqs 4 default 0
Fixes: f62d6b936d ("[PKT_SCHED]: GRED: Use central VQ change procedure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When dumping classes by parent, kernel would return classes twice:
| # tc qdisc add dev lo root prio
| # tc class show dev lo
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:
| # tc class show dev lo parent 8001:
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:
This comes from qdisc_match_from_root() potentially returning the root
qdisc itself if its handle matched. Though in that case, root's classes
were already dumped a few lines above.
Fixes: cb395b2010 ("net: sched: optimize class dumps")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similarly to what has been done in 8b4c3cdd9d ("net: sched: Add policy
validation for tc attributes"), fix classifier code to add validation of
TCA_CHAIN and TCA_KIND netlink attributes.
tested with:
# ./tdc.py -c filter
v2: Let sch_api and cls_api share nla_policy they have in common, thanks
to David Ahern.
v3: Avoid EXPORT_SYMBOL(), as validation of those attributes is not done
by TC modules, thanks to Cong Wang.
While at it, restore the 'Delete / get qdisc' comment to its orginal
position, just above tc_get_qdisc() function prototype.
Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David writes:
"Networking
1) RXRPC receive path fixes from David Howells.
2) Re-export __skb_recv_udp(), from Jiri Kosina.
3) Fix refcounting in u32 classificer, from Al Viro.
4) Userspace netlink ABI fixes from Eugene Syromiatnikov.
5) Don't double iounmap on rmmod in ena driver, from Arthur
Kiyanovski.
6) Fix devlink string attribute handling, we must pull a copy into a
kernel buffer if the lifetime extends past the netlink request.
From Moshe Shemesh.
7) Fix hangs in RDS, from Ka-Cheong Poon.
8) Fix recursive locking lockdep warnings in tipc, from Ying Xue.
9) Clear RX irq correctly in socionext, from Ilias Apalodimas.
10) bcm_sf2 fixes from Florian Fainelli."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits)
net: dsa: bcm_sf2: Call setup during switch resume
net: dsa: bcm_sf2: Fix unbind ordering
net: phy: sfp: remove sfp_mutex's definition
r8169: set RX_MULTI_EN bit in RxConfig for 8168F-family chips
net: socionext: clear rx irq correctly
net/mlx4_core: Fix warnings during boot on driverinit param set failures
tipc: eliminate possible recursive locking detected by LOCKDEP
selftests: udpgso_bench.sh explicitly requires bash
selftests: rtnetlink.sh explicitly requires bash.
qmi_wwan: Added support for Gemalto's Cinterion ALASxx WWAN interface
tipc: queue socket protocol error messages into socket receive buffer
tipc: set link tolerance correctly in broadcast link
net: ipv4: don't let PMTU updates increase route MTU
net: ipv4: update fnhe_pmtu when first hop's MTU changes
net/ipv6: stop leaking percpu memory in fib6 info
rds: RDS (tcp) hangs on sendto() to unresponding address
net: make skb_partial_csum_set() more robust against overflows
devlink: Add helper function for safely copy string param
devlink: Fix param cmode driverinit for string type
devlink: Fix param set handling for string type
...
Kees writes:
"Fix open-coded multiplication arguments to allocators
- Fixes several new open-coded multiplications added in the 4.19
merge window."
* tag 'alloc-args-v4.19-rc8' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
treewide: Replace more open-coded allocation size multiplications
cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references
via ->hlist and via ->tp_root together. u32_destroy() drops the former
and, in case when there had been links, leaves the sucker on the list.
As the result, there's nothing to protect it from getting freed once links
are dropped.
That also makes the "is it busy" check incapable of catching the root
hnode - it *is* busy (there's a reference from tp), but we don't see it as
something separate. "Is it our root?" check partially covers that, but
the problem exists for others' roots as well.
AFAICS, the minimal fix preserving the existing behaviour (where it doesn't
include oopsen, that is) would be this:
* count tp->root and tp_c->hlist as separate references. I.e.
have u32_init() set refcount to 2, not 1.
* in u32_destroy() we always drop the former;
in u32_destroy_hnode() - the latter.
That way we have *all* references contributing to refcount. List
removal happens in u32_destroy_hnode() (called only when ->refcnt is 1)
an in u32_destroy() in case of tc_u_common going away, along with
everything reachable from it. IOW, that way we know that
u32_destroy_key() won't free something still on the list (or pointed to by
someone's ->root).
Reproducer:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: \
u32 divisor 1
tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: \
u32 divisor 1
tc filter add dev eth0 parent ffff: protocol ip prio 100 \
handle 1:0:11 u32 ht 1: link 801: offset at 0 mask 0f00 shift 6 \
plus 0 eat match ip protocol 6 ff
tc filter delete dev eth0 parent ffff: protocol ip prio 200
tc filter change dev eth0 parent ffff: protocol ip prio 100 \
handle 1:0:11 u32 ht 1: link 0: offset at 0 mask 0f00 shift 6 plus 0 \
eat match ip protocol 6 ff
tc filter delete dev eth0 parent ffff: protocol ip prio 100
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As done treewide earlier, this catches several more open-coded
allocation size calculations that were added to the kernel during the
merge window. This performs the following mechanical transformations
using Coccinelle:
kvmalloc(a * b, ...) -> kvmalloc_array(a, b, ...)
kvzalloc(a * b, ...) -> kvcalloc(a, b, ...)
devm_kzalloc(..., a * b, ...) -> devm_kcalloc(..., a, b, ...)
Signed-off-by: Kees Cook <keescook@chromium.org>
A number of TC attributes are processed without proper validation
(e.g., length checks). Add a tca policy for all input attributes and use
when invoking nlmsg_parse.
The 2 Fixes tags below cover the latest additions. The other attributes
are a string (KIND), nested attribute (OPTIONS which does seem to have
validation in most cases), for dumps only or a flag.
Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Fixes: d47a6b0e7c ("net: sched: introduce ingress/egress block index attributes for qdisc")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If "td->u.target_size" is larger than sizeof(struct xt_entry_target) we
return -EINVAL. But we don't check whether it's smaller than
sizeof(struct xt_entry_target) and that could lead to an out of bounds
read.
Fixes: 7ba699c604 ("[NET_SCHED]: Convert actions from rtnetlink to new netlink API")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we delete a chain of filters, we need to notify
user-space we are deleting each filters in this chain
too.
Fixes: 32a4f5ecd7 ("net: sched: introduce chain object to uapi")
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>