linux

mirror of https://github.com/raspberrypi/linux.git synced 2025-12-09 03:20:05 +00:00

Author	SHA1	Message	Date
Andrea Claudi	5ca2ef674d	ipvs: fix dependency on nf_defrag_ipv6 [ Upstream commit `098e13f5b2` ] ipvs relies on nf_defrag_ipv6 module to manage IPv6 fragmentation, but lacks proper Kconfig dependencies and does not explicitly request defrag features. As a result, if netfilter hooks are not loaded, when IPv6 fragmented packet are handled by ipvs only the first fragment makes through. Fix it properly declaring the dependency on Kconfig and registering netfilter hooks on ip_vs_add_service() and ip_vs_new_dest(). Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-23 20:09:45 +01:00
Francesco Ruggeri	e0e6b0d7e0	netfilter: compat: initialize all fields in xt_init [ Upstream commit `8d29d16d21` ] If a non zero value happens to be in xt[NFPROTO_BRIDGE].cur at init time, the following panic can be caused by running % ebtables -t broute -F BROUTING from a 32-bit user level on a 64-bit kernel. This patch replaces kmalloc_array with kcalloc when allocating xt. [ 474.680846] BUG: unable to handle kernel paging request at 0000000009600920 [ 474.687869] PGD 2037006067 P4D 2037006067 PUD 2038938067 PMD 0 [ 474.693838] Oops: 0000 [#1] SMP [ 474.697055] CPU: 9 PID: 4662 Comm: ebtables Kdump: loaded Not tainted 4.19.17-11302235.AroraKernelnext.fc18.x86_64 #1 [ 474.707721] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.0 06/28/2013 [ 474.714313] RIP: 0010:xt_compat_calc_jump+0x2f/0x63 [x_tables] [ 474.720201] Code: 40 0f b6 ff 55 31 c0 48 6b ff 70 48 03 3d dc 45 00 00 48 89 e5 8b 4f 6c 4c 8b 47 60 ff c9 39 c8 7f 2f 8d 14 08 d1 fa 48 63 fa <41> 39 34 f8 4c 8d 0c fd 00 00 00 00 73 05 8d 42 01 eb e1 76 05 8d [ 474.739023] RSP: 0018:ffffc9000943fc58 EFLAGS: 00010207 [ 474.744296] RAX: 0000000000000000 RBX: ffffc90006465000 RCX: 0000000002580249 [ 474.751485] RDX: 00000000012c0124 RSI: fffffffff7be17e9 RDI: 00000000012c0124 [ 474.758670] RBP: ffffc9000943fc58 R08: 0000000000000000 R09: ffffffff8117cf8f [ 474.765855] R10: ffffc90006477000 R11: 0000000000000000 R12: 0000000000000001 [ 474.773048] R13: 0000000000000000 R14: ffffc9000943fcb8 R15: ffffc9000943fcb8 [ 474.780234] FS: 0000000000000000(0000) GS:ffff88a03f840000(0063) knlGS:00000000f7ac7700 [ 474.788612] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 474.794632] CR2: 0000000009600920 CR3: 0000002037422006 CR4: 00000000000606e0 [ 474.802052] Call Trace: [ 474.804789] compat_do_replace+0x1fb/0x2a3 [ebtables] [ 474.810105] compat_do_ebt_set_ctl+0x69/0xe6 [ebtables] [ 474.815605] ? try_module_get+0x37/0x42 [ 474.819716] compat_nf_setsockopt+0x4f/0x6d [ 474.824172] compat_ip_setsockopt+0x7e/0x8c [ 474.828641] compat_raw_setsockopt+0x16/0x3a [ 474.833220] compat_sock_common_setsockopt+0x1d/0x24 [ 474.838458] __compat_sys_setsockopt+0x17e/0x1b1 [ 474.843343] ? __check_object_size+0x76/0x19a [ 474.847960] __ia32_compat_sys_socketcall+0x1cb/0x25b [ 474.853276] do_fast_syscall_32+0xaf/0xf6 [ 474.857548] entry_SYSENTER_compat+0x6b/0x7a Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-23 20:09:45 +01:00
Taehee Yoo	e6e0001791	netfilter: xt_TEE: add missing code to get interface index in checkentry. [ Upstream commit `18c0ab8736` ] checkentry(tee_tg_check) should initialize priv->oif from dev if possible. But only netdevice notifier handler can set that. Hence priv->oif is always -1 until notifier handler is called. Fixes: `9e2f6c5d78` ("netfilter: Rework xt_TEE netdevice notifier") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-13 14:02:40 -07:00
Taehee Yoo	02d86085ca	netfilter: xt_TEE: fix wrong interface selection [ Upstream commit `f24d2d4f95` ] TEE netdevice notifier handler checks only interface name. however each netns can have same interface name. hence other netns's interface could be selected. test commands: %ip netns add vm1 %iptables -I INPUT -p icmp -j TEE --gateway 192.168.1.1 --oif enp2s0 %ip link set enp2s0 netns vm1 Above rule is in the root netns. but that rule could get enp2s0 ifindex of vm1 by notifier handler. After this patch, TEE rule is added to the per-netns list. Fixes: `9e2f6c5d78` ("netfilter: Rework xt_TEE netdevice notifier") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-13 14:02:40 -07:00
Martynas Pumputis	5058447bf7	netfilter: nf_nat: skip nat clash resolution for same-origin entries [ Upstream commit `4e35c1cb94` ] It is possible that two concurrent packets originating from the same socket of a connection-less protocol (e.g. UDP) can end up having different IP_CT_DIR_REPLY tuples which results in one of the packets being dropped. To illustrate this, consider the following simplified scenario: 1. Packet A and B are sent at the same time from two different threads by same UDP socket. No matching conntrack entry exists yet. Both packets cause allocation of a new conntrack entry. 2. get_unique_tuple gets called for A. No clashing entry found. conntrack entry for A is added to main conntrack table. 3. get_unique_tuple is called for B and will find that the reply tuple of B is already taken by A. It will allocate a new UDP source port for B to resolve the clash. 4. conntrack entry for B cannot be added to main conntrack table because its ORIGINAL direction is clashing with A and the REPLY directions of A and B are not the same anymore due to UDP source port reallocation done in step 3. This patch modifies nf_conntrack_tuple_taken so it doesn't consider colliding reply tuples if the IP_CT_DIR_ORIGINAL tuples are equal. [ Florian: simplify patch to not use .allow_clash setting and always ignore identical flows ] Signed-off-by: Martynas Pumputis <martynas@weave.works> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-13 14:02:37 -07:00
ZhangXiaoxu	e0b03a6bad	ipvs: Fix signed integer overflow when setsockopt timeout [ Upstream commit `53ab60baa1` ] There is a UBSAN bug report as below: UBSAN: Undefined behaviour in net/netfilter/ipvs/ip_vs_ctl.c:2227:21 signed integer overflow: -2147483647 * 1000 cannot be represented in type 'int' Reproduce program: #include <stdio.h> #include <sys/types.h> #include <sys/socket.h> #define IPPROTO_IP 0 #define IPPROTO_RAW 255 #define IP_VS_BASE_CTL (64+1024+64) #define IP_VS_SO_SET_TIMEOUT (IP_VS_BASE_CTL+10) /* The argument to IP_VS_SO_GET_TIMEOUT / struct ipvs_timeout_t { int tcp_timeout; int tcp_fin_timeout; int udp_timeout; }; int main() { int ret = -1; int sockfd = -1; struct ipvs_timeout_t to; sockfd = socket(AF_INET, SOCK_RAW, IPPROTO_RAW); if (sockfd == -1) { printf("socket init error\n"); return -1; } to.tcp_timeout = -2147483647; to.tcp_fin_timeout = -2147483647; to.udp_timeout = -2147483647; ret = setsockopt(sockfd, IPPROTO_IP, IP_VS_SO_SET_TIMEOUT, (char )(&to), sizeof(to)); printf("setsockopt return %d\n", ret); return ret; } Return -EINVAL if the timeout value is negative or max than 'INT_MAX / HZ'. Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-13 14:02:27 -07:00
Fernando Fernandez Mancera	0c1054e0e5	netfilter: nfnetlink_osf: add missing fmatch check commit `1a6a0951fc` upstream. When we check the tcp options of a packet and it doesn't match the current fingerprint, the tcp packet option pointer must be restored to its initial value in order to do the proper tcp options check for the next fingerprint. Here we can see an example. Assumming the following fingerprint base with two lines: S10:64:1:60:M,S,T,N,W6: Linux:3.0::Linux 3.0 S20:64:1:60:M,S,T,N,W7: Linux:4.19:arch:Linux 4.1 Where TCP options are the last field in the OS signature, all of them overlap except by the last one, ie. 'W6' versus 'W7'. In case a packet for Linux 4.19 kicks in, the osf finds no matching because the TCP options pointer is updated after checking for the TCP options in the first line. Therefore, reset pointer back to where it should be. Fixes: `11eeef41d5` ("netfilter: passive OS fingerprint xtables match") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-02-27 10:09:03 +01:00
Pablo Neira Ayuso	a905b82e1e	netfilter: nft_compat: use-after-free when deleting targets commit `753c111f65` upstream. Fetch pointer to module before target object is released. Fixes: `29e3880109` ("netfilter: nf_tables: fix use-after-free when deleting compat expressions") Fixes: `0ca743a559` ("netfilter: nf_tables: add compatibility layer for x_tables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-02-27 10:09:03 +01:00
Pablo Neira Ayuso	1500d94e33	netfilter: nf_tables: fix flush after rule deletion in the same batch commit `23b7ca4f74` upstream. Flush after rule deletion bogusly hits -ENOENT. Skip rules that have been already from nft_delrule_by_chain() which is always called from the flush path. Fixes: `cf9dc09d09` ("netfilter: nf_tables: fix missing rules flushing per table") Reported-by: Phil Sutter <phil@nwl.cc> Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-02-27 10:09:02 +01:00
Henry Yen	73aa8292ca	netfilter: nft_flow_offload: fix checking method of conntrack helper [ Upstream commit `2314e87974` ] This patch uses nfct_help() to detect whether an established connection needs conntrack helper instead of using test_bit(IPS_HELPER_BIT, &ct->status). The reason is that IPS_HELPER_BIT is only set when using explicit CT target. However, in the case that a device enables conntrack helper via command "echo 1 > /proc/sys/net/netfilter/nf_conntrack_helper", the status of IPS_HELPER_BIT will not present any change, and consequently it loses the checking ability in the context. Signed-off-by: Henry Yen <henry.yen@mediatek.com> Reviewed-by: Ryder Lee <ryder.lee@mediatek.com> Tested-by: John Crispin <john@phrozen.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-02-27 10:08:55 +01:00
wenxu	6d26c375a4	netfilter: nft_flow_offload: fix interaction with vrf slave device [ Upstream commit `10f4e76587` ] In the forward chain, the iif is changed from slave device to master vrf device. Thus, flow offload does not find a match on the lower slave device. This patch uses the cached route, ie. dst->dev, to update the iif and oif fields in the flow entry. After this patch, the following example works fine: # ip addr add dev eth0 1.1.1.1/24 # ip addr add dev eth1 10.0.0.1/24 # ip link add user1 type vrf table 1 # ip l set user1 up # ip l set dev eth0 master user1 # ip l set dev eth1 master user1 # nft add table firewall # nft add flowtable f fb1 { hook ingress priority 0 \; devices = { eth0, eth1 } \; } # nft add chain f ftb-all {type filter hook forward priority 0 \; policy accept \; } # nft add rule f ftb-all ct zone 1 ip protocol tcp flow offload @fb1 # nft add rule f ftb-all ct zone 1 ip protocol udp flow offload @fb1 Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-02-27 10:08:54 +01:00
wenxu	535be4692f	netfilter: nft_flow_offload: Fix reverse route lookup [ Upstream commit `a799aea098` ] Using the following example: client 1.1.1.7 ---> 2.2.2.7 which dnat to 10.0.0.7 server The first reply packet (ie. syn+ack) uses an incorrect destination address for the reverse route lookup since it uses: daddr = ct->tuplehash[!dir].tuple.dst.u3.ip; which is 2.2.2.7 in the scenario that is described above, while this should be: daddr = ct->tuplehash[dir].tuple.src.u3.ip; that is 10.0.0.7. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-02-27 10:08:53 +01:00
Taehee Yoo	95d4f951e7	netfilter: nf_tables: fix leaking object reference count [ Upstream commit `b91d903688` ] There is no code that decreases the reference count of stateful objects in error path of the nft_add_set_elem(). this causes a leak of reference count of stateful objects. Test commands: $nft add table ip filter $nft add counter ip filter c1 $nft add map ip filter m1 { type ipv4_addr : counter \;} $nft add element ip filter m1 { 1 : c1 } $nft add element ip filter m1 { 1 : c1 } $nft delete element ip filter m1 { 1 } $nft delete counter ip filter c1 Result: Error: Could not process rule: Device or resource busy delete counter ip filter c1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ At the second 'nft add element ip filter m1 { 1 : c1 }', the reference count of the 'c1' is increased then it tries to insert into the 'm1'. but the 'm1' already has same element so it returns -EEXIST. But it doesn't decrease the reference count of the 'c1' in the error path. Due to a leak of the reference count of the 'c1', the 'c1' can't be removed by 'nft delete counter ip filter c1'. Fixes: `8aeff920dc` ("netfilter: nf_tables: add stateful object reference to set elements") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-02-27 10:08:53 +01:00
Stefano Brivio	ad7013cd6d	netfilter: ipset: Allow matching on destination MAC address for mac and ipmac sets [ Upstream commit `8cc4ccf583` ] There doesn't seem to be any reason to restrict MAC address matching to source MAC addresses in set types bitmap:ipmac, hash:ipmac and hash:mac. With this patch, and this setup: ip netns add A ip link add veth1 type veth peer name veth2 netns A ip addr add 192.0.2.1/24 dev veth1 ip -net A addr add 192.0.2.2/24 dev veth2 ip link set veth1 up ip -net A link set veth2 up ip netns exec A ipset create test hash:mac dst=$(ip netns exec A cat /sys/class/net/veth2/address) ip netns exec A ipset add test ${dst} ip netns exec A iptables -P INPUT DROP ip netns exec A iptables -I INPUT -m set --match-set test dst -j ACCEPT ipset will match packets based on destination MAC address: # ping -c1 192.0.2.2 >/dev/null # echo $? 0 Reported-by: Yi Chen <yiche@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-01-26 09:32:33 +01:00
Florian Westphal	6567515e4a	netfilter: nf_conncount: fix argument order to find_next_bit commit `a007232066` upstream. Size and 'next bit' were swapped, this bug could cause worker to reschedule itself even if system was idle. Fixes: `5c789e131c` ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search") Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-01-22 21:40:29 +01:00
Pablo Neira Ayuso	b01b92417d	netfilter: nf_conncount: speculative garbage collection on empty lists commit `c80f10bc97` upstream. Instead of removing a empty list node that might be reintroduced soon thereafter, tentatively place the empty list node on the list passed to tree_nodes_free(), then re-check if the list is empty again before erasing it from the tree. [ Florian: rebase on top of pending nf_conncount fixes ] Fixes: `5c789e131c` ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search") Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-01-22 21:40:29 +01:00
Pablo Neira Ayuso	aea1d19594	netfilter: nf_conncount: move all list iterations under spinlock commit `2f971a8f42` upstream. Two CPUs may race to remove a connection from the list, the existing conn->dead will result in a use-after-free. Use the per-list spinlock to protect list iterations. As all accesses to the list now happen while holding the per-list lock, we no longer need to delay free operations with rcu. Joint work with Florian. Fixes: `5c789e131c` ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search") Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-01-22 21:40:29 +01:00
Florian Westphal	bdc6c893ba	netfilter: nf_conncount: merge lookup and add functions commit `df4a902509` upstream. 'lookup' is always followed by 'add'. Merge both and make the list-walk part of nf_conncount_add(). This also avoids one unneeded unlock/re-lock pair. Extra care needs to be taken in count_tree, as we only hold rcu read lock, i.e. we can only insert to an existing tree node after acquiring its lock and making sure it has a nonzero count. As a zero count should be rare, just fall back to insert_tree() (which acquires tree lock). This issue and its solution were pointed out by Shawn Bohrer during patch review. Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-01-22 21:40:29 +01:00
Florian Westphal	13c639424b	netfilter: nf_conncount: restart search when nodes have been erased commit `e8cfb372b3` upstream. Shawn Bohrer reported a following crash: \|RIP: 0010:rb_erase+0xae/0x360 [..] Call Trace: nf_conncount_destroy+0x59/0xc0 [nf_conncount] cleanup_match+0x45/0x70 [ip_tables] ... Shawn tracked this down to bogus 'parent' pointer: Problem is that when we insert a new node, then there is a chance that the 'parent' that we found was also passed to tree_nodes_free() (because that node was empty) for erase+free. Instead of trying to be clever and detect when this happens, restart the search if we have evicted one or more nodes. To prevent frequent restarts, do not perform gc on the second round. Also, unconditionally schedule the gc worker. The condition gc_count > ARRAY_SIZE(gc_nodes)) cannot be true unless tree grows very large, as the height of the tree will be low even with hundreds of nodes present. Fixes: `5c789e131c` ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search") Reported-by: Shawn Bohrer <sbohrer@cloudflare.com> Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-01-22 21:40:29 +01:00
Florian Westphal	d6b3ff0222	netfilter: nf_conncount: split gc in two phases commit `f7fcc98dfc` upstream. The lockless workqueue garbage collector can race with packet path garbage collector to delete list nodes, as it calls tree_nodes_free() with the addresses of nodes that might have been free'd already from another cpu. To fix this, split gc into two phases. One phase to perform gc on the connections: From a locking perspective, this is the same as count_tree(): we hold rcu lock, but we do not change the tree, we only change the nodes' contents. The second phase acquires the tree lock and reaps empty nodes. This avoids a race condition of the garbage collection vs. packet path: If a node has been free'd already, the second phase won't find it anymore. This second phase is, from locking perspective, same as insert_tree(). The former only modifies nodes (list content, count), latter modifies the tree itself (rb_erase or rb_insert). Fixes: `5c789e131c` ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search") Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-01-22 21:40:28 +01:00
Florian Westphal	ef68fdb517	netfilter: nf_conncount: don't skip eviction when age is negative commit `4cd273bb91` upstream. age is signed integer, so result can be negative when the timestamps have a large delta. In this case we want to discard the entry. Instead of using age >= 2 \|\| age < 0, just make it unsigned. Fixes: `b36e4523d4` ("netfilter: nf_conncount: fix garbage collection confirm race") Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-01-22 21:40:28 +01:00
Shawn Bohrer	c5cbe95a4b	netfilter: nf_conncount: replace CONNCOUNT_LOCK_SLOTS with CONNCOUNT_SLOTS commit `c78e7818f1` upstream. Most of the time these were the same value anyway, but when CONFIG_LOCKDEP was enabled we would use a smaller number of locks to reduce overhead. Unfortunately having two values is confusing and not worth the complexity. This fixes a bug where tree_gc_worker() would only GC up to CONNCOUNT_LOCK_SLOTS trees which meant when CONFIG_LOCKDEP was enabled not all trees would be GCed by tree_gc_worker(). Fixes: `5c789e131c` ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Shawn Bohrer <sbohrer@cloudflare.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-01-22 21:40:28 +01:00
Taehee Yoo	7fd995d3b4	netfilter: nf_conncount: use rb_link_node_rcu() instead of rb_link_node() [ Upstream commit `d4e7df1656` ] rbnode in insert_tree() is rcu protected pointer. So, in order to handle this pointer, _rcu function should be used. rb_link_node_rcu() is a rcu version of rb_link_node(). Fixes: `34848d5c89` ("netfilter: nf_conncount: Split insert and traversal") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-01-13 09:50:59 +01:00
Florian Westphal	ba364929ff	netfilter: nat: can't use dst_hold on noref dst [ Upstream commit `542fbda0f0` ] The dst entry might already have a zero refcount, waiting on rcu list to be free'd. Using dst_hold() transitions its reference count to 1, and next dst release will try to free it again -- resulting in a double free: WARNING: CPU: 1 PID: 0 at include/net/dst.h:239 nf_xfrm_me_harder+0xe7/0x130 [nf_nat] RIP: 0010:nf_xfrm_me_harder+0xe7/0x130 [nf_nat] Code: 48 8b 5c 24 60 65 48 33 1c 25 28 00 00 00 75 53 48 83 c4 68 5b 5d 41 5c c3 85 c0 74 0d 8d 48 01 f0 0f b1 0a 74 86 85 c0 75 f3 <0f> 0b e9 7b ff ff ff 29 c6 31 d2 b9 20 00 48 00 4c 89 e7 e8 31 27 Call Trace: nf_nat_ipv4_out+0x78/0x90 [nf_nat_ipv4] nf_hook_slow+0x36/0xd0 ip_output+0x9f/0xd0 ip_forward+0x328/0x440 ip_rcv+0x8a/0xb0 Use dst_hold_safe instead and bail out if we cannot take a reference. Fixes: `a4c2fd7f78` ("net: remove DST_NOCACHE flag") Reported-by: Martin Zaharinov <micron10@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-01-13 09:50:59 +01:00
Pan Bian	9e40410160	netfilter: ipset: do not call ipset_nest_end after nla_nest_cancel [ Upstream commit `708abf74dd` ] In the error handling block, nla_nest_cancel(skb, atd) is called to cancel the nest operation. But then, ipset_nest_end(skb, atd) is unexpected called to end the nest operation. This patch calls the ipset_nest_end only on the branch that nla_nest_cancel is not called. Fixes: `45040978c8` ("netfilter: ipset: Fix set:list type crash when flush/dump set in parallel") Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-01-13 09:50:59 +01:00
Florian Westphal	6bcf9ef86c	netfilter: seqadj: re-load tcp header pointer after possible head reallocation [ Upstream commit `530aad7701` ] When adjusting sack block sequence numbers, skb_make_writable() gets called to make sure tcp options are all in the linear area, and buffer is not shared. This can cause tcp header pointer to get reallocated, so we must reaload it to avoid memory corruption. This bug pre-dates git history. Reported-by: Neel Mehta <nmehta@google.com> Reported-by: Shane Huntley <shuntley@google.com> Reported-by: Heather Adkins <argv@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-01-13 09:50:57 +01:00
Taehee Yoo	cee05c0371	netfilter: nf_tables: fix suspicious RCU usage in nft_chain_stats_replace() [ Upstream commit `4c05ec4738` ] basechain->stats is rcu protected data which is updated from nft_chain_stats_replace(). This function is executed from the commit phase which holds the pernet nf_tables commit mutex - not the global nfnetlink subsystem mutex. Test commands to reproduce the problem are: %iptables-nft -I INPUT %iptables-nft -Z %iptables-nft -Z This patch uses RCU calls to handle basechain->stats updates to fix a splat that looks like: [89279.358755] ============================= [89279.363656] WARNING: suspicious RCU usage [89279.368458] 4.20.0-rc2+ #44 Tainted: G W L [89279.374661] ----------------------------- [89279.379542] net/netfilter/nf_tables_api.c:1404 suspicious rcu_dereference_protected() usage! [...] [89279.406556] 1 lock held by iptables-nft/5225: [89279.411728] #0: 00000000bf45a000 (&net->nft.commit_mutex){+.+.}, at: nf_tables_valid_genid+0x1f/0x70 [nf_tables] [89279.424022] stack backtrace: [89279.429236] CPU: 0 PID: 5225 Comm: iptables-nft Tainted: G W L 4.20.0-rc2+ #44 [89279.430135] Call Trace: [89279.430135] dump_stack+0xc9/0x16b [89279.430135] ? show_regs_print_info+0x5/0x5 [89279.430135] ? lockdep_rcu_suspicious+0x117/0x160 [89279.430135] nft_chain_commit_update+0x4ea/0x640 [nf_tables] [89279.430135] ? sched_clock_local+0xd4/0x140 [89279.430135] ? check_flags.part.35+0x440/0x440 [89279.430135] ? __rhashtable_remove_fast.constprop.67+0xec0/0xec0 [nf_tables] [89279.430135] ? sched_clock_cpu+0x126/0x170 [89279.430135] ? find_held_lock+0x39/0x1c0 [89279.430135] ? hlock_class+0x140/0x140 [89279.430135] ? is_bpf_text_address+0x5/0xf0 [89279.430135] ? check_flags.part.35+0x440/0x440 [89279.430135] ? __lock_is_held+0xb4/0x140 [89279.430135] nf_tables_commit+0x2555/0x39c0 [nf_tables] Fixes: `f102d66b33` ("netfilter: nf_tables: use dedicated mutex to guard transactions") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-01-13 09:50:57 +01:00
Taehee Yoo	e5f42e0617	netfilter: nf_tables: deactivate expressions in rule replecement routine [ Upstream commit `ca08987885` ] There is no expression deactivation call from the rule replacement path, hence, chain counter is not decremented. A few steps to reproduce the problem: %nft add table ip filter %nft add chain ip filter c1 %nft add chain ip filter c1 %nft add rule ip filter c1 jump c2 %nft replace rule ip filter c1 handle 3 accept %nft flush ruleset <jump c2> expression means immediate NFT_JUMP to chain c2. Reference count of chain c2 is increased when the rule is added. When rule is deleted or replaced, the reference counter of c2 should be decreased via nft_rule_expr_deactivate() which calls nft_immediate_deactivate(). Splat looks like: [ 214.396453] WARNING: CPU: 1 PID: 21 at net/netfilter/nf_tables_api.c:1432 nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables] [ 214.398983] Modules linked in: nf_tables nfnetlink [ 214.398983] CPU: 1 PID: 21 Comm: kworker/1:1 Not tainted 4.20.0-rc2+ #44 [ 214.398983] Workqueue: events nf_tables_trans_destroy_work [nf_tables] [ 214.398983] RIP: 0010:nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables] [ 214.398983] Code: 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 8e 00 00 00 48 8b 7b 58 e8 e1 2c 4e c6 48 89 df e8 d9 2c 4e c6 eb 9a <0f> 0b eb 96 0f 0b e9 7e fe ff ff e8 a7 7e 4e c6 e9 a4 fe ff ff e8 [ 214.398983] RSP: 0018:ffff8881152874e8 EFLAGS: 00010202 [ 214.398983] RAX: 0000000000000001 RBX: ffff88810ef9fc28 RCX: ffff8881152876f0 [ 214.398983] RDX: dffffc0000000000 RSI: 1ffff11022a50ede RDI: ffff88810ef9fc78 [ 214.398983] RBP: 1ffff11022a50e9d R08: 0000000080000000 R09: 0000000000000000 [ 214.398983] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff11022a50eba [ 214.398983] R13: ffff888114446e08 R14: ffff8881152876f0 R15: ffffed1022a50ed6 [ 214.398983] FS: 0000000000000000(0000) GS:ffff888116400000(0000) knlGS:0000000000000000 [ 214.398983] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 214.398983] CR2: 00007fab9bb5f868 CR3: 000000012aa16000 CR4: 00000000001006e0 [ 214.398983] Call Trace: [ 214.398983] ? nf_tables_table_destroy.isra.37+0x100/0x100 [nf_tables] [ 214.398983] ? __kasan_slab_free+0x145/0x180 [ 214.398983] ? nf_tables_trans_destroy_work+0x439/0x830 [nf_tables] [ 214.398983] ? kfree+0xdb/0x280 [ 214.398983] nf_tables_trans_destroy_work+0x5f5/0x830 [nf_tables] [ ... ] Fixes: `bb7b40aecb` ("netfilter: nf_tables: bogus EBUSY in chain deletions") Reported by: Christoph Anton Mitterer <calestyo@scientia.net> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=914505 Link: https://bugzilla.kernel.org/show_bug.cgi?id=201791 Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-12-17 09:24:38 +01:00
Taehee Yoo	8038f92df3	netfilter: nf_conncount: remove wrong condition check routine [ Upstream commit `53ca0f2fec` ] All lists that reach the tree_nodes_free() function have both zero counter and true dead flag. The reason for this is that lists to be release are selected by nf_conncount_gc_list() which already decrements the list counter and sets on the dead flag. Therefore, this if statement in tree_nodes_free() is unnecessary and wrong. Fixes: `31568ec09e` ("netfilter: nf_conncount: fix list_del corruption in conn_free") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-12-17 09:24:37 +01:00
Taehee Yoo	18218f827e	netfilter: add missing error handling code for register functions [ Upstream commit `584eab291c` ] register_{netdevice/inetaddr/inet6addr}_notifier may return an error value, this patch adds the code to handle these error paths. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-12-17 09:24:36 +01:00
Xin Long	28ad9091e1	ipvs: call ip_vs_dst_notifier earlier than ipv6_dev_notf [ Upstream commit `2a31e4bd9a` ] ip_vs_dst_event is supposed to clean up all dst used in ipvs' destinations when a net dev is going down. But it works only when the dst's dev is the same as the dev from the event. Now with the same priority but late registration, ip_vs_dst_notifier is always called later than ipv6_dev_notf where the dst's dev is set to lo for NETDEV_DOWN event. As the dst's dev lo is not the same as the dev from the event in ip_vs_dst_event, ip_vs_dst_notifier doesn't actually work. Also as these dst have to wait for dest_trash_timer to clean them up. It would cause some non-permanent kernel warnings: unregister_netdevice: waiting for br0 to become free. Usage count = 3 To fix it, call ip_vs_dst_notifier earlier than ipv6_dev_notf by increasing its priority to ADDRCONF_NOTIFY_PRIORITY + 5. Note that for ipv4 route fib_netdev_notifier doesn't set dst's dev to lo in NETDEV_DOWN event, so this fix is only needed when IP_VS_IPV6 is defined. Fixes: `7a4f0761fc` ("IPVS: init and cleanup restructuring") Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-12-17 09:24:35 +01:00
Taehee Yoo	fb0fc90cc2	netfilter: xt_hashlimit: fix a possible memory leak in htable_create() [ Upstream commit `b4e955e9f3` ] In the htable_create(), hinfo is allocated by vmalloc() So that if error occurred, hinfo should be freed. Fixes: `11d5f15723` ("netfilter: xt_hashlimit: Create revision 2 to support higher pps rates") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-12-17 09:24:33 +01:00
Florian Westphal	00bac44c99	netfilter: nf_tables: fix use-after-free when deleting compat expressions [ Upstream commit `29e3880109` ] nft_compat ops do not have static storage duration, unlike all other expressions. When nf_tables_expr_destroy() returns, expr->ops might have been free'd already, so we need to store next address before calling expression destructor. For same reason, we can't deref match pointer after nft_xt_put(). This can be easily reproduced by adding msleep() before nft_match_destroy() returns. Fixes: `0ca743a559` ("netfilter: nf_tables: add compatibility layer for x_tables") Reported-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-12-17 09:24:31 +01:00
Taehee Yoo	e947f9aa9a	netfilter: xt_RATEEST: remove netns exit routine [ Upstream commit `0fbcc5b568` ] xt_rateest_net_exit() was added to check whether rules are flushed successfully. but ->net_exit() callback is called earlier than ->destroy() callback. So that ->net_exit() callback can't check that. test commands: %ip netns add vm1 %ip netns exec vm1 iptables -t mangle -I PREROUTING -p udp \ --dport 1111 -j RATEEST --rateest-name ap \ --rateest-interval 250ms --rateest-ewma 0.5s %ip netns del vm1 splat looks like: [ 668.813518] WARNING: CPU: 0 PID: 87 at net/netfilter/xt_RATEEST.c:210 xt_rateest_net_exit+0x210/0x340 [xt_RATEEST] [ 668.813518] Modules linked in: xt_RATEEST xt_tcpudp iptable_mangle bpfilter ip_tables x_tables [ 668.813518] CPU: 0 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc7+ #21 [ 668.813518] Workqueue: netns cleanup_net [ 668.813518] RIP: 0010:xt_rateest_net_exit+0x210/0x340 [xt_RATEEST] [ 668.813518] Code: 00 48 8b 85 30 ff ff ff 4c 8b 23 80 38 00 0f 85 24 01 00 00 48 8b 85 30 ff ff ff 4d 85 e4 4c 89 a5 58 ff ff ff c6 00 f8 74 b2 <0f> 0b 48 83 c3 08 4c 39 f3 75 b0 48 b8 00 00 00 00 00 fc ff df 49 [ 668.813518] RSP: 0018:ffff8801156c73f8 EFLAGS: 00010282 [ 668.813518] RAX: ffffed0022ad8e85 RBX: ffff880118928e98 RCX: 5db8012a00000000 [ 668.813518] RDX: ffff8801156c7428 RSI: 00000000cb1d185f RDI: ffff880115663b74 [ 668.813518] RBP: ffff8801156c74d0 R08: ffff8801156633c0 R09: 1ffff100236440be [ 668.813518] R10: 0000000000000001 R11: ffffed002367d852 R12: ffff880115142b08 [ 668.813518] R13: 1ffff10022ad8e81 R14: ffff880118928ea8 R15: dffffc0000000000 [ 668.813518] FS: 0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000 [ 668.813518] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 668.813518] CR2: 0000563aa69f4f28 CR3: 0000000105a16000 CR4: 00000000001006f0 [ 668.813518] Call Trace: [ 668.813518] ? unregister_netdevice_many+0xe0/0xe0 [ 668.813518] ? xt_rateest_net_init+0x2c0/0x2c0 [xt_RATEEST] [ 668.813518] ? default_device_exit+0x1ca/0x270 [ 668.813518] ? remove_proc_entry+0x1cd/0x390 [ 668.813518] ? dev_change_net_namespace+0xd00/0xd00 [ 668.813518] ? __init_waitqueue_head+0x130/0x130 [ 668.813518] ops_exit_list.isra.10+0x94/0x140 [ 668.813518] cleanup_net+0x45b/0x900 [ 668.813518] ? net_drop_ns+0x110/0x110 [ 668.813518] ? swapgs_restore_regs_and_return_to_usermode+0x3c/0x80 [ 668.813518] ? save_trace+0x300/0x300 [ 668.813518] ? lock_acquire+0x196/0x470 [ 668.813518] ? lock_acquire+0x196/0x470 [ 668.813518] ? process_one_work+0xb60/0x1de0 [ 668.813518] ? _raw_spin_unlock_irq+0x29/0x40 [ 668.813518] ? _raw_spin_unlock_irq+0x29/0x40 [ 668.813518] ? __lock_acquire+0x4500/0x4500 [ 668.813518] ? __lock_is_held+0xb4/0x140 [ 668.813518] process_one_work+0xc13/0x1de0 [ 668.813518] ? pwq_dec_nr_in_flight+0x3c0/0x3c0 [ 668.813518] ? set_load_weight+0x270/0x270 [ ... ] Fixes: `3427b2ab63` ("netfilter: make xt_rateest hash table per net") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-12-17 09:24:31 +01:00
Florian Westphal	8fe8940ffc	netfilter: nf_tables: don't skip inactive chains during update [ Upstream commit `0fb39bbe43` ] There is no synchronization between packet path and the configuration plane. The packet path uses two arrays with rules, one contains the current (active) generation. The other either contains the last (obsolete) generation or the future one. Consider: cpu1 cpu2 nft_do_chain(c); delete c net->gen++; genbit = !!net->gen; rules = c->rg[genbit]; cpu1 ignores c when updating if c is not active anymore in the new generation. On cpu2, we now use rules from wrong generation, as c->rg[old] contains the rules matching 'c' whereas c->rg[new] was not updated and can even point to rules that have been free'd already, causing a crash. To fix this, make sure that 'current' to the 'next' generation are identical for chains that are going away so that c->rg[new] will just use the matching rules even if genbit was incremented already. Fixes: `0cbc06b3fa` ("netfilter: nf_tables: remove synchronize_rcu in commit phase") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-12-17 09:24:31 +01:00
Taehee Yoo	4a3b49f0ce	netfilter: nf_conncount: fix unexpected permanent node of list. [ Upstream commit `3c5cdb17c3` ] When list->count is 0, the list is deleted by GC. But list->count is never reached 0 because initial count value is 1 and it is increased when node is inserted. So that initial value of list->count should be 0. Originally GC always finds zero count list through deleting node and decreasing count. However, list may be left empty since node insertion may fail eg. allocaton problem. In order to solve this problem, GC routine also finds zero count list without deleting node. Fixes: `cb2b36f5a9` ("netfilter: nf_conncount: Switch to plain list") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-12-17 09:24:30 +01:00
Taehee Yoo	ae60f4705f	netfilter: nf_conncount: fix list_del corruption in conn_free [ Upstream commit `31568ec09e` ] nf_conncount_tuple is an element of nft_connlimit and that is deleted by conn_free(). Elements can be deleted by both GC routine and data path functions (nf_conncount_lookup, nf_conncount_add) and they call conn_free() to free elements. But conn_free() only protects lists, not each element. So that list_del corruption could occurred. The conn_free() doesn't check whether element is already deleted. In order to protect elements, dead flag is added. If an element is deleted, dead flag is set. The only conn_free() can delete elements so that both list lock and dead flag are enough to protect it. test commands: %nft add table ip filter %nft add chain ip filter input { type filter hook input priority 0\; } %nft add rule filter input meter test { ip id ct count over 2 } counter splat looks like: [ 1779.495778] list_del corruption, ffff8800b6e12008->prev is LIST_POISON2 (dead000000000200) [ 1779.505453] ------------[ cut here ]------------ [ 1779.506260] kernel BUG at lib/list_debug.c:50! [ 1779.515831] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 1779.516772] CPU: 0 PID: 33 Comm: kworker/0:2 Not tainted 4.19.0-rc6+ #22 [ 1779.516772] Workqueue: events_power_efficient nft_rhash_gc [nf_tables_set] [ 1779.516772] RIP: 0010:__list_del_entry_valid+0xd8/0x150 [ 1779.516772] Code: 39 48 83 c4 08 b8 01 00 00 00 5b 5d c3 48 89 ea 48 c7 c7 00 c3 5b 98 e8 0f dc 40 ff 0f 0b 48 c7 c7 60 c3 5b 98 e8 01 dc 40 ff <0f> 0b 48 c7 c7 c0 c3 5b 98 e8 f3 db 40 ff 0f 0b 48 c7 c7 20 c4 5b [ 1779.516772] RSP: 0018:ffff880119127420 EFLAGS: 00010286 [ 1779.516772] RAX: 000000000000004e RBX: dead000000000200 RCX: 0000000000000000 [ 1779.516772] RDX: 000000000000004e RSI: 0000000000000008 RDI: ffffed0023224e7a [ 1779.516772] RBP: ffff88011934bc10 R08: ffffed002367cea9 R09: ffffed002367cea9 [ 1779.516772] R10: 0000000000000001 R11: ffffed002367cea8 R12: ffff8800b6e12008 [ 1779.516772] R13: ffff8800b6e12010 R14: ffff88011934bc20 R15: ffff8800b6e12008 [ 1779.516772] FS: 0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000 [ 1779.516772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1779.516772] CR2: 00007fc876534010 CR3: 000000010da16000 CR4: 00000000001006f0 [ 1779.516772] Call Trace: [ 1779.516772] conn_free+0x9f/0x2b0 [nf_conncount] [ 1779.516772] ? nf_ct_tmpl_alloc+0x2a0/0x2a0 [nf_conntrack] [ 1779.516772] ? nf_conncount_add+0x520/0x520 [nf_conncount] [ 1779.516772] ? do_raw_spin_trylock+0x1a0/0x1a0 [ 1779.516772] ? do_raw_spin_trylock+0x10/0x1a0 [ 1779.516772] find_or_evict+0xe5/0x150 [nf_conncount] [ 1779.516772] nf_conncount_gc_list+0x162/0x360 [nf_conncount] [ 1779.516772] ? nf_conncount_lookup+0xee0/0xee0 [nf_conncount] [ 1779.516772] ? _raw_spin_unlock_irqrestore+0x45/0x50 [ 1779.516772] ? trace_hardirqs_off+0x6b/0x220 [ 1779.516772] ? trace_hardirqs_on_caller+0x220/0x220 [ 1779.516772] nft_rhash_gc+0x16b/0x540 [nf_tables_set] [ ... ] Fixes: `5c789e131c` ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-12-17 09:24:30 +01:00
Taehee Yoo	08c7e68ab2	netfilter: nf_conncount: use spin_lock_bh instead of spin_lock [ Upstream commit `fd3e71a9f7` ] conn_free() holds lock with spin_lock() and it is called by both nf_conncount_lookup() and nf_conncount_gc_list(). nf_conncount_lookup() is called from bottom-half context and nf_conncount_gc_list() from process context. So that spin_lock() call is not safe. Hence conn_free() should use spin_lock_bh() instead of spin_lock(). test commands: %nft add table ip filter %nft add chain ip filter input { type filter hook input priority 0\; } %nft add rule filter input meter test { ip saddr ct count over 2 } \ counter splat looks like: [ 461.996507] ================================ [ 461.998999] WARNING: inconsistent lock state [ 461.998999] 4.19.0-rc6+ #22 Not tainted [ 461.998999] -------------------------------- [ 461.998999] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [ 461.998999] kworker/0:2/134 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 461.998999] 00000000a71a559a (&(&list->list_lock)->rlock){+.?.}, at: conn_free+0x69/0x2b0 [nf_conncount] [ 461.998999] {IN-SOFTIRQ-W} state was registered at: [ 461.998999] _raw_spin_lock+0x30/0x70 [ 461.998999] nf_conncount_add+0x28a/0x520 [nf_conncount] [ 461.998999] nft_connlimit_eval+0x401/0x580 [nft_connlimit] [ 461.998999] nft_dynset_eval+0x32b/0x590 [nf_tables] [ 461.998999] nft_do_chain+0x497/0x1430 [nf_tables] [ 461.998999] nft_do_chain_ipv4+0x255/0x330 [nf_tables] [ 461.998999] nf_hook_slow+0xb1/0x160 [ ... ] [ 461.998999] other info that might help us debug this: [ 461.998999] Possible unsafe locking scenario: [ 461.998999] [ 461.998999] CPU0 [ 461.998999] ---- [ 461.998999] lock(&(&list->list_lock)->rlock); [ 461.998999] <Interrupt> [ 461.998999] lock(&(&list->list_lock)->rlock); [ 461.998999] [ 461.998999] * DEADLOCK * [ 461.998999] [ ... ] Fixes: `5c789e131c` ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-12-17 09:24:30 +01:00
Florian Westphal	1cf11e7ca0	netfilter: nft_compat: ebtables 'nat' table is normal chain type [ Upstream commit `e4844c9c62` ] Unlike ip(6)tables, the ebtables nat table has no special properties. This bug causes 'ebtables -A' to fail when using a target such as 'snat' (ebt_snat target sets ".table = "nat"'). Targets that have no table restrictions work fine. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-11-27 16:13:03 +01:00
Jozsef Kadlecsik	2f6bf7917f	netfilter: ipset: Fix calling ip_set() macro at dumping [ Upstream commit `8a02bdd50b` ] The ip_set() macro is called when either ip_set_ref_lock held only or no lock/nfnl mutex is held at dumping. Take this into account properly. Also, use Pablo's suggestion to use rcu_dereference_raw(), the ref_netlink protects the set. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-11-27 16:13:03 +01:00
Taehee Yoo	e8b258ce87	netfilter: xt_IDLETIMER: add sysfs filename checking routine [ Upstream commit `54451f60c8` ] When IDLETIMER rule is added, sysfs file is created under /sys/class/xt_idletimer/timers/ But some label name shouldn't be used. ".", "..", "power", "uevent", "subsystem", etc... So that sysfs filename checking routine is needed. test commands: %iptables -I INPUT -j IDLETIMER --timeout 1 --label "power" splat looks like: [95765.423132] sysfs: cannot create duplicate filename '/devices/virtual/xt_idletimer/timers/power' [95765.433418] CPU: 0 PID: 8446 Comm: iptables Not tainted 4.19.0-rc6+ #20 [95765.449755] Call Trace: [95765.449755] dump_stack+0xc9/0x16b [95765.449755] ? show_regs_print_info+0x5/0x5 [95765.449755] sysfs_warn_dup+0x74/0x90 [95765.449755] sysfs_add_file_mode_ns+0x352/0x500 [95765.449755] sysfs_create_file_ns+0x179/0x270 [95765.449755] ? sysfs_add_file_mode_ns+0x500/0x500 [95765.449755] ? idletimer_tg_checkentry+0x3e5/0xb1b [xt_IDLETIMER] [95765.449755] ? rcu_read_lock_sched_held+0x114/0x130 [95765.449755] ? __kmalloc_track_caller+0x211/0x2b0 [95765.449755] ? memcpy+0x34/0x50 [95765.449755] idletimer_tg_checkentry+0x4e2/0xb1b [xt_IDLETIMER] [ ... ] Fixes: `0902b469bd` ("netfilter: xtables: idletimer target implementation") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-11-27 16:13:03 +01:00
Andrey Ryabinin	97fdf29f7d	netfilter: ipset: fix ip_set_list allocation failure [ Upstream commit `ed956f3947` ] ip_set_create() and ip_set_net_init() attempt to allocate physically contiguous memory for ip_set_list. If memory is fragmented, the allocations could easily fail: vzctl: page allocation failure: order:7, mode:0xc0d0 Call Trace: dump_stack+0x19/0x1b warn_alloc_failed+0x110/0x180 __alloc_pages_nodemask+0x7bf/0xc60 alloc_pages_current+0x98/0x110 kmalloc_order+0x18/0x40 kmalloc_order_trace+0x26/0xa0 __kmalloc+0x279/0x290 ip_set_net_init+0x4b/0x90 [ip_set] ops_init+0x3b/0xb0 setup_net+0xbb/0x170 copy_net_ns+0xf1/0x1c0 create_new_namespaces+0xf9/0x180 copy_namespaces+0x8e/0xd0 copy_process+0xb61/0x1a00 do_fork+0x91/0x320 Use kvcalloc() to fallback to 0-order allocations if high order page isn't available. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-11-27 16:13:02 +01:00
Eric Westbrook	cb3e590df4	netfilter: ipset: actually allow allowable CIDR 0 in hash:net,port,net [ Upstream commit `886503f34d` ] Allow /0 as advertised for hash:net,port,net sets. For "hash:net,port,net", ipset(8) says that "either subnet is permitted to be a /0 should you wish to match port between all destinations." Make that statement true. Before: # ipset create cidrzero hash:net,port,net # ipset add cidrzero 0.0.0.0/0,12345,0.0.0.0/0 ipset v6.34: The value of the CIDR parameter of the IP address is invalid # ipset create cidrzero6 hash:net,port,net family inet6 # ipset add cidrzero6 ::/0,12345,::/0 ipset v6.34: The value of the CIDR parameter of the IP address is invalid After: # ipset create cidrzero hash:net,port,net # ipset add cidrzero 0.0.0.0/0,12345,0.0.0.0/0 # ipset test cidrzero 192.168.205.129,12345,172.16.205.129 192.168.205.129,tcp:12345,172.16.205.129 is in set cidrzero. # ipset create cidrzero6 hash:net,port,net family inet6 # ipset add cidrzero6 ::/0,12345,::/0 # ipset test cidrzero6 fe80::1,12345,ff00::1 fe80::1,tcp:12345,ff00::1 is in set cidrzero6. See also: https://bugzilla.kernel.org/show_bug.cgi?id=200897 `df7ff6efb0` Signed-off-by: Eric Westbrook <linux@westbrook.io> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-11-27 16:13:02 +01:00
Stefano Brivio	c75116e66e	netfilter: ipset: list:set: Decrease refcount synchronously on deletion and replace [ Upstream commit `439cd39ea1` ] Commit `45040978c8` ("netfilter: ipset: Fix set:list type crash when flush/dump set in parallel") postponed decreasing set reference counters to the RCU callback. An 'ipset del' command can terminate before the RCU grace period is elapsed, and if sets are listed before then, the reference counter shown in userspace will be wrong: # ipset create h hash:ip; ipset create l list:set; ipset add l # ipset del l h; ipset list h Name: h Type: hash:ip Revision: 4 Header: family inet hashsize 1024 maxelem 65536 Size in memory: 88 References: 1 Number of entries: 0 Members: # sleep 1; ipset list h Name: h Type: hash:ip Revision: 4 Header: family inet hashsize 1024 maxelem 65536 Size in memory: 88 References: 0 Number of entries: 0 Members: Fix this by making the reference count update synchronous again. As a result, when sets are listed, ip_set_name_byindex() might now fetch a set whose reference count is already zero. Instead of relying on the reference count to protect against concurrent set renaming, grab ip_set_ref_lock as reader and copy the name, while holding the same lock in ip_set_rename() as writer instead. Reported-by: Li Shuang <shuali@redhat.com> Fixes: `45040978c8` ("netfilter: ipset: Fix set:list type crash when flush/dump set in parallel") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-11-27 16:13:02 +01:00
Pablo Neira Ayuso	fecf70b135	Revert "netfilter: nft_numgen: add map lookups for numgen random operations" [ Upstream commit `4269fea768` ] Laura found a better way to do this from userspace without requiring kernel infrastructure, revert this. Fixes: `978d8f9055` ("netfilter: nft_numgen: add map lookups for numgen random operations") Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2018-11-27 16:13:02 +01:00
Vasily Khoruzhick	1be1576a13	netfilter: conntrack: fix calculation of next bucket number in early_drop commit `f393808dc6` upstream. If there's no entry to drop in bucket that corresponds to the hash, early_drop() should look for it in other buckets. But since it increments hash instead of bucket number, it actually looks in the same bucket 8 times: hsize is 16k by default (14 bits) and hash is 32-bit value, so reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in most cases. Fix it by increasing bucket number instead of hash and rename _hash to bucket to avoid future confusion. Fixes: `3e86638e9a` ("netfilter: conntrack: consider ct netns in early_drop logic") Cc: <stable@vger.kernel.org> # v4.7+ Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-11-21 09:19:18 +01:00
Paolo Abeni	703acc3265	netfilter: xt_nat: fix DNAT target for shifted portmap ranges [ Upstream commit `cb20f2d2c0` ] The commit `2eb0f624b7` ("netfilter: add NAT support for shifted portmap ranges") did not set the checkentry/destroy callbacks for the newly added DNAT target. As a result, rulesets using only such nat targets are not effective, as the relevant conntrack hooks are not enabled. The above affect also nft_compat rulesets. Fix the issue adding the missing initializers. Fixes: `2eb0f624b7` ("netfilter: add NAT support for shifted portmap ranges") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-11-13 11:08:20 -08:00
Flavio Leitner	40e4f26e6a	netfilter: xt_socket: check sk before checking for netns. Only check for the network namespace if the socket is available. Fixes: `f564650106` ("netfilter: check if the socket netns is correct.") Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-09-28 14:47:41 +02:00
Taehee Yoo	a13f814a67	netfilter: nft_set_rbtree: add missing rb_erase() in GC routine The nft_set_gc_batch_check() checks whether gc buffer is full. If gc buffer is full, gc buffer is released by the nft_set_gc_batch_complete() internally. In case of rbtree, the rb_erase() should be called before calling the nft_set_gc_batch_complete(). therefore the rb_erase() should be called before calling the nft_set_gc_batch_check() too. test commands: table ip filter { set set1 { type ipv4_addr; flags interval, timeout; gc-interval 10s; timeout 1s; elements = { 1-2, 3-4, 5-6, ... 10000-10001, } } } %nft -f test.nft splat looks like: [ 430.273885] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 430.282158] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 430.283116] CPU: 1 PID: 190 Comm: kworker/1:2 Tainted: G B 4.18.0+ #7 [ 430.283116] Workqueue: events_power_efficient nft_rbtree_gc [nf_tables_set] [ 430.313559] RIP: 0010:rb_next+0x81/0x130 [ 430.313559] Code: 08 49 bd 00 00 00 00 00 fc ff df 48 bb 00 00 00 00 00 fc ff df 48 85 c0 75 05 eb 58 48 89 d4 [ 430.313559] RSP: 0018:ffff88010cdb7680 EFLAGS: 00010207 [ 430.313559] RAX: 0000000000b84854 RBX: dffffc0000000000 RCX: ffffffff83f01973 [ 430.313559] RDX: 000000000017090c RSI: 0000000000000008 RDI: 0000000000b84864 [ 430.313559] RBP: ffff8801060d4588 R08: fffffbfff09bc349 R09: fffffbfff09bc349 [ 430.313559] R10: 0000000000000001 R11: fffffbfff09bc348 R12: ffff880100f081a8 [ 430.313559] R13: dffffc0000000000 R14: ffff880100ff8688 R15: dffffc0000000000 [ 430.313559] FS: 0000000000000000(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000 [ 430.313559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 430.313559] CR2: 0000000001551008 CR3: 000000005dc16000 CR4: 00000000001006e0 [ 430.313559] Call Trace: [ 430.313559] nft_rbtree_gc+0x112/0x5c0 [nf_tables_set] [ 430.313559] process_one_work+0xc13/0x1ec0 [ 430.313559] ? _raw_spin_unlock_irq+0x29/0x40 [ 430.313559] ? pwq_dec_nr_in_flight+0x3c0/0x3c0 [ 430.313559] ? set_load_weight+0x270/0x270 [ 430.313559] ? __switch_to_asm+0x34/0x70 [ 430.313559] ? __switch_to_asm+0x40/0x70 [ 430.313559] ? __switch_to_asm+0x34/0x70 [ 430.313559] ? __switch_to_asm+0x34/0x70 [ 430.313559] ? __switch_to_asm+0x40/0x70 [ 430.313559] ? __switch_to_asm+0x34/0x70 [ 430.313559] ? __switch_to_asm+0x40/0x70 [ 430.313559] ? __switch_to_asm+0x34/0x70 [ 430.313559] ? __switch_to_asm+0x34/0x70 [ 430.313559] ? __switch_to_asm+0x40/0x70 [ 430.313559] ? __switch_to_asm+0x34/0x70 [ 430.313559] ? __schedule+0x6d3/0x1f50 [ 430.313559] ? find_held_lock+0x39/0x1c0 [ 430.313559] ? __sched_text_start+0x8/0x8 [ 430.313559] ? cyc2ns_read_end+0x10/0x10 [ 430.313559] ? save_trace+0x300/0x300 [ 430.313559] ? sched_clock_local+0xd4/0x140 [ 430.313559] ? find_held_lock+0x39/0x1c0 [ 430.313559] ? worker_thread+0x353/0x1120 [ 430.313559] ? worker_thread+0x353/0x1120 [ 430.313559] ? lock_contended+0xe70/0xe70 [ 430.313559] ? __lock_acquire+0x4500/0x4500 [ 430.535635] ? do_raw_spin_unlock+0xa5/0x330 [ 430.535635] ? do_raw_spin_trylock+0x101/0x1a0 [ 430.535635] ? do_raw_spin_lock+0x1f0/0x1f0 [ 430.535635] ? _raw_spin_lock_irq+0x10/0x70 [ 430.535635] worker_thread+0x15d/0x1120 [ ... ] Fixes: `8d8540c4f5` ("netfilter: nft_set_rbtree: add timeout support") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-09-28 14:47:01 +02:00
zhong jiang	346fa83d10	netfilter: conntrack: get rid of double sizeof sizeof(sizeof()) is quite strange and does not seem to be what is wanted here. The issue is detected with the help of Coccinelle. Fixes: `3921584674` ("netfilter: conntrack: remove nlattr_size pointer from l4proto trackers") Signed-off-by: zhong jiang <zhongjiang@huawei.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-09-20 18:40:32 +02:00

1 2 3 4 5 ...

4612 Commits