Commit Graph

4612 Commits

Author SHA1 Message Date
Andrea Claudi
5ca2ef674d ipvs: fix dependency on nf_defrag_ipv6
[ Upstream commit 098e13f5b2 ]

ipvs relies on nf_defrag_ipv6 module to manage IPv6 fragmentation,
but lacks proper Kconfig dependencies and does not explicitly
request defrag features.

As a result, if netfilter hooks are not loaded, when IPv6 fragmented
packet are handled by ipvs only the first fragment makes through.

Fix it properly declaring the dependency on Kconfig and registering
netfilter hooks on ip_vs_add_service() and ip_vs_new_dest().

Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:45 +01:00
Francesco Ruggeri
e0e6b0d7e0 netfilter: compat: initialize all fields in xt_init
[ Upstream commit 8d29d16d21 ]

If a non zero value happens to be in xt[NFPROTO_BRIDGE].cur at init
time, the following panic can be caused by running

% ebtables -t broute -F BROUTING

from a 32-bit user level on a 64-bit kernel. This patch replaces
kmalloc_array with kcalloc when allocating xt.

[  474.680846] BUG: unable to handle kernel paging request at 0000000009600920
[  474.687869] PGD 2037006067 P4D 2037006067 PUD 2038938067 PMD 0
[  474.693838] Oops: 0000 [#1] SMP
[  474.697055] CPU: 9 PID: 4662 Comm: ebtables Kdump: loaded Not tainted 4.19.17-11302235.AroraKernelnext.fc18.x86_64 #1
[  474.707721] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.0 06/28/2013
[  474.714313] RIP: 0010:xt_compat_calc_jump+0x2f/0x63 [x_tables]
[  474.720201] Code: 40 0f b6 ff 55 31 c0 48 6b ff 70 48 03 3d dc 45 00 00 48 89 e5 8b 4f 6c 4c 8b 47 60 ff c9 39 c8 7f 2f 8d 14 08 d1 fa 48 63 fa <41> 39 34 f8 4c 8d 0c fd 00 00 00 00 73 05 8d 42 01 eb e1 76 05 8d
[  474.739023] RSP: 0018:ffffc9000943fc58 EFLAGS: 00010207
[  474.744296] RAX: 0000000000000000 RBX: ffffc90006465000 RCX: 0000000002580249
[  474.751485] RDX: 00000000012c0124 RSI: fffffffff7be17e9 RDI: 00000000012c0124
[  474.758670] RBP: ffffc9000943fc58 R08: 0000000000000000 R09: ffffffff8117cf8f
[  474.765855] R10: ffffc90006477000 R11: 0000000000000000 R12: 0000000000000001
[  474.773048] R13: 0000000000000000 R14: ffffc9000943fcb8 R15: ffffc9000943fcb8
[  474.780234] FS:  0000000000000000(0000) GS:ffff88a03f840000(0063) knlGS:00000000f7ac7700
[  474.788612] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[  474.794632] CR2: 0000000009600920 CR3: 0000002037422006 CR4: 00000000000606e0
[  474.802052] Call Trace:
[  474.804789]  compat_do_replace+0x1fb/0x2a3 [ebtables]
[  474.810105]  compat_do_ebt_set_ctl+0x69/0xe6 [ebtables]
[  474.815605]  ? try_module_get+0x37/0x42
[  474.819716]  compat_nf_setsockopt+0x4f/0x6d
[  474.824172]  compat_ip_setsockopt+0x7e/0x8c
[  474.828641]  compat_raw_setsockopt+0x16/0x3a
[  474.833220]  compat_sock_common_setsockopt+0x1d/0x24
[  474.838458]  __compat_sys_setsockopt+0x17e/0x1b1
[  474.843343]  ? __check_object_size+0x76/0x19a
[  474.847960]  __ia32_compat_sys_socketcall+0x1cb/0x25b
[  474.853276]  do_fast_syscall_32+0xaf/0xf6
[  474.857548]  entry_SYSENTER_compat+0x6b/0x7a

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:45 +01:00
Taehee Yoo
e6e0001791 netfilter: xt_TEE: add missing code to get interface index in checkentry.
[ Upstream commit 18c0ab8736 ]

checkentry(tee_tg_check) should initialize priv->oif from dev if possible.
But only netdevice notifier handler can set that.
Hence priv->oif is always -1 until notifier handler is called.

Fixes: 9e2f6c5d78 ("netfilter: Rework xt_TEE netdevice notifier")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:40 -07:00
Taehee Yoo
02d86085ca netfilter: xt_TEE: fix wrong interface selection
[ Upstream commit f24d2d4f95 ]

TEE netdevice notifier handler checks only interface name. however
each netns can have same interface name. hence other netns's interface
could be selected.

test commands:
   %ip netns add vm1
   %iptables -I INPUT -p icmp -j TEE --gateway 192.168.1.1 --oif enp2s0
   %ip link set enp2s0 netns vm1

Above rule is in the root netns. but that rule could get enp2s0
ifindex of vm1 by notifier handler.

After this patch, TEE rule is added to the per-netns list.

Fixes: 9e2f6c5d78 ("netfilter: Rework xt_TEE netdevice notifier")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:40 -07:00
Martynas Pumputis
5058447bf7 netfilter: nf_nat: skip nat clash resolution for same-origin entries
[ Upstream commit 4e35c1cb94 ]

It is possible that two concurrent packets originating from the same
socket of a connection-less protocol (e.g. UDP) can end up having
different IP_CT_DIR_REPLY tuples which results in one of the packets
being dropped.

To illustrate this, consider the following simplified scenario:

1. Packet A and B are sent at the same time from two different threads
   by same UDP socket.  No matching conntrack entry exists yet.
   Both packets cause allocation of a new conntrack entry.
2. get_unique_tuple gets called for A.  No clashing entry found.
   conntrack entry for A is added to main conntrack table.
3. get_unique_tuple is called for B and will find that the reply
   tuple of B is already taken by A.
   It will allocate a new UDP source port for B to resolve the clash.
4. conntrack entry for B cannot be added to main conntrack table
   because its ORIGINAL direction is clashing with A and the REPLY
   directions of A and B are not the same anymore due to UDP source
   port reallocation done in step 3.

This patch modifies nf_conntrack_tuple_taken so it doesn't consider
colliding reply tuples if the IP_CT_DIR_ORIGINAL tuples are equal.

[ Florian: simplify patch to not use .allow_clash setting
  and always ignore identical flows ]

Signed-off-by: Martynas Pumputis <martynas@weave.works>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:37 -07:00
ZhangXiaoxu
e0b03a6bad ipvs: Fix signed integer overflow when setsockopt timeout
[ Upstream commit 53ab60baa1 ]

There is a UBSAN bug report as below:
UBSAN: Undefined behaviour in net/netfilter/ipvs/ip_vs_ctl.c:2227:21
signed integer overflow:
-2147483647 * 1000 cannot be represented in type 'int'

Reproduce program:
	#include <stdio.h>
	#include <sys/types.h>
	#include <sys/socket.h>

	#define IPPROTO_IP 0
	#define IPPROTO_RAW 255

	#define IP_VS_BASE_CTL		(64+1024+64)
	#define IP_VS_SO_SET_TIMEOUT	(IP_VS_BASE_CTL+10)

	/* The argument to IP_VS_SO_GET_TIMEOUT */
	struct ipvs_timeout_t {
		int tcp_timeout;
		int tcp_fin_timeout;
		int udp_timeout;
	};

	int main() {
		int ret = -1;
		int sockfd = -1;
		struct ipvs_timeout_t to;

		sockfd = socket(AF_INET, SOCK_RAW, IPPROTO_RAW);
		if (sockfd == -1) {
			printf("socket init error\n");
			return -1;
		}

		to.tcp_timeout = -2147483647;
		to.tcp_fin_timeout = -2147483647;
		to.udp_timeout = -2147483647;

		ret = setsockopt(sockfd,
				 IPPROTO_IP,
				 IP_VS_SO_SET_TIMEOUT,
				 (char *)(&to),
				 sizeof(to));

		printf("setsockopt return %d\n", ret);
		return ret;
	}

Return -EINVAL if the timeout value is negative or max than 'INT_MAX / HZ'.

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:27 -07:00
Fernando Fernandez Mancera
0c1054e0e5 netfilter: nfnetlink_osf: add missing fmatch check
commit 1a6a0951fc upstream.

When we check the tcp options of a packet and it doesn't match the current
fingerprint, the tcp packet option pointer must be restored to its initial
value in order to do the proper tcp options check for the next fingerprint.

Here we can see an example.
Assumming the following fingerprint base with two lines:

S10:64:1:60:M*,S,T,N,W6:      Linux:3.0::Linux 3.0
S20:64:1:60:M*,S,T,N,W7:      Linux:4.19:arch:Linux 4.1

Where TCP options are the last field in the OS signature, all of them overlap
except by the last one, ie. 'W6' versus 'W7'.

In case a packet for Linux 4.19 kicks in, the osf finds no matching because the
TCP options pointer is updated after checking for the TCP options in the first
line.

Therefore, reset pointer back to where it should be.

Fixes: 11eeef41d5 ("netfilter: passive OS fingerprint xtables match")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-27 10:09:03 +01:00
Pablo Neira Ayuso
a905b82e1e netfilter: nft_compat: use-after-free when deleting targets
commit 753c111f65 upstream.

Fetch pointer to module before target object is released.

Fixes: 29e3880109 ("netfilter: nf_tables: fix use-after-free when deleting compat expressions")
Fixes: 0ca743a559 ("netfilter: nf_tables: add compatibility layer for x_tables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-27 10:09:03 +01:00
Pablo Neira Ayuso
1500d94e33 netfilter: nf_tables: fix flush after rule deletion in the same batch
commit 23b7ca4f74 upstream.

Flush after rule deletion bogusly hits -ENOENT. Skip rules that have
been already from nft_delrule_by_chain() which is always called from the
flush path.

Fixes: cf9dc09d09 ("netfilter: nf_tables: fix missing rules flushing per table")
Reported-by: Phil Sutter <phil@nwl.cc>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-27 10:09:02 +01:00
Henry Yen
73aa8292ca netfilter: nft_flow_offload: fix checking method of conntrack helper
[ Upstream commit 2314e87974 ]

This patch uses nfct_help() to detect whether an established connection
needs conntrack helper instead of using test_bit(IPS_HELPER_BIT,
&ct->status).

The reason is that IPS_HELPER_BIT is only set when using explicit CT
target.

However, in the case that a device enables conntrack helper via command
"echo 1 > /proc/sys/net/netfilter/nf_conntrack_helper", the status of
IPS_HELPER_BIT will not present any change, and consequently it loses
the checking ability in the context.

Signed-off-by: Henry Yen <henry.yen@mediatek.com>
Reviewed-by: Ryder Lee <ryder.lee@mediatek.com>
Tested-by: John Crispin <john@phrozen.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-27 10:08:55 +01:00
wenxu
6d26c375a4 netfilter: nft_flow_offload: fix interaction with vrf slave device
[ Upstream commit 10f4e76587 ]

In the forward chain, the iif is changed from slave device to master vrf
device. Thus, flow offload does not find a match on the lower slave
device.

This patch uses the cached route, ie. dst->dev, to update the iif and
oif fields in the flow entry.

After this patch, the following example works fine:

 # ip addr add dev eth0 1.1.1.1/24
 # ip addr add dev eth1 10.0.0.1/24
 # ip link add user1 type vrf table 1
 # ip l set user1 up
 # ip l set dev eth0 master user1
 # ip l set dev eth1 master user1

 # nft add table firewall
 # nft add flowtable f fb1 { hook ingress priority 0 \; devices = { eth0, eth1 } \; }
 # nft add chain f ftb-all {type filter hook forward priority 0 \; policy accept \; }
 # nft add rule f ftb-all ct zone 1 ip protocol tcp flow offload @fb1
 # nft add rule f ftb-all ct zone 1 ip protocol udp flow offload @fb1

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-27 10:08:54 +01:00
wenxu
535be4692f netfilter: nft_flow_offload: Fix reverse route lookup
[ Upstream commit a799aea098 ]

Using the following example:

	client 1.1.1.7 ---> 2.2.2.7 which dnat to 10.0.0.7 server

The first reply packet (ie. syn+ack) uses an incorrect destination
address for the reverse route lookup since it uses:

	daddr = ct->tuplehash[!dir].tuple.dst.u3.ip;

which is 2.2.2.7 in the scenario that is described above, while this
should be:

	daddr = ct->tuplehash[dir].tuple.src.u3.ip;

that is 10.0.0.7.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-27 10:08:53 +01:00
Taehee Yoo
95d4f951e7 netfilter: nf_tables: fix leaking object reference count
[ Upstream commit b91d903688 ]

There is no code that decreases the reference count of stateful objects
in error path of the nft_add_set_elem(). this causes a leak of reference
count of stateful objects.

Test commands:
   $nft add table ip filter
   $nft add counter ip filter c1
   $nft add map ip filter m1 { type ipv4_addr : counter \;}
   $nft add element ip filter m1 { 1 : c1 }
   $nft add element ip filter m1 { 1 : c1 }
   $nft delete element ip filter m1 { 1 }
   $nft delete counter ip filter c1

Result:
   Error: Could not process rule: Device or resource busy
   delete counter ip filter c1
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

At the second 'nft add element ip filter m1 { 1 : c1 }', the reference
count of the 'c1' is increased then it tries to insert into the 'm1'. but
the 'm1' already has same element so it returns -EEXIST.
But it doesn't decrease the reference count of the 'c1' in the error path.
Due to a leak of the reference count of the 'c1', the 'c1' can't be
removed by 'nft delete counter ip filter c1'.

Fixes: 8aeff920dc ("netfilter: nf_tables: add stateful object reference to set elements")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-27 10:08:53 +01:00
Stefano Brivio
ad7013cd6d netfilter: ipset: Allow matching on destination MAC address for mac and ipmac sets
[ Upstream commit 8cc4ccf583 ]

There doesn't seem to be any reason to restrict MAC address
matching to source MAC addresses in set types bitmap:ipmac,
hash:ipmac and hash:mac. With this patch, and this setup:

  ip netns add A
  ip link add veth1 type veth peer name veth2 netns A
  ip addr add 192.0.2.1/24 dev veth1
  ip -net A addr add 192.0.2.2/24 dev veth2
  ip link set veth1 up
  ip -net A link set veth2 up

  ip netns exec A ipset create test hash:mac
  dst=$(ip netns exec A cat /sys/class/net/veth2/address)
  ip netns exec A ipset add test ${dst}
  ip netns exec A iptables -P INPUT DROP
  ip netns exec A iptables -I INPUT -m set --match-set test dst -j ACCEPT

ipset will match packets based on destination MAC address:

  # ping -c1 192.0.2.2 >/dev/null
  # echo $?
  0

Reported-by: Yi Chen <yiche@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26 09:32:33 +01:00
Florian Westphal
6567515e4a netfilter: nf_conncount: fix argument order to find_next_bit
commit a007232066 upstream.

Size and 'next bit' were swapped, this bug could cause worker to
reschedule itself even if system was idle.

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:29 +01:00
Pablo Neira Ayuso
b01b92417d netfilter: nf_conncount: speculative garbage collection on empty lists
commit c80f10bc97 upstream.

Instead of removing a empty list node that might be reintroduced soon
thereafter, tentatively place the empty list node on the list passed to
tree_nodes_free(), then re-check if the list is empty again before erasing
it from the tree.

[ Florian: rebase on top of pending nf_conncount fixes ]

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:29 +01:00
Pablo Neira Ayuso
aea1d19594 netfilter: nf_conncount: move all list iterations under spinlock
commit 2f971a8f42 upstream.

Two CPUs may race to remove a connection from the list, the existing
conn->dead will result in a use-after-free. Use the per-list spinlock to
protect list iterations.

As all accesses to the list now happen while holding the per-list lock,
we no longer need to delay free operations with rcu.

Joint work with Florian.

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:29 +01:00
Florian Westphal
bdc6c893ba netfilter: nf_conncount: merge lookup and add functions
commit df4a902509 upstream.

'lookup' is always followed by 'add'.
Merge both and make the list-walk part of nf_conncount_add().

This also avoids one unneeded unlock/re-lock pair.

Extra care needs to be taken in count_tree, as we only hold rcu
read lock, i.e. we can only insert to an existing tree node after
acquiring its lock and making sure it has a nonzero count.

As a zero count should be rare, just fall back to insert_tree()
(which acquires tree lock).

This issue and its solution were pointed out by Shawn Bohrer
during patch review.

Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:29 +01:00
Florian Westphal
13c639424b netfilter: nf_conncount: restart search when nodes have been erased
commit e8cfb372b3 upstream.

Shawn Bohrer reported a following crash:
 |RIP: 0010:rb_erase+0xae/0x360
 [..]
 Call Trace:
  nf_conncount_destroy+0x59/0xc0 [nf_conncount]
  cleanup_match+0x45/0x70 [ip_tables]
  ...

Shawn tracked this down to bogus 'parent' pointer:
Problem is that when we insert a new node, then there is a chance that
the 'parent' that we found was also passed to tree_nodes_free() (because
that node was empty) for erase+free.

Instead of trying to be clever and detect when this happens, restart
the search if we have evicted one or more nodes.  To prevent frequent
restarts, do not perform gc on the second round.

Also, unconditionally schedule the gc worker.
The condition

  gc_count > ARRAY_SIZE(gc_nodes))

cannot be true unless tree grows very large, as the height of the tree
will be low even with hundreds of nodes present.

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reported-by: Shawn Bohrer <sbohrer@cloudflare.com>
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:29 +01:00
Florian Westphal
d6b3ff0222 netfilter: nf_conncount: split gc in two phases
commit f7fcc98dfc upstream.

The lockless workqueue garbage collector can race with packet path
garbage collector to delete list nodes, as it calls tree_nodes_free()
with the addresses of nodes that might have been free'd already from
another cpu.

To fix this, split gc into two phases.

One phase to perform gc on the connections: From a locking perspective,
this is the same as count_tree(): we hold rcu lock, but we do not
change the tree, we only change the nodes' contents.

The second phase acquires the tree lock and reaps empty nodes.
This avoids a race condition of the garbage collection vs.  packet path:
If a node has been free'd already, the second phase won't find it anymore.

This second phase is, from locking perspective, same as insert_tree().

The former only modifies nodes (list content, count), latter modifies
the tree itself (rb_erase or rb_insert).

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:28 +01:00
Florian Westphal
ef68fdb517 netfilter: nf_conncount: don't skip eviction when age is negative
commit 4cd273bb91 upstream.

age is signed integer, so result can be negative when the timestamps
have a large delta.  In this case we want to discard the entry.

Instead of using age >= 2 || age < 0, just make it unsigned.

Fixes: b36e4523d4 ("netfilter: nf_conncount: fix garbage collection confirm race")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:28 +01:00
Shawn Bohrer
c5cbe95a4b netfilter: nf_conncount: replace CONNCOUNT_LOCK_SLOTS with CONNCOUNT_SLOTS
commit c78e7818f1 upstream.

Most of the time these were the same value anyway, but when
CONFIG_LOCKDEP was enabled we would use a smaller number of locks to
reduce overhead.  Unfortunately having two values is confusing and not
worth the complexity.

This fixes a bug where tree_gc_worker() would only GC up to
CONNCOUNT_LOCK_SLOTS trees which meant when CONFIG_LOCKDEP was enabled
not all trees would be GCed by tree_gc_worker().

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:28 +01:00
Taehee Yoo
7fd995d3b4 netfilter: nf_conncount: use rb_link_node_rcu() instead of rb_link_node()
[ Upstream commit d4e7df1656 ]

rbnode in insert_tree() is rcu protected pointer.
So, in order to handle this pointer, _rcu function should be used.
rb_link_node_rcu() is a rcu version of rb_link_node().

Fixes: 34848d5c89 ("netfilter: nf_conncount: Split insert and traversal")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-13 09:50:59 +01:00
Florian Westphal
ba364929ff netfilter: nat: can't use dst_hold on noref dst
[ Upstream commit 542fbda0f0 ]

The dst entry might already have a zero refcount, waiting on rcu list
to be free'd.  Using dst_hold() transitions its reference count to 1, and
next dst release will try to free it again -- resulting in a double free:

  WARNING: CPU: 1 PID: 0 at include/net/dst.h:239 nf_xfrm_me_harder+0xe7/0x130 [nf_nat]
  RIP: 0010:nf_xfrm_me_harder+0xe7/0x130 [nf_nat]
  Code: 48 8b 5c 24 60 65 48 33 1c 25 28 00 00 00 75 53 48 83 c4 68 5b 5d 41 5c c3 85 c0 74 0d 8d 48 01 f0 0f b1 0a 74 86 85 c0 75 f3 <0f> 0b e9 7b ff ff ff 29 c6 31 d2 b9 20 00 48 00 4c 89 e7 e8 31 27
  Call Trace:
  nf_nat_ipv4_out+0x78/0x90 [nf_nat_ipv4]
  nf_hook_slow+0x36/0xd0
  ip_output+0x9f/0xd0
  ip_forward+0x328/0x440
  ip_rcv+0x8a/0xb0

Use dst_hold_safe instead and bail out if we cannot take a reference.

Fixes: a4c2fd7f78 ("net: remove DST_NOCACHE flag")
Reported-by: Martin Zaharinov <micron10@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-13 09:50:59 +01:00
Pan Bian
9e40410160 netfilter: ipset: do not call ipset_nest_end after nla_nest_cancel
[ Upstream commit 708abf74dd ]

In the error handling block, nla_nest_cancel(skb, atd) is called to
cancel the nest operation. But then, ipset_nest_end(skb, atd) is
unexpected called to end the nest operation. This patch calls the
ipset_nest_end only on the branch that nla_nest_cancel is not called.

Fixes: 45040978c8 ("netfilter: ipset: Fix set:list type crash when flush/dump set in parallel")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-13 09:50:59 +01:00
Florian Westphal
6bcf9ef86c netfilter: seqadj: re-load tcp header pointer after possible head reallocation
[ Upstream commit 530aad7701 ]

When adjusting sack block sequence numbers, skb_make_writable() gets
called to make sure tcp options are all in the linear area, and buffer
is not shared.

This can cause tcp header pointer to get reallocated, so we must
reaload it to avoid memory corruption.

This bug pre-dates git history.

Reported-by: Neel Mehta <nmehta@google.com>
Reported-by: Shane Huntley <shuntley@google.com>
Reported-by: Heather Adkins <argv@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-13 09:50:57 +01:00
Taehee Yoo
cee05c0371 netfilter: nf_tables: fix suspicious RCU usage in nft_chain_stats_replace()
[ Upstream commit 4c05ec4738 ]

basechain->stats is rcu protected data which is updated from
nft_chain_stats_replace(). This function is executed from the commit
phase which holds the pernet nf_tables commit mutex - not the global
nfnetlink subsystem mutex.

Test commands to reproduce the problem are:
   %iptables-nft -I INPUT
   %iptables-nft -Z
   %iptables-nft -Z

This patch uses RCU calls to handle basechain->stats updates to fix a
splat that looks like:

[89279.358755] =============================
[89279.363656] WARNING: suspicious RCU usage
[89279.368458] 4.20.0-rc2+ #44 Tainted: G        W    L
[89279.374661] -----------------------------
[89279.379542] net/netfilter/nf_tables_api.c:1404 suspicious rcu_dereference_protected() usage!
[...]
[89279.406556] 1 lock held by iptables-nft/5225:
[89279.411728]  #0: 00000000bf45a000 (&net->nft.commit_mutex){+.+.}, at: nf_tables_valid_genid+0x1f/0x70 [nf_tables]
[89279.424022] stack backtrace:
[89279.429236] CPU: 0 PID: 5225 Comm: iptables-nft Tainted: G        W    L    4.20.0-rc2+ #44
[89279.430135] Call Trace:
[89279.430135]  dump_stack+0xc9/0x16b
[89279.430135]  ? show_regs_print_info+0x5/0x5
[89279.430135]  ? lockdep_rcu_suspicious+0x117/0x160
[89279.430135]  nft_chain_commit_update+0x4ea/0x640 [nf_tables]
[89279.430135]  ? sched_clock_local+0xd4/0x140
[89279.430135]  ? check_flags.part.35+0x440/0x440
[89279.430135]  ? __rhashtable_remove_fast.constprop.67+0xec0/0xec0 [nf_tables]
[89279.430135]  ? sched_clock_cpu+0x126/0x170
[89279.430135]  ? find_held_lock+0x39/0x1c0
[89279.430135]  ? hlock_class+0x140/0x140
[89279.430135]  ? is_bpf_text_address+0x5/0xf0
[89279.430135]  ? check_flags.part.35+0x440/0x440
[89279.430135]  ? __lock_is_held+0xb4/0x140
[89279.430135]  nf_tables_commit+0x2555/0x39c0 [nf_tables]

Fixes: f102d66b33 ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-13 09:50:57 +01:00
Taehee Yoo
e5f42e0617 netfilter: nf_tables: deactivate expressions in rule replecement routine
[ Upstream commit ca08987885 ]

There is no expression deactivation call from the rule replacement path,
hence, chain counter is not decremented. A few steps to reproduce the
problem:

   %nft add table ip filter
   %nft add chain ip filter c1
   %nft add chain ip filter c1
   %nft add rule ip filter c1 jump c2
   %nft replace rule ip filter c1 handle 3 accept
   %nft flush ruleset

<jump c2> expression means immediate NFT_JUMP to chain c2.
Reference count of chain c2 is increased when the rule is added.

When rule is deleted or replaced, the reference counter of c2 should be
decreased via nft_rule_expr_deactivate() which calls
nft_immediate_deactivate().

Splat looks like:
[  214.396453] WARNING: CPU: 1 PID: 21 at net/netfilter/nf_tables_api.c:1432 nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables]
[  214.398983] Modules linked in: nf_tables nfnetlink
[  214.398983] CPU: 1 PID: 21 Comm: kworker/1:1 Not tainted 4.20.0-rc2+ #44
[  214.398983] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
[  214.398983] RIP: 0010:nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables]
[  214.398983] Code: 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 8e 00 00 00 48 8b 7b 58 e8 e1 2c 4e c6 48 89 df e8 d9 2c 4e c6 eb 9a <0f> 0b eb 96 0f 0b e9 7e fe ff ff e8 a7 7e 4e c6 e9 a4 fe ff ff e8
[  214.398983] RSP: 0018:ffff8881152874e8 EFLAGS: 00010202
[  214.398983] RAX: 0000000000000001 RBX: ffff88810ef9fc28 RCX: ffff8881152876f0
[  214.398983] RDX: dffffc0000000000 RSI: 1ffff11022a50ede RDI: ffff88810ef9fc78
[  214.398983] RBP: 1ffff11022a50e9d R08: 0000000080000000 R09: 0000000000000000
[  214.398983] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff11022a50eba
[  214.398983] R13: ffff888114446e08 R14: ffff8881152876f0 R15: ffffed1022a50ed6
[  214.398983] FS:  0000000000000000(0000) GS:ffff888116400000(0000) knlGS:0000000000000000
[  214.398983] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  214.398983] CR2: 00007fab9bb5f868 CR3: 000000012aa16000 CR4: 00000000001006e0
[  214.398983] Call Trace:
[  214.398983]  ? nf_tables_table_destroy.isra.37+0x100/0x100 [nf_tables]
[  214.398983]  ? __kasan_slab_free+0x145/0x180
[  214.398983]  ? nf_tables_trans_destroy_work+0x439/0x830 [nf_tables]
[  214.398983]  ? kfree+0xdb/0x280
[  214.398983]  nf_tables_trans_destroy_work+0x5f5/0x830 [nf_tables]
[ ... ]

Fixes: bb7b40aecb ("netfilter: nf_tables: bogus EBUSY in chain deletions")
Reported by: Christoph Anton Mitterer <calestyo@scientia.net>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=914505
Link: https://bugzilla.kernel.org/show_bug.cgi?id=201791
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:38 +01:00
Taehee Yoo
8038f92df3 netfilter: nf_conncount: remove wrong condition check routine
[ Upstream commit 53ca0f2fec ]

All lists that reach the tree_nodes_free() function have both zero
counter and true dead flag. The reason for this is that lists to be
release are selected by nf_conncount_gc_list() which already decrements
the list counter and sets on the dead flag. Therefore, this if statement
in tree_nodes_free() is unnecessary and wrong.

Fixes: 31568ec09e ("netfilter: nf_conncount: fix list_del corruption in conn_free")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:37 +01:00
Taehee Yoo
18218f827e netfilter: add missing error handling code for register functions
[ Upstream commit 584eab291c ]

register_{netdevice/inetaddr/inet6addr}_notifier may return an error
value, this patch adds the code to handle these error paths.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:36 +01:00
Xin Long
28ad9091e1 ipvs: call ip_vs_dst_notifier earlier than ipv6_dev_notf
[ Upstream commit 2a31e4bd9a ]

ip_vs_dst_event is supposed to clean up all dst used in ipvs'
destinations when a net dev is going down. But it works only
when the dst's dev is the same as the dev from the event.

Now with the same priority but late registration,
ip_vs_dst_notifier is always called later than ipv6_dev_notf
where the dst's dev is set to lo for NETDEV_DOWN event.

As the dst's dev lo is not the same as the dev from the event
in ip_vs_dst_event, ip_vs_dst_notifier doesn't actually work.
Also as these dst have to wait for dest_trash_timer to clean
them up. It would cause some non-permanent kernel warnings:

  unregister_netdevice: waiting for br0 to become free. Usage count = 3

To fix it, call ip_vs_dst_notifier earlier than ipv6_dev_notf
by increasing its priority to ADDRCONF_NOTIFY_PRIORITY + 5.

Note that for ipv4 route fib_netdev_notifier doesn't set dst's
dev to lo in NETDEV_DOWN event, so this fix is only needed when
IP_VS_IPV6 is defined.

Fixes: 7a4f0761fc ("IPVS: init and cleanup restructuring")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:35 +01:00
Taehee Yoo
fb0fc90cc2 netfilter: xt_hashlimit: fix a possible memory leak in htable_create()
[ Upstream commit b4e955e9f3 ]

In the htable_create(), hinfo is allocated by vmalloc()
So that if error occurred, hinfo should be freed.

Fixes: 11d5f15723 ("netfilter: xt_hashlimit: Create revision 2 to support higher pps rates")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:33 +01:00
Florian Westphal
00bac44c99 netfilter: nf_tables: fix use-after-free when deleting compat expressions
[ Upstream commit 29e3880109 ]

nft_compat ops do not have static storage duration, unlike all other
expressions.

When nf_tables_expr_destroy() returns, expr->ops might have been
free'd already, so we need to store next address before calling
expression destructor.

For same reason, we can't deref match pointer after nft_xt_put().

This can be easily reproduced by adding msleep() before
nft_match_destroy() returns.

Fixes: 0ca743a559 ("netfilter: nf_tables: add compatibility layer for x_tables")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:31 +01:00
Taehee Yoo
e947f9aa9a netfilter: xt_RATEEST: remove netns exit routine
[ Upstream commit 0fbcc5b568 ]

xt_rateest_net_exit() was added to check whether rules are flushed
successfully. but ->net_exit() callback is called earlier than
->destroy() callback.
So that ->net_exit() callback can't check that.

test commands:
   %ip netns add vm1
   %ip netns exec vm1 iptables -t mangle -I PREROUTING -p udp \
	   --dport 1111 -j RATEEST --rateest-name ap \
	   --rateest-interval 250ms --rateest-ewma 0.5s
   %ip netns del vm1

splat looks like:
[  668.813518] WARNING: CPU: 0 PID: 87 at net/netfilter/xt_RATEEST.c:210 xt_rateest_net_exit+0x210/0x340 [xt_RATEEST]
[  668.813518] Modules linked in: xt_RATEEST xt_tcpudp iptable_mangle bpfilter ip_tables x_tables
[  668.813518] CPU: 0 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc7+ #21
[  668.813518] Workqueue: netns cleanup_net
[  668.813518] RIP: 0010:xt_rateest_net_exit+0x210/0x340 [xt_RATEEST]
[  668.813518] Code: 00 48 8b 85 30 ff ff ff 4c 8b 23 80 38 00 0f 85 24 01 00 00 48 8b 85 30 ff ff ff 4d 85 e4 4c 89 a5 58 ff ff ff c6 00 f8 74 b2 <0f> 0b 48 83 c3 08 4c 39 f3 75 b0 48 b8 00 00 00 00 00 fc ff df 49
[  668.813518] RSP: 0018:ffff8801156c73f8 EFLAGS: 00010282
[  668.813518] RAX: ffffed0022ad8e85 RBX: ffff880118928e98 RCX: 5db8012a00000000
[  668.813518] RDX: ffff8801156c7428 RSI: 00000000cb1d185f RDI: ffff880115663b74
[  668.813518] RBP: ffff8801156c74d0 R08: ffff8801156633c0 R09: 1ffff100236440be
[  668.813518] R10: 0000000000000001 R11: ffffed002367d852 R12: ffff880115142b08
[  668.813518] R13: 1ffff10022ad8e81 R14: ffff880118928ea8 R15: dffffc0000000000
[  668.813518] FS:  0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000
[  668.813518] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  668.813518] CR2: 0000563aa69f4f28 CR3: 0000000105a16000 CR4: 00000000001006f0
[  668.813518] Call Trace:
[  668.813518]  ? unregister_netdevice_many+0xe0/0xe0
[  668.813518]  ? xt_rateest_net_init+0x2c0/0x2c0 [xt_RATEEST]
[  668.813518]  ? default_device_exit+0x1ca/0x270
[  668.813518]  ? remove_proc_entry+0x1cd/0x390
[  668.813518]  ? dev_change_net_namespace+0xd00/0xd00
[  668.813518]  ? __init_waitqueue_head+0x130/0x130
[  668.813518]  ops_exit_list.isra.10+0x94/0x140
[  668.813518]  cleanup_net+0x45b/0x900
[  668.813518]  ? net_drop_ns+0x110/0x110
[  668.813518]  ? swapgs_restore_regs_and_return_to_usermode+0x3c/0x80
[  668.813518]  ? save_trace+0x300/0x300
[  668.813518]  ? lock_acquire+0x196/0x470
[  668.813518]  ? lock_acquire+0x196/0x470
[  668.813518]  ? process_one_work+0xb60/0x1de0
[  668.813518]  ? _raw_spin_unlock_irq+0x29/0x40
[  668.813518]  ? _raw_spin_unlock_irq+0x29/0x40
[  668.813518]  ? __lock_acquire+0x4500/0x4500
[  668.813518]  ? __lock_is_held+0xb4/0x140
[  668.813518]  process_one_work+0xc13/0x1de0
[  668.813518]  ? pwq_dec_nr_in_flight+0x3c0/0x3c0
[  668.813518]  ? set_load_weight+0x270/0x270
[ ... ]

Fixes: 3427b2ab63 ("netfilter: make xt_rateest hash table per net")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:31 +01:00
Florian Westphal
8fe8940ffc netfilter: nf_tables: don't skip inactive chains during update
[ Upstream commit 0fb39bbe43 ]

There is no synchronization between packet path and the configuration plane.

The packet path uses two arrays with rules, one contains the current (active)
generation.  The other either contains the last (obsolete) generation or
the future one.

Consider:
cpu1               cpu2
                   nft_do_chain(c);
delete c
net->gen++;
                   genbit = !!net->gen;
                   rules = c->rg[genbit];

cpu1 ignores c when updating if c is not active anymore in the new
generation.

On cpu2, we now use rules from wrong generation, as c->rg[old]
contains the rules matching 'c' whereas c->rg[new] was not updated and
can even point to rules that have been free'd already, causing a crash.

To fix this, make sure that 'current' to the 'next' generation are
identical for chains that are going away so that c->rg[new] will just
use the matching rules even if genbit was incremented already.

Fixes: 0cbc06b3fa ("netfilter: nf_tables: remove synchronize_rcu in commit phase")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:31 +01:00
Taehee Yoo
4a3b49f0ce netfilter: nf_conncount: fix unexpected permanent node of list.
[ Upstream commit 3c5cdb17c3 ]

When list->count is 0, the list is deleted by GC. But list->count is
never reached 0 because initial count value is 1 and it is increased
when node is inserted. So that initial value of list->count should be 0.

Originally GC always finds zero count list through deleting node and
decreasing count. However, list may be left empty since node insertion
may fail eg.  allocaton problem. In order to solve this problem, GC
routine also finds zero count list without deleting node.

Fixes: cb2b36f5a9 ("netfilter: nf_conncount: Switch to plain list")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:30 +01:00
Taehee Yoo
ae60f4705f netfilter: nf_conncount: fix list_del corruption in conn_free
[ Upstream commit 31568ec09e ]

nf_conncount_tuple is an element of nft_connlimit and that is deleted by
conn_free(). Elements can be deleted by both GC routine and data path
functions (nf_conncount_lookup, nf_conncount_add) and they call
conn_free() to free elements. But conn_free() only protects lists, not
each element. So that list_del corruption could occurred.

The conn_free() doesn't check whether element is already deleted. In
order to protect elements, dead flag is added. If an element is deleted,
dead flag is set. The only conn_free() can delete elements so that both
list lock and dead flag are enough to protect it.

test commands:
   %nft add table ip filter
   %nft add chain ip filter input { type filter hook input priority 0\; }
   %nft add rule filter input meter test { ip id ct count over 2 } counter

splat looks like:
[ 1779.495778] list_del corruption, ffff8800b6e12008->prev is LIST_POISON2 (dead000000000200)
[ 1779.505453] ------------[ cut here ]------------
[ 1779.506260] kernel BUG at lib/list_debug.c:50!
[ 1779.515831] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 1779.516772] CPU: 0 PID: 33 Comm: kworker/0:2 Not tainted 4.19.0-rc6+ #22
[ 1779.516772] Workqueue: events_power_efficient nft_rhash_gc [nf_tables_set]
[ 1779.516772] RIP: 0010:__list_del_entry_valid+0xd8/0x150
[ 1779.516772] Code: 39 48 83 c4 08 b8 01 00 00 00 5b 5d c3 48 89 ea 48 c7 c7 00 c3 5b 98 e8 0f dc 40 ff 0f 0b 48 c7 c7 60 c3 5b 98 e8 01 dc 40 ff <0f> 0b 48 c7 c7 c0 c3 5b 98 e8 f3 db 40 ff 0f 0b 48 c7 c7 20 c4 5b
[ 1779.516772] RSP: 0018:ffff880119127420 EFLAGS: 00010286
[ 1779.516772] RAX: 000000000000004e RBX: dead000000000200 RCX: 0000000000000000
[ 1779.516772] RDX: 000000000000004e RSI: 0000000000000008 RDI: ffffed0023224e7a
[ 1779.516772] RBP: ffff88011934bc10 R08: ffffed002367cea9 R09: ffffed002367cea9
[ 1779.516772] R10: 0000000000000001 R11: ffffed002367cea8 R12: ffff8800b6e12008
[ 1779.516772] R13: ffff8800b6e12010 R14: ffff88011934bc20 R15: ffff8800b6e12008
[ 1779.516772] FS:  0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000
[ 1779.516772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1779.516772] CR2: 00007fc876534010 CR3: 000000010da16000 CR4: 00000000001006f0
[ 1779.516772] Call Trace:
[ 1779.516772]  conn_free+0x9f/0x2b0 [nf_conncount]
[ 1779.516772]  ? nf_ct_tmpl_alloc+0x2a0/0x2a0 [nf_conntrack]
[ 1779.516772]  ? nf_conncount_add+0x520/0x520 [nf_conncount]
[ 1779.516772]  ? do_raw_spin_trylock+0x1a0/0x1a0
[ 1779.516772]  ? do_raw_spin_trylock+0x10/0x1a0
[ 1779.516772]  find_or_evict+0xe5/0x150 [nf_conncount]
[ 1779.516772]  nf_conncount_gc_list+0x162/0x360 [nf_conncount]
[ 1779.516772]  ? nf_conncount_lookup+0xee0/0xee0 [nf_conncount]
[ 1779.516772]  ? _raw_spin_unlock_irqrestore+0x45/0x50
[ 1779.516772]  ? trace_hardirqs_off+0x6b/0x220
[ 1779.516772]  ? trace_hardirqs_on_caller+0x220/0x220
[ 1779.516772]  nft_rhash_gc+0x16b/0x540 [nf_tables_set]
[ ... ]

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:30 +01:00
Taehee Yoo
08c7e68ab2 netfilter: nf_conncount: use spin_lock_bh instead of spin_lock
[ Upstream commit fd3e71a9f7 ]

conn_free() holds lock with spin_lock() and it is called by both
nf_conncount_lookup() and nf_conncount_gc_list(). nf_conncount_lookup()
is called from bottom-half context and nf_conncount_gc_list() from
process context. So that spin_lock() call is not safe. Hence
conn_free() should use spin_lock_bh() instead of spin_lock().

test commands:
   %nft add table ip filter
   %nft add chain ip filter input { type filter hook input priority 0\; }
   %nft add rule filter input meter test { ip saddr ct count over 2 } \
	   counter

splat looks like:
[  461.996507] ================================
[  461.998999] WARNING: inconsistent lock state
[  461.998999] 4.19.0-rc6+ #22 Not tainted
[  461.998999] --------------------------------
[  461.998999] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  461.998999] kworker/0:2/134 [HC0[0]:SC0[0]:HE1:SE1] takes:
[  461.998999] 00000000a71a559a (&(&list->list_lock)->rlock){+.?.}, at: conn_free+0x69/0x2b0 [nf_conncount]
[  461.998999] {IN-SOFTIRQ-W} state was registered at:
[  461.998999]   _raw_spin_lock+0x30/0x70
[  461.998999]   nf_conncount_add+0x28a/0x520 [nf_conncount]
[  461.998999]   nft_connlimit_eval+0x401/0x580 [nft_connlimit]
[  461.998999]   nft_dynset_eval+0x32b/0x590 [nf_tables]
[  461.998999]   nft_do_chain+0x497/0x1430 [nf_tables]
[  461.998999]   nft_do_chain_ipv4+0x255/0x330 [nf_tables]
[  461.998999]   nf_hook_slow+0xb1/0x160
[ ... ]
[  461.998999] other info that might help us debug this:
[  461.998999]  Possible unsafe locking scenario:
[  461.998999]
[  461.998999]        CPU0
[  461.998999]        ----
[  461.998999]   lock(&(&list->list_lock)->rlock);
[  461.998999]   <Interrupt>
[  461.998999]     lock(&(&list->list_lock)->rlock);
[  461.998999]
[  461.998999]  *** DEADLOCK ***
[  461.998999]
[ ... ]

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:30 +01:00
Florian Westphal
1cf11e7ca0 netfilter: nft_compat: ebtables 'nat' table is normal chain type
[ Upstream commit e4844c9c62 ]

Unlike ip(6)tables, the ebtables nat table has no special properties.
This bug causes 'ebtables -A' to fail when using a target such as
'snat' (ebt_snat target sets ".table = "nat"').  Targets that have
no table restrictions work fine.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-27 16:13:03 +01:00
Jozsef Kadlecsik
2f6bf7917f netfilter: ipset: Fix calling ip_set() macro at dumping
[ Upstream commit 8a02bdd50b ]

The ip_set() macro is called when either ip_set_ref_lock held only
or no lock/nfnl mutex is held at dumping. Take this into account
properly. Also, use Pablo's suggestion to use rcu_dereference_raw(),
the ref_netlink protects the set.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-27 16:13:03 +01:00
Taehee Yoo
e8b258ce87 netfilter: xt_IDLETIMER: add sysfs filename checking routine
[ Upstream commit 54451f60c8 ]

When IDLETIMER rule is added, sysfs file is created under
/sys/class/xt_idletimer/timers/
But some label name shouldn't be used.
".", "..", "power", "uevent", "subsystem", etc...
So that sysfs filename checking routine is needed.

test commands:
   %iptables -I INPUT -j IDLETIMER --timeout 1 --label "power"

splat looks like:
[95765.423132] sysfs: cannot create duplicate filename '/devices/virtual/xt_idletimer/timers/power'
[95765.433418] CPU: 0 PID: 8446 Comm: iptables Not tainted 4.19.0-rc6+ #20
[95765.449755] Call Trace:
[95765.449755]  dump_stack+0xc9/0x16b
[95765.449755]  ? show_regs_print_info+0x5/0x5
[95765.449755]  sysfs_warn_dup+0x74/0x90
[95765.449755]  sysfs_add_file_mode_ns+0x352/0x500
[95765.449755]  sysfs_create_file_ns+0x179/0x270
[95765.449755]  ? sysfs_add_file_mode_ns+0x500/0x500
[95765.449755]  ? idletimer_tg_checkentry+0x3e5/0xb1b [xt_IDLETIMER]
[95765.449755]  ? rcu_read_lock_sched_held+0x114/0x130
[95765.449755]  ? __kmalloc_track_caller+0x211/0x2b0
[95765.449755]  ? memcpy+0x34/0x50
[95765.449755]  idletimer_tg_checkentry+0x4e2/0xb1b [xt_IDLETIMER]
[ ... ]

Fixes: 0902b469bd ("netfilter: xtables: idletimer target implementation")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-27 16:13:03 +01:00
Andrey Ryabinin
97fdf29f7d netfilter: ipset: fix ip_set_list allocation failure
[ Upstream commit ed956f3947 ]

ip_set_create() and ip_set_net_init() attempt to allocate physically
contiguous memory for ip_set_list. If memory is fragmented, the
allocations could easily fail:

        vzctl: page allocation failure: order:7, mode:0xc0d0

        Call Trace:
         dump_stack+0x19/0x1b
         warn_alloc_failed+0x110/0x180
         __alloc_pages_nodemask+0x7bf/0xc60
         alloc_pages_current+0x98/0x110
         kmalloc_order+0x18/0x40
         kmalloc_order_trace+0x26/0xa0
         __kmalloc+0x279/0x290
         ip_set_net_init+0x4b/0x90 [ip_set]
         ops_init+0x3b/0xb0
         setup_net+0xbb/0x170
         copy_net_ns+0xf1/0x1c0
         create_new_namespaces+0xf9/0x180
         copy_namespaces+0x8e/0xd0
         copy_process+0xb61/0x1a00
         do_fork+0x91/0x320

Use kvcalloc() to fallback to 0-order allocations if high order
page isn't available.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-27 16:13:02 +01:00
Eric Westbrook
cb3e590df4 netfilter: ipset: actually allow allowable CIDR 0 in hash:net,port,net
[ Upstream commit 886503f34d ]

Allow /0 as advertised for hash:net,port,net sets.

For "hash:net,port,net", ipset(8) says that "either subnet
is permitted to be a /0 should you wish to match port
between all destinations."

Make that statement true.

Before:

    # ipset create cidrzero hash:net,port,net
    # ipset add cidrzero 0.0.0.0/0,12345,0.0.0.0/0
    ipset v6.34: The value of the CIDR parameter of the IP address is invalid

    # ipset create cidrzero6 hash:net,port,net family inet6
    # ipset add cidrzero6 ::/0,12345,::/0
    ipset v6.34: The value of the CIDR parameter of the IP address is invalid

After:

    # ipset create cidrzero hash:net,port,net
    # ipset add cidrzero 0.0.0.0/0,12345,0.0.0.0/0
    # ipset test cidrzero 192.168.205.129,12345,172.16.205.129
    192.168.205.129,tcp:12345,172.16.205.129 is in set cidrzero.

    # ipset create cidrzero6 hash:net,port,net family inet6
    # ipset add cidrzero6 ::/0,12345,::/0
    # ipset test cidrzero6 fe80::1,12345,ff00::1
    fe80::1,tcp:12345,ff00::1 is in set cidrzero6.

See also:

  https://bugzilla.kernel.org/show_bug.cgi?id=200897
  df7ff6efb0

Signed-off-by: Eric Westbrook <linux@westbrook.io>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-27 16:13:02 +01:00
Stefano Brivio
c75116e66e netfilter: ipset: list:set: Decrease refcount synchronously on deletion and replace
[ Upstream commit 439cd39ea1 ]

Commit 45040978c8 ("netfilter: ipset: Fix set:list type crash
when flush/dump set in parallel") postponed decreasing set
reference counters to the RCU callback.

An 'ipset del' command can terminate before the RCU grace period
is elapsed, and if sets are listed before then, the reference
counter shown in userspace will be wrong:

 # ipset create h hash:ip; ipset create l list:set; ipset add l
 # ipset del l h; ipset list h
 Name: h
 Type: hash:ip
 Revision: 4
 Header: family inet hashsize 1024 maxelem 65536
 Size in memory: 88
 References: 1
 Number of entries: 0
 Members:
 # sleep 1; ipset list h
 Name: h
 Type: hash:ip
 Revision: 4
 Header: family inet hashsize 1024 maxelem 65536
 Size in memory: 88
 References: 0
 Number of entries: 0
 Members:

Fix this by making the reference count update synchronous again.

As a result, when sets are listed, ip_set_name_byindex() might
now fetch a set whose reference count is already zero. Instead
of relying on the reference count to protect against concurrent
set renaming, grab ip_set_ref_lock as reader and copy the name,
while holding the same lock in ip_set_rename() as writer
instead.

Reported-by: Li Shuang <shuali@redhat.com>
Fixes: 45040978c8 ("netfilter: ipset: Fix set:list type crash when flush/dump set in parallel")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-27 16:13:02 +01:00
Pablo Neira Ayuso
fecf70b135 Revert "netfilter: nft_numgen: add map lookups for numgen random operations"
[ Upstream commit 4269fea768 ]

Laura found a better way to do this from userspace without requiring
kernel infrastructure, revert this.

Fixes: 978d8f9055 ("netfilter: nft_numgen: add map lookups for numgen random operations")
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-27 16:13:02 +01:00
Vasily Khoruzhick
1be1576a13 netfilter: conntrack: fix calculation of next bucket number in early_drop
commit f393808dc6 upstream.

If there's no entry to drop in bucket that corresponds to the hash,
early_drop() should look for it in other buckets. But since it increments
hash instead of bucket number, it actually looks in the same bucket 8
times: hsize is 16k by default (14 bits) and hash is 32-bit value, so
reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in
most cases.

Fix it by increasing bucket number instead of hash and rename _hash
to bucket to avoid future confusion.

Fixes: 3e86638e9a ("netfilter: conntrack: consider ct netns in early_drop logic")
Cc: <stable@vger.kernel.org> # v4.7+
Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21 09:19:18 +01:00
Paolo Abeni
703acc3265 netfilter: xt_nat: fix DNAT target for shifted portmap ranges
[ Upstream commit cb20f2d2c0 ]

The commit 2eb0f624b7 ("netfilter: add NAT support for shifted
portmap ranges") did not set the checkentry/destroy callbacks for
the newly added DNAT target. As a result, rulesets using only
such nat targets are not effective, as the relevant conntrack hooks
are not enabled.
The above affect also nft_compat rulesets.
Fix the issue adding the missing initializers.

Fixes: 2eb0f624b7 ("netfilter: add NAT support for shifted portmap ranges")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13 11:08:20 -08:00
Flavio Leitner
40e4f26e6a netfilter: xt_socket: check sk before checking for netns.
Only check for the network namespace if the socket is available.

Fixes: f564650106 ("netfilter: check if the socket netns is correct.")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-28 14:47:41 +02:00
Taehee Yoo
a13f814a67 netfilter: nft_set_rbtree: add missing rb_erase() in GC routine
The nft_set_gc_batch_check() checks whether gc buffer is full.
If gc buffer is full, gc buffer is released by
the nft_set_gc_batch_complete() internally.
In case of rbtree, the rb_erase() should be called before calling the
nft_set_gc_batch_complete(). therefore the rb_erase() should
be called before calling the nft_set_gc_batch_check() too.

test commands:
   table ip filter {
	   set set1 {
		   type ipv4_addr; flags interval, timeout;
		   gc-interval 10s;
		   timeout 1s;
		   elements = {
			   1-2,
			   3-4,
			   5-6,
			   ...
			   10000-10001,
		   }
	   }
   }
   %nft -f test.nft

splat looks like:
[  430.273885] kasan: GPF could be caused by NULL-ptr deref or user memory access
[  430.282158] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  430.283116] CPU: 1 PID: 190 Comm: kworker/1:2 Tainted: G    B             4.18.0+ #7
[  430.283116] Workqueue: events_power_efficient nft_rbtree_gc [nf_tables_set]
[  430.313559] RIP: 0010:rb_next+0x81/0x130
[  430.313559] Code: 08 49 bd 00 00 00 00 00 fc ff df 48 bb 00 00 00 00 00 fc ff df 48 85 c0 75 05 eb 58 48 89 d4
[  430.313559] RSP: 0018:ffff88010cdb7680 EFLAGS: 00010207
[  430.313559] RAX: 0000000000b84854 RBX: dffffc0000000000 RCX: ffffffff83f01973
[  430.313559] RDX: 000000000017090c RSI: 0000000000000008 RDI: 0000000000b84864
[  430.313559] RBP: ffff8801060d4588 R08: fffffbfff09bc349 R09: fffffbfff09bc349
[  430.313559] R10: 0000000000000001 R11: fffffbfff09bc348 R12: ffff880100f081a8
[  430.313559] R13: dffffc0000000000 R14: ffff880100ff8688 R15: dffffc0000000000
[  430.313559] FS:  0000000000000000(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
[  430.313559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  430.313559] CR2: 0000000001551008 CR3: 000000005dc16000 CR4: 00000000001006e0
[  430.313559] Call Trace:
[  430.313559]  nft_rbtree_gc+0x112/0x5c0 [nf_tables_set]
[  430.313559]  process_one_work+0xc13/0x1ec0
[  430.313559]  ? _raw_spin_unlock_irq+0x29/0x40
[  430.313559]  ? pwq_dec_nr_in_flight+0x3c0/0x3c0
[  430.313559]  ? set_load_weight+0x270/0x270
[  430.313559]  ? __switch_to_asm+0x34/0x70
[  430.313559]  ? __switch_to_asm+0x40/0x70
[  430.313559]  ? __switch_to_asm+0x34/0x70
[  430.313559]  ? __switch_to_asm+0x34/0x70
[  430.313559]  ? __switch_to_asm+0x40/0x70
[  430.313559]  ? __switch_to_asm+0x34/0x70
[  430.313559]  ? __switch_to_asm+0x40/0x70
[  430.313559]  ? __switch_to_asm+0x34/0x70
[  430.313559]  ? __switch_to_asm+0x34/0x70
[  430.313559]  ? __switch_to_asm+0x40/0x70
[  430.313559]  ? __switch_to_asm+0x34/0x70
[  430.313559]  ? __schedule+0x6d3/0x1f50
[  430.313559]  ? find_held_lock+0x39/0x1c0
[  430.313559]  ? __sched_text_start+0x8/0x8
[  430.313559]  ? cyc2ns_read_end+0x10/0x10
[  430.313559]  ? save_trace+0x300/0x300
[  430.313559]  ? sched_clock_local+0xd4/0x140
[  430.313559]  ? find_held_lock+0x39/0x1c0
[  430.313559]  ? worker_thread+0x353/0x1120
[  430.313559]  ? worker_thread+0x353/0x1120
[  430.313559]  ? lock_contended+0xe70/0xe70
[  430.313559]  ? __lock_acquire+0x4500/0x4500
[  430.535635]  ? do_raw_spin_unlock+0xa5/0x330
[  430.535635]  ? do_raw_spin_trylock+0x101/0x1a0
[  430.535635]  ? do_raw_spin_lock+0x1f0/0x1f0
[  430.535635]  ? _raw_spin_lock_irq+0x10/0x70
[  430.535635]  worker_thread+0x15d/0x1120
[ ... ]

Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-28 14:47:01 +02:00
zhong jiang
346fa83d10 netfilter: conntrack: get rid of double sizeof
sizeof(sizeof()) is quite strange and does not seem to be what
is wanted here.

The issue is detected with the help of Coccinelle.

Fixes: 3921584674 ("netfilter: conntrack: remove nlattr_size pointer from l4proto trackers")
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 18:40:32 +02:00