Commit Graph

4251 Commits

Author SHA1 Message Date
Eric Dumazet
1b1cb1f78a net: ping: small changes
ping_table is not __read_mostly, since it contains one rwlock,
and is static to ping.c

ping_port_rover & ping_v4_lookup are static

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-15 01:22:21 -04:00
David S. Miller
7be799a70b ipv4: Remove rt->rt_dst reference from ip_forward_options().
At this point iph->daddr equals what rt->rt_dst would hold.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 17:31:02 -04:00
David S. Miller
8e36360ae8 ipv4: Remove route key identity dependencies in ip_rt_get_source().
Pass in the sk_buff so that we can fetch the necessary keys from
the packet header when working with input routes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 17:29:41 -04:00
David S. Miller
22f728f8f3 ipv4: Always call ip_options_build() after rest of IP header is filled in.
This will allow ip_options_build() to reliably look at the values of
iph->{daddr,saddr}

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 17:21:27 -04:00
David S. Miller
0374d9ceb0 ipv4: Kill spurious write to iph->daddr in ip_forward_options().
This code block executes when opt->srr_is_hit is set.  It will be
set only by ip_options_rcv_srr().

ip_options_rcv_srr() walks until it hits a matching nexthop in the SRR
option addresses, and when it matches one 1) looks up the route for
that nexthop and 2) on route lookup success it writes that nexthop
value into iph->daddr.

ip_forward_options() runs later, and again walks the SRR option
addresses looking for the option matching the destination of the route
stored in skb_rtable().  This route will be the same exact one looked
up for the nexthop by ip_options_rcv_srr().

Therefore "rt->rt_dst == iph->daddr" must be true.

All it really needs to do is record the route's source address in the
matching SRR option adddress.  It need not write iph->daddr again,
since that has already been done by ip_options_rcv_srr() as detailed
above.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 17:15:50 -04:00
Vasiliy Kulikov
c319b4d76b net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind.  It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges.  In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping.  In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).

Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/

A new ping socket is created with

    socket(PF_INET, SOCK_DGRAM, PROT_ICMP)

Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.

Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.

ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.

ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).

socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range".  It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets.  Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.

The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).

Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping

For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro.  A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/

Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.

All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.

PATCH v3:
    - switched to flowi4.
    - minor changes to be consistent with raw sockets code.

PATCH v2:
    - changed ping_debug() to pr_debug().
    - removed CONFIG_IP_PING.
    - removed ping_seq_fops.owner field (unused for procfs).
    - switched to proc_net_fops_create().
    - switched to %pK in seq_printf().

PATCH v1:
    - fixed checksumming bug.
    - CAP_NET_RAW may not create icmp sockets anymore.

RFC v2:
    - minor cleanups.
    - introduced sysctl'able group range to restrict socket(2).

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 16:08:13 -04:00
David S. Miller
72a8f97bf2 ipv4: Fix 'iph' use before set.
I swear none of my compilers warned about this, yet it is so
obvious.

> net/ipv4/ip_forward.c: In function 'ip_forward':
> net/ipv4/ip_forward.c:87: warning: 'iph' may be used uninitialized in this function

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-12 23:03:46 -04:00
David S. Miller
def57687e9 ipv4: Elide use of rt->rt_dst in ip_forward()
No matter what kind of header mangling occurs due to IP options
processing, rt->rt_dst will always equal iph->daddr in the packet.

So we can safely use iph->daddr instead of rt->rt_dst here.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-12 19:34:30 -04:00
David S. Miller
c30883bdff ipv4: Simplify iph->daddr overwrite in ip_options_rcv_srr().
We already copy the 4-byte nexthop from the options block into
local variable "nexthop" for the route lookup.

Re-use that variable instead of memcpy()'ing again when assigning
to iph->daddr after the route lookup succeeds.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-12 19:30:58 -04:00
David S. Miller
10949550bd ipv4: Kill spurious opt->srr check in ip_options_rcv_srr().
All call sites conditionalize the call to ip_options_rcv_srr()
with a check of opt->srr, so no need to check it again there.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-12 19:26:57 -04:00
David S. Miller
3c709f8fb4 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-3.6
Conflicts:
	drivers/net/benet/be_main.c
2011-05-11 14:26:58 -04:00
Steffen Klassert
43a4dea4c9 xfrm: Assign the inner mode output function to the dst entry
As it is, we assign the outer modes output function to the dst entry
when we create the xfrm bundle. This leads to two problems on interfamily
scenarios. We might insert ipv4 packets into ip6_fragment when called
from xfrm6_output. The system crashes if we try to fragment an ipv4
packet with ip6_fragment. This issue was introduced with git commit
ad0081e4 (ipv6: Fragment locally generated tunnel-mode IPSec6 packets
as needed). The second issue is, that we might insert ipv4 packets in
netfilter6 and vice versa on interfamily scenarios.

With this patch we assign the inner mode output function to the dst entry
when we create the xfrm bundle. So xfrm4_output/xfrm6_output from the inner
mode is used and the right fragmentation and netfilter functions are called.
We switch then to outer mode with the output_finish functions.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 15:03:34 -07:00
Eric Dumazet
1fc19aff84 net: fix two lockdep splats
Commit e67f88dd12 (net: dont hold rtnl mutex during netlink dump
callbacks) switched rtnl protection to RCU, but we forgot to adjust two
rcu_dereference() lockdep annotations :

inet_get_link_af_size() or inet_fill_link_af() might be called with
rcu_read_lock or rtnl held, so use rcu_dereference_rtnl()
instead of rtnl_dereference()

Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 15:03:01 -07:00
David S. Miller
8f01cb0827 ipv4: xfrm: Eliminate ->rt_src reference in policy code.
Rearrange xfrm4_dst_lookup() so that it works by calling a helper
function __xfrm_dst_lookup() that takes an explicit flow key storage
area as an argument.

Use this new helper in xfrm4_get_saddr() so we can fetch the selected
source address from the flow instead of from rt->rt_src

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 13:32:48 -07:00
David S. Miller
79ab053145 ipv4: udp: Eliminate remaining uses of rt->rt_src
We already track and pass around the correct flow key,
so simply use it in udp_send_skb().

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 13:32:47 -07:00
David S. Miller
9f6abb5f17 ipv4: icmp: Eliminate remaining uses of rt->rt_src
On input packets, rt->rt_src always equals ip_hdr(skb)->saddr

Anything that mangles or otherwise changes the IP header must
relookup the route found at skb_rtable().  Therefore this
invariant must always hold true.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 13:32:46 -07:00
David S. Miller
0a5ebb8000 ipv4: Pass explicit daddr arg to ip_send_reply().
This eliminates an access to rt->rt_src.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 13:32:46 -07:00
David S. Miller
f5fca60865 ipv4: Pass flow key down into ip_append_*().
This way rt->rt_dst accesses are unnecessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 21:24:07 -07:00
David S. Miller
77968b7824 ipv4: Pass flow keys down into datagram packet building engine.
This way ip_output.c no longer needs rt->rt_{src,dst}.

We already have these keys sitting, ready and waiting, on the stack or
in a socket structure.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 21:24:06 -07:00
David S. Miller
e474995f29 udp: Use flow key information instead of rt->rt_{src,dst}
We have two cases.

Either the socket is in TCP_ESTABLISHED state and connect() filled
in the inet socket cork flow, or we looked up the route here and
used an on-stack flow.

Track which one it was, and use it to obtain src/dst addrs.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 21:12:48 -07:00
stephen hemminger
b9f47a3aae tcp_cubic: limit delayed_ack ratio to prevent divide error
TCP Cubic keeps a metric that estimates the amount of delayed
acknowledgements to use in adjusting the window. If an abnormally
large number of packets are acknowledged at once, then the update
could wrap and reach zero. This kind of ACK could only
happen when there was a large window and huge number of
ACK's were lost.

This patch limits the value of delayed ack ratio. The choice of 32
is just a conservative value since normally it should be range of
1 to 4 packets.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 15:51:57 -07:00
David S. Miller
c5216cc70f tcp: Use cork flow info instead of rt->rt_dst in tcp_v4_get_peer()
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 15:28:29 -07:00
David S. Miller
ea4fc0d619 ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit().
Now we can pick it out of the provided flow key.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 15:28:28 -07:00
David S. Miller
d9d8da805d inet: Pass flowi to ->queue_xmit().
This allows us to acquire the exact route keying information from the
protocol, however that might be managed.

It handles all of the possibilities, from the simplest case of storing
the key in inet->cork.fl to the more complex setup SCTP has where
individual transports determine the flow.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 15:28:28 -07:00
David S. Miller
0e73441992 ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.
Operation order is now transposed, we first create the child
socket then we try to hook up the route.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 15:28:03 -07:00
David S. Miller
77357a9552 ipv4: Create inet_csk_route_child_sock().
This is just like inet_csk_route_req() except that it operates after
we've created the new child socket.

In this way we can use the new socket's cork flow for proper route
key storage.

This will be used by DCCP and TCP child socket creation handling.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 14:34:22 -07:00
David S. Miller
b57ae01a8a ipv4: Use cork flow in ip_queue_xmit()
All invokers of ip_queue_xmit() must make certain that the
socket is locked.  All of SCTP, TCP, DCCP, and L2TP now make
sure this is the case.

Therefore we can use the cork flow during output route lookup in
ip_queue_xmit() when the socket route check fails.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 14:05:14 -07:00
David S. Miller
6e86913810 ipv4: Use cork flow in inet_sk_{reselect_saddr,rebuild_header}()
These two functions must be invoked only when the socket is locked
(because socket identity modifications are made non-atomically).

Therefore we can use the cork flow for output route lookups.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 14:05:13 -07:00
David S. Miller
3038eeac02 ipv4: Lock socket and use cork flow in ip4_datagram_connect().
This is to make sure that an l2tp socket's inet cork flow is
fully filled in, when it's encapsulated in UDP.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 13:48:57 -07:00
David S. Miller
da905bd1d5 tcp: Use cork flow in tcp_v4_connect()
Since this is invoked from inet_stream_connect() the socket is locked
and therefore this usage is safe.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 13:18:54 -07:00
David S. Miller
706527280e ipv4: Initialize cork->opt using NULL not 0.
Noticed by Joe Perches.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-06 16:01:15 -07:00
David S. Miller
b80d72261a ipv4: Initialize on-stack cork more efficiently.
ip_setup_cork() explicitly initializes every member of
inet_cork except flags, addr, and opt.  So we can simply
set those three members to zero instead of using a
memset() via an empty struct assignment.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2011-05-06 15:37:57 -07:00
David S. Miller
bdc712b4c2 inet: Decrease overhead of on-stack inet_cork.
When we fast path datagram sends to avoid locking by putting
the inet_cork on the stack we use up lots of space that isn't
necessary.

This is because inet_cork contains a "struct flowi" which isn't
used in these code paths.

Split inet_cork to two parts, "inet_cork" and "inet_cork_full".
Only the latter of which has the "struct flowi" and is what is
stored in inet_sock.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2011-05-06 15:37:57 -07:00
David S. Miller
7143b7d412 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/tg3.c
2011-05-05 14:59:02 -07:00
Jiri Pirko
1c5cae815d net: call dev_alloc_name from register_netdevice
Force dev_alloc_name() to be called from register_netdevice() by
dev_get_valid_name(). That allows to remove multiple explicit
dev_alloc_name() calls.

The possibility to call dev_alloc_name in advance remains.

This also fixes veth creation regresion caused by
84c49d8c3e

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-05 10:57:45 -07:00
Eric Dumazet
64f3b9e203 net: ip_expire() must revalidate route
Commit 4a94445c9a (net: Use ip_route_input_noref() in input path)
added a bug in IP defragmentation handling, in case timeout is fired.

When a frame is defragmented, we use last skb dst field when building
final skb. Its dst is valid, since we are in rcu read section.

But if a timeout occurs, we take first queued fragment to build one ICMP
TIME EXCEEDED message. Problem is all queued skb have weak dst pointers,
since we escaped RCU critical section after their queueing. icmp_send()
might dereference a now freed (and possibly reused) part of memory.

Calling skb_dst_drop() and ip_route_input_noref() to revalidate route is
the only possible choice.

Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-04 14:04:07 -07:00
David S. Miller
cbb1e85f9c ipv4: Kill rt->rt_{src, dst} usage in IP GRE tunnels.
First, make callers pass on-stack flowi4 to ip_route_output_gre()
so they can get at the fully resolved flow key.

Next, use that in ipgre_tunnel_xmit() to avoid the need to use
rt->rt_{dst,src}.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-04 12:55:07 -07:00
David S. Miller
9a1b9496cd ipv4: Pass explicit saddr/daddr args to ipmr_get_route().
This eliminates the need to use rt->rt_{src,dst}.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-04 12:18:54 -07:00
David S. Miller
dd927a2694 ipv4: In ip_build_and_send_pkt() use 'saddr' and 'daddr' args passed in.
Instead of rt->rt_{dst,src}

The only tricky part is source route option handling.

If the source route option is enabled we can't just use plain 'daddr',
we have to use opt->opt.faddr.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-04 12:03:30 -07:00
David S. Miller
69458cb194 ipv4: Use flowi4->{daddr,saddr} in ipip_tunnel_xmit().
Instead of rt->rt_{dst,src}

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-04 11:10:28 -07:00
David S. Miller
492f64ce12 ipv4: Use flowi4's {saddr,daddr} in igmpv3_newpack() and igmp_send_report()
Instead of rt->rt_{src,dst}

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-03 20:53:12 -07:00
David S. Miller
31e4543db2 ipv4: Make caller provide on-stack flow key to ip_route_output_ports().
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-03 20:25:42 -07:00
David S. Miller
475949d8e8 ipv4: Renamt struct rtable's rt_tos to rt_key_tos.
To more accurately reflect that it is purely a routing
cache lookup key and is used in no other context.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-03 19:45:15 -07:00
David S. Miller
417da66fa9 ipv4: Rework ipmr_rt_fib_lookup() flow key initialization.
Use information from the skb as much as possible, currently
this means daddr, saddr, and TOS.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-03 19:42:43 -07:00
Lucian Adrian Grijincu
ff538818f4 sysctl: net: call unregister_net_sysctl_table where needed
ctl_table_headers registered with register_net_sysctl_table should
have been unregistered with the equivalent unregister_net_sysctl_table

Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-02 16:12:14 -07:00
David S. Miller
5615787257 ipv4: Make sure flowi4->{saddr,daddr} are always set.
Slow path output route resolution always makes sure that
->{saddr,daddr} are set, and also if we trigger into IPSEC resolution
we initialize them as well, because xfrm_lookup() expects them to be
fully resolved.

But if we hit the fast path and flowi4->flowi4_proto is zero, we won't
do this initialization.

Therefore, move the IPSEC path initialization to the route cache
lookup fast path to make sure these are always set.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-02 14:37:45 -07:00
Alexey Dobriyan
7cfd260910 ipv4: don't spam dmesg with "Using LC-trie" messages
fib_trie_table() is called during netns creation and
Chromium uses clone(CLONE_NEWNET) to sandbox renderer process.

Don't print anything.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-01 23:17:50 -07:00
Ben Hutchings
ad246c992b ipv4, ipv6, bonding: Restore control over number of peer notifications
For backward compatibility, we should retain the module parameters and
sysfs attributes to control the number of peer notifications
(gratuitous ARPs and unsolicited NAs) sent after bonding failover.
Also, it is possible for failover to take place even though the new
active slave does not have link up, and in that case the peer
notification should be deferred until it does.

Change ipv4 and ipv6 so they do not automatically send peer
notifications on bonding failover.

Change the bonding driver to send separate NETDEV_NOTIFY_PEERS
notifications when the link is up, as many times as requested.  Since
it does not directly control which protocols send notifications, make
num_grat_arp and num_unsol_na aliases for a single parameter.  Bump
the bonding version number and update its documentation.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Acked-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-29 12:44:11 -07:00
David S. Miller
d4fb3d74d7 ipv4: Get route daddr from flow key in tcp_v4_connect().
Now that output route lookups update the flow with
destination address selection, we can fetch it from
fl4->daddr instead of rt->rt_dst

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 23:50:32 -07:00
David S. Miller
072d8c9414 ipv4: Get route daddr from flow key in inet_csk_route_req().
Now that output route lookups update the flow with
destination address selection, we can fetch it from
fl4->daddr instead of rt->rt_dst

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 23:50:09 -07:00