Based upon a lockdep report.
Since ->poll() can be invoked from netpoll with interrupts
disabled, we must not unconditionally enable interrupts
in napi_complete().
Instead we must use local_irq_{save,restore}().
Noticed by Peter Zijlstra:
<irqs disabled>
netpoll_poll()
poll_napi()
spin_trylock(&napi->poll_lock)
poll_one_napi()
napi->poll() := sky2_poll()
napi_complete()
local_irq_disable()
local_irq_enable() <--- *BUG*
<irq>
irq_exit()
do_softirq()
net_rx_action()
spin_lock(&napi->poll_lock) <--- Deadlock!
Because we still hold the lock....
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 BEET output function is incorrectly including the inner
header in the payload to be protected. This causes a crash as
the packet doesn't actually have that many bytes for a second
header.
The IPv4 BEET output on the other hand is broken when it comes
to handling an inner IPv6 header since it always assumes an
inner IPv4 header.
This patch fixes both by making sure that neither BEET output
function touches the inner header at all. All access is now
done through the protocol-independent cb structure. Two new
attributes are added to make this work, the IP header length
and the IPv4 option length. They're filled in by the inner
mode's output function.
Thanks to Joakim Koskela for finding this problem.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Proxy neighbors do not have any reference counting, so any caller
of pneigh_lookup (unless it's a netlink triggered add/del routine)
should _not_ perform any actions on the found proxy entry.
There's one exception from this rule - the ipv6's ndisc_recv_ns()
uses found entry to check the flags for NTF_ROUTER.
This creates a race between the ndisc and pneigh_delete - after
the pneigh is returned to the caller, the nd_tbl.lock is dropped
and the deleting procedure may proceed.
One of the fixes would be to add a reference counting, but this
problem exists for ndisc only. Besides such a patch would be too
big for -rc4.
So I propose to introduce a __pneigh_lookup() which is supposed
to be called with the lock held and use it in ndisc code to check
the flags on alive pneigh entry.
Changes from v2:
As David noticed, Exported the __pneigh_lookup() to ipv6 module.
The checkpatch generates a warning on it, since the EXPORT_SYMBOL
does not follow the symbol itself, but in this file all the
exports come at the end, so I decided no to break this harmony.
Changes from v1:
Fixed comments from YOSHIFUJI - indentation of prototype in header
and the pndisc_check_router() name - and a compilation fix, pointed
by Daniel - the is_routed was (falsely) considered as uninitialized
by gcc.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduced by 270637abff
("[SCTP]: Fix a race between module load and protosw access")
Reported by Gabriel C:
In file included from net/sctp/sm_statetable.c:50:
include/net/sctp/sctp.h: In function 'sctp_v6_pf_init':
include/net/sctp/sctp.h:392: warning: 'return' with a value, in function returning void
In file included from net/sctp/sm_statefuns.c:62:
include/net/sctp/sctp.h: In function 'sctp_v6_pf_init':
include/net/sctp/sctp.h:392: warning: 'return' with a value, in function returning void
...
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race is SCTP between the loading of the module
and the access by the socket layer to the protocol functions.
In particular, a list of addresss that SCTP maintains is
not initialized prior to the registration with the protosw.
Thus it is possible for a user application to gain access
to SCTP functions before everything has been initialized.
The problem shows up as odd crashes during connection
initializtion when we try to access the SCTP address list.
The solution is to refactor how we do registration and
initialize the lists prior to registering with the protosw.
Care must be taken since the address list initialization
depends on some other pieces of SCTP initialization. Also
the clean-up in case of failure now also needs to be refactored.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
ahci: Add Marvell 6121 SATA support
pata_ali: use atapi_cmd_type() to determine cmd type instead of transfer size
ahci: implement skip_host_reset parameter
ahci: request all PCI BARs
devres: implement pcim_iomap_regions_request_all()
libata-acpi: improve dock event handling
Some drivers need to reserve all PCI BARs to prevent other drivers
misusing unoccupied BARs. pcim_iomap_regions_request_all() requests
all BARs and iomap specified BARs.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
There is a race in virtio_net, dealing with disabling/enabling the callback.
I saw the following oops:
kernel BUG at /space/kvm/drivers/virtio/virtio_ring.c:218!
illegal operation: 0001 [#1] SMP
Modules linked in: sunrpc dm_mod
CPU: 2 Not tainted 2.6.25-rc1zlive-host-10623-gd358142-dirty #99
Process swapper (pid: 0, task: 000000000f85a610, ksp: 000000000f873c60)
Krnl PSW : 0404300180000000 00000000002b81a6 (vring_disable_cb+0x16/0x20)
R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 EA:3
Krnl GPRS: 0000000000000001 0000000000000001 0000000010005800 0000000000000001
000000000f3a0900 000000000f85a610 0000000000000000 0000000000000000
0000000000000000 000000000f870000 0000000000000000 0000000000001237
000000000f3a0920 000000000010ff74 00000000002846f6 000000000fa0bcd8
Krnl Code: 00000000002b819a: a7110001 tmll %r1,1
00000000002b819e: a7840004 brc 8,2b81a6
00000000002b81a2: a7f40001 brc 15,2b81a4
>00000000002b81a6: a51b0001 oill %r1,1
00000000002b81aa: 40102000 sth %r1,0(%r2)
00000000002b81ae: 07fe bcr 15,%r14
00000000002b81b0: eb7ff0380024 stmg %r7,%r15,56(%r15)
00000000002b81b6: a7f13e00 tmll %r15,15872
Call Trace:
([<000000000fa0bcd0>] 0xfa0bcd0)
[<00000000002b8350>] vring_interrupt+0x5c/0x6c
[<000000000010ab08>] do_extint+0xb8/0xf0
[<0000000000110716>] ext_no_vtime+0x16/0x1a
[<0000000000107e72>] cpu_idle+0x1c2/0x1e0
The problem can be triggered with a high amount of host->guest traffic.
I think its the following race:
poll says netif_rx_complete
poll calls enable_cb
enable_cb opens the interrupt mask
a new packet comes, an interrupt is triggered----\
enable_cb sees that there is more work |
enable_cb disables the interrupt |
. V
. interrupt is delivered
. skb_recv_done does atomic napi test, ok
some waiting disable_cb is called->check fails->bang!
.
poll would do napi check
poll would do disable_cb
The fix is to let enable_cb not disable the interrupt again, but expect the
caller to do the cleanup if it returns false. In that case, the interrupt is
only disabled, if the napi test_set_bit was successful.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleaned up doco)
Commit a0c1e9073e added code to futex.c
to detect whether futex_atomic_cmpxchg_inatomic was implemented at run
time:
+ curval = cmpxchg_futex_value_locked(NULL, 0, 0);
+ if (curval == -EFAULT)
+ futex_cmpxchg_enabled = 1;
This is bogus on parisc, since page zero in kernel virtual space is the
gateway page for syscall entry, and should not be read from the kernel.
(That, and we really don't like the kernel faulting on its own address
space...)
Signed-off-by: Kyle McMartin <kyle@mcmartin.ca>
Commit 721fdf3416 introduced a subtle bug
by accidently removing the "static" from iodc_dbuf. This resulted in, what
appeared to be, a trap without *current set to a task. Probably the result of
a trap in real mode while calling firmware.
Also do other misc clean ups. Since the only input from firmware is non
blocking, share iodc_dbuf between input and output, and spinlock the
only callers.
Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
The comments in the definition of struct export_operations don't match the
current members.
Add a comment for the 2 new functions and remove 2 comments for unused ones.
Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Al Viro wrote:
>
> After that commit in asm-h8300/uaccess.h we have
>
> #define get_user(x, ptr) \
> ({ \
> int __gu_err = 0; \
> uint32_t __gu_val = 0; \
> ^^^^^^^^^^^^^^^^^
> switch (sizeof(*(ptr))) { \
> case 1: \
> case 2: \
> case 4: \
> __gu_val = *(ptr); \
> break; \
> case 8: \
> memcpy(&__gu_val, ptr, sizeof (*(ptr))); \
> ^^^^^^^^^^^^^^^^
>
> which, of course, is FUBAR whenever we actually hit that case - memcpy of
> 8 bytes into uint32_t is obviously wrong. Why don't we simply do
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PCI busses can be registered multiple times, so we need to detect if we
have registered our bus structure in sysfs already. If so, don't do it
again.
Thanks to Guennadi Liakhovetski <g.liakhovetski@gmx.de> for reporting
the problem, and to Linus for poking me to get me to believe that it was
a real problem.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Comparing with kernel 2.6.24, tbench result has regression with
2.6.25-rc1.
1) On 2 quad-core processor stoakley: 4%.
2) On 4 quad-core processor tigerton: more than 30%.
bisect located below patch.
b4ce92775c is first bad commit
commit b4ce92775c
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue Nov 13 21:33:32 2007 -0800
[IPV6]: Move nfheader_len into rt6_info
The dst member nfheader_len is only used by IPv6. It's also currently
creating a rather ugly alignment hole in struct dst. Therefore this patch
moves it from there into struct rt6_info.
Above patch changes the cache line alignment, especially member
__refcnt. I did a testing by adding 2 unsigned long pading before
lastuse, so the 3 members, lastuse/__refcnt/__use, are moved to next
cache line. The performance is recovered.
I created a patch to rearrange the members in struct dst_entry.
With Eric and Valdis Kletnieks's suggestion, I made finer arrangement.
1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So
sizeof(dst_entry)=200 no matter if CONFIG_NET_CLS_ROUTE=y/n. I
tested many patches on my 16-core tigerton by moving tclassid to
different place. It looks like tclassid could also have impact on
performance. If moving tclassid before metrics, or just don't move
tclassid, the performance isn't good. So I move it behind metrics.
2) Add comments before __refcnt.
On 16-core tigerton:
If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18%
better than the one without the patch;
If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30%
better than the one without the patch.
With 32bit 2.6.25-rc1 on 8-core stoakley, the new patch doesn't
introduce regression.
Thank Eric, Valdis, and David!
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When building drivers/macintosh/mediabay.c if CONFIG_ADB_PMU isn't
defined we get:
drivers/built-in.o: In function `media_bay_step':
mediabay.c:(.text+0x92b84): undefined reference to `pmu_suspend'
mediabay.c:(.text+0x92c08): undefined reference to `pmu_resume'
Create empty place holders in that scenario.
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
pmu_sys_suspended is declared extern when:
defined(CONFIG_PM_SLEEP) && defined(CONFIG_PPC32)
but only defined when:
defined(CONFIG_SUSPEND) && defined(CONFIG_PPC32)
which is wrong. Let's fix that.
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
[SCTP]: Fix local_addr deletions during list traversals.
net: fix build with CONFIG_NET=n
[TCP]: Prevent sending past receiver window with TSO (at last skb)
rt2x00: Add new D-Link USB ID
rt2x00: never disable multicast because it disables broadcast too
libertas: fix the 'compare command with itself' properly
drivers/net/Kconfig: fix whitespace for GELIC_WIRELESS entry
[NETFILTER]: nf_queue: don't return error when unregistering a non-existant handler
[NETFILTER]: nfnetlink_queue: fix EPERM when binding/unbinding and instance 0 exists
[NETFILTER]: nfnetlink_log: fix EPERM when binding/unbinding and instance 0 exists
[NETFILTER]: nf_conntrack: replace horrible hack with ksize()
[NETFILTER]: nf_conntrack: add \n to "expectation table full" message
[NETFILTER]: xt_time: fix failure to match on Sundays
[NETFILTER]: nfnetlink_log: fix computation of netlink skb size
[NETFILTER]: nfnetlink_queue: fix computation of allocated size for netlink skb.
[NETFILTER]: nfnetlink: fix ifdef in nfnetlink_compat.h
[NET]: include <linux/types.h> into linux/ethtool.h for __u* typedef
[NET]: Make /proc/net a symlink on /proc/self/net (v3)
RxRPC: fix rxrpc_recvmsg()'s returning of msg_name
net/enc28j60: oops fix
...
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus:
[MIPS] Clocksource: Only install r4k counter as clocksource if present.
[MIPS] Lasat: fix LASAT_CASCADE_IRQ
[MIPS] Delete leftovers of old pcspeaker support.
[MIPS] BCM1480: Init pci controller io_map_base
[MIPS] Yosemite: Fix a few more section reference bugs.
[MIPS] Fix yosemite build error
[MIPS] Fix loads of section missmatches
[MIPS] IP27: Tighten up CPU description to fix warnings.
[MIPS] Fix plat_ioremap for JMR3927
[MIPS] Export __ucmpdi2 to modules.
[MIPS] Fix typo in comment
[MIPS] Use KBUILD_DEFCONFIG
[MIPS] Allow 48Hz to be selected if CONFIG_SYS_SUPPORTS_ARBIT_HZ is set.
[MIPS] Added missing cases for rdhwr emulation
[MIPS] Alchemy: Fix ids in Alchemy db dma device table
It was all wrapped in '#ifdef CONFIG_BLOCK' anyway, so userspace was
getting nothing useful out of it. And the special #ifndef __KERNEL__
version of 'struct partition' makes me inclined to promote an attitude
of violence...
Stick some comments on some of the #endifs too, while we're at it.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduced in commit-id 9e2779fa28 and
ifdef'ed out for nommu in 8ca3ed87db, both
approaches end up breaking the nommu build in different ways. An
impressive feat for a 2-liner.
Current is_vmalloc_addr() users fall in to two camps:
- Determining whether to use vfree()/kfree()
- Whether to do vmlist traversal (only /proc/kcore).
Since we don't support /proc/kcore on nommu, that leaves the
vfree()/kfree() determination use cases. nommu vfree() happens to be a
wrapper to kfree() anyways, so is_vmalloc_addr() can always return 0
and end up with the right behaviour.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
quicklists cause a serious memory leak on 32-bit x86,
as documented at:
http://bugzilla.kernel.org/show_bug.cgi?id=9991
the reason is that the quicklist pool is a special-purpose
cache that grows out of proportion. It is not accounted for
anywhere and users have no way to even realize that it's
the quicklists that are causing RAM usage spikes. It was
supposed to be a relatively small pool, but as demonstrated
by KOSAKI Motohiro, they can grow as large as:
Quicklists: 1194304 kB
given how much trouble this code has caused historically,
and given that Andrew objected to its introduction on x86
(years ago), the best option at this point is to remove them.
[ any performance benefits of caching constructed pgds should
be implemented in a more generic way (possibly within the page
allocator), while still allowing constructed pages to be
allocated by other workloads. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6:
drivers: fix dma_get_required_mask
firmware: provide stubs for the FW_LOADER=n case
nozomi: fix initialization and early flow control access
sysdev: fix problem with sysdev_class being re-registered
Additional input received from JMicron on MemoryStick host interfaces showed
that some assumtions in fifo handling code were incorrect. This patch also
fixes data corruption used to occure during PIO transfers.
Signed-off-by: Alex Dubov <oakad@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bus driver may need to be informed that host is being suspended/resumed.
Signed-off-by: Alex Dubov <oakad@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thanks to some input from kind people at JMicron it is now possible to have
more correct definitions of protocol structures and bit field semantics.
Signed-off-by: Alex Dubov <oakad@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This macro is used to define tables, not to declare them.
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>