1. 17 Nov, 2008 10 commits
    • Eric Dumazet's avatar
      net: make sure struct dst_entry refcount is aligned on 64 bytes · 5635c10d
      Eric Dumazet authored
      As found in the past (commit f1dd9c37
      [NET]: Fix tbench regression in 2.6.25-rc1), it is really
      important that struct dst_entry refcount is aligned on a cache line.
      
      We cannot use __atribute((aligned)), so manually pad the structure
      for 32 and 64 bit arches.
      
      for 32bit : offsetof(truct dst_entry, __refcnt) is 0x80
      for 64bit : offsetof(truct dst_entry, __refcnt) is 0xc0
      
      As it is not possible to guess at compile time cache line size,
      we use a generic value of 64 bytes, that satisfies many current arches.
      (Using 128 bytes alignment on 64bit arches would waste 64 bytes)
      
      Add a BUILD_BUG_ON to catch future updates to "struct dst_entry" dont
      break this alignment.
      
      "tbench 8" is 4.4 % faster on a dual quad core (HP BL460c G1), Intel E5450 @3.00GHz
      (2350 MB/s instead of 2250 MB/s)
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5635c10d
    • Eric Dumazet's avatar
      rcu: documents rculist_nulls · 536533e6
      Eric Dumazet authored
      Adds Documentation/RCU/rculist_nulls.txt file to describe how 'nulls'
      end-of-list can help in some RCU algos.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      536533e6
    • Eric Dumazet's avatar
      net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls · 3ab5aee7
      Eric Dumazet authored
      RCU was added to UDP lookups, using a fast infrastructure :
      - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
        price of call_rcu() at freeing time.
      - hlist_nulls permits to use few memory barriers.
      
      This patch uses same infrastructure for TCP/DCCP established
      and timewait sockets.
      
      Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
      using short lived TCP connections. A followup patch, converting
      rwlocks to spinlocks will even speedup this case.
      
      __inet_lookup_established() is pretty fast now we dont have to
      dirty a contended cache line (read_lock/read_unlock)
      
      Only established and timewait hashtable are converted to RCU
      (bind table and listen table are still using traditional locking)
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ab5aee7
    • Eric Dumazet's avatar
      udp: Use hlist_nulls in UDP RCU code · 88ab1932
      Eric Dumazet authored
      This is a straightforward patch, using hlist_nulls infrastructure.
      
      RCUification already done on UDP two weeks ago.
      
      Using hlist_nulls permits us to avoid some memory barriers, both
      at lookup time and delete time.
      
      Patch is large because it adds new macros to include/net/sock.h.
      These macros will be used by TCP & DCCP in next patch.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88ab1932
    • Eric Dumazet's avatar
      rcu: Introduce hlist_nulls variant of hlist · bbaffaca
      Eric Dumazet authored
      hlist uses NULL value to finish a chain.
      
      hlist_nulls variant use the low order bit set to 1 to signal an end-of-list marker.
      
      This allows to store many different end markers, so that some RCU lockless
      algos (used in TCP/UDP stack for example) can save some memory barriers in
      fast paths.
      
      Two new files are added :
      
      include/linux/list_nulls.h
        - mimics hlist part of include/linux/list.h, derived to hlist_nulls variant
      
      include/linux/rculist_nulls.h
        - mimics hlist part of include/linux/rculist.h, derived to hlist_nulls variant
      
         Only four helpers are declared for the moment :
      
           hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
           hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()
      
      prefetches() were removed, since an end of list is not anymore NULL value.
      prefetches() could trigger useless (and possibly dangerous) memory transactions.
      
      Example of use (extracted from __udp4_lib_lookup())
      
      	struct sock *sk, *result;
              struct hlist_nulls_node *node;
              unsigned short hnum = ntohs(dport);
              unsigned int hash = udp_hashfn(net, hnum);
              struct udp_hslot *hslot = &udptable->hash[hash];
              int score, badness;
      
              rcu_read_lock();
      begin:
              result = NULL;
              badness = -1;
              sk_nulls_for_each_rcu(sk, node, &hslot->head) {
                      score = compute_score(sk, net, saddr, hnum, sport,
                                            daddr, dport, dif);
                      if (score > badness) {
                              result = sk;
                              badness = score;
                      }
              }
              /*
               * if the nulls value we got at the end of this lookup is
               * not the expected one, we must restart lookup.
               * We probably met an item that was moved to another chain.
               */
              if (get_nulls_value(node) != hash)
                      goto begin;
      
              if (result) {
                      if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
                              result = NULL;
                      else if (unlikely(compute_score(result, net, saddr, hnum, sport,
                                        daddr, dport, dif) < badness)) {
                              sock_put(result);
                              goto begin;
                      }
              }
              rcu_read_unlock();
              return result;
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbaffaca
    • Balazs Scheidler's avatar
      TPROXY: implemented IP_RECVORIGDSTADDR socket option · e8b2dfe9
      Balazs Scheidler authored
      In case UDP traffic is redirected to a local UDP socket,
      the originally addressed destination address/port
      cannot be recovered with the in-kernel tproxy.
      
      This patch adds an IP_RECVORIGDSTADDR sockopt that enables
      a IP_ORIGDSTADDR ancillary message in recvmsg(). This
      ancillary message contains the original destination address/port
      of the packet being received.
      Signed-off-by: default avatarBalazs Scheidler <bazsi@balabit.hu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8b2dfe9
    • Ben Greear's avatar
      ipv4: Fix ARP behavior with many mac-vlans · 8164f1b7
      Ben Greear authored
      Ben Greear wrote:
      > I have 500 mac-vlans on a system talking to 500 other
      > mac-vlans.  My problem is that the arp-table gets extremely
      > huge because every time an arp-request comes in on all mac-vlans,
      > a stale arp entry is added for each mac-vlan.  I have filtering
      > turned on, but that doesn't help because the neigh_event_ns call
      > below will cause a stale neighbor entry to be created regardless
      > of whether a replay will be sent or not.
      > Maybe the neigh_event code should be below the checks for dont_send,
      > and only create check neigh_event_ns if we are !dont_send?
      
      The attached patch makes it work much better for me.  The patch
      will cause the code to NOT create a stale neighbor entry if we
      are not going to respond to the ARP request.  The old code
      *would* create a stale entry even if we are not going to respond.
      Signed-off-by: default avatarBen Greear <greearb@candelatech.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8164f1b7
    • Alexander Duyck's avatar
      e1000e: enable ECC correction on 82571 silicon · 6ea7ae1d
      Alexander Duyck authored
      This change enables ECC correction for the packet buffer on all 82571
      silicon.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ea7ae1d
    • Paulius Zaleckas's avatar
      phylib: make mdio-gpio work without OF (v4) · f004f3ea
      Paulius Zaleckas authored
      make mdio-gpio work with non OpenFirmware gpio implementation.
      
      Aditional changes to mdio-gpio:
      - use gpio_request() and gpio_free()
      - place irq[] array in struct mdio_gpio_info
      - add module description, author and license
      - add note about compiling this driver as module
      - rename mdc and mdio function (were ugly names)
      - change MII to MDIO in bus name
      - add __init __exit to module (un)loading functions
      - probe fails if no phys added to the bus
      - kzalloc bitbang with sizeof(*bitbang)
      
      Changes since v3:
      - keep bus naming "%x" to be compatible with existing drivers.
      
      Changes since v2:
      - more #ifdefs reduction
      - platform driver will be registered on OF platforms also
      - unified platform and OF bus_id to phy%i
      
      Changes since v1:
      - removed NO_IRQ
      - reduced #idefs
      
      Laurent, please test this driver under OF.
      Signed-off-by: default avatarPaulius Zaleckas <paulius.zaleckas@teltonika.lt>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f004f3ea
    • Paulius Zaleckas's avatar
  2. 16 Nov, 2008 3 commits
  3. 14 Nov, 2008 4 commits
    • Eric Dumazet's avatar
      net: speedup dst_release() · ef711cf1
      Eric Dumazet authored
      During tbench/oprofile sessions, I found that dst_release() was in third position.
      
      CPU: Core 2, speed 2999.68 MHz (estimated)
      Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
      samples  %        symbol name
      483726    9.0185  __copy_user_zeroing_intel
      191466    3.5697  __copy_user_intel
      185475    3.4580  dst_release
      175114    3.2648  ip_queue_xmit
      153447    2.8608  tcp_sendmsg
      108775    2.0280  tcp_recvmsg
      102659    1.9140  sysenter_past_esp
      101450    1.8914  tcp_current_mss
      95067     1.7724  __copy_from_user_ll
      86531     1.6133  tcp_transmit_skb
      
      Of course, all CPUS fight on the dst_entry associated with 127.0.0.1 
      
      Instead of first checking the refcount value, then decrement it,
      we use atomic_dec_return() to help CPU to make the right memory transaction
      (ie getting the cache line in exclusive mode)
      
      dst_release() is now at the fifth position, and tbench a litle bit faster ;)
      
      CPU: Core 2, speed 3000.1 MHz (estimated)
      Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
      samples  %        symbol name
      647107    8.8072  __copy_user_zeroing_intel
      258840    3.5229  ip_queue_xmit
      258302    3.5155  __copy_user_intel
      209629    2.8531  tcp_sendmsg
      165632    2.2543  dst_release
      149232    2.0311  tcp_current_mss
      147821    2.0119  tcp_recvmsg
      137893    1.8767  sysenter_past_esp
      127473    1.7349  __copy_from_user_ll
      121308    1.6510  ip_finish_output
      118510    1.6129  tcp_transmit_skb
      109295    1.4875  tcp_v4_rcv
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef711cf1
    • Jarek Poplawski's avatar
      pkt_sched: Remove qdisc->ops->requeue() etc. · f30ab418
      Jarek Poplawski authored
      After implementing qdisc->ops->peek() and changing sch_netem into
      classless qdisc there are no more qdisc->ops->requeue() users. This
      patch removes this method with its wrappers (qdisc_requeue()), and
      also unused qdisc->requeue structure. There are a few minor fixes of
      warnings (htb_enqueue()) and comments btw.
      
      The idea to kill ->requeue() and a similar patch were first developed
      by David S. Miller.
      Signed-off-by: default avatarJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f30ab418
    • Petr Tesarik's avatar
      tcp: remove an unnecessary field in struct tcp_skb_cb · 38a7ddff
      Petr Tesarik authored
      The urg_ptr field is not used anywhere and is merely confusing.
      Signed-off-by: default avatarPetr Tesarik <ptesarik@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38a7ddff
    • Harvey Harrison's avatar
      isdn: use %pI4, remove get_{u8/u16/u32} and put_{u8/u16/u32} inlines · 00bcd522
      Harvey Harrison authored
      They would have been better named as get_be16, put_be16, etc.
      as they were hiding an endian shift inside.
      
      They don't add much over explicitly coding the byteshifting
      and gcc sometimes has a problem with builtin_constant_p inside
      inline functions, so it may do a better job of byteswapping
      at compile time rather than runtime.
      Signed-off-by: default avatarHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      00bcd522
  4. 13 Nov, 2008 11 commits
  5. 12 Nov, 2008 9 commits
  6. 11 Nov, 2008 3 commits