Commits · afdc3238ec948531205f5c5f77d2de7bae519c71 · linux / linux-davinci

11 Jul, 2007 40 commits

[RTNETLINK]: Add nested compat attribute · afdc3238

Patrick McHardy authored Jun 25, 2007

Add a nested compat attribute type that can be used to convert
attributes that contain a structure to nested attributes in a
backwards compatible way.

The attribute looks like this:

struct {
        [ compat contents ]
        struct rtattr {
                .rta_len        = total size,
                .rta_type       = type,
        } rta;
        struct old_structure struct;

        [ nested top-level attribute ]
        struct rtattr {
                .rta_len        = nest size,
                .rta_type       = type,
        } nest_attr;

        [ optional 0 .. n nested attributes ]
        struct rtattr {
                .rta_len        = private attribute len,
                .rta_type       = private attribute typ,
        } nested_attr;
        struct nested_data data;
};

Since both userspace and kernel deal correctly with attributes that are
larger than expected old versions will just parse the compat part and
ignore the rest.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

afdc3238

[NETLINK]: attr: add nested compat attribute type · 1092cb21

Patrick McHardy authored Jun 25, 2007

Add a nested compat attribute type that can be used to convert
attributes that contain a structure to nested attributes in a
backwards compatible way.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

1092cb21

[SKBUFF]: Keep track of writable header len of headerless clones · 334a8132

Patrick McHardy authored Jun 25, 2007

Currently NAT (and others) that want to modify cloned skbs copy them,
even if in the vast majority of cases its not necessary because the
skb is a clone made by TCP and the portion NAT wants to modify is
actually writable because TCP release the header reference before
cloning.

The problem is that there is no clean way for NAT to find out how
long the writable header area is, so this patch introduces skb->hdr_len
to hold this length. When a headerless skb is cloned skb->hdr_len
is set to the current headroom, for regular clones it is copied from
the original. A new function skb_clone_writable(skb, len) returns
whether the skb is writable up to len bytes from skb->data. To avoid
enlarging the skb the mac_len field is reduced to 16 bit and the
new hdr_len field is put in the remaining 16 bit.

I've done a few rough benchmarks of NAT (not with this exact patch,
but a very similar one). As expected it saves huge amounts of system
time in case of sendfile, bringing it down to basically the same
amount as without NAT, with sendmsg it only helps on loopback,
probably because of the large MTU.

Transmit a 1GB file using sendfile/sendmsg over eth0/lo with and
without NAT:

- sendfile eth0, no NAT:	sys     0m0.388s
- sendfile eth0, NAT:		sys     0m1.835s
- sendfile eth0: NAT + path:	sys     0m0.370s	(~ -80%)

- sendfile lo, no NAT:		sys     0m0.258s
- sendfile lo, NAT:		sys     0m2.609s
- sendfile lo, NAT + patch:	sys     0m0.260s	(~ -90%)

- sendmsg eth0, no NAT:		sys     0m2.508s
- sendmsg eth0, NAT:		sys     0m2.539s
- sendmsg eth0, NAT + patch:	sys     0m2.445s	(no change)

- sendmsg lo, no NAT:		sys	0m2.151s
- sendmsg lo, NAT:		sys     0m3.557s
- sendmsg lo, NAT + patch:	sys     0m2.159s	(~ -40%)

I expect other users can see a similar performance improvement,
packet mangling iptables targets, ipip and ip_gre come to mind ..
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

334a8132

[NET]: qdisc_restart - couple of optimizations. · e50c41b5

Krishna Kumar authored Jun 24, 2007

Changes :

- netif_queue_stopped need not be called inside qdisc_restart as
  it has been called already in qdisc_run() before the first skb
  is sent, and in __qdisc_run() after each intermediate skb is
  sent (note : we are the only sender, so the queue cannot get
  stopped while the tx lock was got in the ~LLTX case).

- BUG_ON((int) q->q.qlen < 0) was a relic from old times when -1
  meant more packets are available, and __qdisc_run used to loop
  when qdisc_restart() returned -1. During those days, it was
  necessary to make sure that qlen is never less than zero, since
  __qdisc_run would get into an infinite loop if no packets are on
  the queue and this bug in qdisc was there (and worse - no more
  skbs could ever get queue'd as we hold the queue lock too). With
  Herbert's recent change to return values, this check is not
  required.  Hopefully Herbert can validate this change. If at all
  this is required, it should be added to skb_dequeue (in failure
  case), and not to qdisc_qlen.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e50c41b5

[NET]: qdisc_restart - readability changes plus one bug fix. · 6c1361a6

Krishna Kumar authored Jun 24, 2007

New changes :

- Incorporated Peter Waskiewicz's comments.
- Re-added back one warning message (on driver returning wrong value).

Previous changes :

- Converted to use switch/case code which looks neater.

- "if (ret == NETDEV_TX_LOCKED && lockless)" is buggy, and the lockless
  check should be removed, since driver will return NETDEV_TX_LOCKED only
  if lockless is true and driver has to do the locking. In the original
  code as well as the latest code, this code can result in a bug where
  if LLTX is not set for a driver (lockless == 0) but the driver is written
  wrongly to do a trylock (despite LLTX being set), the driver returns
  LOCKED. But since lockless is zero, the packet is requeue'd instead of
  calling collision code which will issue warning and free up the skb.
  Instead this skb will be retried with this driver next time, and the same
  result will ensue. Removing this check will catch these driver bugs instead
  of hiding the problem. I am keeping this change to readability section
  since :
  	a. it is confusing to check two things as it is; and
  	b. it is difficult to keep this check in the changed 'switch' code.

- Changed some names, like try_get_tx_pkt to dev_dequeue_skb (as that is
  the work being done and easier to understand) and do_dev_requeue to
  dev_requeue_skb, merged handle_dev_cpu_collision and tx_islocked to
  dev_handle_collision (handle_dev_cpu_collision is a small routine with only
  one caller, so there is no need to have two separate routines which also
  results in getting rid of two macros, etc.

- Removed an XXX comment as it should never fail (I suspect this was related
  to batch skb WIP, Jamal ?). Converted some functions to original coding
  style of having the return values and the function name on same line, eg
  prio2list.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c1361a6

[CCID3]: Fix a bug in the send time processing · 49d66a70

Gerrit Renker authored Jun 16, 2007

ccid3_hc_tx_send_packet currently returns 0 when the time difference between
current time and t_nom is less than 1000 microseconds.

In this case the packet is sent immediately; but, unlike other packets that can
be emitted on first attempt, it will not have its window counter updated and
its options set as required. This is a bug.

Fix: Require the time difference to be at least 1000 microseconds. The
algorithm then converges: time differences > 1000 microseconds trigger the
timer in dccp_write_xmit; after timer expiry this function is tried again; when
the time difference is less than 1000, the packet will have its options added
and window counter updated as required.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>

49d66a70

[CCID3]: Sending time: update to ktime_t · 8132da4d

Gerrit Renker authored Jun 16, 2007

This updates the computation of t_nom and t_last_win_count to use the newer
gettimeofday interface.

Committer note: used ktime_to_timeval to set the 'now' variable to t_ld in
                ccid3hctx_no_feedback_timer
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>

8132da4d

[KTIME]: Introduce ktime_add_us · 1e180f72
Arnaldo Carvalho de Melo authored Jun 16, 2007
```
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
```
1e180f72

[KTIME]: Introduce ktime_us_delta · f1c91da4

Gerrit Renker authored Jun 16, 2007

This provides a reusable time difference function which returns the difference in
microseconds, as often used in the DCCP code.

Commiter note: renamed ktime_delta to ktime_us_delta and put it in ktime.h.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>

f1c91da4

loss_interval: make struct dccp_li_hist_entry private · dd36a9ab

Arnaldo Carvalho de Melo authored May 28, 2007

net/dccp/ccids/lib/loss_interval.c is the only place where this struct is used.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

dd36a9ab

loss_interval: Nuke dccp_li_hist · cc4d6a3a

Arnaldo Carvalho de Melo authored May 28, 2007

It had just a slab cache, so, for the sake of simplicity just make
dccp_trfc_lib module init routine create the slab cache, no need for users of
the lib to create a private loss_interval object.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

cc4d6a3a

loss_interval: Make dccp_li_hist_entry_{new,delete} private · c70b729e
Arnaldo Carvalho de Melo authored May 28, 2007
```
Not used outside the loss_interval code anymore.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
```
c70b729e
loss_interval: unexport dccp_li_hist_interval_new · 8c281780
Arnaldo Carvalho de Melo authored May 28, 2007
```
Now its only used inside the loss_interval code.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
```
8c281780

[DCCP] loss_interval: Move ccid3_hc_rx_update_li to loss_interval · cc0a910b

Arnaldo Carvalho de Melo authored Jun 14, 2007

Renaming it to dccp_li_update_li.

Also based on previous work by Ian McDonald.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>

cc0a910b

[CCID3]: Pass ccid3_li_hist to ccid3_hc_rx_update_li · 878ac600

Arnaldo Carvalho de Melo authored Jun 14, 2007

Now ccid3_hc_rx_update_li is ready to be moved to
net/dccp/ccids/lib/loss_interval, it uses the same interface as the other
functions there.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

878ac600

Remove accesses to ccid3_hc_rx_sock in ccid3_hc_rx_{update,calc_first}_li · d83258a3

Arnaldo Carvalho de Melo authored May 28, 2007

This is a preparatory patch for moving these loss interval functions from
net/dccp/ccids/ccid3.c to net/dccp/ccids/lib/loss_interval.c.

Based on a patch by Ian McDonald.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>

d83258a3

loss_interval: Fix timeval initialisation · 6bc7efe8

Ian McDonald authored May 28, 2007

When compiling with EXTRA_CFLAGS=-W noticed that tstamp is not initialised
correctly in dccp_li_calc_first_li.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>

6bc7efe8

Fix dccp_sum_coverage · e961811f

Ian McDonald authored May 28, 2007

When compiling with EXTRA_CFLAGS=-W notice that we have signed/unsigned issue
in dccp.h.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>

e961811f

ccid3: Update copyrights · b2f41ff4

Ian McDonald authored May 28, 2007

Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>

b2f41ff4

[VLAN]: Use rtnl_link API · 07b5b17e

Patrick McHardy authored Jun 13, 2007

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

07b5b17e

[VLAN]: Introduce symbolic constants for flag values · a4bf3af4

Patrick McHardy authored Jun 13, 2007

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

a4bf3af4

[VLAN]: Keep track of number of QoS mappings · b020cb48

Patrick McHardy authored Jun 13, 2007

Keep track of the number of configured ingress/egress QoS mappings to
avoid iteration while calculating the netlink attribute size.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

b020cb48

[VLAN]: Use 32 bit value for skb->priority mapping · 734423cf

Patrick McHardy authored Jun 13, 2007

skb->priority has only 32 bits and even VLAN uses 32 bit values in its API.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

734423cf

[VLAN]: Return proper error codes in register_vlan_device · 2ae0bf69

Patrick McHardy authored Jun 13, 2007

The returned device is unused, return proper error codes instead and avoid
having the ioctl handler guess the error.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

2ae0bf69

[VLAN]: Move device registation to seperate function · e89fe42c

Patrick McHardy authored Jun 13, 2007

Move device registration and configuration of the underlying device to a
seperate function.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

e89fe42c

[VLAN]: Split up device checks · c1d3ee99

Patrick McHardy authored Jun 13, 2007

Move the checks of the underlying device to a seperate function.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

c1d3ee99

[VLAN]: Move vlan_group allocation to seperate function · 42429aae

Patrick McHardy authored Jun 13, 2007

Move group allocation to a seperate function to clean up the code a bit
and allocate groups before registering the device. Device registration
is globally visible and causes netlink events, so we shouldn't fail
afterwards.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

42429aae

[VLAN]: Move some device intialization code to dev->init callback · 2f4284a4

Patrick McHardy authored Jun 13, 2007

Move some device initialization code to new dev->init callback to make
it shareable with netlink. Additionally this fixes a minor bug, dev->iflink
is set after registration, which causes an incorrect value in the initial
netlink message.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

2f4284a4

[VLAN]: Convert name-based configuration functions to struct netdevice * · c17d8874

Patrick McHardy authored Jun 13, 2007

Move the device lookup and checks to the ioctl handler under the RTNL and
change all name-based interfaces to take a struct net_device * instead.

This allows to use them from a netlink interface, which identifies devices
based on ifindex not name. It also avoids races between the ioctl interface
and the (upcoming) netlink interface since now all changes happen under the
RTNL.

As a nice side effect this greatly simplifies error handling in the helper
functions and fixes a number of incorrect error codes like -EINVAL for
device not found.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

c17d8874

[IFB]: Use rtnl_link API · 9ba2cd65

Patrick McHardy authored Jun 13, 2007

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

9ba2cd65

[IFB]: Keep ifb devices on list · 62b7ffca

Patrick McHardy authored Jun 13, 2007

Use a list instead of an array to allow creating new devices.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

62b7ffca

[DUMMY]: Use rtnl_link API · 5d5cb173

Patrick McHardy authored Jun 13, 2007

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

5d5cb173

[DUMMY]: Keep dummy devices on list · 206c9fb2

Patrick McHardy authored Jun 13, 2007

Use a list instead of an array to allow creating new devices.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

206c9fb2

[DUMMY]: Use dev->stats · 58651b24

Patrick McHardy authored Jun 13, 2007

Use dev->stats instead of netdev_priv().
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

58651b24

[RTNETLINK]: Link creation API · 38f7b870

Patrick McHardy authored Jun 13, 2007

Add rtnetlink API for creating, changing and deleting software devices.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

38f7b870

[RTNETLINK]: Split up rtnl_setlink · 0157f60c

Patrick McHardy authored Jun 13, 2007

Split up rtnl_setlink into a function performing validation and a function
performing the actual changes. This allows to share the modifcation logic
with rtnl_newlink, which is introduced by the next patch.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

0157f60c

[NET]: Mark struct net_device * argument to netdev_priv const · 6472ce60
Patrick McHardy authored Jun 13, 2007
```
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
```
6472ce60

[MAC80211]: Add support for SIOCGIWRATE ioctl · b3d88ad4

Larry Finger authored Jun 10, 2007

At present, transmission rate information for mac80211 is available only
if verbose debugging is turned on, and then only in the logs. This patch
implements the SIOCGIWRATE ioctl, which adds the current transmission rate to
the output of iwconfig.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b3d88ad4

[NET]: Kill eth_copy_and_sum(). · 8c7b7faa

David S. Miller authored Jul 10, 2007

It hasn't "summed" anything in over 7 years, and it's
just a straight mempcy ala skb_copy_to_linear_data()
so just get rid of it.
Signed-off-by: David S. Miller <davem@davemloft.net>

8c7b7faa

[TCPv4]: Improve BH latency in /proc/net/tcp · a7ab4b50

Herbert Xu authored Jun 10, 2007

Currently the code for /proc/net/tcp disable BH while iterating
over the entire established hash table.  Even though we call
cond_resched_softirq for each entry, we still won't process
softirq's as regularly as we would otherwise do which results
in poor performance when the system is loaded near capacity.

This anomaly comes from the 2.4 code where this was all in a
single function and the local_bh_disable might have made sense
as a small optimisation.

The cost of each local_bh_disable is so small when compared
against the increased latency in keeping it disabled over a
large but mostly empty TCP established hash table that we
should just move it to the individual read_lock/read_unlock
calls as we do in inet_diag.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

a7ab4b50