Network Working Group                                           B. Zhang
Internet-Draft                                          Univ. of Arizona
Intended status: Informational                                  L. Zhang
Expires: September 10, 2009                                         UCLA
                                                           March 9, 2009


              Evolution Towards Global Routing Scalability
                      draft-zhang-evolution-01.txt

Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.  This document may contain material
   from IETF Documents or IETF Contributions published or made publicly
   available before November 10, 2008.  The person(s) controlling the
   copyright in some of this material may not have granted the IETF
   Trust the right to allow modifications of such material outside the
   IETF Standards Process.  Without obtaining an adequate license from
   the person(s) controlling the copyright in such materials, this
   document may not be modified outside the IETF Standards Process, and
   derivative works of it may not be created outside the IETF Standards
   Process, except to format it for publication as an RFC or to
   translate it into languages other than English.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on September 10, 2009.

Copyright Notice

   Copyright (c) 2009 IETF Trust and the persons identified as the
   document authors.  All rights reserved.


Zhang & Zhang          Expires September 10, 2009               [Page 1]

Internet-Draft                 Scaling BGP                    March 2009


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents in effect on the date of
   publication of this document (http://trustee.ietf.org/license-info).
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.

Abstract

   Internet routing scalability has long been considered a serious
   problem.  Over the years many efforts have been devoted to address
   this problem, however the IETF community as a whole is yet to achieve
   a shared understanding on what is the best way forward.  We step up a
   level to re-examine the problem and the ongoing efforts, and conclude
   that, to effectively solve the routing scalability problem, we first
   need a clear understanding on how to introduce solutions to the
   Internet, which is a global scale deployed system.  In this draft we
   sketch out our reasoning on the need for an evolutionary path towards
   scaling the global routing system, instead of attempting a new
   design.


Zhang & Zhang          Expires September 10, 2009               [Page 2]

Internet-Draft                 Scaling BGP                    March 2009


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  Difficulties in Deploying New Solutions  . . . . . . . . . . .  4
   3.  An Evolutionary Path towards Scalable Routing  . . . . . . . .  6
     3.1.  Stage One: Reducing FIB Size . . . . . . . . . . . . . . .  7
     3.2.  Stage Two: Reducing Multi-AS Virtual Aggregation
           Overhead . . . . . . . . . . . . . . . . . . . . . . . . .  9
     3.3.  Stage Three: Reducing RIB Size . . . . . . . . . . . . . . 10
     3.4.  Stage Four: Insulating the Core from Edge Churns . . . . . 11
     3.5.  Summary  . . . . . . . . . . . . . . . . . . . . . . . . . 12
   4.  Evolution versus Incremental Deployability . . . . . . . . . . 12
   5.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 14
   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 14
   7.  Informative References . . . . . . . . . . . . . . . . . . . . 15
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 16


Zhang & Zhang          Expires September 10, 2009               [Page 3]

Internet-Draft                 Scaling BGP                    March 2009


1.  Introduction

   Internet routing scalability has been a long outstanding problem.
   Over the years many efforts, including our own, have been devoted to
   solve this problem.  Our earlier effort led to the development of a
   new routing architecture design dubbed SIRA, A Secure and Scalable
   Internet Routing Architecture [SIRA].  Since the 2006 IAB Workshop on
   Internet Routing and Addressing [RFC4984], new IRTF/IETF efforts have
   been devoted to developing a new routing architecture that can
   provide effective control over the routing system growth.  A number
   of proposals have been put on the table [RRG].  We worked out the
   specifics of SIRA and renamed it to APT [APT].  Some proposal even
   developed running code [LISP].  Yet no clear consensus has emerged in
   the community as which proposal(s) may have a good chance to be
   deployed, or what is the best way forward.

   Assuming the routing scalability problem is real and we can find a
   new design that is technically sound, why is it so difficult to agree
   on deploying a new design that can solve the problem?  We put in the
   effort to understand fundamental roadblocks in rolling out APT, and
   came to a new understanding of the problem at hand: when facing a
   problem, as engineers we naturally tend to design a new system to
   solve the problem, hoping that the new design would be rolled out to
   replace the old problematic one.  This kind of "solution by new
   design" approach can be effective in solving problems in small scale
   (e.g. one could easily replace an old computer with a new one), but
   it does not work for the deployed Internet.  Instead, the Internet-
   scale systems need to resolve problems through an evolutionary path,
   not a revolutionary new design.

   In this draft we first discuss the major difficulties in rolling out
   a new design to solve the global routing scalability problem.  These
   difficulties show that the Internet routing infrastructure needs an
   evolutionary path to move forward.  We then sketch out a solution
   scenario towards resolving routing scalability problem.  We also draw
   a distinction between an evolutionary process towards a final
   solution direction versus "an incrementally deployable new design".


2.  Difficulties in Deploying New Solutions

   Two of the few fundamental properties of the Internet are its
   distributed governance and its diversity along multiple dimensions.
   The Internet is an interconnect of tens of thousands independently
   administrated networks, each with its own budget, planning, business
   models and operational practices.  As a result, not everyone shares
   the same view as far as routing scalability is concerned.  For
   example many customer networks and small regional providers do not


Zhang & Zhang          Expires September 10, 2009               [Page 4]

Internet-Draft                 Scaling BGP                    March 2009


   carry the full BGP routing table internally; instead they only
   propagate internal routes inside their networks and use default
   routing to reach the rest of the Internet through one or a few exit
   points.  On the other hand, large networks in general carry the full
   routing table internally for efficient data delivery to a large
   number of destinations.  In this case, the former may not feel the
   pain of routing table growth but the latter may do.

   Even among the networks that do carry the full routing table inside,
   some (such as content providers) are able to upgrade their routing
   infrastructure every few years to keep up with the demand of ever
   growing BGP table; others may not be able to afford doing so.  For
   example, we have learned from a few large ISPs that, although they
   may be able to upgrade their (relatively small number of) core
   routers with the latest technology that can handle a million or more
   routes, they could not afford upgrading all their edge routers which
   may count up to a thousand or more, even though some of them have
   been 10 or more years old.  As a result, some networks may encounter
   the routing scalability problem years earlier than others, some may
   experience severe problems while others may not see the problem at
   all.  Even within the same network, some routers can handle the
   increasing routing table size while others cannot.  Several incidents
   have occurred recently that were caused by edge router RIB overflow.
   Although these incidents may be caused by other problems (e.g., route
   leak-out) that led to the inflation of the RIB size, they did show
   the fact that a large RIB size can easily push old edge routers to
   fall off the cliff.

   Therefore, although finding a way to put routing table size under
   control may be viewed as needed in the long run, different network
   can have different degrees of incentive to solve the problem, and
   some may not see a need to take any action towards fixing the problem
   for the time being.

   Yet another important issue is network economics.  A new solution
   design usually calls for software upgrade or even new hardware, both
   require additional investment as well as new expertise in managing
   and troubleshooting the new technology.  The affordability associated
   with deploying a new design varies greatly among different networks.
   Even if a network may suffer pain from the growing routing table
   size, it still may not be able to deploy a new solution if the cost
   is considered prohibitively high.  Instead, people tend to look for
   simple twists of the existing systems that can provide effective
   release from the RIB/FIB growth pressure.  One such simple patch was
   presented at October 2008 NANOG meeting [NANOG44] .

   Each network makes its own business decision on whether to deploy a
   new design, based on its evaluation of the severity of the problem


Zhang & Zhang          Expires September 10, 2009               [Page 5]

Internet-Draft                 Scaling BGP                    March 2009


   and its affordability of deploying the solution.  Given the scale and
   diversity of the Internet, it is certain that the buy-in of any new
   solution will not be harmonious.  Even for those networks that
   require a solution to handle routing scalability, the deployment will
   likely be a gradual process with several stages.  Furthermore, the
   day for *all* the networks to deploy a new solution may take forever
   to come.

   To summarize: we see that

   o  Different parties have different perceptions regarding the routing
      scalability problem due to different operational practices; some
      are yet to be convinced that the routing scalability problem is
      serious [BGP2008].
   o  For networks where the routing scalability problem shows up, there
      can be different severity at different routers.
   o  Networks that experience routing scalability problems are also
      likely to have affordability concerns for new solutions.
   o  If any new solution gets rolled out, it is certain to start from
      one or a few parties first, and may or may not ever reach the
      entire Internet.

   The above argues that we should attack the routing scalability
   problem with an evolutionary approach.  By evolution we mean that the
   solution should enable the routing table size reduction at only those
   routers whose capacity fall behind the the FIB or RIB growth, and
   that the solution should be built on top of the existing system.
   Building a solution on top of the existing system makes it much
   cheaper and easier to roll out, and make it likely to work
   transparently with the rest part that does not make the changes or
   does not make the change at the same time.


3.  An Evolutionary Path towards Scalable Routing

   Based on our current understanding of the problem and the solution
   space, in this section we sketch out an evolutionary path towards
   scalable Inter-domain routing.  As the Internet continues to evolve
   over time, it is likely that our understanding will also evolve, thus
   the path we sketched out in this draft may change.  The main point we
   want to make is not any particular evolutionary path, but rather to
   show evidences that such an evolutionary path both exists and is
   feasible, and that efforts towards solving an existing problem should
   aim for an evolutionary path towards architectural change, rather
   than attempting a brand new design.

   At this time we can see several stages in evolving today's BGP
   routing system towards a controllable growth of the core routing


Zhang & Zhang          Expires September 10, 2009               [Page 6]

Internet-Draft                 Scaling BGP                    March 2009


   table size.  We divide this evolution into stages by identifying
   potentially most severe pain at the time that seems warranting a fix,
   and we identify a fix that has a reasonable cost, can be carried out
   by individual network, and can be built on top of the existing
   operations, so that it does not break any other parts of the routing
   system.  Note that any such simple fix necessarily has its
   limitations.  As the fix gets widely deployed, its limitations are
   likely to become more pronounced, and can become the next problem to
   address, which will lead to the next stage of evolving the system
   forward.

3.1.  Stage One: Reducing FIB Size

   Over the last month or so we conducted a quick survey on routing
   scalability among a small group of people with operational expertise.
   The results identified the fast growing FIB size as the highest
   priority concern in routing scalability; this is also consistent with
   the results from the IAB 2006 workshop on Routing and Addressing
   [RFC4984].  Therefore, we consider reducing FIB size is the first
   issue to resolve towards scalable BGP routing, and we believe there
   is no major disagreement regarding this problem statement.

   The proposed solutions for resolving this FIB scalability problem, on
   the other hand, differ significantly.  Most of the proposals
   presented to the IRTF Routing Research Group (including our own, APT)
   took on the direction of a basic architectural change.  Not only is
   an architectural change likely to take long to go through the IETF
   standardization process as well as costly to roll out, but also a
   more fundamental problem is that it is difficult to make it both
   bring immediate benefits to first movers and be compatible with
   today's deployed base.  We will discuss more about the issues in
   introducing a new design in Section 4.

   A very different solution, Virtual Aggregation (VA), has been
   proposed by Francis and Xu [Virtual_Aggregation].  Briefly, Virtual
   Aggregation works as follows.  An ISP can reduce its routers FIB size
   by configuring a router to announce a short prefix, say 1.0.0.0/8, in
   place of multiple longer prefixes that fall within 1.0.0.0/8, into
   its own network.  This router is called an Aggregation Point Router
   (APR) and the short prefix is called a virtual prefix.  The APR
   maintains FIB entries for all the longer prefixes (e.g., 1.1.0.0/16)
   covered by the virtual prefix, while other routers in the network
   only need to maintain one FIB entry for the virtual prefix 1.0.0.0/8.
   When a router receives a packet to be forwarded to 1.1.0.0/16, its
   FIB will direct the packet to the APR, which checks its FIB entry and
   finds 1.1.0.0./16, then tunnels the packet to the egress router for
   the actual prefix 1.1.0.0./16.


Zhang & Zhang          Expires September 10, 2009               [Page 7]

Internet-Draft                 Scaling BGP                    March 2009


   We view Virtual Aggregation as an evolutionary step towards scaling
   the FIB size, because it can be done by an individual ISP to
   effectively shrink the FIB size for some of its routers, and the
   deployment only requires configuration changes to start with.  It has
   no impact on the routing operation of any other networks.  At the
   same time, since all packets destined to the prefixes that have been
   aggregated in a virtual prefix will go through the APR, this step of
   evolutionary fix introduces both additional delivery delay (i.e.,
   path stretch) and encapsulation cost.  Furthermore, the APR can also
   become a concentration point of traffic.

   Several operational steps can be applied to mitigate these problems.

   o  Do not aggregate prefixes that carry heavy volumes of traffic to
      prevent the traffic from path stretch or contributing to the APR
      load.
   o  One can adjust APR load by adjusting the number of virtual
      prefixes, using more APRs to share the load.
   o  One may be able to configure an APR at the POP where adjacent
      prefixes are announced into one's network; properly positioning
      APRs can minimize the path stretch.
   o  Finally, if an APR receives heaving volume of traffic from certain
      ingress routers, the APR can send to those ingress routers the FIB
      entries that their traffic are destined to, so that these ingress
      routers cache the FIB entries and encapsulate the packets towards
      the egress routers.  This will reduce both the APR load and
      eliminate the path stretch for those cached FIB entries.

   This last technique makes an APR perform more or less in the same way
   as a Default Mapper (DM) in our APT design [APT], however with one
   fundamental difference.  Deploying an APR does not require any new
   protocol or a new functional box (the DM node) that the APT
   deployment would require.  Instead, an operator can simply configure
   a router to be an APR, without needing any changes to other routers
   that benefit from reduced FIB size.  Only when the APR rollout
   becomes successful and the APR load becomes an issue, then the
   operator may consider additional changes to make the ingress routers
   handle caching.

   We believe that the deployment of Virtual Aggregation (VA) can
   effectively reduce the FIB size at some routers.  Indeed, Virtual
   Aggregation is simply a poor man's map-encap within one AS.  The APR
   holds the mapping table from the virtual prefix to all the egress
   routers where the specific prefixes can be reached, this mapping
   information is directly derived from BGP routing updates.  The APR
   then encapsulates packets to those egress routers.

   How many ISPs would deploy VA?  How much time can VA buy us in


Zhang & Zhang          Expires September 10, 2009               [Page 8]

Internet-Draft                 Scaling BGP                    March 2009


   curtailing the FIB size growth?  It seems only time can tell.  But if
   we look ahead one step, as the Internet continues to grow, and as
   IPv6 deployment starts rolling out, more networks may face the FIB
   size problem and adopt Virtual Aggregation as a solution.  When two
   or more adjacent ASes all deploy Virtual Aggregation, packets that
   traverse these ASes will experience the cumulated path stretches and
   encapsulation/decapsulation cost of all the ASes along their paths.
   The need to resolve this new problem (of cumulated path stretch and
   cost) can naturally lead to the next step of evolution towards better
   routing scalability.

3.2.  Stage Two: Reducing Multi-AS Virtual Aggregation Overhead

   Assuming the AS path a packet takes is W-X-Y-Z, and both X and Y have
   deployed Virtual Aggregation.  Then we would like to see that X's APR
   encapsulates the packet directly to the egress router of Y instead of
   X's own.  This will minimize the path stretch and the packet will
   only need to be encapsulated/decapsulated once instead of two times.

   To enable such inter-AS Virtual Aggregation, X's APR needs to know
   Y's egress router for a given destination prefix P. This mapping
   information (i.e., mapping from a destination prefix P to an egress
   router) needs to be propagated from Y to X. The least resistant
   approach is to piggyback such mapping information on the existing BGP
   announcement for prefix P. Francis and Xu have proposed such an
   extension to BGP, which carries the mapping information in a new BGP
   attribute [InterDomainVA]; the APT team was also looking into more or
   less the same design when the above mentioned draft was published.

   We argue that this second step is feasible by the following
   reasoning.  First, this second step towards better routing
   scalability will take place only after at least two networks (X and
   Y) have deployed VA and benefited from it.  Therefore we reason that
   they would not want to move away from VA but would like to minimize
   VA's cost in path stretch and encapsulation, to improve the traffic
   performance for their customers.  Second, the required BGP
   implementation changes are backward compatible, meaning that networks
   that have deployed this solution and networks that have not deployed
   this solution can still communicate without problems.

   As a side note we would also like to point out that this virtual
   aggregation mapping exchange *closely* resembles the early design of
   APT mapping information exchange between Default Mappers that the APT
   team proposed earlier [APT-00].  The content of the mapping exchanges
   is somewhat different, but a more fundamental difference between what
   we discussed in this section and that early APT design back in 2007
   is the following: here we sketch out an evolutionary path forward,
   which does not require, as a starting point, any protocol change or


Zhang & Zhang          Expires September 10, 2009               [Page 9]

Internet-Draft                 Scaling BGP                    March 2009


   information exchange across multiple ASes that the early APT design
   does.  Rather, the need for mapping exchange arises only after the
   FIB size reduction has been achieved, and the mapping exchange can
   start with two adjacent ASes after each of them has deployed Virtual
   Aggregation.

3.3.  Stage Three: Reducing RIB Size

   Piggybacking the virtual aggregation mapping information on BGP can
   work well when the mapping table is small, i.e., the number of
   networks that have adopted Virtual Aggregation is small.  When more
   networks have adopted Virtual Aggregation, the mapping table will
   grow large, which may make it no longer feasible to piggyback all the
   mapping information on the existing BGP sessions.  The main problem,
   as we can perceive now, would be the RIB size growth: A BGP router
   will receive the same mapping information from multiple neighboring
   BGP routers, and store all of it in its Adj-RIBs-IN and Local-RIB.
   Thus every BGP router will have to store multiple copies of the
   mapping table.  This issue was pointed out back in 2007 when the
   early APT design was discussed, and one suggestion to get around the
   problem is to use separate BGP sessions for mapping information
   exchange.

   Another factor is that, after a network X has deployed virtual
   aggregation for a while and has gained sufficient operational
   experience, it may become clear that many of its routers no longer
   need to keep the full RIB table.  If an internal router has small FIB
   and relies on APRs to route packets towards all other destinations,
   it does not need a full RIB to build its FIB.  Theoretically
   speaking, all border routers of X that connect to legacy networks
   (i.e., those that have not deployed VA) would still need to keep the
   full RIB in order to make BGP announcements into the legacy
   neighbors.  However in practice, only the customer-facing border
   routers need a full RIB.  The other border routers, those that face
   either peer or provider legacy neighbors, only need to announce X's
   own customer prefixes to them.  Careful engineering analysis and
   configuration can eliminate the need for many routers to keep full
   RIB; among those which keep the full RIB will be the ones serving as
   APRs.

   As we perceive what may happen further into the future, the picture
   becomes more blurry, hence what we try to forecast here may or may
   not bear great accuracy for what may happen in the future.  Having
   said that, we perceive that the combination of the aforementioned two
   factors (relieving regular routers from storing mapping table and
   full RIB table) would lead to moving the mapping dissemination from
   the regular BGP instance (which is used for inter-domain routing) to
   a separate BGP instance only between APRs via multi-hop BGP sessions.


Zhang & Zhang          Expires September 10, 2009              [Page 10]

Internet-Draft                 Scaling BGP                    March 2009


   Though the protocol is still BGP for the ease of deployment, APRs
   would run a different session (e.g., on a different TCP port) for
   mapping dissemination purpose only.  Other regular routers run
   regular BGP instance for inter-domain routing purpose, but are
   relieved from bearing the overhead of storing and propagating mapping
   information or the full RIB table.

   When the RIB size for most routers (other than the APRs) is reduced,
   what are the prefixes that get dropped out of the RIB?  Since APRs
   (or ingress routers, if they are upgraded to handle caching) must
   encapsulate packets towards egress routers that connect to the more
   specific prefixes that have been aggregated out, the ASes must
   exchange the reachability information about their own topologies, so
   that routers in different ASes know how to reach each other.  The
   prefixes that got aggregated out of the core routing system would be
   those that belong to the edge ASes.  As such, Virtual Aggregation
   plus mapping exchange effectively drives the overall routing system
   towards the separation of edge site prefixes from the transit network
   routing, a scalable routing architecture that the APT design has
   depicted.

   Again we cannot help but to point out the close resemblance between
   the system we depicted above and the original APT design.  On the
   surface, it seems the only noticeable difference is just the names:
   here we have APRs instead of APT's DMs that use BGP to exchange
   mapping information.  But once again we must not forget an essential
   difference: we reach this perceived stage-three towards scalable
   routing through an evolutionary path, instead of requiring
   installation of a new design from day one.

3.4.  Stage Four: Insulating the Core from Edge Churns

   In the current Internet, flaps of customer prefixes are propagated to
   the rest of the Internet in the form of BGP updates, i.e., routing
   churns.  With virtual aggregation and mapping exchange, these churns
   would be reflected as mapping updates, which are disseminated through
   the interconnects of APRs.  We perceive this as a benefit, as other
   non-APR routers can be sheltered from updates due to edge
   instabilities.

   Our earlier measurement and analysis study [TopologyGrowth] has shown
   that most Internet topology growth comes from the addition of
   customer edge ASes.  It is conceivable that as the number of customer
   sites continues to increase, the amount of churns may become too much
   to handle in a cost-effective way.  A solution to this edge churn
   problem is to insulate the mapping dissemination system from the edge
   dynamics.  Based on the current BGP data, our estimation shows that,
   if we could remove BGP updates induced by customer prefix


Zhang & Zhang          Expires September 10, 2009              [Page 11]

Internet-Draft                 Scaling BGP                    March 2009


   instabilities, we would have reduced the total amount of routing
   churns by an order of magnitude [eFIT_IPv6].  Ideally, when the link
   connecting a customer site to a provider fails, the mapping system
   should propagate this failure information only when the failure has a
   long duration, so that every network will be aware of this failure
   and choose an alternative path to reach the affected customer site.
   But long lasting failures probably do not happen frequently.  Short
   failures, which are frequent, should not be propagated through the
   mapping system.  Instead, they should be handled by other means.  For
   example, in the APT design, the failure handling actions are data-
   driven, i.e., a link failure to an edge network is not reported
   unless and until there are data packets that are heading towards the
   failed link.  We are actively working on an evolutionary solution
   that can provide equivalent data-driven handling of edge failures as
   APT does.

3.5.  Summary

   If we can imagine a picture where all the networks in the Internet
   had deployed all the stages of routing scalability improvement we
   sketched above, then the Internet routing system would have converged
   to a new map-encap routing architecture like APT.  Then what is the
   fundamental difference between the evolutionary path described in
   this draft and the deployment of APT?

   First, we clarify that the goal is to reduce the FIB/RIB table size;
   we emphasize that the separation of edges and core (or EIDs from
   Locators as in LISP's terminology) itself should *not* be a required
   starting point.  Second, we show that the evolutionary path, which
   goes through several stages with clearly identified benefits and
   minimal cost at each stage, can naturally converge towards the
   separation as a result!  We make two points from this last
   statements: (1) This could also be used as an evidence that the
   evolution can indeed lead to architectural changes, as it moves the
   system towards the same point that a new design points to. (2) We
   used the phrase "converge towards separation", rather than "achieving
   separation", because we believe that, even after a long time and many
   networks have adopted the solution, it is most likely that some
   networks will remain at various early stages of the evolution, some
   even may not have made a single change.  This is the nature of the
   Internet, due to its two properties that we mentioned at the
   beginning: its distributed governance, and its diversity along
   economical and operational practice dimensions.


4.  Evolution versus Incremental Deployability

   So far our discussion has focused on a possible evolution of the


Zhang & Zhang          Expires September 10, 2009              [Page 12]

Internet-Draft                 Scaling BGP                    March 2009


   routing system towards a scalable design.  In this section we would
   like to broaden the discussion to a more general question: Many new
   designs have plans for incremental deployment.  So what is the
   difference between incremental deployability and an evolutionary
   path?

   We believe that all new designs have an implicit assumption that the
   entire system would eventually move to the new design.  No matter how
   much design effort one puts into the incremental deployment step of a
   new design, the design itself does not start with the assumption that
   not all parts of the system would adopt it.  Therefore, it is likely
   the case that the assumed benefit of the new design would be achieved
   only after a majority, if not the whole, of the system has deployed
   the design, and that the cost of incremental deployment would be
   minimized only then as well.  The incremental deployment machinery is
   simply to glue together the part that has made the change and the
   rest that has not, to make the system function together at the
   intermediate, and hopefully transient, stage, as the system would be
   in a sub-optimal state until the new design gets fully deployed.  In
   contrast, gradual evolutions in a large system depict a picture where
   changes may happen here and there as needed, but there is no
   expectation that the system as a whole must make a change.

   A typical example is multicast deployment.  The main benefit of
   deploying IP multicast comes from the scalability and performance
   improvement for *large-scale* group communication applications,
   however, such applications will not emerge until the majority of the
   Internet has rolled out the deployment since the applications require
   IP multicast.  Therefore, although tunneling to MBone provides a
   technical machinery to connect deployed sites, first movers do not
   see enough benefits to justify such a significant change to the
   network.  On the other hand, overlay multicast has been growing fast.
   It does not require changes to the network and can co-exist with any
   existing network protocol or application.  Its deployment cost is as
   small as installing a software by individual users.  And more
   importantly, users who run it see the immediate benefits while users
   who do not run it are not affected.

   The evolutionary approach recognizes that changes to the Internet can
   only be a gradual process with multiple stages.  At each stage,
   networks that make the changes must have the incentive to do so.
   More specifically,

   1.  Each stage focuses on an immediate problem with enough economic
       impact that warrants a fix.
   2.  Each stage offers a solution that solves the problem, does not
       break other parts of the Internet, and can be deployed with a
       reasonable cost considering the specific problem.


Zhang & Zhang          Expires September 10, 2009              [Page 13]

Internet-Draft                 Scaling BGP                    March 2009


   3.  As the solution is being deployed by more and more networks, its
       downside may become more pronounced and eventually requires a
       fix, which leads to the next stage of the evolution.

   Like many others, we too hoped that our new design, APT, could be
   eventually deployed everywhere to put the routing scalability under
   control.  We gradually realized that it is infeasible to attempt to
   roll out a new routing framework (i.e. a clean separation of edge
   prefixes from the core routing system) in a vast distributed system.
   The Internet Protocol, IP, was designed to accommodate heterogeneity
   at subnet technology level.  Today, the intrinsic heterogeneity and
   distributed governance in the Internet needs the accommodation of
   heterogeneity at the network control plane.  Solutions to routing
   scalability should be control knobs on top of the deployed base to
   those parties who need them, and should not be an expectation that
   the entire Internet would (eventually) move to a new design.

   An evolutionary process accommodates differences at different parts
   of the system, as new functions are built on top of, hence can
   peacefully co-exist with, the deployed base.  On the other hand, a
   revolutionary new design focuses on the final outcome once the
   replacement of the old by the new is done throughout the Internet.
   The latter would offer a clean picture of the overall system,
   assuming the final stage could be reached.  The former, on the other
   hand, would present a much messier or more complex picture, both
   because we twist old protocols for new functions and because
   different parts may do different things.  As pointed out by Haldane
   over 80 years ago [SizeImpact], in biological systems, "The higher
   animals are not larger than the lower because they are more
   complicated.  They are more complicated because they are larger."  We
   believe the same is true for man-built systems: as the system grows
   large in size, it is necessarily becoming more complicated.


5.  Acknowledgements

   The authors are part of the APT team.  The APT project is funded by
   US National Science Foundation.  The survey mentioned in Section 3.1
   was conducted by Dan Jen.


6.  Security Considerations

   This draft is a discussion on the Internet's necessity to follow an
   evolutionary path towards the future.  There is no direct impact on
   the Internet security.


Zhang & Zhang          Expires September 10, 2009              [Page 14]

Internet-Draft                 Scaling BGP                    March 2009


7.  Informative References

   [APT]      Jen, D., Meisel, M., Massey, D., Wang, L., Zhang, B., and
              L. Zhang, "APT: A Practical Transit Mapping Service",
              draft-jen-apt-01, November 2007.

   [APT-00]   Jen, D., Meisel, M., Massey, D., Wang, L., Zhang, B., and
              L. Zhang, "APT: A Practical Transit Mapping Service",
              draft-jen-apt-00, July 2007.

   [BGP2008]  Huston, G., "BGP IN 2008 - what's changed", APRICOT
              presentation, 2009, <http://apricot2009.net/
              index.php?option=content&task=view&id=51>.

   [InterDomainVA]
              Xu, X. and P. Francis, "Simple Tunnel Endpoint Signaling
              in BGP", draft-xu-tunnel-00, February 2009.

   [LISP]     Farinacci, D., Fuller, V., Meyer, D., and D. Lewis,
              "Location/ID Separation Protocol (LISP)",
              draft-farinacci-lisp-12, March 2009.

   [NANOG44]  Roisman, D., "Extending the Life of Layer 3 Switches in a
              256k+ Route World", NANOG44, October 2008, <http://
              www.nanog.org/meetings/nanog44/presentations/Monday/
              Roisman_lightning.pdf>.

   [RFC4984]  Meyer, D., Zhang, L., and K. Fall, "Report from the IAB
              Workshop on Routing and Addressing", RFC 4984,
              September 2007.

   [RRG]      RRG, "IRTF Routing Research Group Home Page", <http://
              tools.ietf.org/group/irtf/trac/wiki/RoutingResearchGroup>.

   [SIRA]     Zhang, B. and et. al., "A Secure and Scalable Internet
              Routing Architecture", ACM SIGCOMM 2006 Poster Session.

   [SizeImpact]
              Haldane, "On Being the Right Size", 1928,
              <http://irl.cs.ucla.edu/papers/right-size.html>.

   [TopologyGrowth]
              Oliveira, R., Zhang, B., and L. Zhang, "Observing the
              Evolution of Internet AS Topology", ACM SIGCOMM 2007.

   [Virtual_Aggregation]
              Francis, P., Xu, X., and H. Billani, "FIB Suppression with
              Virtual Aggregation and Default Routes",


Zhang & Zhang          Expires September 10, 2009              [Page 15]

Internet-Draft                 Scaling BGP                    March 2009


              draft-francis-idr-intra-va-01, September 2008.

   [eFIT_IPv6]
              Massey, D. and et. al., "A Scalable Routing System Design
              for Future Internet", ACM SIGCOMM 2007 IPv6 Workshop.


Authors' Addresses

   Beichuan Zhang
   Univ. of Arizona

   Email: bzhang@arizona.edu


   Lixia Zhang
   UCLA

   Email: lixia@cs.ucla.edu


Zhang & Zhang          Expires September 10, 2009              [Page 16]