Network Working Group J. Kunze Internet-Draft California Digital Library Expires: May 22, 2009 November 18, 2008 Oxum: Octet Stream Sum http://www.ietf.org/internet-drafts/draft-kunze-oxum-00.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on May 22, 2009. Copyright Notice Copyright (C) The IETF Trust (2008). Kunze Expires May 22, 2009 [Page 1] Internet-Draft Oxum November 2008 Abstract This document specifies "oxum", a two-part number, OCTETS.STREAMS, that is a kind of simple size summary for complex digital objects. In the mainstream case of a complex object that is a set of files, the STREAMS part is the total number of files and the OCTETS part is the total number of 8-bit bytes across all those files; for example, an oxum of 876543.21 could mean a total of 876,543 bytes across 21 files. Which set of streams comprises a complex object for an oxum computation depends in general on the object's type. One important type is the stream set defined by the set of files contained in a file hierarchy. An oxum is not a checksum in that, while a changed oxum means a changed object, an unchanged oxum does not mean an unchanged object. Kunze Expires May 22, 2009 [Page 2] Internet-Draft Oxum November 2008 1. The size of a digital object It can be hard to characterize the size of an arbitrary digital object. Word count, page count, image dimensions, video running time, or number of database records might all be useful metrics, depending on the type of the object. For a single file, one crude but easily obtained metric is the number of octets (8-bit bytes) in the file. This document introduces an analogous metric for a _complex digital object_, by which we mean an object that is not equivalent to a single file. A complex object may consist of a group of files or parts of one or more files (e.g., a database). Kunze Expires May 22, 2009 [Page 3] Internet-Draft Oxum November 2008 2. The octet stream sum (oxum) A complex digital object that has a well-defined set of octet streams, such as a document represented by a group of 14 text and image files, has a well-defined "oxum" (octet stream sum). The oxum is a two part number such as 567898.14 which corresponds to 567,898 octets spread over 14 files. The general form of an oxum is OCTETS.STREAMS where STREAMS is the total number of streams (e.g., files) and OCTETS is the total number of octets across all those streams. In general, these two numbers will be positive integers, although there may be situations (not described here) in which it makes sense for either one of them to be left unspecified with a hyphen ('-'). The period ('.') separator is required. Other examples: 1998.10 # 1998 octets spread over 10 streams 105.3 # 105 octets, 3 streams (not 105 and 3 tenths) 21436794142.831 # almost 19 Gigabytes spread over 831 streams 709895249489.8756 # about 661 Gb, or 710 Gb if you divide by 1000 -.1 # one stream, but number of octets not known yet The oxum is designed to be machine readable and to fit into a variety of syntactic contexts, such as command lines, file paths, URL [RFC3986] query strings, and XML [XML] tags. Note that the oxum is _not_ designed as a secure digest or checksum. While an oxum cannot change without a change to the object, an unchanged oxum absolutely does not imply an unchanged object. Do not use oxum in place of a cryptographic digest algorithm (cf. SHA1 [RFC3174]). Kunze Expires May 22, 2009 [Page 4] Internet-Draft Oxum November 2008 3. Oxum complex object types An _oxum object type_ is used to describe how to derive an object's stream set. For oxum to be meaningful for an object type, the type must have a well-defined, canonical stream set. Once the stream set is known, the oxum computation is straightforward and the streams can be processed in any order. One especially natural way to derive a stream set is to define a way to reduce an object type to a file group. Files are primal streams. In this document, a "regular" file is a contiguous sequence of octets with a well-defined start and end, whether the sequence is named in static storage (e.g., "memo.pdf") or is unnamed and recently retrieved (e.g., a web page) from a network socket. There are many filesystem entities that are not regular files, including directory nodes, block special files, and symbolic links. In this document, the word "file" usually refers to a regular file. A (regular) file is an oxum-ready stream. As a base case, a complex object consisting of exactly one file has an oxum of the form "OCTETS.1", as in 12345.1 Things get more interesting when dealing with more than one file. Any private or public agreement can be made about what constitutes a file group, hence a stream set, for the purposes of an oxum computation. A stream set might be declared to comprise all the attachments of an email message, or all the files resulting from a normalized dump procedure run against the tables of a database. An easily delineated group is all the files contained in a directory. Any recognized group of regular files can form on oxum stream set, including a simple manifest or list of filenames. For example, a transfer protocol might use oxum to help set the receiver's expectations in terms of total bytes and files contained in a transferred package [GRABIT]. 3.1. File hierarchy oxum The "file hierarchy" oxum type has the stream set defined by the group of all regular files rooted in a given directory (or folder), and including all regular files nested in all its subdirectories. This Linux shell script computes oxums for each of its arguments. Kunze Expires May 22, 2009 [Page 5] Internet-Draft Oxum November 2008 #!/bin/csh -f foreach f ($*) find $f -type f | sed "s/.*/'&'/" | xargs stat -t | \ awk -v f=$f '{s += $2} END {printf "%s.%s %s\n", s, NR, f}' end 3.2. Database oxum The "database" oxum type has the stream set defined by the group of files resulting from a canonical dump of all the database tables. This is [not worked out! XXX] related to the canonicalization phase in computing a Universal Numeric Fingerprint UNF [UNF]. Without UNF's cryptographic phase, this produces an oxum that is stable even if the raw data is moved to another platform or software system. Kunze Expires May 22, 2009 [Page 6] Internet-Draft Oxum November 2008 4. Security considerations Neither the oxum metric nor its computation pose any direct risk to computers and networks. Documentation that describes any use of oxum should caution implementors not to use an oxum in place of a cryptographically secure digest. Any such use would be trivial to spoof. Kunze Expires May 22, 2009 [Page 7] Internet-Draft Oxum November 2008 5. References [GRABIT] NDIIPP/CDL, "The GrabIt File Exchange Protocol", 2008, . [RFC3174] Eastlake, D. and P. Jones, "US Secure Hash Algorithm 1 (SHA1)", RFC 3174, September 2001. [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, January 2005. [UNF] Altman and King, "A Proposed Standard for the Scholarly Citation of Data", March 2007, . [XML] Bray, "Extensible Markup Language (XML) 1.0 (Fourth Edition)", August 2006, . Kunze Expires May 22, 2009 [Page 8] Internet-Draft Oxum November 2008 Author's Address John A. Kunze California Digital Library 415 20th St, 4th Floor Oakland, CA 94612 US Fax: +1 510-893-5212 Email: jak@ucop.edu Kunze Expires May 22, 2009 [Page 9] Internet-Draft Oxum November 2008 Full Copyright Statement Copyright (C) The IETF Trust (2008). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Acknowledgment Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA). Kunze Expires May 22, 2009 [Page 10]