***************************************************************************** --- Rationale This is a forensic analysis undertaken to compare BusyBox 0.25 with current BusyBox (approximately svn 16124), for the purpose of finding and removing any code copyrighted by Bruce Perens. Bruce Perens created BusyBox in 1995 as a utility for the Debian bootloader, and declared the project complete in 1996, at which point he abandoned further development. Forks of the code were subsequently maintained by Enrique Zanardi (for Debian) and Dave Cinege (for the Linux Router Project). In 1998, Erik Andersen founded a new BusyBox project for Lineo, to create a general purpose utility package for embedded Linux systems. Erik unified the Zanardi and Cinege versions of BusyBox, and launched a website, CVS repository, and mailing list for the new project (asking for and receiving Bruce's permission to do so). After leaving Lineo, Erik continued this line of BusyBox development on his own. Later, I (Rob Landley) began contributing to Erik's BusyBox project with the goal of upgrading BusyBox into a more efficient general-purpose replacement for the existing standard Linux command line utility packages (the gnu utils, etc), without sacrificing the existing simplicity or small size of BusyBox. My initial goal was to create a busybox-powered development environment (the Firmware Linux project) capable of rebuilding itself from source code without any other packages but a compiler toolchain, C library, and kernel. My eventual goal is to use BusyBox as the set of command line tools on my laptop. My first contribution to BusyBox was in 2001 (svn 2128), and I was granted CVS access in 2003 (svn 8252). After BusyBox's 1.0 release (in October 2004), Erik turned his attention to his other embedded projects (uClibc and buildroot) which had not yet achieved their 1.0 releases. In August 2005, I got Erik's permission to package and put out the BusyBox 1.01 bugfix release, and then turned my attention to stabilizing the development tree for a 1.1.0 release (in January 2006). This was not an attempt to become maintianer, merely to take some of the load off of Erik until he had more spare time. However, the increasing popularity of embedded Linux led Erik to instead hand off official maintainership of BusyBox in February 2006, to its de-facto maintainer (me) so he could focus on uClibc and buildroot. Bruce Perens never even posted to the BusyBox mailing list during Erik's entire tenure as BusyBox maintainer (a period of over 7 years). In 2006, Bruce's web page still pointed to BusyBox as hosted by Lineo, a reference which was last current at the end of 2001. Despite this, in September 2006 Bruce posted a series of increasingly confrontational mesages to the BusyBox mailing list objecting to the the plans of the current maintainer (me) to release new versions of BusyBox under GPL version 2 (rather than GPLv2 or later). This topic had been discussed on the list for 9 months; he showed up to interrupt its implementation. His confrontational attitude and lack of tact quickly burned through the respect and deference his historical contributions were due, and his repeated demands quickly turned into threats (despite being asked to fork the project from any of the existing releases if he felt that strongly about the issue, plus repeated assurance that existing releases remained under the licenses they had already been released under, and a persistent failure to explain how "GPLv2" wasn't a compatible subset of "GPLv2 or later"). In one of Bruce's messages, he stated that "you may attempt to prove that everything I've written has been filtered out over 6 years", and implied it would be the only way to get rid of him. Since he wouldn't take me up on my offer to fork off, I'm taking him up on his offer to demonstrate his irrelevance. This is an effort to either show that Bruce has no copyrights on any of the current code, or to remove any code shown to have his copyrights, in hopes that he'll shut up and go away. It's also possible that a detailed analysis of the origins of BusyBox (predating the current source control system) will assist future license enforcement efforts, but the motivation is definitely making Bruce go away. --- Methodology This is a four-part forensic analysis comparing the current BusyBox development tree with BusyBox 0.25. The four parts of the analysis will: 1) Search for files containing Perens' copyright notice. 2) Search the source control repository history for lines unchanged since svn commit #5 (which checked BusyBox into the CVS tree, after uClibc). 3) Compare the 0.25 and current trees using Eric Raymond's comparator tool. 4) A manual inspection of 67 files in the the 0.25 tree, and comparison with any corresponding or equivalent files in the current tree. BusyBox 0.25 is the oldest version of BusyBox released by Erik Andersen. The project was already on its third maintainer since Perens left, and all of Perens' contributions predate Erik's tenure as maintainer, so code in 0.25 is the only potential source of a Perens copyright in the current version of BusyBox. This is an effort to either show that Bruce has no copyrights on any of the current code, or to remove any code shown to have his copyrights, in hopes that he'll shut up and go away. The first two passes cannot show the absence of copyrighted code, but can highlight areas where its presence must be carefully checked for. The last two passes are more exhaustive tests, capable of showing the absence of old code. ***************************************************************************** --- Part 1: Copyright notices The first test is to grep for Bruce Perens' copyright notices: $ find . | grep -v \.svn | xargs grep -i perens ./docs/busybox.net/oldnews.html:     I have received permission from Bruce Perens (the original author of BusyBox) ./docs/busybox_footer.pod:Bruce Perens ./archival/tar.c: *  Copyright (C) 1995 Bruce Perens ./coreutils/df.c: * based on original code by (I think) Bruce Perens . ./coreutils/pwd.c: * Copyright (C) 1995, 1996 by Bruce Perens . ./coreutils/sync.c: * Copyright (C) 1995, 1996 by Bruce Perens . ./init/init.c: * Copyright (C) 1995, 1996 by Bruce Perens . ./procps/kill.c: * Copyright (C) 1995, 1996 by Bruce Perens . ./util-linux/more.c: * Copyright (C) 1995, 1996 by Bruce Perens . ./util-linux/mount.c: * Copyright (C) 1995, 1996 by Bruce Perens . ./AUTHORS:Bruce Perens The AUTHORS, oldnews.html, and busybox_footer.pod files did not exist in BusyBox 0.25. The AUTHORS mention is historical (Linus Torvalds has one too, despite never directly contributing to the project -- we went out and found his code and added it ourselves), oldnews.html notes that Eric asked for and received Bruce's permission to set up a new website for BusyBox (at Lineo, hence the mention on Bruce's web page), and the hit in busybox_footer.pod is because it contains a copy of the AUTHORS file. To be clear, those three are files mentioning Bruce, but not claiming to be covered by Bruce's copyright. This leaves possible contamination in the following files: tar.c, df.c, pwd.c, sync.c, init.c, kill.c, more.c, mount.c These eight files will receive special attention in later passes. --- Part 2: Source Control History This test is to see which lines of current BusyBox code the "svn annotate" command traces back to the first checkin (which was svn-5 because uClibc went into source control first). This is a weak test because it can be confused by whitespace changes, but can still detect old code. To reproduce these results: $ svn co svn://busybox.net/trunk/busybox $ cd busybox $ for i in `find . | grep -v \.svn`; do [ -f "$i" ] && echo && echo "$i" && svn annotate "$i"; done > everything.txt $ sed -n -e '/^\./p' -e '/^[ \t]*5[ \t]/p' everything.txt > files.txt Large chunks of archival/gzip.c are still the same, apparently code taken from gzip 1.2.4. This file will also receive special attention in later passes. Several files have unchanged blank lines (or lines containing nothing but a curly bracket, #includes of standard header files, etc), but not one line of executable code: archival/tar.c, console-tools/{clear.c, loadkmap.c}, coreutils/{printf.c,sleep.c,dd.c,df.c,mknod.c,ln.c,date.c,rm.c,pwd.c, length.c,chroot.c,mkdir.c,cat.c,sync.c,rmdir.c,touch.c}, findutils/grep.c, init/init.c, init/halt.c, procps/kill.c, util-linux/{dmesg.c,mkswap.c,more.c,mount.c,umount.c}, and README. Of those, coreutils/printf.c has a possibly copyrightable comment, but it's just a comment documenting the printf backslash options, and it's attributed to David MacKenzie of the GNU project. Files with some executable lines in among the comments and blank lines include miscutils/mkdevs.c (one line: "switch (type[0]) {"), miscutils/mt.c (eight, small and scattered), applets/busybox.mkll (has one), applets/busybox.c (has two, one of which is "int main(int argc, char **argv)"), examples/busybox.spec (largely unchanged but the "Packager" field attributes it to Erik Andersen), and Makefile (still has the "clean:" and "distclean:" targets). This pass shows only two files of further interest: archival/gzip.c and miscutils/mt.c, bringing the list up to 10 files total from the first 2 passes: gzip.c mt.c tar.c, df.c, pwd.c, sync.c, init.c, kill.c, more.c, mount.c ***************************************************************************** --- Part 3: Comparator. After SCO's 2003 claims of copied code hidden in the Linux kernel, Eric Raymond wrote comparator. Ron Rivest (the inventor of MD5) created a custom 8-byte digest function for version 2.0 of this tool. Results from comparator have been introduced as evidence in the SCO trial. This is the current best-of-breed tool for finding common code segments between two source trees. Comparator checks an entire code tree against another entire code tree, to find copied code anywhere in the tree. It is not confused by whitespace changes, file renames, cut and paste between files, or most other code resequencing. The project's web page is at "http://www.catb.org/~esr/comparator", and the manual page (describing its design) is at "http://www.catb.org/~esr/comparator/comparator.html". For this test the ".svn" directories were filtered out of the current BusyBox tarball using "tar --exclude", to reduce the false positive rate: tar czC .. --exclude .svn busybox | tar xzv The command used to compare the two trees was: comparator-2.5/comparator -N line-oriented,remove-whitespace,remove-braces \ busybox-0.25 busybox > out.txt This found 169 matching line ranges, all but 40 of which are between gzip.c or zcat.c and their modern equivalents in the archival directory. This is detecting the continued use of the inflate and deflate engines from gzip 1.2.4, which will be examined in more detail at the start of the manual inspection section. Of the 40 remaining matches (identified by the 0.25 filename and line range detected in the modern version): The following are spurious or unprotectable matches: dd.c 11-14, 9-12, 14-16, 4-10, ls.c 5-11, 11-19, mnc.c 18-21, 10-16, 21-24, 20-22: Chunks the long GPL boilerplate, matching against files still using the old long format of the GPL permission grant. init.c 116-118: close(0); close(1); close(2);. Part of manual daemonizing, not protectable. mkswap.c: 126-129, 121-127, 128-144, 146-154, 162-169: Most of these matches are against mkfs_minix.c, which was written by Linus Torvalds (who also wrote mkswap, but it's a separate file from a separate upstream source). The current BusyBox doesn't use the old mkswap.c but a fresh implementation I wrote from scratch (svn 15704). There's also a match against the e2fsprogs code (again, separately sourced). nl-f 117-119, 10-12: This file is just a range of numbers, possibly the output of "seq 1 360" back when it was buggy and stuck a tab at the beginning of each line. I have no idea why it was ever in the tree. These matches are in code with known copyright attribution, added to the codebase by people who are known to have modified it after Bruce's time: foundmount.c 16-27, 31-36: These matches are against the current find_mount_point() function in libbb/find_mount_point.c, which is copyright 1999-2004 by Erik Andersen. loadkmap.c 58-60, 59-62: The main loop of loadkmap and dumpkmap has a few lines in common with the old loadkmap. The current loadkmap is copyright Enrique Zanardi, and dumpkmap is copyright Arne Bernin. ls.c 2-5, 19-23, 25-31, 36-42: This is the description and copyright notice of Brian Candler, along with a descriptive comment about the known limitations of "tiny-ls 0.1.0", which seems to have been an existing external project added to BusyBox. Note that no actual code is matched, only introductory comments. mnc.c 50-54: This is variable declarations (not the code that uses them) from mini-netcat, which states it was "built from the ground up for LRP" (the Linux Router Project), postdating Bruce's tenure. Copyright Charles P. Wright and Dave Cinege. swapoff.c 29-31, 25-30, umount.c 45-48, 50-62: These all match sections of the current libbb/mtab.c, which has one function erase_mtab() and is copyright 1999-2004 Erik Andersen. This file is only used for legacy mtab support, is somewhat functionally constrained by the mechanics of getmntent/setmntent and struct mntent, and now that I look at it is slightly insane (fall back to /proc/mounts while _removing_ an entry?) Denis Vlasenko's been pondering rewriting this anyway to avoid an unrelated race condition. See svn 11099 for more detailed history on this entry. Code of unknown provenance: math.c 112-116, 118-133, 142-145: Portions of the function "stack_machine()", and the call to said function, in the modern dc.c. There is a license on this file, but no copyright statement. more.c 44-46: these three lines initialize termio state. Although the original more.c was created by Bruce (and is one of the files with his copyright notice), the BB_MORE_TERM option appears to be a later addition. mount.c 104-106: Considering I wrote the current code this match occurs in, I suspect it's just a fluke rather than actual leakage. Since I do remember trying this function several ways to see what was smallest (I was explicitly trying to remove the modifying-your-arguments bit to have the comparisons be length based, but that increased the size), I'm leaning towards "fluke". The name "options" for the option string and "comma" for the comma position indicator seem particularly creative, the increment can't happen before the assignment, and the test must come before both. It's essentially undoing a manual strtok(), which is a fairly standard cleanup operation. mt.c 8-36, 38-50, 55-57, 81-85: magnetic tape control is obsolete and really doesn't belong in BusyBox. The last time anybody specifically touched mt.c (rather than as part of a large global search-and-replace style change that touched many files) was 2001. I'm happy to rip it out rather than even try to answer questions about it. It's a driver for special-purpose hardware that can have a special-purpose utility in its own darn package. Depending on how paranoid we want to be, we may want to tweak small portions of dc, more, mount, and delete mt.c entirely as not worth fixing. --- Manual inspection. BusyBox 0.25 contained 310,063 bytes in 67 files. By far the two largest files are gzip.c (107,812 bytes) and zcat.c (71,936 bytes). The remaining total is 130,315 bytes, and the remaining files average 2k per file. --- Changes in applet structure: BusyBox is not the only program to combine multiple applets into a single executable, varying the behavior based on the filename. An example predating BusyBox would be the gzip 1.2.4 package which early versions of BusyBox took code from. (The gzip 1.2.4 main() function starting on line 424 of gzip.c assigns progname = basename(argv[0]), and then examines that to set behavior flags for gzip, gunzip, zcat, and other names.) The "behavior flags" technique doesn't scale to a large number of applets (although Busybox 0.25 was still attempting it). Instead, a superior organization is the function dispatch table: just as each standalone C program has a main() function with argc and argv[], when gluing several programs together as applets each applet can have an applet_main() function receiving argc and argv[]. This is a standard programming technique used everywhere from C++ virtual methods to the linux VFS, and too obvious to even be patentable under a sane patent system; certainly not a sufficiently creative element of each applet to be protectable under copyright. The applet_main() declarations also changed format from 0.25 to today. The type is now generally declared on the same line as the rest of the function. The "extern" declaration (which serves no purpose on a function definition, but only on a function prototype), and was removed. The remains of the old flag-parsing behavior (the extra "struct FileInfo *" argument) was removed. The remaining arguments are defined by the C programming language, although the format in which they're declared is now more standard ("char **argv" or "char *argv[]", vs "char * * argv" which no longer occurs in the current BusyBox tree). Each applet also has a help message. In 0.25 this was a string assigned to a global variable near the top of the applet's .c file. The mechanism by which current BusyBox provides help text is very different: a central header file "include/usage.h" contains #defines for up to four different categories of help information for each applet (applet_trivial_usage, applet_full_usage, applet_example_usage, applet_notes_usage). This information is used to generate documentation (man page and web page), and to provide configurable granularity for help messages built into each applet. For all but the simplest applets, the contents of the old help strings are inappropriate for the new more detailed help system. --- Data compression code (2 files) gzip.c, zcat.c An analysis of the two largest files, gzip.c and zcat.c, indicate that they are based on the external "gzip" project, version 1.2.4. Modification for BusyBox consisted almost entirely of removing code, not adding it. The two files implement streaming deflate and inflate, respecitvely. They were created by concatinging files from gzip together, and are highly redundant (containing concatenations of the same files, and in the case of zcat at least one standard header file). These two files are by far the largest remaining areas of similarity in the tree between BusyBox 0.25 and BusyBox 1.2.1, and thus deserve close analysis. -- gzip.c gzip.c contains the deflate engine, which in current versions of BusyBox lives in the file archival/gzip.c. The modern BusyBox version is 77,666 bytes, 30k smaller than the 0.25 version. The deflate data format is documented in http://www.faqs.org/rfcs/rfc1951.html The 0.25 version of gzip.c was stripped down for BusyBox by Charles P. Wright. It is primarily a concatenation of the gzip 1.2.4 files gzip.h lzw.h revision.h tailor.h gzip.c, bits.c deflate.c gzip.c trees.c util.c zip.c. For comparison purposes, I concatenated those files together in that order from gzip 1.2.4, creating a file containing all but 39 of the lines from BusyBox 0.25's gzip.c according to "diff -u zcat.c test.c | grep '^-' | wc". It also contains over 1800 lines which were removed to produce the BusyBox version, mostly in large contiguous sections such as entire functions). The added lines are comments, some preprocessor directives, an obsolete usage message (I.E. "gzip\nignores all command line arguments\ncompress stdin to stdout with -9 compression"), lines touched in the process of removing "pack_level", and a rename of the function main() to gzip_main(). There do not appear to be any nontrival new creative elements. There is no evidence that Bruce Perens ever personally modified this file. -- zcat.c zcat.c contains the inflate engine, which in modern BusyBox is contained in the file archval/libunarchive/decompress_unzip.c. The modern BusyBox version is 26,247 bytes, almost 1/3 the size of the BusyBox 0.25 version. The 0.25 version of zcat.c was stripped down for BusyBox by Sven Rudolph. It claims to be a concatenation of files from gzip 1.2.4, but instead of concatenating the files in order, the various #include directives in gzip.c and unzip.c were commented out and the appropriate files spliced into place. (This is an awkward arrangement that was cleaned up by Erik Andersen.) It has also had more code removed, which confuses diff enough to create false positives when checking for added lines. zcat.c contains lines from gzip.c, tailor.h, gzip.h, lzw.h, revision.h, getopt.h (!), unzip.c, crypt.h, util.c, and inflate.c. The ordering can be roughtly recreated from gzip 1.2.4 via: head -n 56 gzip.c cat tailor.h gzip.h lzw.h revision.h getopt.h sed -n -e '62,700p' -e '1139,$p' gzip.c sed -n '1,21p' unzip.c cat crypt.h sed -n '22,$p' unzip.c cat util.c inflate.c This is only an approximate ordering, and a diff of zcat.c against this produces 86 new or changed lines according to diff -bu, most of which are false positives or trivially changed (generally one character). A better ordering script could reduce the false positives, but this is sufficient for manual inspection of what's left. Aside from false positives, there's an introductory comment from Sven, blank lines, preprocessor directives, an obsolete usage message ("zcat\n\n\tuncompress gzipped data from stdin to stdout\n"), and some single character changes replacing gzip's built-in allocation functions with standard library functions (for example, fcalloc->calloc), and main() renamed to zcat_main(). Some stub code was added where code had been removed, for example the code to read the time_stamp field has been replaced by four calls to get_byte() with the comment "Ignore time stamp". See also the comment "Discard original name if any", where a while loop reads and discard characters until the next NUL byte. This code does not apply to the current BusyBox, since the stubbed out functionality has long since been re-implemented. There are no other nontrivial new creative elements. There is no evidence that Bruce Perens ever personally modified this file. -- Conclusion about zcat.c and gzip.c Bruce Perens cannot claim a copyright on current BusyBox code through these two files unless he contributed the code to gzip 1.2.4 before BusyBox merged it. --- Files that are not source code (6 files): busybox.obj, busybox.mkll, LICENSE, nl-f, README, Makefile Makefile - Bears no resemblance to the current BusyBox makefile. The comment "This will choke on a non-debian system" says it all, really. Incidentally, we're investigating a switch from our current Makefiles to the Linux kernel's kconfig. This is one of the things the v2-only licensing paves the way for. busybox.obj - A small shell script used by the old Makefile to extract the list of objects from busybox.def.h. No longer used, nor present. busybox.mkll - Shell script to make the list of symlinks when installing BusyBox. The current BusyBox still has a file by this name, but it's a complete new implementation by Larry Doolittle extracting information from different files using different tools (the new one is based on awk instead of sed). nl-f - a list of numbers from 1 to 360, one per line. Why? I have no friggin clue. Not present today. Not actually copyrightable, for that matter. LICENSE - This is actually functionally similar to the current AUTHORS file. Contains an uncopyrightable list of facts, and is not used to build the BusyBox binary. Note that Bruce Perens' copyright notice ends at 1996. More on that in the next file. README - Ancient and no relation to the current README. Note that it says patches should be sent to Dave Cinege and Enrique Zanardi, who were the maintainers Eric Andersen inherited the project from. Bruce Perens had already abandoned the project long before Eric got involved, and much of the development of 0.25 had occurred after Bruce lost interest. --- Header files (3 files): busybox.def.h, busybox_functions.h, internal.h busybox.def.h - a list of configuration symbols, to be edited by hand. These days we use menuconfig from the Linux kernel, which generates symbol names in a different format, to be parsed by different infrastructure, and even uses different symbol names (CONFIG_XXX instead of BB_XXX). busybox_functions.h - function prototypes for mkswap() and fdflush(). Trival. (And note that mkswap() declared in internal.h, so even though the file does almost nothing, half of it's redundant anyway.) internal.h - This file is entirely obsolete. It performs a function similar to the current include/applets.h, but in a much more primitive manner. The bulk of this file (lines 62-124, and 139-190) is repeated prototypes and declarations for the main() and usage[] for each applet. These days this is done by include/applets.h, using one macro for each applet. Much of the remaining content (such as the FileInfo structure on lines 11-34, and declaration of parsing functions on lines 109-124 and 130-138) is for the obsolete command line option parsing that BusyBox long ago replaced. A few function prototypes (lines 45-60, 130-138) would be in include/libbb.h if we still had equivalent functions, but these days this would mostly be static within individual .c files. The only part of this file of any real interest is struct Applet, which is a precursor to the current struct BB_applet. Only 2 of the 5 members still have the same name (char *name and the function pointer main()), and only char *name is unchanged; the function pointer is of a different type (its arguments have changed). The remaining fields in the old Applet are a string and two integers dealing with the obsolete argument parsing, while the remaining fields in the new BB_applet are bitfields for install location and set-user-id bit privilidge escalation. ***************************************************************************** Still todo: --- Tar file handling (4 files): tarfn.h, star.c, tarcat.c, tarfn.c --- Small files under 1000 bytes each (25 files): sync.c, true.c, halt.c, false.c, reboot.c, clear.c, length.c, sleep.c, rmdir.c, date.c, touch.c, pwd.c, find.c, fdflush.c, dyadic.c, cat.c, swapon.c, rm.c, chgrp.c, utility.c, postprocess.c, update.c, mkdir.c, mknod.c, mv.c. --- Files are over 1000 bytes long (27 files): ls.c, init.c, mount.c, mkswap.c, main.c, dd.c, losetup.c, mnc.c, chmod.c, umount.c, kill.c, descend.c, cp.c, monadic.c, df.c, math.c, mt.c, more.c, makedevs.c, dmesg.c, loadkmap.c, block_device.c, swapoff.c, chown.c, ln.c, findmount.c, dump.c. sync.c: THis is the smallest source file in the tree (smaller than true.c). Despite having Bruce Perens' copyright notice on this file (not added by Bruce but by Erik Andersen in svn-49), not a single line is the same from 0.25 to current. The original file was a trivial wrapper around one line of code: a call to the sync() C library function which was probably not copyrightable. The only similarity the current code has is that it also calls the sync() function out of the C library, from a function called sync_main(). Given that information I could rewrite this file blind. The copyright notice is in error. true.c: This file is one function that returns 0. All function arguments It also #includes a file that no longer exists, mnc.c: mini-netcat states it was "built from the ground up for LRP" (the Linux Router Project), postdating Bruce's tenure. Copyright Charles P. Wright and Dave Cinege. The current version has been extensively modified by Matt Kraai (a rewrite in 1464) and myself (svn 15675). I could have replaced this netcat with the version of netcat.c I wrote from scratch in 2001, but it seemed politer to upgrade the existing one instead. I've extensively rewritten mount.c already, I don't think there's any of the original code left in it but I'm happy to let shredcompare tell me for sure.   I'd been planning to fix df because the current one's broken.  pwd and sync are trivial, and neither kill nor more are brain surgery.  I once extensively rewrote init, but that rewrite was lost when Erik froze the tree for 1.0.0-pre and then migrated the repository from cvs to svn afterwards, and I switched laptops while waiting something like nine months for the checkin window to reopen.  I've been meaning to rewrite it again anyway, this might give me some incentive, I'll have to investigate more closely. The only one requiring actual _effort_ on that list is tar.c, and I know that's had a huge number of changes since the first checkin.  Perhaps it needs more, I'll see what comparator says...