TITLE: Duplication in Many Forms AUTHOR: Eugene Wallingford DATE: October 03, 2004 5:00 PM DESC: Duplication of all sorts can hurt you, but recognizing it in its subtler forms isn't so easy. ----- BODY: We all know that, all other things being equal, duplication in a program is bad. The Pragmatic Programmer uses the phrase DRY -- "Don't Repeat Yourself" -- to capture the essence of this idea. Unchecked, duplication creates a maintenance nightmare, because changes to the system may be needed in several, hard to find places. When programming test-first, duplication is a natural part of the development process. First, we craft a test for a requirement, and then we do the simplest thing possible to make the test pass. That simplest thing often involves duplicating some other piece of code, with perhaps a few tweaks that distinguish the new case from the existing one. Test-first development doesn't leave the duplication unchecked, though, because it calls for us to refactor our code as soon as the code for a test passes. What sort of duplication should we look for when refactoring code, whether as a part of TDD, as part of taming a legacy system, or as part of improving our non-TDD system? Duplication can occur in many guises, not all of which are immediately obvious when examining a body of code. The simplest form of duplication is when two pieces of code look the same. Such textual duplication results from copy and paste, but it can also occur when solving related problems independently. When we duplicate text via copy and paste, we usually know to eliminate the duplication in the upcoming refactoring phase. Even when we generate it independently, it's easy enough to recognize as we move around our code base. Common refactorings such as factoring out a method or superclass address textual duplication. A particular sort of textual duplication arises in how we sometimes name things. Consider this piece of Java code, based on a thread on the refactoring discussion list: sendMessageToServer( Message m, Server s ) There's a not-so-subtle duplication in the name of the method and its arguments. I like explicit names, to the point of using longer names than most folks like to type or read, but this example repeats the intent of the method in the method name sendMessageToServer and the argument types Message and Server. The duplication rises to another level when used in this too-common way:

      Message message = ...;
      Server  server  = ...;
      ...
      sendMessageToServer( message, server );

That's triplication, not duplication! Let your language do some work for you. And you don't need to work in a statically-typed language to see how good names can eliminate such repetition. A typical Smalltalk method signature for the above would probably read:

      send: aMessage to: aServer

We can eliminate this sort of name duplication by choosing better names. :-) Name methods for their actions, and let the names of argument objects participate in the readability of an statement. A related form of duplication occurs when two pieces of code behave alike. We can call this functional duplication. Sometimes functional duplication begins as textual duplication, but it can happen quite innocently as programmers working on different parts of a system reinvent one another's solutions to common problems. When two methods or two classes do the same thing, we run into the same maintenance problem as in textual duplication. When requirements change, one of the methods or classes may be modified or enhanced, leaving some part of the system using an older version. Functional duplication is hard to find, because the code may not jump out at you. One of the less-mentioned benefits of small methods and small classes is that it's harder for functional duplication to hide in complex code. If you see code that does the same thing as another piece of code, you're more likely to see it in simpler code. XP's encouragement that all programmers work on all parts of the system over time, through promiscuous pairing and non-exclusive attachment to particular sub-systems also helps us avoid this problem, as we are more likely to come into contact with all of the system's functionality sooner or later. Once identified, we can eliminate functional duplication using many of the same factoring techniques as we use on textual duplication. But we may also need to redesign some of our interfaces when different functionality goes by different names in the system. Dave Astels points out another kind of duplication in his article on bad code: temporal duplication, when work is repeated unnecessarily in a program. I see this sort of duplication when both client code and server code perform safety checks on the values of variables, say, to verify a pre- or post-condition. But it can happen in other ways, too. For example, student code often asks a collection if it contains an entry with a particular key, and when the collection says 'yes' it asks for the entry. This may involve searching the underlying collection instance variable twice. Temporal duplication is harder to find, because it requires a deeper feel for what the code is doing. One way to eliminate temporal duplication is to decide who is responsible for an invariant condition and then having only that object enforce it. Another is to rethink the interface of an object -- why ask the collection if it contains the key; why not just ask for the desired entry and behave appropriately when it can't find it? A third way is to cache the result of the first effort and then return the value immediately upon future requests. Choosing which of these techniques to use is a matter of balancing different forces. Someone should write some patterns... There are some other forms of duplication that show up as a result of how we design our code. Kevin Rutherford wrote an article or two on how many if statements duplicate knowledge held elsewhere in the system. This is a sort of epistemological duplication that lies at the heart of good system design. In object-oriented programming, we don't need to use an if statement to recover what the system knows or used to know. At the moment the system knows something about its future behavior, it can create an object that has that behavior. Joe Bergin and I have been encouraging this as a way for students and instructors to design programs that make better use of polymorphic dispatch than explicit selection. The advantage of polymorphic dispatch over if statements is, of course, that we can customize a program's behavior by plugging a new kind of object into the system, rather than editing the program code to address another case in the if statement. And, where there is one such if, there tends to be more than one, and we end up with a form of textual duplication if only in the structure of the choices being made! I like this quote from the Rutherford article mentioned above as a concrete criterion for recognizing epistemological duplication in choices:

Therefore it seems to me that there are two kinds of conditional statement in a code base: The first kind tests an aspect of the running system's external environment (did the user push that button? does the database hold that value? does that file exist?). And the second kind tests something that some other part of the system already knows. Let's ban the second kind...

Duplication in all its forms can come back to hurt a programmer in the long run. I think that one of reasons we feel so good when we read the code of the masters is that even the less obvious forms of duplication are nowhere to be found. We may not recognize this reason, but it's there. Look for these kinds of duplication the next time a piece of code makes you say "Ahh!" or "Ugh." You may be surprised by what you find -- and what you don't. Then think about these kinds of duplication the next time you are refactoring your code. You will surprise yourself by the opportunities you have to improve your program. -----