14 January 1994
5,500 words

Text retrieval:
How to use and abuse search programs

By Norman Bauman

The Bay Area Rapid Transit system pushed transportation technology to the limit, in the 1970s. Trains didn't have conductors; they ran automatically, like self-service elevators.

Unfortunately BART also pushed technology beyond the limit. A robot car came into the Freemont station at the end of the line, didn't stop, and kept going--through the station and into the parking lot. The lawsuits over this and other problems came to about $250 million.

The ensuing lawsuits pushed litigation support technology to the limit too. Litigators for Bechtel, one of the parties in the suits, fed every document into the IBM STAIRS system, a full-text document retrieval system.

"If you put the full text of the document on the computer, what more could you want?" That was the paradigm of the day, said David C. Blair, now Associate Professor of Computer and Information Systems, Graduate School of Business, at the University of Michigan. It would seem that, with the right search, you could find anything you want.

But STAIRS only retrieved 20% of the relevant documents in the 350,000-page database. Even worse, the attorneys thought they were getting most of the relevant documents when they weren't, Blair concluded in a frequently-cited paper, "An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System," that he wrote with M.E. Maron, Professor (Emeritus) of Library and Information Sciences at the University of California, Berkeley (Communications of the ACM, March 1985, 28:3).

(When Blair and Maron wrote their article in Communications of the ACM, they were under a confidentiality agreement, and could not identify the BART case, or disclose specific details of the searches, which are revealed here for the first time.)

The unfortunate incident

The language that people used to refer to a given concept was just too variable. The lawyers would say, "Get me all the documents that talk about the Freemont accident," recalled Blair, who was then a document retrieval specialist at Bechtel, assigned to help the corporate law department. So paralegals would search for terms like "Freemont," "accident," and "train." But most of the relevant documents didn't contain those words.

"People who were really angry about this situation," said Blair, "were very direct in the way they referred to the situation: the 'Freemont accident.'"

"People who felt vulnerable or culpable tended to refer to it in oblique or euphemistic terms: 'The incident of last Tuesday,' 'the unfortunate incident,' 'the unfortunate occurrence of last week.'" The minutes of one meeting didn't mention "Freemont" or "accident" or "train" at all. The opening statement of the minutes was, "We all know why we're here."

"How would you search on that?" asked Blair.

The final judgment on full-text systems

Some attorneys use this study to reject text retrieval entirely, but Blair doesn't go that far. "You have to be a cautious user. It will depend a lot on the kind of question you have to answer." Names are easy, concepts are hard.

The much-cited 20% figure really doesn't represent an inherent limit of free text search either, as attorneys sometimes claim. There were other problems of document management in the BART case that would have reduced the success of any search technique. For example, an Austin, Texas firm, Brown McCarroll & Oaks Hartline, routinely handles similar cases, ran into the same problems, and figured out how to deal with them.

"The real conclusion of that paper was that full text retrieval was not a simple solution to the problem, it was one of many tools, and had to be used very cautiously," said Blair.

There should be a way

Computers can do wonderful things. They can search for words; they can offer synonyms. Maybe computers could read the words of every document in a database, and select all the documents that refer to a given subject. True, the same concept can be expressed in different words and different ways, but there are only a finite number of synonyms and grammatical constructions. Then we could tell the computer what the issues are, it could scan millions of documents, and give us every relevant document. That way, humans wouldn't have to go through the tedious task of reading, abstracting and coding discovery documents for weeks on end, and lawyers wouldn't miss anything. There should be a way to do it.

That was the optimism of a decade ago. Since then, computers have become more powerful, lawsuits have been getting bigger, and we have the experience of the refined multi-million dollar commercial services like Westlaw and Lexis. Can't we just throw all our documents into a computer and search for the ones we want?

Lessons learned

Unfortunately, daily language turned out to be too complex and variable for that model to work. But it's not a total failure. Years of trying have produced more modest successes, and experience has given us a few rules for what works, and what doesn't work, with text search programs.

The most important lesson is that small databases work better than large ones, and irrelevant documents clutter your searches. So you shouldn't fill up your database with junk documents that you'll never use, merely because disk storage is relatively cheap.

The second lesson is that, the better you understand the structure of your case, and the origin and significance of your documents, the more easily you can find your documents. You can search large full-text databases, provided you organize them well at the beginning.

"You don't get something for nothing," said John C. Tredennick, Jr., litigation partner at Holland & Hart, Denver, CO, and past chair of the American Bar Association's Litigation Interest Group. "No computer is going to do the work for you. You've got to study the documents and make decisions about what's relevant."

Beyond that, you can search for legal documents more easily than for discovery documents. You can search for your own documents more easily than for documents you have never seen before. You can search for names and dates more easily than concepts. Boolean searches are clever--but sometimes too clever, and they don't work as well as people thought they would.

3 good applications

Lawyers have found a few good text search applications for their own in-house documents.

First, one of the most effective applications seems to be document assembly. You can retrieve your own personal or firm documents for drafting new documents. Alternately, you could use an electronic form book, but for many lawyers, it's easier to find previous language with a text search program and paste it into a new document.

Second, after you have organized your documents into a structured collection, such as a trial notebook, you can use text search programs as another tool to find things that you know are in there.

And third, attorneys usually have their computer and paper files in a reasonably well-organized filing system, but sometimes you still have trouble finding a document, and text searching gives you another way to look. If you remember a name, you can find the document.

A fourth application is more problematic and certainly more difficult: Scanning through discovery document collections that are too big to read, and finding useful information--maybe even a smoking gun. You can scan for names and standard phrases, but you can't reliably find concepts. Computers can help you skim, but they can't read for you.

In huge document cases, lawyers want to find every document that's relevant to every point in their case. Computers can't do that.

Armies of Munchkins

Text search programs are limited, but how effective are the alternatives? Some lawyers don't trust full text retrieval, so they're willing to fall back on brute force. Attorneys tell Blair, "If the case gets big enough, we'll just hire armies of munchkins to look through this stuff." The munchkin can write abstracts using a controlled vocabulary, and assign codings and categories to the documents for easy retrieval. That way, one word will correspond to one concept, and you can find anything you want.

"There are two problems with that," said Blair.

First, the munchkin has to make decisions about evidence and relevance, so the lawyer is delegating legal judgment to non-lawyers.

Second, the lawyer is giving directions to the munchkin that are "basically the same kind of direction you would give to a full text retrieval system." So the coding and abstracting is subject to the same kind of limitations. And there are inherent limitations to indexing.

"It's very difficult to say precisely what you're looking for," said Blair. "You can't tell a munchkin, 'Look for a smoking gun, look for a letter that says they knew what they were doing.' Evidence is something that you can't often describe ahead of time. It has to be a recognition: 'Aha! I didn't know he said that.'"

Neither one is superior

"Back in the' 70s and early '80s, there was a large amount of research comparing free text with controlled vocabulary indexing," said Howard Turtle, Ph.D., Principal Research Scientist at West Publishing Co. "A summary of the research is that, neither is superior. They're both about as good, but you find different cases using each technique. Your best bet is to use both."

"Performance will depend on the quality of the indexing, and the quality of the searches," said Turtle. "A really well-indexed document collection, being searched by someone who knows the vocabulary very well, will work much better than a novice on the same collection, or a searcher in a poorly indexed collection."

"Manual indexing is guaranteed to be hard to use for big collections," said Turtle, "and to some extent any kind of indexing is going to be hard to use for big collections."

"You can never be sure you've got it all," said Turtle. "You can need to do your best to try to state your question in different ways."

The final compromise

The BART litigation was 20 years ago. Hardware and software have improved, and lawyers have more experience. Brown McCarroll & Oaks Hartline, a 120-lawyer defense litigation firm in Austin, TX, faced some of the same problems, and worked out solutions and more efficient compromises.

"We handle pretty big cases," said Leslie Webb, Software Support Specialist. "They can be in the millions of documents," she said. For litigation support, they index their documents with standard fields in Paradox or Foxpro, and then import the data into Folio Views,

"For a while we were scanning everything," said Webb. An outside consultant scanned and imported the documents into Folio Views, which creates a searchable file called an Infobase. "If you have full text, you have everything you need," they assumed.

Or at least, that was the original idea. When the attorneys and paralegals searched, "they would get so much that they didn't know how to narrow it down," said Webb. "And there's a cost consideration." The consultant charges 75 cents to $1.50 per page for full text, depending on the quantity and number of fields.

Now, their scanning is selective. The full text of depositions goes into the Infobase, as do the Paradox and Foxpro indexes, and they image medical records, but they only scan documents into full text that are particularly important.

Sometimes they can get word processing text directly during discovery. One of their discovery question is, "Do you have this available on disk?" They don't get it very often, "but it has been done," said Webb.

How do you want your menu to look?

"We group things together," said Webb. A Folio Views Infobase contains the full indexed text (compressed to as little as 30% of the size of the original file). The opening screen contains a menu of groups of files, and a text searching window. With a boolean search engine, you can search the entire text, or you can search one of the groups. They typically group documents according to document type, such as deposition, pleading, correspondence, discovery, or source of documents, and sometimes plaintiff and defendant. "It's not just dumped into an Infobase and searched," said Webb.

"We usually sit down with the attorney and ask what they want," said Webb. "They give us an outline of the menu they want to have, and we put the document in wherever it fits."

Folio Views is easy to use, said Webb. "But to author a real good Infobase is difficult, because if you don't know what you're doing, if you don't know how to group things, how to link things, you'll never find what you're looking for. On the front end there's a ton of work involved. You're only going to get out of it what you put into it."

Typically the attorneys use a fairly easy search, and so they get bigger hits, said Webb. Attorneys might search medical records for "headache" or "head." Legal assistants "might use a bit more complicated search, to try to narrow it down," she said. "They'll search for names, phrases, and use proximity searches a lot."

"The most important thing for full-text searching is discovery responses and searching for a witness' name," said Webb. Attorneys also search for repeated themes in the depositions.

Some attorneys like to keep everything they can in an Infobase. But that takes a lot of work. "Usually they have a full-time person working on keeping everything grouped appropriately," said Webb.

Rings a distant bell

"How to deal with massive amounts of text is still an unsolved problem," said David Thede, a programmer who became a lawyer and wrote a popular shareware search program called dtSearch. "There is no perfect solution to the millions-of-documents problem where you're trying to find one key concept. All you can do is try different things, and different strategies work in different situations."

"Text searching works well if you're looking for a document with a person's name," said Thede. "Then it works like a charm." It works well if you're looking for a subject matter where people tend to use the same terms consistently, such as terms of art in law, he said. "It works less well if you're looking for a concept that people express in lots of different ways."

Thede drew a distinction between, first, searching your own personal documents, and second, searching a database that is unfamiliar.

The "looking-for-something-I-forgot" search of your own work is much easier, said Thede. "Somebody asks you a question, and it rings a distant bell," he said. "You remember you did something on it a few years ago. You remember enough to search on a name, or a concept."

"In many cases," said Thede, "I can find the answer while I've got the person on the phone."

And indeed searching your own documents is one of the most popular and successful applications of text search programs in the law office.

Finding your own previous documents

David's wife Elizabeth Thede was staff attorney for the Federal Reserve Board in Washington, in the Banking Structure Section of the legal division, for four years until she quit to market dtSearch. At the Fed, she implemented the Bank Holding Company Act, and related statutes. She reviewed applications of bank holding companies to merge, to acquire, and to engage in new activity. And she tested out her husband's program.

"I don't know anything about computers, so I was the computer-illiterate tester," she said. "I don't even know DOS."

She would typically review an application for compliance with the law, and then write a memorandum to her supervising attorney, or to the Board, discussing any relevant legal issues that arose.

"Suppose the Community Reinvestment Act (CRA) were an issue in an application," she said. "Suppose a community group protested an application alleging that the Home Mortgage Disclosure Act (HMDA) data indicates discrimination against minorities." She would then write a memo addressing that issue.

"I would want to find the language I used in the last order on that CRA issue," she said. Not only is it easier, but the Fed wanted her to be "as conservative as possible and use as much precedent as possible," so she tried to use the same language as much as possible. Using dtSearch, she would search for:

(community reinvestment act or cra) and (home mortgage disclosure act or hmda)

"A little box pops up with search results sorted by name, date or number of hits," she said. "I would run it by reverse chronological order, so the newest ones were on top." She could view the documents in a window, copy the language from different files, and save the text to an ASCII file for import into WordPerfect.

This illustrates two of the ways text search programs work best: first, searching your personal documents, second, searching for concepts that can be defined by standard legal terms.

Form book on the fly

John Goudge, litigator and system administrator at Rodriguez and Villalobos, Chicago, uses another popular search program, ZyIndex, to retrieve his work products, and his colleagues' work product, from the firm database. "It gives me a combination corporate memory, and form book, and style book, without having to construct one," he said. The firm has three databases: one on municipal bond questions; one on general contract law, with form contracts, releases, and a sample book; and one on litigation, with "every significant pleading or research memo and certain significant cases downloaded from Westlaw or Lexis," he said.

For example, Goudge wanted to look up a rule about prompt service to an employer and employee. One was promptly served; the other was not promptly served and is therefore not liable. Is the remaining party liable? So he searched for:

(diligence and dismissal) w/40 (employer or employee or principal agent or respondent superior) and w/40 (estoppel or res judicata)

"ZyIndex has a thesaurus," said Goudge. "You cursor on 'employee' and get a whole bunch of synonyms." In five minutes, he found a pleading on point, even though it was done four years ago by an attorney who has since left the firm. The answer: "Neither of the two can be liable," said Goudge.

"Lawyers tend to talk with certain buzzwords," said Goudge. "'Open and obvious.' 'Trespasser.' 'Licensee.' 'Invitee.' If you're talking about a bus, it's 'highest duty of care.' You hit these key words and it comes on out. I know what the jargon is that's used in by lawyers that practice in this area, and by judges when they write opinions in this area."

Fact situation: Swings

"Typically we do work for school boards," said Goudge. "They have certain immunities involving playgrounds, and standards of misconduct involved, so we can very quickly look up previous motions for summary judgment or motions to dismiss." He would typically want to argue to the issue of whether it was willful, wanton, or negligent, and whether the school was immune. He would call up "Knox College," an important case, and that would suggest useful keywords. On the fact issues he could search for "monkey bars," "slides," or "swings." On the legal issue he could search for "willful misconduct," "negligence," and "immune," perhaps narrowing it with "playground." After copying from his previous work and recent cases, "I have 90% of my motion complete," he said. "Maybe I hit Shepherd's."

This illustrates another way in which text searching works well: finding fact situations, like those associated with "swings."

Research search

In contrast is the research search. "Instead of looking for my stuff, I'm looking for other people's stuff," said Thede. "It's a lot more difficult, because I have less of an idea of what I'm looking for, so I try different things till I find things that are on point."

"The classic example is the entire body of U.S. case law," said Thede. "It is just tremendous."

A common technique is to perform one search, retrieve some documents, see what phrases appear, and use those phrases for another search. This is another technique that looks easy, and works well on documents that are written by lawyers, but can break down in large discovery databases.

"The general concept is to look at the pattern of words in the document and turn that into a manageable thing that you can store and search on," said Thede. You examine the documents, and try to find words that are relatively frequent in those documents but relatively infrequent in the entire database, he explained. It seems possible to write an algorithm that would allow a computer to do that automatically, but that goal has been elusive.

"It takes more intelligence than computers have today to recognize those documents, because you have to know the English language," said Thede. "You have to know what the words mean. The next frontier in text retrieval is to actually have the computer understand, in some sense, the content of the document, rather than treating it as a bunch of words it doesn't understand at all."

That frontier is a long way off. But, said Thede, "I think there are intermediate points."

Uncanny searching

One program that seemed to find associated words automatically was Lotus Agenda, which is still sold and supported but not updated. David Beckman, of Beckman Hirsch & Ell, Burlington, IA, used Agenda for litigation databases. "We had a burn victim in an explosion," he recalled. A year or so into the case the issue of bone pins came up. The doctors had drilled holes in the client's knee and ankle to insert pins while the bones healed, and the pins had to come out. "They were painful," said Beckman. "We wanted to know how painful. What are these pins, where are they and what did they do with them?"

"We searched through our Agenda data base," said Beckman. "All kinds of things came up--interviews with the nurses, depositions, interviews with families. One of the nurses had to leave the room because she couldn't stomach to watch it." This material was very useful in negotiation.

"We came up with things that didn't have the word 'pins' in them," said Beckman. "I couldn't figure out how Agenda found them. The manual says that, if there are items out there that don't match the word 'pin,' but have a lot of other items you've selected, it'll pick that item." For example, there were names of instruments used in taking the pins out.

"Nowhere in this data base did we set up a keyword that said 'pin,'" said Beckman.

Find "trap correction"

Information scientists have been trying in vain to create programs that would operate like that consistently. At least Beckman was searching lawyer-written documents, which have a considerable redundancy.

The documents in a large unstructured discovery database, like the one used in the BART litigation, are far more difficult to search, because engineering terms are not as standardized, and engineers don't write as redundantly as attorneys. Engineers will frequently write to each other about mutually-understood concepts which need not be explicitly mentioned at all.

Blair often found himself following "a trail of linguistic creativity through the database" as he described in his paper. In searching for documents discussing "trap correction," they discovered other documents referring to it as "wire warp." Other documents referred to it as the "shunt correction system." The inventor was named "Coxwell" and documents he had written were in the database, but Coxwell referred to it as the "Roman circle method." The system had been tested in another city, where it was referred to as an "air truck." Finally, after 40 hours of searching, with "no reason to believe that we had reached the end of the trail," they ran out of time and quit.

Find "Quantities of steel"

Anyone who does text searches knows that synonyms are a problem. But the problem is harder than it seems.

One allegation was that a construction company had ordered excessive quantities of steel, and reabsorbed the steel into their inventory, said Blair. The lawyers wanted to search all the documents for the key phrase, "steel quantity." But engineers don't use the term "steel quantity." They will refer to "girders," "beams," "braces," or "frames." You have to know that these are all steel, and you have to translate your requests into those terms, he said.

"OK," some lawyers say, "you need a thesaurus." Many text search programs will expand your search term with a default dictionary of synonyms, or let you build a custom dictionary. But that didn't work either with this large database.

STAIRS had a thesaurus, said Blair. "They hired an engineer to spend a year and a half," to manually decide that one word was related to another word. "He did a pretty good job." But even with the thesaurus, the system "was not able to retrieve a single document that was relevant that could not be retrieved in the other way."

"The problem was that the engineers and lawyers used somewhat different vocabularies to talk about the same things," said Blair. "A thesaurus based on the engineers' vocabulary missed many of the words and semantic relationships that existed in the lawyer's vocabulary," he said. The engineer linked the terms "girders," "beams," "braces," and "frames," but he didn't link it to "steel quantity." Engineers, said Blair, don't talk about "quantities of steel."

It turned out that the documents relevant to the steel issue were the bills of materials, which are listings of the material delivered by a subcontractor. "What you didn't want was people just talking in general about steel," said Blair.

But, had they known, one search in the BART database would have been easy. People occasionally referred to a critical, embarrassing issue as a "smoking gun." So all you had to do was search for the key words "smoking gun," and you'd have it.

3 lessons learned from the BART litigation

There are three basic principles in document searching, said Blair. "One is to keep the document collection as small as possible, and include only the relevant documents," he said. "Two is to do some serious upfront work about what the intellectual structure of the lawsuit is, what is relevant and what isn't, what the issues are. Three is, a lot of lawsuits nowadays are just big. If you've got a big document collection, try to partition it into sets of small document collections."

"When a lawsuit occurs revolving around the activities of people in a particular department, very often the law firm will say, "Put everything that came out of that department during this time frame onto the computer system." That's a bad move. Decide what's relevant, don't just gather everything, said Blair. "Everything can't be relevant."

If the documents start out with any kind of organization, keep that organization available, said Blair. For example, if you get documents in response to discovery questions, code them in a field to indicate the questions.

Following the structure of the case

Losing the original organization was one of the mistakes of the BART suit. For example, the complaint had 13 specific issues, such as the Freemont accident or the steel quantity, and the discovery documents were in response to each of those 13 issues, said Blair. Instead of keeping the documents distinct by issue, they merged the documents into one big collection, without being able to separate them again. There was no "issue" field in the record.

If they had kept the 13 issues distinct, then instead of searching the entire 40,000 documents, they could have directed their queries to a much smaller partition of documents, all relevant to, say, the Freemont accident, explained Blair. Searching the partition, "you can tolerate more ambiguity in the queries, and more sloshing around in the searching, because everything in the smaller database is related to an issue."

Documents are often clustered together, to perform a particular activity, such as a contract negotiation, said Blair. So you should be able to link those documents together, even though they may not have a distinctive key word in common. In the BART suit, for example, brake design was an issue, and the attorneys wanted to follow the negotiations between prime contractor and subcontractor over brake design. For the delivery of steel, the cluster of documents that recorded the quantity and price was the bills of material.

Big is bad

"The size of the system is going to seriously affect performance," said Blair. "If you have a good document retrieval system, the best way to turn it into a bad one is just make it bigger."

If you submit a query to a collection of 1,000 documents, you might get 50 documents back. "That's not a problem," he said. "I'll just paw my way through." But if you submit that same query to a collection of 100,000 documents, you might get 5,000 documents back, which is useless. This problem of information retrieval is known as "output overload."

The software enables you to add restrictive Boolean terms to a query until it reduces the responses to a manageable number. But with each new term, relevant documents are excluded. Most people realize that they're sacrificing something, but when Blair worked out the mathematics, the extent was "quite startling." With five search terms, using some reasonable assumptions, a query should yield only 1 relevant document in 1,000, he calculated.

Searching at trial

Search programs work best on familiar documents that are well-organized, and few people are better organized and more familiar with their material than a good trial lawyer in court. "I've got to know every aspect of this case cold before I go to trial," said Harold Goldner, of Ellis Jacobs & Associates, Philadelphia. The computer gives him another tool for organizing and finding things. "With the computer you can do a real quick keyword search," he said. "The witness says the light was green, you can look at the page where he said the light was red."

Goldner has used InMagic on a Gateway 2000 Nomad laptop at depositions for over a year now. He's been preparing his computer for trial, but so far the opponents have settled first.

The last one was the "music video case," Frank v. Whitesnake, in the Eastern District of Pennsylvania. Goldner's client was a photographer who claimed to have fallen on electrical cabling which was concealed beneath the stage during the shooting of a music video. She fractured her wrist, which did not heal properly and interfered with her work.

The key word was "cable," said Goldner. "When I typed in 'cable', I was overwhelmed with responses," he said. He narrowed it down by looking for references to the color of the cable. "The witness said the cable was orange. Everyone [on the defense] denied that there was anything but black cable."

Revenge of the nerds

Goldner took 11 depositions in Los Angeles, got them on floppy disks, and loaded them into 11 databases in InMagic. He used that information to, first, prepare for trial and second, search during cross-examination instead of thumbing through deposition digests. "It's the equivalent of having somebody sitting beside you and saying, 'Here it is.' But you've got to know that the document exists."

"My opponents were aware of my clicking through the deposition digests, and, with my Sharp Wizard, and cellular phone, they regarded me as something of a nerd," said Goldner. They made jokes about what would happen if he were struck by lightening. But juries are prepared for computers, he thinks. "LA Law routinely shows lawyers at counsel table with laptops," he notes. When the judge said, "Let's pick a jury," he said, "Fine, I'm ready to go." The computer helped him call their bluff, and they settled, Goldner says.

Goldner keeps a log of requests and answers to interrogatories, and other trial details, in his InMagic database, so when someone claims he never sent an answer, he can point to the file. "A lot of insurance adjusters do that on their computers too," he said. "Occasionally we'll get into a log-reading war."

Contradictions at trial

John C. Tredennick, Jr., litigation partner at Holland & Hart, Denver, CO, described how he defended the owners of a nursing home who were sued by their former management company. At trial, a witness unexpectedly testified, "I don't know why they claim we were competing. We sent 66 patients to their nursing home in the past two years." Tredennick, who had all of his 17 discovery pleadings in a Summation II database on a Zenith notebook, was sure he had asked about the number of patients and gotten a different answer. His co-counsel objected, and Tredennick tried to find it.

With the judge waiting, Tredennick searched for "patients" within 4 lines of "sent"--and found nothing. So he searched again, this time simply for "patients," and found interrogatory number 17, in which the plaintiff's president had sworn that they had only sent nine patients. He could have found the same interrogatory manually, overnight, but it had much more impact to point it out immediately, he felt. The judge was displeased at the contradiction. The plaintiffs lost and Tredennick's client won a $14 million counterclaim.