CAPíTULO IV. RESULTADOS
4.7 Análisis de Caso
4.7.2 Caracterización de los territorios a nivel distrital
Two of the main problems with existing approaches are the reliance on virtual printers as a pagination mechanism and using TIFF as the primary paginated format. The virtual printer mechanism introduces an unnecessary bottleneck through the Windows print system. It also introduces certain anomalies with output format particularly e-mails in the commonly used PST file format where the paginated emails contain a ’dummy’ user account name as a header due to the way Outlook natively handles printing. TIFF files present a one way problem in that a more text appropriate format such as PDF can be converted to TIFF without issue, but converting TIFF to PDF leaves you with the same file size and OCR problems the TIFF had to begin with. In comparison, PDF files containing primarily textual information are relatively small and closer in size to the native ESI.
In addressing the pagination problem with Discere I chose to use PDF as the primary pagination format for the reasons just explained. PDF has numerous benefits including wide spread support in many applications, small file sizes for primarily textual documents, and easy conversion to various image formats if required. In lieu of using a virtual printer architecture, I utilize the native associated application itself where possible to do the conversion and alternatively build the paginated PDF output programmatically where the native application will not support direct conversion.
Where existing approaches utilize proprietary applications, I instead use open source components. The substitution of open source components produces numerous benefits both cost focused and functionality focused. Cost wise, open source components remove the need for additional software licenses per node. Technologically, open source components present the ability to interface directly with the application instead of relying on a specific vendor to allow for programatic incorporation. Open source also allows Discere to leverage significant development investment from the open source ecosystem into creating a robust processing application without reimplementing rendering components for each file type; additionally, Discere will benefit from the continued development of the open source components it relies on when those components are updated.
9.3.1
In litigation, e-mail is predominantly found in Personal Storage Table (“PST”) format files generated either from Outlook as a local mail-store or as an export from an Exchange server instance. PST files prior to the 2003 switch to 64 bit addressing were limited to 2GB, while post 2003 files have a much larger maximum (20-50GB by default.) By using an appropriate library implementation[22] the internal content and folder structure of the PST file can be navigated without the presence of an Outlook installation. In existing
industry approaches the utilization of Outlook for PST processing restricts such tools to the Windows platform. Reliance on Outlook also introduces formatting artifacts in that Outlook produces the user profile name on the top of each email when printed, and the use of dummy profiles for discovery processing causes these artifacts to appear regardless of email source.
PST files contain emails, appointments, contacts, and other information. The contents, such as emails, may themselves be thought of as containers as is the case when emails have attachments. Processing PST files requires iterating through the contents while preserving parent-child relationships, then treating each contained item as a potential container (see figure 9.3.1 on the previous page). Emails consist of more than just visual elements; in rendering a paginated email the visual cues of contained data sources (such as attachments) must be preserved, the individual child attachments must be extracted for further processing, and the data sources must be parsed into the human readable fields (to, from, cc, bcc, subject, date, body, etc).
MSG files must also be handled in a similar way to PST files in that they to are email containers, though of a single message, but may contain other MSG files as well as other attachments. In a similar vein to PST files, existing libraries provide access to the underlying data structures in the MSG file so they can be rendered more efficiently than relying on Outlook.[3]
9.3.2
Office Documents, Images, & Drawings
Office Document, file types generally associated originally with Microsoft Office and now more widely sup- ported by various productivity suites, are the bulk of file types typically handled in discovery beside emails. In many ways, email and office documents go hand in hand as such files are plentiful in email corpora as
attachments. There are substantial differences in their formats from the pre-2007 binary formats in old Ob- ject Linking and Embedding structure (“OLE”), the 2007/2010 XML based format, and the “open formats” such as Open Document Format for Office Applications (ODF) first created by the OASIS consortium and later adopted as an ISO standard.
Originally, Discere relied on OpenOffice for Office Documents via the Universal Network Objects API. Subsequently, Discere incorporated an open source library, JODConverter, which is centered around control- ling OpenOffice processes for conversion purposes.[23] Later still, with the acquisition of Sun by Oracle, the OpenOffice project forked into OpenOffice (now an Apache project) and LibreOffice (the fork). Discere will work with either project currently, and in some cases utilizes both due to one or the other not being able to handle a particular file due to its content and/or an existing bug. With each release of the projects new features become supported, and performance improvements are realized.
Libre/Open office also provide support for certain file types which are not traditional Office Documents, including certain vector image formats such as DXF8 files. While individual image files generally are better
handled directly, the support for DXF is very important for data sets from engineering companies where they are prevalent.
9.3.3
ASCII or Unicode Text
Text itself is easily handled through the PDF libraries themselves.[2, 19, 21] By handling the text as a direct addition, it remains searchable in contrast to approaches which convert such information into images. This both decreases the resulting file size, and removes errors associated with later OCR used to recapture the lost textual information.
9.3.4
Binary Files
Binary Files which are not presently supported, or where the format does not lend itself to pagination are handled in three ways. First, the Apache TIKA library[5] provides support for extracting textual and metadata information from a wide variety of file types. Many of its supported file types overlap with those better supported by other parts of the system, and in such cases is only used as a fall back for failed conversions.
Where textual data cannot be extracted from the binary file, an extraction of ASCII string data is attempted. As a final failure handler, placeholder files are used for files not amenable to either TIKA or String Extraction.
9.3.5
Archives
Archive files, or file containers contain other files, but are not themselves clearly suitable for meaningful pagination. I have chosen to render these files by constructing PDF listings of their available metadata, and the files contained within them. The parent/child information is recorded so that contained files can be located from a given container entry as well as allowing the container to be located when encountering a specific file of interest. Archive support for a variety of formats is available as part of the base Java API.
8DXF files are commonly exports of AutoCAD formats that are more widely supported due to the proprietary nature of the .DWG format.