III: Small: Issues in Understanding, Indexing, Querying, and Visualizing Spatio-Textual Spreadsheets on the Web

Numerous organizations including government agencies are sitting on mountains of spreadsheet data that are becoming increasingly common on the web, but whose contents remain out of reach via search engines because direct links to the contents of their constituent cells are rare. Thus spreadsheet data represent legacy databases, especially since many of their underlying schemas are no longer accessible. The goal of this research is to discover the schema according to which the spreadsheet is constructed. The focus is on the spatio-textual spreadsheet which is a spreadsheet where the values of the spatial attributes are specified textually. Such spreadsheets support spatial searches whose output is visual and whose utility is enhanced by being able to handle spatial synonyms. This is done, in part, by devising methods to automatically discover the spatial attributes of the spreadsheet as well as how to distinguish between several instances of them which arise due to the presence of a containment hierarchy. In particular, use is made of spatial coherence which is manifested by observing that spatial data in the same column are usually of the same spatial type, while spatial data in the same spreadsheet row usually exhibit a containment relationship. Moreover, adjacent or nearby rows exhibit spreadsheet coherence in that they are usually similar. The broad impact of this research is to make spreadsheet data a first class citizen on the web with the same chances of being discovered and accessed as data found in other documents.

Reports describing results of this and related research will be available at http://www.cs.umd.edu/~hjs/spreadsheets.html

Principal Investigators