Extracción de características: piel
4.3. Detectores
4.3.2. Algoritmo SDMIM
6.2.1 Overview and Aims
Semantic Level 1 is predominantly intended to enhance the consistency of local data.
That is, to assist analysis that does not require the consideration of other datasets. It aims to achieve two important goals:
1. Increase the internal consistency of the dataset,
2. Standardize aspects of the dataset structure to facilitate further semantic process-ing.
More specifically, it will:
• Ensure that only one term is used for a given concept,
• Standardize the structure by which Context dates are recorded (but not the dates themselves),
• Ensure that only one ‘Find’ is described per table row,
• Prepare the data for Semantic Level 2.
The primary benefits are:
• Quick visualization of a single dataset using visualization software and services such as spreadsheet graphs or Many Eyes,9
• Easier comparison with datasets that have also been processed in this way.
This Semantic Level makes use of spreadsheet software with which most archaeologists are already likely to be familiar. Microsoft Excel, OpenOffice Calc and Google Spread-sheets are all viable tools for the recipes in Semantic Level 1. None of the recipes should be difficult for those acquainted with spreadsheets but the amount of time required will largely depend on the complexity of the original data. A typical timeframe would be 1–2 hours from start to finish.
9http://www-958.ibm.com/software/data/cognos/manyeyes/
6.2.2 Recipes
Recipe 1: Create a Copy
None of the recipes are intended to change the original data and recipes should only be performed on a copy of the dataset. This recipe creates a copy of the data and uploads it to a specific directory of the PBworks website, which has an extremely simple interface for this function. The original data is typically, although not always, in the form of a spreadsheet. In cases where the data is held in a database, it is recommended that it be exported as a CSV file before importing into a spreadsheet package. The user is also cautioned to check for common problems such as missing column names, character encoding issues (especially for accented or non-latin letters) and the conversion of decimal numbers to integers (or vice versa). Naturally it is impossible to provide concrete guidance for the wide variety of data formats in which the data may originate, so the main intention is to raise awareness of such problems and, if encountered, direct the user to consult the appropriate software documentation. Before uploading, the user is encouraged to use an appropriate naming convention for the file. Note that additional metadata is not incorporated here as the goal is ultimately to encourage users to create self-describing data as efficiently as possible. Extensive discussion of what to include, and under what circumstances, would inevitably add considerable additional complexity from the outset. However, users are expected to follow appropriate metadata procedures for both their original dataset and any outputs they intend to make public.
Recipe 2: Table Axes
In order to have a common structure for processing it is necessary to ensure that the axes of the spreadsheet are such that finds — the fundamental entity to which other information will be attached — are associated with rows rather than columns. Finds, discussed in Section 4.4.1, are the smallest unit which it makes sense to compare across sites. In order to mitigate the inevitable ambiguity of using common archaeological terms across different disciplinary traditions, the following definitions are also provided:
Find The aggregation of all fragments of one amphora Class within one Context Class The combination of an Amphora’s Form and Fabric
Form The physical shape of an amphora, specified by category in a typology system Fabric The geographic origin of the material with which an amphora is composed Context A unit of an archaeological Excavation
Excavation A season of archaeological excavation at a particular site
The subsequent recipes require that there should be one row in the spreadsheet for each Find or Context and its associated information (amphora Form, Fabric, Context, dating, location and so on). Some datasets, especially summary tables, list the amphora Forms as rows, and the Contexts as columns, for concise viewing. If this is the case with the user’s data then the tables need to be transposed (i.e. their axes need to be inverted) by following this recipe. Sometimes category labels will only be found in the first row and implicit for successive rows. If this is the case, users are also instructed to fill in such rows explicitly.
Example Input10
form context 1 context 2 context 3
Keay 5 5 3
Dressel 1 1 7
LRA 3 2
Example Output
context keay 5 dressel 1 lra 3
Context 1 5 2
Context 2 3 1
Context 3 7
Recipe 3: Filter Undesired Content
As the amount of time needed to complete each Semantic Level depends heavily on the amount of variation in the data, rather than the amount of content, it is helpful to remove irrelevant material as early as possible. The nature of such extraneous content is of course impossible to predict but is typically Find material that is not a subset of the data category of interest. In the case of Roman Port Networks this might be Finds other than amphorae. It is important to re-emphasise that the goal here is not to semantically describe the complete original dataset, but only those parts which are relevant to the specific research agenda. This recipe requires the user to filter and delete rows appropriately (or columns in cases where Find classifications are separated this way). Summary data, such as count totals, is also removed.
10In the interests of simplicity, input and output examples only show information specifically relevant to the recipe under discussion.
Example Input
context category type count
1 amphora Ostia V 3
1 fineware ARS 6
2 amphora Keay 20 2
Example Output
context category type count
1 amphora Ostia V 3
2 amphora Keay 20 2
Recipe 4: Add and Normalize Context Dates
Time is a longstanding issue of debate in archaeological database management and there is a great deal of variation in the way that Context dating is recorded. Frequently an approximate era will be given, or a terminus post and ante quem. In order to facilitate the production of time-series graphs and broad comparisons with other data, it is necessary to produce year-based dates for each Context. The only way to move from an imprecise dating term (such as ‘third century’) to a specific year without introducing inaccuracy is to extract the widest date range possible. The two essential dates are:
Terminus post quem (required where available): The earliest date at which a Con-text’s creation began.
Terminus ante quem (required where available): The latest date at which a Context’s creation ended.
If it is important to capture the original level of precision the following dates may also be included:
Inner terminus post quem (optional): The latest date at which a Context’s creation began.
Inner terminus ante quem (optional): The earliest date at which a Context’s cre-ation ended.
It is rarely possible to perform meaningful analyses over large numbers of Contexts in this way however, so the external temporal bounds are most useful, practically speak-ing, even while being the most vague. An important issue highlighted here is that a period date (e.g. 2nd century CE) will give different values dependent on whether it is being considered as a terminus post quem (101 CE) or terminus ante quem (200 CE).
No instruction is given as to the precise definitions of period terms (‘Late Middle Bronze Age’, ‘mid-sixth century’, etc.). This is because the user will inevitably be in the best position to judge the definition of those phrases in their own context. Imposing arbi-trary generic definitions neither clarifies the original data compiler’s intention or avoids ambiguity during analysis.
Unfortunately there are few time visualization technologies with good support for BCE dates, but by setting these as negative integer values it is possible to plot them adequately enough on a graph. Using integers also effectively converts years into a point date rather than a period. This is (almost) always sufficient for archaeological use. Date formats, in contrast, tend to lead to additional complexity in how they are recorded and interpreted because they must ultimately be reduced to a specific calendar-independent moment in time. This high level of granularity is seldom of use to archaeologists when comparing inter-site distributions over large time periods.
Example Input
context earliest latest
1 1st C. Late 2nd C.
2 Late 2nd BCE Early 1st C.
Example Output
context earliest latest tpq inner tpq inner taq taq
1 1st C. Late 2nd C. 1 100 150 200
2 Late 2nd BCE Early 1st C. -150 -100 1 50
Recipe 5: Make Each Row Equivalent to a Single Find
Although useful for visualization in summary tables, it is very difficult to combine tables in which Finds from different Classes of amphora are given in separate columns and thus referred to in the same row. Where this is the case, it is necessary to produce a new row for each Class of amphora. This is, unfortunately, a relatively convoluted process, involving a considerable amount of copying and pasting. Fortunately summary tables of this nature tend to be comparatively small as they are generally laid out for visualization
across a single page. At the end of this recipe, any row with no content (such as that of Late Roman Amphora 3 in Context B in the example below) are deleted. Likewise, any border formatting that may have been present is also removed.
Example Input
context keay 5 lra 3
A 1 2
B 1
Example Output
context count form
A 1 Keay 5
A 2 LRA 3
B 1 Keay 5
Recipe 6: Consolidate Counts
Raw archaeological data will often be recorded at the level of individual rims, handles, bases and sherds, or small assemblages of them. Although this can be very important for intrasite analysis it is at too high a level of granularity for inter-site analysis. As a Find is the sum of all fragments of a certain ceramics Class we only need totals.
Two recipes are given here, depending on the nature of the data. If sherd types are separated into separate columns, and there is only a single row for each Find, the pro-cess is relatively straightforward, and can be done using the SUM() function found in all spreadsheet packages. The second recipe, in cases for which there are multiple rows that need to be consolidated for each Find, is a little more complicated and time consuming, and uses several spreadsheet functions. The essential principle of consolidating counts should be relatively easy for the user to grasp however. In either case, the archaeol-ogist may alternatively wish to calculate a standard metric, such as Estimated Vessel Equivalent, and use that figure in place of the raw count.
Example Input I
context rim base handle sherd form
A 1 3 0 5 LRA 3
Example Output I
context count form
A 9 LRA 3
Example Input II
context form fabric fragment type fragment count
A Dressel 20 Baetica handle 3
A Dressel 20 Baetica rim 1
B Dressel 2-4 Gaul handle 3
Example Output II
context form fabric fragment count
A Dressel 20 Baetica 4
B Dressel 2-4 Gaul 3
Recipe 7: Uncertainty
Uncertainty, like time, is another complex topic when it comes to digital representation.
Modeling uncertainty is extremely difficult largely because there are so many kinds and degrees of it. Most frequent, however, are i) possibility: an indication that a value may or may not be correct, and ii) disjunction: an indication that an attribute has either one value or another (but not both).
The first is often expressed with a modal operator such as a question mark (‘?’) or adverb (‘perhaps’, ‘possibly’, etc.) added to the description. As these operators cannot be distinguished by the computer as being independent of the values themselves, it is necessary to record them separately. The simplest way to deal with uncertainty of this nature is to create a new ‘uncertain’ field which can be either true or false. This way we can choose to filter out uncertain Finds if we wish to. We must also remove the question mark from the description so that it will not remain distinct from other Finds of the
same Class. This is by no means a perfect solution as by separating the uncertainty from a specific field it raises it to the level of the entire entity. For example, the semantics of the input may state that ‘3 amphorae were found in Context A. It is possible that they are of Form Keay 5’. The output, in contrast, must only be interpreted as: ‘It is possible that 3 amphorae of type Keay 5 were found in Context A.’ Nevertheless this greatly simplifies the filtering process later and is in line with our philosophy of prioritizing accuracy over precision.
Multiple possibilities (e.g. ‘Keay 13 or Dressel 20’) are in some ways more problem-atic. Representing them as separate statements would make it likely that they will be interpreted as two separate Finds unless complicated logical operators are introduced.
Again, following the priority of accuracy over precision, the user may choose one of two options:
1. They may choose one of the alternatives for the value and mark it as uncertain (in the manner described above).
2. If the value is categorical they may change to ‘unidentified’ or a similarly ‘null’
value. In this case the Normalize Terms recipe (Recipe 9) should be followed.
The first option is generally preferable, as less information is lost, but psychologically the user may feel more comfortable giving no value at all than ‘hazarding a guess’.
Example Input
context form
A LRA 3?
A Keay 5
Example Output
context form uncertain
A LRA 3 TRUE
A Keay 5 FALSE
Recipe 8: Separate Form and Fabric
In some datasets the description of Form and Fabric is recorded in the same column.
These need to be separated into individual columns so that they can be interpreted separately. This recipe simply replicates the column — new terminology is established in the following recipe (‘Normalize Terms’).
Example Input
context class
A Dressel 2-4 (Gaul) A Dressel 20 (Baetica)
Example Output
context form fabric
A Dressel 2-4 (Gaul) Dressel 2-4 (Gaul) A Dressel 20 (Baetica Dressel 20 (Baetica))
Recipe 9: Normalize Terms
This is perhaps the most fundamental recipe in Semantic Level 1. A great deal of legacy data, and raw data in particular, uses varying terms to describe the same concept, often without realising it. This can be caused by a range of factors which may or may not be apparent to the human eye. Examples include:
• Capitalization (‘beltr´an 2’)
• Numerals (‘Beltr´an II’)
• Abbreviation (‘Bel. 2’)
• Whitespace (‘Beltr´an2’)
• Accents and character encodings (‘Beltran 2’)
• Typographical errors (‘Bletr´an 2’)
• Qualifying information (‘Beltr´an 2 (local)’)
Machine-readability, with or without URIs, is inherently based on symbol-matching, i.e.
the assumption that identical strings of characters refer to the same value. It is therefore necessary to ensure that terms are normalized so that there is only one term for each concept within the dataset. Note that for Semantic Level 1 it does not matter what term is used as long as consistency is maintained. Depending on the complexity and
‘messiness’ of the data this may be easy to do within a spreadsheet package or not.
In simple cases the user is given instructions to sort the data based on each category field, checking to ensure that there are not multiple terms for the same concept. Single instances of terms are often a good indicator of spelling mistakes and similar anomalies.
This process is also used to define relevant terms for Form and Fabric if these have been separated out in Recipe 8.
If the data seems particularly complex it is recommended that the user download and make use of the Google Refine package which is explicitly intended for such work. Al-though alternative software packages exist, such as the Stanford DataWrangler,11Refine will also be used for URI mapping to Freebase in Semantic Level 2, so the additional time spent learning the software here is not lost. Specific instructions are given for a number of common tasks, including: installation; creating a new Refine project; undo and redoing actions; filtering and sorting; creating ‘Text Facets’; clustering terms to-gether; removing whitespace and capitalization; editing multiple terms; exporting back to a spreadsheet. Refine is a relatively sophisticated tool, however, and providing a com-plete tutorial to the user would be beyond scope of the Cookbook. The user is therefore also directed to helpful online documentation12 as well as the download site.13
Recipe 10: Add Find ID
At this stage the data should now share a common structure in which each row is equivalent to a Find and there are not multiple local terms for the same concept. This recipe requires the user to make a final consistency check to make sure that no duplicate records have been created and adds a unique local Find identifier to each row.
Example Input
context form
A Dressel 20
A Dressel 20
A LRA 3
Example Output
find context form
1023 A Dressel 20
1024 A LRA 3
11http://vis.stanford.edu/wrangler/
12http://code.google.com/p/google-refine/wiki/GettingStarted
13http://code.google.com/p/google-refine/downloads/list
Recipe 11: Clean Up and Upload
The final recipe is a health check that the data now conforms to Semantic Level 1 and provides instructions for uploading it to the appropriate directory on the PBWorks server. The user should clear any extraneous formatting and ensure that they only have one row of headings. The following columns should remain, although the labels used for them may vary depending on the language and conventions of the user.
REQUIRED:
• find id (integer)
• context (charvar)
• terminus post quem (integer)
• terminus ante quem (integer)
• form (charvar)
• fabric (charvar)
ONE OR MORE OF:
• fragment count (integer)
• weight (integer)
• minimum number of indivuals (integer)
• estimated vessel equivalence (integer)
OPTIONALLY:
• uncertain (boolean)
• inner terminus post quem (integer)
• inner terminus ante quem (integer)
The file is then saved using a filenaming template similar to that in Recipe 1 and uploaded to a directory of other datasets that conform to Semantic Level 1.
6.2.3 Visualization and Analysis
On completion of Semantic Level 1 the data is now considerably easier to visualize.
Combining it with other datasets that have been processed to this level is also easier, although still not possible to do automatically. Two approaches are suggested for visu-alization. The first is the creation of an accumulation graph using spreadsheet software.
The second is the use of freely available online visualization toolkits such as Many Eyes.
Visualization and Analysis 1: Accumulation graph
Although the data in Semantic Level 1 is much more consistent in its layout and ter-minology it still presents a number of challenges for graphing temporal trends. Most automated graphing techniques are not good for visualizing irregular time intervals — especially those with uncertain dates. There are at least two problems to tackle:
1. Most graphs require a value for each amphora Class at each time interval.
2. Most graphs expect equal intervals between time points.
The combination of these requirements make temporal uncertainty very difficult to rep-resent on most graphs. However a linear scatterplot graph can provide a reasonable approximation to a time curve for multiple irregular series of values (Figure 6.2). With this method we can display both the terminus post and ante quem times for Contexts containing specific categories of amphora and project an approximate ‘time corridor’
for their deposition. This shows the accumulation of particular amphora Forms over time periods of increased or decreased deposition. By using normalized Context dates
Figure 6.2: Example Accumulation Graph
and normalized weights and/or fragment counts, it is also possible to compare temporal distributions across different Excavations, even within the same graph. Caveats must be borne in mind however: First that this is of course just the deposition time — the
and normalized weights and/or fragment counts, it is also possible to compare temporal distributions across different Excavations, even within the same graph. Caveats must be borne in mind however: First that this is of course just the deposition time — the