Within the simulation framework, profiles were represented as a collection of alleles determined to be present at each of the 15 autosomal loci contained in the AmpFℓSTR® Identifiler® Amplification Kit (Applied Biosystems, Foster City, CA): D8S1179, D21S11, D7S820, CSF1PO, D3S1358, TH01, D13S317, D16S539, D2S1338, D19S433, vWA, TPOX, D18S51, D5S818, FGA, D5S818, FGA [43]. Amelogenin, the gender-determining locus, was not considered since it is not a hypervariable locus providing discriminatory power.
The alleles observed and tabulated by Butler et al. [56] in their 2003 population study using Identifiler® were taken as the universe of realizable alleles for the purposes of simulating profiles. Butler et al. observed 59 distinct allele calls over all autosomal STR loci: 5, 6, 7, 8, 8.1, 9, 9.3, 10, 10.3, 11, 12, 12.2, 13, 13.2, 14, 14.2, 15, 15.2, 16, 16.2, 17, 17.2, 18, 18.2, 19, 19.2, 20, 21, 21.2, 22, 22.2, 22.3, 23, 23.2, 24, 24.2, 25, 25.2,
26, 27, 28, 29, 29.2, 30, 30.2, 31, 31.2, 32, 32.2, 33, 33.1, 33.2, 34, 34.2, 35, 36, 37, 38, 39. The collection of alleles observed at each particular locus, along with each allele’s associated subpopulation frequency among Caucasians is diagrammed in Figure 1.
Figure 1: Dirac Delta Function Plots of Genetic Model Used in Simulating Profiles
The genetic model employed when simulating profiles was based on Butler et al.’s 2003 subpopulation study of Caucasians using the Identifiler® kit [56]. All observed alleles (i.e., the common alleles) at each of the 15 autosomal STR loci are represented by the integer topping each vertical line, and each allele’s associated subpopulation frequency is represented by the height of its respective vertical line.
Profiles of individuals were generated by randomly selecting two (not necessarily distinct) alleles for each of the 15 autosomal loci in the Identifiler® kit. For any given locus, a list of alleles was constructed that consisted of all alleles with non-zero
subpopulation frequencies with respect to the 302 Caucasians observed by Butler et al. Two alleles were selected at random from this locus-specific allele list according to the subpopulation frequencies observed by Butler et al. (and represented in Figure 1). For
instance, for the CSF1PO locus, the following alleles were observed (with the associated subpopulation frequencies in Caucasians): allele 8 (with a frequency of 0.00497); allele 9 (0.01159); allele 10 (0.21689); allele 11 (0.30132); allele 12 (0.36093); allele 13
(0.09603); and allele 14 (0.00828). The simulation selects a 9 allele 1.159% of the time and a 12 allele 36.093% of the time. A genotype consisting of alleles 8 and 13 at the CSF1PO locus would be selected 2×0.00497×0.09603=0.09545%of the time2, whereas a homozygous CSF1PO locus consisting of the alleles 11 and 11 would be selected 0.30132×0.30132=0.09079% of the time. This random selection is
consummated for each of the 15 autosomal loci in the Identifiler® kit according to Butler
et al.’s observed allele frequencies for each allele at each locus.
Each individual profile, then, consists of a collection of two alleles at each of the fifteen loci. Recasting this information in the form of a matrix served the dual purpose of neatly summarizing an individual’s profile in a logical manner as well as setting up the data in a computationally efficient manner. Matrix representation of profiles leveraged MATLAB’s vectorized analysis environment to facilitate fast data manipulation in performing essential operations, such as “summing” individuals’ profiles to simulate mixtures and comparing two profiles for the presence of common alleles.
Thus, an amplification result for an individual was represented as a 59 x 15 matrix, with the 59 rows corresponding to the universe of possible alleles at all loci and the 15 columns corresponding to those particular loci. (Row 1 corresponded to allele 5
2
The factor of 2 that is included in the product of heterozygous allele frequencies to account for the combinatorial fact that a genotype consisting of allele 8 and allele 13 (in that order) is equivalent to a genotype consisting of allele 13 and allele 8 (in that order).
and proceeded in a monotonically increasing fashion through row 59, which
corresponded to allele 39.) The loci order corresponded to the ordering listed above with column 1 corresponding to the D8S1179 locus and column 15 corresponding to the FGA locus.)
The matrix entries represented relative allele prevalences for that profile. The relative presence of an allele allows for a simple model of allele expression—as either absent, present heterozygously, or present homozygously for single-source profiles while not taking into account signal intensity, the number of contributors, or relative contributor ratios for mixed profiles. For example, a given single-source profile matrix consisted of entries of relative prevalences of 0, 1, or 2. An entry of zero corresponded to the absence of that allele for that particular locus (e.g., a zero in the 8th row and 1st column indicates that the reference did not have a 10 allele at the D8S1179 locus); an entry of unity corresponded to the presence of a heterozygote allele at a particular locus (e.g., a one in the 42nd row and 2nd column indicates that the reference’s D21S11 locus is heterozygous and that one—and only one—of the two alleles possessed at this locus is a 29); an entry of 2 represents a homozygous allele at the specified locus (e.g., a two in the 11th row and 3rd column indicates that the reference has two 12 alleles at the D7S820 locus). The profiles of all individuals were assumed to have exactly two alleles at a given locus.
For an individual profile I with 15 loci L, each consisting of two alleles αL,1 and
(
)
∑∑
∑
= = = = + = 15 1 2 1 15 1 , 2 , 1 , L A L A L L L I I I I α α αEquation 5: Model of Individual Profile
A L
Iα ,
≡ allele A contained at locus L for individual I
I ≡ (complete) individual profile
Figure 2 shows a graphic representation of an example single-source profile using the Dirac delta function.
Figure 2: Dirac Delta Function Representation of an Example Single-Source Profile
The alleles at each of the 15 autosomal loci are represented by the integer topping each vertical line, and each allele’s associated subpopulation frequency is represented by the height of its respective vertical line.
Figure 3: Graphical Matrix Representation of a Representative Single-Source Profile
The loci names on the x-axis have been abbreviated. The colors of the abbreviated loci names correspond to their respective fluorescent dye colors in the Identifiler® kit (except in the case of the black font, which corresponds to a dye color of yellow). Because of space limitations, only the first and last allele values are identified on the y-axis. In place of potentially illegible numbers, a relative allele prevalence of 0 is represented as white- space, while a red box indicates a relative prevalence of 1, and a blue box indicates a relative prevalence of 2.
Mixtures were generated by summing the matrices of a given number of contributors. A mixture of two people could have matrix entries corresponding to relative allele prevalence in the range 0 – 4 depending on the degree of allelic overlap between contributors. (For example, if two contributors were homozygous for the same allele at a particular locus, the resulting mixture matrix entry for that locus’s allele would be 4.)
Therefore, in general, for a mixture M1, created by contributions from two
individuals I1 and I2 with 15 loci L, each consisting of two alleles αL,1 and αL,2, the
resulting alleles expressed at each locus in the mixture profile are modeled by the simple sum shown in Equation 6.
(
)
∑∑∑
∑
= = = = = + = + = 2 1 15 1 2 1 15 1 2 1 1 , 2 , 2 1 , 1 c L A S L S S A L c L L I I I I I M α α αEquation 6: Model of a Two-Person Mixture Profile
αL,A ≡ allele A contained at locus L
Sc ≡ single-source profile for individual c
M1 ≡ (complete) profile for mixture 1
Figure 4 and Figure 5 show graphic representations of an example mixture of Person 1 and Person 2’s profiles as a collection of Dirac delta function plots and as a matrix, respectively.
Figure 4: Dirac Delta Function Representation of Example Mixture Profile from Person 1 and Person 2
The alleles at each of the 15 autosomal loci are represented by the integer topping each vertical line, and each allele’s associated subpopulation frequency is represented by the height of its respective vertical line.
Figure 5: Graphical Matrix Representation of Example Mixture Profile from Person 1 and Person 2
The loci names on the x-axis have been abbreviated. The colors of the abbreviated loci names correspond to their respective fluorescent dye color in the Identifiler® kit (except in the case of the black font, which corresponds to a dye color of yellow). The first and last common alleles are identified on the y-axis, where a relative allele prevalence of 0 is represented as white-space; a red box indicates a relative prevalence of 1; a blue box indicates a relative prevalence of 2; a green box indicates a relative prevalence of 3; and a magenta box indicates a relative prevalence of 4.
In this scenario, allele detection has effectively been reduced to a binary system such that each allele is deterministically either present or absent. This ultimately manifests itself as a relative prevalence number instead of a peak height or area. Modeling different
mixture ratios between contributors would be required to fully encompass the potential effects of allelic drop-out. Since the drop-out model employed in this study incorporates consideration of relative allele prevalence, all results assume a 1:1 mixture ratio.
2.2.2.1 Modeling Allele Drop-out
If no drop-out is assumed, summation of the single-source matrices would result in a mixture profile with all contributed alleles detected; this was considered a “pristine mixture” profile and is akin to instances in casework in which the mixture proportion ratio is 1:1 with a total DNA mass input of greater than 0.5 ng into the amplification process [57]. To account for instances where lower targets of DNA are amplified, allele drop-out needed to be modeled. To accomplish this, pristine mixtures were perturbed for varying proportions of drop-out for a heterozygous allele from 0 to 1 in increments of 0.1. Here, a drop-out level of 0 means that all of the alleles were detected; in this case, the “perturbed mixture” is identical to the “pristine mixture.” A non-zero level of drop- out corresponded to the proportion of time a heterozygously-present allele (i.e., an allele with a relative prevalence of 1) was not detected. For example, for a single-source sample with a drop-out proportion of 0.1, each allele with a relative prevalence of 1 stood a 10% chance of not being detected. Random numbers drawn separately for each allele at every locus according to the specified proportion of drop-out determined whether a given allele actually dropped out.
For alleles within a profile that were contributed multiple times (i.e., had a relative prevalence greater than unity)—either from an individual contributor being homozygous at that locus or from overlapping alleles between contributors—their
increased prevalence diminished the probability that that particular allele would drop-out. Therefore, an allele that is twice as prevalent in a mixture is half as likely to completely drop-out while an allele that is four times as prevalent is one-quarter as likely to drop-out.
Thus, for a particular mixture allele α at a particular locus L with a relative prevalence φ, the expression used to describe the probability of drop-out LA
D ,
)
Pr( αφ for that particular mixture allele, given a specified probability of drop-out for a heterozygous allele
1 ) Pr( = φ D , is given by Equation 7. φ φ α φ 1 ) Pr( ) Pr( , = = D D LA
Equation 7: Probability of Allele Drop-out
αL,A ≡ allele A at locus L
φ ≡ relative prevalence of allele (e.g., 0, 1, 2, 3, 4) 1
)
Pr(D φ= ≡ specified probability of drop-out for a heterozygous allele
A L
D ,
) Pr( α
φ ≡ realized probability of drop-out for allele A at locus L
Whether an allele actually dropped out or remained observable was determined through a random number draw weighted with the appropriate probability of drop-out LA
D ,
) Pr( αφ .
2.2.3 Generating Populations for Comparison