FAQ

This page attempts to answer common questions we are asked by reviewers, and after presentations, about our model and methodology.

What is the Texture Tiling Model?
What are "mongrels"?
How do we know we have the right image statistics?
How is the Texture Tiling Model different from Portilla & Simoncelli (2000)?
Would any set of image statistics work?
Is the Texture Tiling Model the same as that used by Freeman & Simoncelli (2011)?
How are the "summary statistics" of the Texture Tiling Model different from "ensemble statistics"?
What is the logic behind the experiments with mongrels?
How can you say you have a predictive model when you have a human observer in the loop?"
Why don't you use [insert favorite methodology] instead?
What hypotheses underlie your work on the Texture Tiling Model?
What assumptions underlie the work of Freeman & Simoncelli (2011)?
What is Freeman & Simoncelli's methodology? What are the risks associated with it?

What is the Texture Tiling Model?

The Texture Tiling Model is a model of representation in early vision, which we developed initially to attempt to explain visual crowding. The model hypothesizes that the visual system uses a compressed representation in terms of a rich set of summary statistics, computed over local pooling regions. This representation may provide the visual system with a general-purpose strategy for getting as much information as possible through a limited capacity "bottleneck".

In particular, we have proposed, as an intelligent guess, that the visual system might measure: the marginal distribution of luminance; luminance autocorrelation; correlations of the magnitude of responses of oriented V1-like wavelets across differences in orientation, neighboring positions, and scale; and phase correlation across scale. This perhaps sounds complicated, but really is not; computing a given second-order correlation merely requires taking responses of a pair of V1-like filters, pointwise multiplying them, and taking the average over a “pooling region”. These statistics were previously proposed as important for capturing the appearance of textures, for a texture synthesis algorithm (Portilla & Simoncelli, 2000).

Pooling regions are presumed to be elongated radially, to overlap, tile the visual field, and grow approximately linearly with distance from point of fixation.

Our work is agnostic about whether the proposed computations occur in a single brain region or level of processing.

What are "mongrels"?

A mongrel is an image synthesized to have the same summary statistics as a given original stimulus. We use the term both for syntheses which match the summary statistics of only a single local patch (Balas, Nakano, & Rosenholtz, 2009), and for "full-field" mongrels, which iteratively apply constraints from local summary statistics measured over local "pooling regions," which overlap, grow with eccentricity, and tile the visual field (Rosenholtz, 2011). Any given stimulus lends itself to many mongrels, i.e. many images that share the same image statistics.

How do we know we have the right image statistics?

While additional work is required to pin down the right statistical measurements, our present set provide a good initial guess. Certainly they seem quite plausible as a visual system representation. Early stages of standard feed-forward models of object recognition typically measure responses of oriented, V1-like feature detectors, as does our model. They then build up progres-sively more complex features by looking for co-occurrence of simple structures over a small pooling region (Fukushima, 1980; Riesenhuber & Poggio, 1999). These co-occurrences, computed over a larger pooling region, can approximate the correlations computed by our model.

Second, our summary statistics appear to be quite close to sufficient. Balas (2006) showed that observers are barely above chance at parafoveal discrimination between a grayscale texture synthesized with this set of statistics and an original patch of texture. More recent results have shown a similar sufficiency of these summary statistics for capturing the appearance of real scenes. Freeman & Simoncelli (2011) synthesized full-field versions of natural scenes. These syntheses were generated to satisfy constraints based on local summary statistics in regions that tile the visual field and grow linearly with eccentricity. When viewing at the appropriate fixation point, observers had great difficulty discriminating real from synthetic scenes. That the proposed statistics are close to sufficient for capturing both texture and scene appearance is impressive; much information has been thrown away, and yet observers have difficulty telling the difference between an original image and a noise image coerced to have the same statistics.

Finally, significant subsets of the proposed summary statistics are also necessary. If a subset of statistics is necessary, then textures synthesized without that set should be easily distinguishable from the original texture. Balas (2006) has shown that observers become much better at parafoveal discrimination between real and synthesized textures when the syntheses do not make use of either the marginal statistics of luminance, or of the correlations of magnitude responses of V1-like oriented filters.

How is the Texture Tiling Model different from Portilla & Simoncelli (2000)?

The Texture Tiling Model is a model of early visual processing. Portilla & Simoncelli (2000) is a model of "visual texture", which one might take to mean "represention of texture in vision." In addition, Portilla & Simoncelli provided a very good texture synthesis algorithm for testing thie model. Portilla & Simoncelli was not presented as a model of vision more generally.

Would any set of image statistics work?

Ultimately, no. At present, it may be true that the limited range of stimuli and tasks which have been used for the behavioral study of vision are not sufficient to finely discriminate between one set of statistics and another. One can perhaps go a long way toward explaining a number of phenomena simply with a model that makes a rich set of measurements which are summary-statistical in nature, i.e. which measure important details at the expense of a loss of location information. Nonetheless, it is promising that thus far we have been able to predict quite accurately performance on crowded letter recognition (Balas, Nakano, & Rosenholtz, 2009), and visual search (Rosenholtz et al, in submission), with the given set of statistics.

Is the Texture Tiling Model the same as that used by Freeman & Simoncelli (2011)?

Essentially, yes, the model used by Freeman & Simoncelli (2011) is the same as the one we proposed in Balas, Nakano, & Rosenholtz (2009) and Rosenholtz (2011). The statistics they compute within each pooling region are the same as the ones we compute. The two models may differ in some parameter settings, for instance their pattern of pooling regions differs somewhat from ours. Their procedure for generating what we call full-field mongrels and they call "metamers" is similar but not the same.

How are the "summary statistics" of the Texture Tiling Model different from "ensemble statistics"?

Work in ensemble and summary statistic models of vision are similar in that they hypothesize that the visual system might use a compressed, statistical representation, perhaps in order to deal with an information bottleneck in visual processing.

However, as the term "ensemble statistics" is typically used, it refers to statistics of the features of a number of items. Our summary statistics, by contrast, are computed over the image, and require no prior segmentation into "items". Furthermore, the Texture Tiling Model is sufficiently specified that it can make testable predictions for arbitrary images and tasks, whereas the "ensemble statistics" have yet to be fully specified, and are hampered from making general predictions by their reliance on computing statistics of things rather than of measurements on the image.

What is the logic behind the experiments with mongrels?

Work in the Rosenholtz lab focuses on using the Texture Tiling Model to predict visual task performance. The logic is as follows: we can use synthesis techniques to generate a number of mongrels which share the same local summary statistics as the original. The model cannot tell these mongrels apart from the original, nor from each other. If these images are indistinguishable, how hard would a given task be?

By synthesizing images which are confusable, according to the model, we can generate testable model predictions for a wide range of tasks. Most powerfully, we can predict performance on higher-level visual tasks without needing a model of higher-level vision. We don’t need to build a top-lit vs. side-lit cube discriminator to tell from the corresponding mongrels that our model predicts this task will be easy. In practise, we ask subjects to view a number of synthesized images, and measure their task performance with those mongrels as a measure of the informativeness of the summary statistics for a given task. (Balas, Nakano, & Rosenholtz, 2009 gives details.)

How can you say you have a predictive model when you have a human observer in the loop?

Most behavioral models (at least initially) relate the results of one set of experiments to the results of another set of experiments. Measure a bunch of contrast sensitivity thresholds and see if your model can relate those to the visibility of a more complex pattern. Measure the results of some simple search experiments, and see if your model can predict on that basis where people look in natural images. A difference, of course, is that one only has to run a few contrast sensitivity experiments, then potentially predict the results of many visibility experiments, whereas we continue to run a new mongrel experiment for each high-level task. But this is because low-level vision is relatively simple and well understood, being well modeled by some linear filtering plus a few simple non-linearities. Whereas higher-level vision is complex and barely understood at all.

Modeling the effects of low-level vision, and sticking a human in the loop to do the high-level vision has precedents as well, for example: (a) Want to know if acuity loss is responsible for some percept? Apply a blur to the image which mimics effects of loss of contrast sensitivity, look at the results, and see if loss of information explains the effect of interest. (b) What to know if center-surround filtering explains brightness contrast illusions? Apply the center-surround filtering, and have a person examine the results. (c) Does your grouping algorithm work? Visualize the groups the algorithm finds, and have someone look at them. (In many cases researchers do "demo psychophysics" rather than a real experiment, but formally or not there is a human in the loop.)

Why don't you use [insert favorite methodology] instead?

...ask observers to discriminate mongrels from originals in the periphery?
It is true that such an experiment would be a strong test of whether our model is exactly correct (and our synthesis procedures perfect). However, this experimental methodology does not degrade gracefully.

Difficulty discriminating between original and mongrel in the periphery is unlikely to be a good measure of how good a model is:

Models which throw away too little information can lead to just as good a discriminability as models which throw away the right amount. Peripheral vision will throw away any extra information the model has preserved. (In the extreme, if our model is "peripheral vision is perfect," the "mongrel" will just be the original image. Peripheral mongrel-original discriminability will not disprove this obviously incorrect model.)
Making a small error in selecting the model statistics may well give as good discriminability from the original as being quite wrong about the statistics. Pretty much any change to the kind of statistics computed will lead to changes that are many jnd’s above visibility threshold.

In our methodology, we relate performance on a range of tasks to difficulty discriminating task-relevant mongrels in central vision. Our measure of the goodness of our model is the degree to which mongrel discriminability is predictive of task performance. For a rich enough set of tasks, this methodology likely degrades more gracefully: if the stats are a little wrong, one won't do as good a job of predicting task performance; if the stats are very wrong, one will do a much worse job of predicting task performance.

...do mongrel discrimination in the periphery instead of in the fovea?
This methodology suffers from a similar issue to that above: it does not distinguish between getting the statistics right and not throwing away enough information. However, otherwise it likely degrades more gracefully than the above suggestion.

...replace the original stimuli with mongrelized, and see if it affects performance?
Would make a nice demo, but similar issues to those listed above.

...just run a machine classifier on the summary statistic vectors?
With approximately 1000-dimensional vectors of statistics, pretty much everything is highly discriminable by a machine classifier. The visual system, of course, has measurement noise, which is why task performance isn't perfect. We could add noise to the vectors, and see how discriminable they are. But then we would essentially be fitting 1000 noise parameters with (at this stage) very few data points. It would ultimately be great if we could take the human out of the loop, but we don’t feel it’s well-founded at this point.

What hypotheses underlie your work on the Texture Tiling Model?

In the strong version of our hypothesis, the visual system represents its inputs in a compressed way via a rich set of local summary statistics, and after that stage of processing, the visual system does the best it can with the available information. This implies that the effects of the summary statistic representation will be observable in a wide range of visual phenomena.

Our overall hypothesis is ambitious, high-risk with the potential for high-reward: that one can explain a wide range of visual phenomena using the same model of early visual representation.

What assumptions underlie the work of Freeman & Simoncelli (2011)?

Freeman & Simoncelli (2011) assume that no processing level in the visual system is special; all levels have a similar loss of information, and we do not have conscious access to any early levels. This is a fairly standard hypothesis in vision. If their assumption is correct, then with a model of processing at a given level, they can in theory create metamers for that level of processing. The model used to generate these metamers can inform us about visual processing at that stage, as well as giving us clues to where in the visual system that processing might occur.

If their assumption is not correct, the metamer trick may not always work. If, for instance, we had conscious access to the outputs of V1, one would not be able to "trick" viewers with V2 metamers; observers could query V1 and tell that a metamer was different from the original.

Freeman & Simoncelli’s work assumes that the proposed model is an accurate model of the computations in a particular level of the visual system. They (implicitly) assume that the only unknown is how the pooling regions scale with eccentricity, and attempt to measure this scaling behaviorally. By mapping the scaling coefficient to physiological and behavioral data, they hypothesize that these computations happen in area V2.

In the Rosenholtz lab, we are less interested in matching our summary statistic model of vision to a specific brain area than we are in determining what limits such a representation puts on visual processing. Our aim is to see whether our proposed model can can explain perceptual phenomena that are not currently well understood.

What is Freeman & Simoncelli's methodology? What are the risks associated with it?

Freeman & Simoncelli’s work involves creating metamers, stimuli which are physically different but perceptually identical. In order to make the metamers perceptual indistinguishable from the original image, one must correctly set the rate at which pooling regions grow with distance from the point of fixation. Freeman and Simoncelli find the optimal rate by showing observers original and synthesized images, and determining the rate at which they are indistinguishable. The rate parameter lends insight into which brain region regards the metamer as identical to the original image.

Experimentally, this methodology is straightforward, as is the hypothesis that one can, in theory, construct metamers and use them to study visual processing stages. The risk, with this work, is that artifacts in the synthesis procedure, and incorrect assumptions, may lead to misjudging in what brain regions the hypothesized computations occur. If the computations actually happen in more than one brain region, this methodology will likely pinpoint the earliest region, with the smallest receptive field sizes. If the synthesis routine contains artifacts, smaller pooling regions may be required to render those artifacts invisible, leading again to conclusions which place the computations too early in the ventral stream.

The Texture Tiling Model of Early Visual Representation

FAQ