When we try to reconcile models and data, we are constructing fields of currents, temperature, phytoplankton biomass, etc. which nearly fit the data and nearly solve the model equations. Why should we admit that the models might be imperfect?

- Their external forcing fields may be erroneous either due to errors in the methods used to generate the fields or undersampling of critical scales
- The dynamics are only approximate and include unresolved scales, simplified parameterizations of critical processes, etc.
- The numerical model computes a grid value; how should this be related to an observation somewhere in the surrounding cells? How do we close the equations after averaging over cells?

The Kalman filter is an excellent algorithm for sequential assimilation of data into imperfect models. There is a well-recognized method for inserting a best estimate of the covariance of the errors in the dynamics-plus-forcing. Although the sequential aspect is valuable, the price is high. One must compute the full error covariance matrix for the solution. This has made the algorithm prohibitive for large problems.

The adjoint method is usually implemented such that distributed dynamical errors are suppressed. This need not be the case, but including the fields of model errors would make the dimension of the search so long that unpreconditioned searches for a best fit would be prohibitive. By suppressing fields of dynamical errors, one only need search for the best initial fields and a few tuning parameters in the dynamics. However, for a long enough assimilation interval, the initial conditions have insufficient control and the results are poor as the later data have no or incorrect impact on model behavior. Clearly, dynamical errors must be allowed to appear later in the assimilation.

The use of representers allows fields of dynamical errors as well as adjustments to the parameters. It properly preconditions the search by limiting it to the space of observable degrees of freedom. Two concerns have been raised about the representer approach. First, it applies only to linear models, and second, the workload increases dramatically as the data set increases. However, the representer method has been applied iteratively to nonlinear quasi-geostrophic models and three-dimensional, time-dependent primitive equation models. Convergence in both cases was rapid. Concerning the size of the data sets, it is now possible to assimilate very large data sets.

Along with incomplete models, our understanding of the error fields associated with the data sets to be assimilated is also inadequate. In many cases, we have little quantitative knowledge of how biogeochemical fields change on a temporal and spatial basis. This is most apparent with rate measurements such as primary productivity. Abbott et al. (1995) showed that simple bio-optical models that are used to convert instrument measurements into biologically-relevant quantities, such as chlorophyll, can have significant temporal and spatial variability. The variations in the model parameters must be understood before such measurements from drifters and moorings can be assimilated into a model. Characterization of these error fields requires extensive, repetitive measurements at the appropriate scales.

These problems are well-known, yet in our haste to develop coupled models we sometimes ignore other, equally important issues. First, although circulation models are more mature (and the equations of motion are better defined than the equivalent ecological equations), there remain some obvious problems. Circulation models have problems with resolution and parameterizations, just as do coupled models. Forcing fields and initial conditions are no less a problem with purely physical models than with coupled models. Second, our understanding of the physics is still limited. Processes such as fronts, bifurcating jets, and other intermittent events are difficult to express in mathematical form. Third, there remain serious technical challenges in the numerical methods as well. For example, there are issues in variational methods applied to nonsmooth or singular problems (e.g., abrupt switches in parameterized processes, outcroppings in layered models), least squares methods applied to non-Gaussian processes such as intermittent ocean fields (fronts, spring blooms, etc.), the estimation of error fields in least squares estimators, and preconditioning techniques.

Biogeochemical models need fields of vertical velocity that are produced from circulation models as one example of physical/biological coupling. E. Harrison (pers. comm.) has compared vertical velocities as produced by present circulation models using different wind forcing data sets. The differences in the vertical velocities are as large as the differences between ENSO and non-ENSO conditions. This observation suggests that there still remain serious problems in both the models and in the forcing fields. Coupling biogeochemical models with physical models is not straightforward.

Despite these obstacles, many lessons have been learned from past modeling efforts. It appears that biological models should include at least two types of phytoplankton (a diatom and a smaller form that does not require silica), two types of grazers and two nutrients. One could model iron effects as a regulator of the rate of nitrate uptake, but so far no one has developed an explicit iron cycling model. The food web structure is important to model behavior, especially in terms of grazing and export production. Some differences in light utilization are related to phytoplankton size. Thus simple nutrient-phytoplankton-zooplankton models do not capture all of the essential behavior. New techniques in parameter estimation are being applied to biological models, but such tuning must be viewed with caution since our knowledge of the functional forms that we are trying to fit may be inadequate. The first steps in data assimilation in biogeochemical models show promise and should be pursued (Prunet et al., 1996a, 1996b).