Model refinement is a key step in crystallographic structure determination that ensures final atomic structure of macromolecule represents measured diffraction data as good as possible. Several decades have been put into developing methods and computational tools to streamline this step. In this manuscript we provide a brief overview of major milestones of crystallographic computing and methods development pertinent to structure refinement.

El refinamiento es un paso clave en el proceso de determinación de una estructura cristalográfica al garantizar que la estructura atómica de la macromolécula final represente de la mejor manera posible los datos de difracción. Han hecho falta varias décadas para poder desarrollar nuevos métodos y herramientas computacionales dirigidas a dinamizar esta etapa. En este artículo ofrecemos un breve resumen de los principales hitos en la computación cristalográfica y de los nuevos métodos relevantes para el refinamiento de estructuras.

Crystallographic structure determination is a complex procedure that involves a number of very diverse steps, shown in

Crystallographic models are constructed as a solution to an inverse problem (notion formally introduced by Ambartsumian,

The goal of crystallographic studies is to recover the electron density distribution from measured intensities and interpret this in terms of individual atoms. For small molecules and high-resolution diffraction data, it is possible to recover the atomic information directly from amplitudes. For large molecules and less well diffracting crystals, one first tries to solve an intermediate problem of obtaining phases corresponding to the measured amplitudes. Measured amplitudes and recovered approximate phases are then used to calculate the corresponding Fourier synthesis, which is a finite-resolution image of the electron density. This image of the electron density is the subject of interpretation in terms of an atomic model. Depending on the data resolution and quality of initial phases the quality of this image may vary substantially (

This means that models obtained from building atoms into maps calculated using initial approximate phases are often inexact and in most cases are insufficient to derive the required structural conclusions: “

Improving a model means modification of its parameters, resulting in another model that better describes the experimental data. A way to link the parameters describing the model to the available experimental data is to define some function (target) such that its value consistently decreases (or increases) as model improves. Thus refinement of atomic models (simply “refinement” in what follows) can be thought of as an optimization problem (Hughes,

Since refinement can be formulated as an optimization problem the following needs to be defined: a) the

Model parameters are the variables that describe the crystal and its content. For example, these may be coordinates of the atoms, parameters describing atomic vibrations, disorder, descriptors of the solvent continuum and so on. Once these parameters are defined they can be used to calculate structure factors from the model. The amplitudes of calculated structure factors are then matched against the experimentally measured structure factor amplitudes and the target function is evaluated. An optimization method is then used to decide how the model parameters can be changed such that the target function value decreases (in case of minimization). Once this decision is made, the new set of structure factors is calculated from the updated model and matched against the measured values again. This is repeated several times until convergence (

Diffraction theory (Ewald, ^{4} or larger) and we have ^{4} or larger), then the number of computer operations to calculate a single set of structure factors from a model and compare their amplitudes with the experimental values would require an order of 10^{10} computer operations.

Steepest descent is a simple and powerful optimization method that can be employed to minimize a refinement target. Using this method requires a vector of partial derivatives (gradient) of the refinement target with respect to the model parameters. These derivatives can be calculated either by formulae or by finite difference methods. It’s worthwhile to note that the calculation of each partial derivative is as computationally expensive as calculation of the target function itself. In both cases, the number of operations required to calculate a single gradient value is proportional to the number of model parameters (with respect to which this gradient is calculated) giving a number of operations larger than 10^{15} ~ 10 ∙ 10^{4} ∙ 10^{10} (each atom being characterized by 5-10 parameters). Since the optimization of the target typically requires many iterations to converge to a local minimum of the minimization target, the number of operations may easily rise to 10^{20}.

As a quick illustration we cite Hughes (

“

Obviously, computer power has increased dramatically since that time but the same is true for the size of the structural problems.

These computational difficulties (computational cost) are convoluted with methodological obstacles. For example, the optimization methods have limited convergence radius: the steepest descent method converges to the closest local minimum that may be far from the global minimum we are interested in. Even if the global minimum of the least-squares target is achieved, it may correspond to an incorrect structure since it does not take into account model incompleteness. Indeed, since structure factors depend on a whole set of atoms, missing atoms in the model at intermediate refinement steps makes the direct comparison of calculated and measured structure factors by a least-squares target inappropriate and minimization of the least-square target may make the structure worse (Lunin, Afonine and Urzhumtsev,

The development of refinement programs closely followed the progress of macromolecular structure solution. By the middle of 1960s^{ }the first macromolecular atomic structures were reported and computers became accessible to crystallographers. In 1971 R. Diamond reported the first common-use refinement program that employed a number of methodological advances. First, to reduce the number of independent parameters and to avoid distortion of covalent bonding geometry due to insufficient experimental data resolution, Diamond used torsion angles as the variable atomic model parameters. While requiring less variables, the parameterization in torsion angle space may limit refinement convergence as any changes in atomic coordinates need to be propagated along the chain. Second, to avoid the time-consuming calculation of structure factors from the model, Diamond suggested using Fourier maps calculated with the experimental amplitudes and the best available approximate phases as the target for fitting the atomic model parameters. Phases, being approximate, may be inaccurate enough to lead to an incorrectly refined model.

The availability of Diamond’s refinement program designed specifically for macromolecules, and progress in macromolecular structure solution in general (Watenpaugh

Some of these programs (Sussman

The first major advance was a result of the intuition of D. Sayre (

The practically useful algorithm for performing Fourier transforms efficiently is known as the Fast Fourier Transformation (FFT) had been suggested by Cooley and Tukey (

The rapid progress of computer hardware stimulated the development of other numerical methods including those of optimization. In particular, Hestenes and Stiefel (

The problem of fast gradient calculation was not specific to crystallographic refinement. Baur and Strassen (

The principal idea is that for steps involved in the calculation of the target function the calculation of the gradient involves the same steps but in a backward direction. This means that as soon as a fast algorithm for calculation of an arbitrary target function is available, the fast algorithm for calculation of its gradients is guaranteed to be available too.

This important result had a number of implications for the development of refinement programs and methods. First, this showed that the Agarwal’s algorithm was a particular case of a general approach. Second, this indicated that there was no need for the refinement targets to be quadratic; the fast gradients can be calculated for any function and therefore the crystallographers could focus on the best choice of the targets from a structural rather than computational point of view. Third, this showed that the crystallographic target can include any type of restraint, and not be limited to quadratic functions of coordinates or distances. Overall, this principle allowed deconvolution of the three basic components of the optimization problem: the choice of the model parameters, choice of the target, and model optimization method.

A next important question is whether it is possible to propose a general way of development refinement programs given varieties of models and refinement targets. The considerations above suggest that the determining step is the calculation of the target from initial independent parameters.

The key step of the refinement is generating structure factors from a crystal model (here we assume atomic model) and comparing them with the experimental data via evaluation of the target function. Using constraints means atomic parameters (coordinates and/or scattering parameters) are not independent and are obtained from some other parameters that are varied (refined) independently. An example is a rigid body refinement where groups of atoms are considered rigid (Scheringer,

Once parameters of all atoms are known, the density map in the crystal is generated using spherical or multipolar (Hansen and Coppens,

The next step is also common for most refinement programs: density map is converted into a set of its Fourier coefficients (structure factors). For this any efficient Fourier transform algorithm, and not necessarily FFT (Cooley and Tukey,

If a real space target is used, for example the target comparing a model map with a known cryoEM map point by point, one more step is needed to calculate a model map with a subset of structure factors obtained at the previous step, those in the sphere of a limited resolution. Since this requires more calculations, no one of the known programs does this in this strict way. One possibility to avoid this calculation is to convert the experimental map into its Fourier coefficients and use them for comparison with the model structure factors. Another possibility is to estimate a shape of individual atoms in the maps of the same resolution as the experimental one (Diamond,

As each step shown above we have different kinds of crystal description,

Following the steps above that describe the

The overall calculation algorithm is a chain of transitions between different kinds of crystal descriptions; each transition depends neither on the previous steps nor on further steps. Obviously, a transition may pass through its internal intermediate steps. For each kind of parameters various targets can be introduced that are fully independent of the parameters of other kinds. For example, one can envision targets (restraints) on the independent parameters in case of constraints (Urzhumtsev, Lunin and Vernoslova,

The fact that the global target to be optimized is a sum of the composited targets allows an independent calculation of their gradients with respect to the independent parameters. The algorithms to calculate the gradient for each of them with respect to their own variables are obtained by inverting each transition one by one. Then using the chain rule these gradients are recalculated to the gradients with respect to the independent parameters (Lunin and Urzhumtsev,

There are very many advances in macromolecular refinement due to computational progress and methodological understanding; we give only a few examples. Disconnection of a choice of the model, targets and optimization procedure for crystallographic refinement allowed Brünger, Kuriyan and Karplus (

This also allowed for the straightforward introduction of a number of new diffraction targets such as a maximum-likelihood (Pannu and Read,

Finally, new parameters could be introduced without the need to reformulate either the target calculation, or the minimization procedure. In particular, this concerns non-atomic parameters such as new bulk-solvent models (for example, Jiang and Brunger,

As can be seen from the preceding sections, a refinement program is typically a large suite composed of many modules each one designed to perform a specific task. While older programs are mostly written using FORTRAN as a programming language (one of principal developers of which was D. Sayre), most recently developed tools such as

In spite of the effort required to develop an efficient and robust refinement program, nowadays the crystallographic community has access to a large number. Most popular are SHELXL (Sheldrick and Schneider,

A refinement run for a moderately-sized macromolecular model nowadays takes from a few minutes (for a small size protein) to several hours (for structures as large as a ribosome). This acceleration is obviously due to both the availability of new powerful computers and efficient algorithms implemented in refinement programs.

Steps from phasing to final structure report (

Improvements in crystallization and data collection techniques have increased the number of low-resolution datasets being collected. Typically this data corresponds to crystals of large molecules that may have substantial mobility. Low-resolution maps combined with the size of the problem (large models result in a large amount of data) make model building and refinement extremely challenging. First, low resolution maps do not readily permit the accurate building of models, so initial models often possess poor geometry and may have gross stereochemical imperfections. Given unfavorable data-to-parameter ratios subsequent refinement often may not yield significant improvement. The lack of experimental data at these resolutions means that successful refinement is highly dependent on prior knowledge – the restraints. While the traditional stereochemistry restraints used in refinement are sufficient at medium to high resolution, they do not provide enough additional information at low resolution (Headd

As mentioned above, the geometry restraints that are used in refinement programs can be simple and relatively naïve, mostly designed to preserve basic model geometry and prevent a model from deterioration in the case of insufficient quality data, which is almost always the case for macromolecules. As a result these restraints tend to generate unrealistic models if data resolution is limited. An alternative to extending these restraints with additional information is to design better potential functions, which may be not as sophisticated as those used in the molecular simulation field but more tailored to the context of structure refinement. Another approach is the use of QM/MM (quantum mechanics/molecular mechanic) methods to generate accurate structures of small molecules in macromolecular structures or even whole macromolecular structures (Canfield

Hydrogen is a weak X-ray scatterer and therefore it is barely observable in maps derived from X-ray diffraction experiments. Historically this has prompted macromolecular crystallographers to generate models without H atoms. Only in ultra-high resolution X-ray diffraction experiments is it possible to visualize some, but typically not all, hydrogen atoms. These ultra-high resolution structures constitute only 0.002% of all structures in PDB. At the same time H atoms constitute nearly half of the atoms in a protein structure, they mediate most of interatomic contacts, often play key roles in catalytic activities of enzymes, and participate in ligand binding. While most hydrogen positions can be inferred from the local geometry, there are still 10-15% of H atoms that have rotational degrees of freedom and thus cannot be predicted based on local stereochemistry alone. Neutron diffraction is therefore a technique of some importance (for review, see for example Afonine

While it is rather rare that crystals of macromolecules diffract to ultra-high resolution, better than approximately 0.9Å, there are ~500 structures in PDB solved at this resolution. The typical Gaussian model parameterization used at lower resolution is insufficient in these extreme cases. Instead, more complex models are needed such as multipolar representations for electron density distributions. However, this approach approximately triples the number of parameters per atom. This poses some fundamental problems. One is that the FFT based method of structure factor and gradient calculation cannot be readily used for a non-gaussian (multipolar) parameterization while a progress has been reported (Schnieders

Recent improvements in the field of cryo-electron microscopy (cryo-EM) have made it possible to generate structural information at resolutions approaching low-resolution X-ray crystallography (3.5 Å and lower). The result of the cryo-EM experiment is a map that can be used to build and refine an atomic model. Most refinement tools available today were designed for X-ray or neutron crystallography, and therefore designed to perform complete model refinement against diffraction data (amplitudes or intensities of measured structure factors) and not maps. Also, typically these are very large structures. Since the resolution is low the maps are challenging to interpret and provide limited information for model refinement. Therefore, new methods need to be developed, such as real space refinement approaches that can efficiently perform rapid refinement of large macromolecules to generate models of high chemical quality.

Structure validation is a process that aims to perform thorough assessment of model quality. Traditionally structure validation was performed at the very end of structure determination. However, it is now accepted that this is suboptimal because errors created and unnoticed at the beginning of structure determination may propagate and become very difficult to detect and address later on. Therefore active structure validation should be performed constantly through the entire process of structure determination and not only at the very end. This changes the paradigm of the structure determination workflow and thus requires significant changes in the corresponding software.

This work was supported by the NIH (Project 1P01 GM063210) and the Phenix Industrial Consortium. This work was supported in part by the US Department of Energy under Contract No. DE-AC02-05CH11231. AU thanks the French Infrastructure for Integrated Structural Biology (FRISBI) ANR-10-INSB-05-01 and Instruct, part of the European Strategy Forum on Research Infrastructures (ESFRI).