Here we provide a web interface to help in the normalization of cDNA microarray data. The method implemented is the print-tip loess as explained in Yang, Dudoit, Luu, Lin, Peng, Ngai, and Speed (2002). "Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation". Nucleic Acids Research, Vol. 30, No. 4, e15. Additional information can be found in Smyth & Speed (2003). "Normalization of cDNA microarray data", in "METHODS: Selecting Candidate Genes from DNA Array Screens: Application to Neuroscience", D. Carter (ed.). The paper by Smyth, Yang, and Speed (2003) Statistical issues in microarray data analysis. In: Functional Genomics: Methods and Protocols, M. J. Brownstein and A. B. Khodursky (eds.), Methods in Molecular Biology Volume 224, Humana Press, Totowa, NJ, pages 111-136, contains additional information on quality measures (specially section 6) and is very useful. The above three references are centered on the print-tip loess approach; Huber, von Heydebreck, and Vingron (2003) "Analysis of microarray gene expression data", in "Handbook of Statistical Genetics, 2nd edition", Wiley, provide more background and discussion of alternative normalization methods.
We strongly encourage you to take a look at the above references, because we will not give many details in this page. Essentially, the objective of the normalization (here we follow closely the introduction in Smyth & Speed (2003)) is to adjust for effects that are due to variations in the technology rather than the biology. In particular, we will try to adjust for differences in the red and green labeling caused, for example, by differences in the binding of the labels; a widespread phenomenon are differences in the labeling (i.e., dye biases) that are related to intensity (the typical curvilinear MA plots). Since these differences can be related to which print-tip printed each spot, the adjustment is carried out, generally, for each print-tip separately. Thus, the basic normalization is based on a print-tip loess, which fits a robust local regression to the relation between M (difference in log ratios) and A (the "average" staining). The normalized M value is the original one minus the loess fitted one, and thus should correct for spatial effects (as reflected by print-tips) and for effects related to intensity.
You also want to look at some additional plots. We provide images of the arrays, including the red and green background, and the unnormalized and normalized M. These plots should help you spot damaged arrays, spatial patterns, or miscellaneous strange patterns.
The histograms of the raw pixel intensities provide the (log 2) of the red and green mean foregrounds. This values will often range between 0 and 16. You do not want to see values piling up in the higher end (that would probably mean a lot of saturation because the scanner was set too high), nor on the low end (the hybridization did not work well).
Finally, you also want to check that the normalization is working in term of scale (approx. variance). Thus, we provide box plots intra-array to help assess if there are differences in scale among print-tips within array, and box plots inter-array to assess if there are differences in scale among arrays. If there are large differences in scale among arrays, you might want to normalize for scale; this is rarely required, and introduces additional noise.
Block Column Row Name ID F635 Mean B635 Median F532 Mean B532 Median Flags
If you normalize in different passes, you can still check that variances of different arrays are similar by checking the boxplots of the M values after normalization
You must enter the layout.
The layout refers to the number of rows and columns in the main grid and the number of rows and columns within each subgrid (each of the squares or rectangles defined by the rows and columns of the main grid). Your arraying facility should tell you how the array is structured. We are very strict with not trying to do any guess work about the layout because if we make a mistake here, all print-tip based methods and all image plots will be wrong. Thus, we emphasize, make sure you get the layout right. (Yes, there is a simple way to see the dimensions of the subgrid, and that is by looking at the largest number for rows and columns. However, there is often no unambiguous way to guess the main grid: suppose you get a 36 for the largest block number; now, does that come from a 6 x 6 or from a 9 x 4 array?).
At CNIO, our layout for the human oncochip is often 12 x 4 (main grid) and 16 x 16 (sub-grid); for the mouse oncochip is 12 x 4 and 22 x 22.
You can set here what type of sample goes in the numerator and what in the denominator of the log ratio. With reference designs, people often place the sample in the numerator and the control or reference sample in the denominator. Thus, for instance, if your reference was always labeled with Cy3, you should choose "red(Cy5)/green(Cy3) ratio."
Please note that you can set different colouring options for different arrays if you enter array by array. This is very handy when you have dye-swaps. If you entered a compressed file, the same scheme of ratios and colors is used for all the arrays.
DNMAD now (as of Agust 2005) provides more flexible options for the handling of flags. These options will allow you to exclude spots from being used for normalization (i.e., exclude spots from being used for finding the loess fits), exclude spots from being normalized at all (essentially, turning them into missing values), and/or use only a subset of spots for finding the loess fits. At the same time, rather than offer a whole new interface for this part, we wanted to provide something which would be reasonably familiar to previous users of DNMAD, so as to avoid making changes incompatible with previous routines of use.
You must check this option to use any flags at all. If you check it, and you have any spots with negative flags the spots with negative flags will not be used for fitting the loess curves. In addition you can:
At CNIO most people who exclude a point from the normalization also want to exlude that point from any further analysis (for example, because the spot had a bad shape, or there was a scratch). If you check this option, spots that are not used for the normalization will be return as "NA" (missing values) in the final result.
However, you might want to exclude a spot from the normalization, but have it returned. In such a case, the spots with negative flags are not used for the normalization, but their "normalized M value" is computed (subtracting from their original M the fitted value from the loess curve of the rest of the points) and returned in the final results.
If you check this option and you have any spots with positive flags, only spots with positive flags will be used for computing the loess fits. In other words, only spots with positive flags are used for computing the normalization. Spots without flags are normalized: their "normalized M value" is computed (subtracting from their original M the fitted value from the loess curve of the rest of the points) and returned in the final results. Spots with negative flags are either normalized or returned as missing depending on how you set the option for "Return negative flagged points as NA".
The following table might help:
| How your flags should be set in your files | DNMAD settings | ||||
|---|---|---|---|---|---|
| Use negative flags | Return negative flagged poinst as NA | Use positive flags | |||
| What you want | Use all spots for fitting normalization and normalize all spots | Does not matter (if none of the boxes are checked!!!!). | |||
| Do not use some spots for normalization, but normalize all | Flag with a negative flag the spots to exclude | X | |||
| Do not use some spots for normalization, and set as missing the ones not used for normalization | Flag with negative flags the spots to set as missing | X | X | ||
| Use only some spots for normalization, set as missing some of the ones not used for normalization, normalize (but not use for normalization) other spots | Flag with a negative flags the spot to set as missing, with positive flags spots to use for normalization and with a flag of 0 those to normalize but not use for normalization | X | X | X | |
Which spots should you exclude? Recall that when you normalize there is no assumption that the spots being used are spots that are useful for your biological question. For example, a spot labeled "gene A" but which really has, because of accidental mishap, some other sample, is probably unsuitable for future analyses (you don't know what is in there) but it is probably perfectly suitable for normalization: it contains valid information in the relationship between M and A. As well, there might be spots which contain "house keeping" genes, or other samples you do not care about. These spots you probably do not want to exclude from the normalization (even if you do not use them later in your analyses). On the other hand, there might be a spot that you cannot trust because it has an obvious halo, it is donnut shapped, or whichever other physical characteristics that would lead you to mistrust any mean, median, or whichever statistic computed by your software. In general, excluding a spot just because you have no experimental interest in it is NOT a good idea. Please think about what to exclude, if anything. In addition, not everybody trusts automatically set flags (e.g., the -50 in GenePix); so some people leave the -50, but then use -75 or -100 to identify manually set flags. If you want to exclude the -75 but not exclude the -50, you will need to do some work on your own, like setting all -50 flags to 0. (Gawk, GNU awk, is great tool for these types of changes).
Which spots should you use to normalize?(using the positive flags option). Here the idea is often that you have a set of points (those with positive flag) that you know (or think you know) should be the ones to use for the normalization, while all other points are not to be trusted for normalization. Please, please, think twice before choosing these "good" points. In our experience, if might be more natural and easier to justify such a choice with aCGH arrays than with gene expression arrays.
You can use the background to correct the foreground values. There are several possible approaches. The simplest is to use background subtraction, where the median of the background for each spot is subtracted from the foreground of each spot. One consequence of this approach is that you can end up with many spots with negative values, and the log of a negative value is not defined. Thus, you end up creating lots of "missing values", where you actually had a true reading (granted, a reading where foreground was less than background, but definitely a reading that gives you more information than a true missing value from, say, a spot where you had a bubble). Alternatively, and probably a much better option, you can use the "half" method, where any intensity less than 0.5 after subtracting the background is reset equal to 0.5 (and thus you end up with a very tiny value, but a value that will not lead to a missing when taking logs).
If you don't use background subtraction, well, then, the values used are the mean of the foreground. It used to be that most people used background subtraction, but there are some evidences that background subtraction might do more harm than good (e.g., Qin and Kerr, 2003), another example of the variance-bias trade-off.
The design of your array might be such that the assumptions behind the use of loess do not hold within each print-tip group. In such a case, you might want to use loess over the whole array. Note that, if you use global loess, you are not longer taking advantage of the implicit spatial correction that is carried out with print-tip loess.
First, if there were warnings or other types of recoverable problems, you will see a listing of them. A typical situation is having few points in any one of the subgrids when using print-tip loess. A warning will be given if there are less than 100; this is a somewhat arbitrary threshold but it lets you know that your loess curve might be unreliable, and thus the normalization suspect (we thank Gordon Smyth for this comment).
You are returned the box-plots of the log (base 2) ratio before and after print-tip loess normalization. This allows you to assess the need for slide scale normalization (a normalization that will ensure the same scale for all arrays).
The box plots for each array show box-plots, before and after normalization, of the ratio for each array. This allows you to assess if the scale of different print-tip groups is comparable.
If there are differences in scale among arrays, you might want to consider slide scale normalization. Using slide scale normalization adds noise to the final values (you are dividing by random variables), and the references above often discourage this normalization unless there are strong reasons for using it. We assume you know what you are doing.
First plots shown shown are histograms of the raw intensities of each of the foreground channels.
Similar (not identical) information is also displayed in the density plots of the arrays. The densities shown are those for the raw or background corrected (depending on the option you choose) red and green (shown as blue) channels. On the left, and for comparison, we show the distribution of these channels for all the arrays. These types of plots are more common of single-channel normalization, but we have found that these plots can be useful in some cases to identify problematic arrays.
You will also want to examine the image plots, to check the quality of your arrays and possible spatial problems. To make life easier for color-blind people, the M plot is not shown with the customary green-to-red scale, but rather with a blue-to-red scale. We show here image plots of the log 2 intensities of the foreground and backgound for both channels, as well as the transformed and untransformed M values (the log [base 2] differential ratio: log(R/G)).
The MA plotsshow the relationship between A (the "average signal" [0.5 * (log R + log G)], where R is the background subtracted red [mean of F635 - median of B635] and G the background subtracted green [mean of F532 - median of B532]) and M (the log [base 2] differential ratio: log(R/G)). These plots are shown both before and after normalization, and with different color lines for the lowess lines of each print-tip.
You can also either save the file (only the normalized data, plus the gene names), or send it to the preP or to the deprecated preprocessor (Herrero, J., Díaz-Uriarte, R.. and Dopazo, J. (2003). Gene Expression Data Preprocessing. Bioinformatics. 19: 655-656).
For most plots, you can either view a set of plots per array, or plots for all the arrays. You can also download all plots.
In general, we suggest you don't use those filters. The first one is rarely a good idea: first, you are using a global background when you have a much more reliable measure of the same quantity which is the local background; second, and much more importantly, you have already subtracted the local background from the foreground, and spots with small foreground values will be close to the background, thus yielding final M values close to 0. You want these values, and not missings, since you do have information that these spots show little differential expression. The second approach is somewhat arbitrary (what threshold of A?), and if values with low A are really not very reliable, they will either tend to clump around 0, or show high variance, so in either case they will not bias the results of your analyses.
This program was developped by Juan M. Vaquerizas and Ramón Díaz-Uriarte, from the Bioinformatics Unit at CNIO. This tool is, essentially, a web interface to the limma and marrayNorm Bioconductor packages, with a few kluges added by us here and there (and Bioconductor is a set of R packages). Limma has been developed by Gordon Smyth and marrayNorm by Sandrine Dudoit and Yee Hwa (Jean) Yang. Sandrine Dudoit also kindly answered our questions about normalization. The CGI part is greatly facilitated by the CGIwithR package, by David Firth. We want to thank all these authors for the great tools that they have made available for all to use. If you find this useful, and since R and Bioconductor are developed by a team of volunteers, we suggest you consider making a donation to the R foundation for statistical computing.
We also want to thank the people of the CNIO Bioinformatics Unit for beta testing, and comments with the design of the tool. In particular thanks to J. Santoyo and J. Herrero for help with the graphical design and with the catching of errors.
Funding partially provided by Project TIC2003-09331-C02-02 of the Spanish Ministry of Education and Science. This application is running on a cluster of machines purchased with funds from the RTICCC.
Please, before calling/emailing us if "the program doesn't work", make
sure you have read this documentation. In particular, make sure
your data meet the requirements. Juanma has spent a lot
of time trying to catch many mistakes, and providing
meaningful error messages; please read them and act accordingly
(i.e., fix the reported problems before asking us).
Please, also make sure that javascript is enabled on your web browser.
This tool has been tested in Firefox (1.0.1 and 1.0.6),
Internet Explorer 6.0, Netscape 7.0 and Konqueror 3.1.1.
DNMAD has been running, with only minor changes, since 2003, with between 60 and 200 uses a month. We stoped active development of DNMAD a long time ago. However, because of the apparent popularity of the application, we have moved it from our old servers to the newer ones, so it can continue to be used. However, beware that we are using "old" software: problems of backwards compatibility of some packages have prevented us from updating to the latest R version and the latest version of some packages. What follows is the most current information regarding the version of R and the different packages used.
_
platform x86_64-unknown-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 5.1
year 2007
month 06
day 27
svn rev 42083
language R
version.string R version 2.5.1 (2007-06-27)
Uploaded data set are saved in temporary directories in the server and are accessible through the web until they are erased after some time. Anybody can access those directories, nevertheless the name of the directories are not trivial, thus it is not easy for a third person to access your data.
In any case, you should keep in mind that communications between the client
(your computer) and the server are not encripted at all, thus it is also
possible for somebody else to look at your data while you are uploading or
dowloading them.
This software is experimental in nature and is supplied "AS IS", without
obligation by the authors or the CNIO the to provide accompanying services or
support. The entire risk as to the quality and performance of the software is
with you. The authors expressly disclaim any and all warranties regarding the
software, whether express or implied, including but not limited to warranties
pertaining to merchantability or fitness for a particular purpose.