The evidence of GMO harm in pig study is pretty flimsy

The latest really scary paper regarding GMOs has been circulated widely on Twitter today, primarily by the usual suspects (Bittman, Pollan, and many others). The paper (available here, on the blog of the primary author Judy Carman) is titled “A long-term toxicology study on pigs fed a combined genetically modified (GM) soy and GM maize diet.”  The study has already been criticized for various reasons by David Tribe and Mark Lynas. The authors of the study fed pigs for nearly 2 years 22.7 weeks with either a “GM” or a “non-GM” diet. The GM diet was a mixture of corn and soybean that had Bt and glyphosate-resistance traits. The non-GM diet apparently has a similar amount of corn and soybean, but used non-GM conventional varieties instead. The authors measured LOTS of things, and mostly found there was no statistical differences between the GM and non-GM diets.

I don’t have time to do a full critique, but there is at least one statistical choice that I found odd, and thought I’d throw it out there for others to discuss. The authors claim to have found 2 differences between groups of pigs fed the different diets, and that has been the basis for the widespread interest in this study, particularly among folks who are anti-biotechnology. From the abstract:

“There were no differences between pigs fed the GM and non-GM diets for feed intake, weight gain, mortality, and routine blood biochemistry measurements. The GM diet was associated with gastric and uterine differences in pigs. GM-fed pigs had uteri that were 25% heavier than non-GM fed pigs (p=0.025). GM-fed pigs had a higher rate of severe stomach inflammation with a rate of 32% of GM-fed pigs compared to 12% of non-GM-fed pigs (p=0.004). The severe stomach inflammation was worse in GM-fed males compared to non-GM fed males by a factor of 4.0 (p=0.041), and GM-fed females compared to non-GM fed females by a factor of 2.2 (p=0.034).”

So the GM diet apparently resulted in increased uterus weight, and increased stomach inflammation compared to the non-GM diet. Table 2 in the manuscript presents organ weights, and as described in the abstract, the uterus. Since I don’t have the benefit of raw data, I suppose we’ll have to trust the authors on this one, that the uterus weight was greater in GM-fed vs non-GM fed pigs. Group means were 0.12 and 0.10 for pigs in the GM and non-GM groups, respectively. Table 2 doesn’t list the units for any of the numbers, so I don’t know if the weights are in grams, kilograms, ounces, metric tons… As a plant scientist, I really have no concept of what a normal pig uterus should weigh. Or any uterus, for that matter. But I digress. [UPDATE: it was recently pointed out to me that the numbers are a percentage of total body weight. So 0.1 and 0.12% of body weight, I guess. I still don’t really know if that is good, bad, or normal. Go ahead and let me know in the comments if you like.]

The second major finding of this study relates to stomach inflammation. The authors present in Table 3 of the manuscript “gross pathologies” related to various organs. For the stomach, the authors list 4 different categories related to inflammation:

  • Nil inflammation
  • Mild inflammation
  • Moderate inflammation
  • Severe inflammation

The authors compared the number of pigs that fell into each category independently, and found no differences between GM and non-GM groups with respect to Nil, Mild, or Moderate inflammation categories. But the authors found that there were more pigs from the GM-fed group with “Severe inflammation” compared to the non-GM group. And this is the major finding of the study; that “GM-fed pigs had a higher rate of severe stomach inflammation.”

But this seems to me a very strange way to analyze this data. The 4 categories the authors used to classify stomach inflammation are what is known as ordinal categorical data, and are pretty common in the literature. The typical way to analyze ordinal data is to give values to each category, and conduct either a t-test ora Mann-Whitney (also called Wilcoxson) test. [EDIT: many other tests are possible, the Mann-Whitney being among the simplest.] The reason for this, is that there is structure to the data; that is, Mild inflammation is worse than Nil inflammation. And Severe is worse than the other three categories. We lose that information by separating them for analysis the way the authors of the pig study did. All 4 categories give information about stomach inflammation, and if we look only at “severe” inflammation, we lose the additional information the other categories provide. A proper analysis would include the structure of these data.

Since the authors present the number of animals in each category, we can analyze the data in a more standard way. I’ve provided the R code for doing so if you’d like to follow along at home. We’re going to use both a t-test and the Wilcoxson test and see if the results are similar to what Carman et al. concluded.

## Coding: Nil = 0, Mild = 1, Moderate = 2, Severe = 3
## enter the non-GM diet data:
nonGM.fed<-c(rep(0,4),rep(1,31),rep(2,29),rep(3,9))

## enter the GM diet data:
GM.fed<-c(rep(0,8),rep(1,23),rep(2,18),rep(3,23))

[TABLE OMITTED in response to a valid criticism in the comments by Steve Kass.]
This table shows the number of pigs in each treatment group, and the mean and median values for stomach inflammation, based on the coding we used (Nil = 0, Mild = 1, Moderate = 2, Severe = 3). The mean inflammation values basically tell us that, on average, pigs on the non-GM diet had mild to moderate stomach inflammation, and the GM-fed pigs were only slightly different (1.59 vs 1.78). But are these values statistically different? Below is the code (and output) using a t-test and a Wilcoxson (Mann-Whitney) test:
[NOTE: I’ve left the code for t-test below, but as pointed out by several commenters, the Wilcoxson test is more appropriate for this data.]

t.test(nonGM.fed,GM.fed)
# Welch Two Sample t-test
# t = -1.248, df = 132.574, p-value = 0.2142

wilcox.test(nonGM.fed,GM.fed)
# Wilcoxon rank sum test with continuity correction
# W = 2325, p-value = 0.2081

Notice the p-values in the t-test and Mann-Whitney test. Much higher than those reported by the authors who only analyzed the severe group. But does it hold up by running the males and females separately, as the authors did in Table 4?

## Males
male.nonGM.fed<-c(rep(0,1),rep(1,16),rep(2,17),rep(3,2))
male.GM.fed<-c(rep(0,4),rep(1,12),rep(2,12),rep(3,8))
#t.test(male.nonGM.fed,male.GM.fed)

wilcox.test(male.nonGM.fed,male.GM.fed)
#	Wilcoxon rank sum test with continuity correction
# W = 600, p-value = 0.5669

## Females
female.nonGM.fed<-c(rep(0,3),rep(1,15),rep(2,12),rep(3,7))
female.GM.fed<-c(rep(0,4),rep(1,11),rep(2,6),rep(3,15))
#t.test(female.nonGM.fed,female.GM.fed)

wilcox.test(female.nonGM.fed,female.GM.fed)
#	Wilcoxon rank sum test with continuity correction
# W = 564, p-value = 0.2408

If I were to have analyzed these data, using the statistical techniques that I was taught were appropriate for the type of data, I would have concluded there was no statistical difference in stomach inflammation between the pigs fed the two different diets. To analyze these data the way the authors did makes it seem like they’re trying to find a difference, where none really exist.

 

UPDATE: June 13, 2013

I’ve been accused by whoever runs the gmoseralini.org whoops… I mean the gmojudycarman.org website of failing “kindergarten-level statistics.” I think that may be a slight exaggeration. Nonetheless, I will very briefly address their criticism. Bill Price has already addressed this to some extent in the comments:

Now, reasonable people can certainly disagree on how data should be analyzed. If there were only one correct way to analyze data, there would be far fewer statisticians in the world. But I stand by my view (and Dr. Price seems to agree) that is is inappropriate to collect data by categorizing into 4 ordinal categories, but then ignore that structure in the analysis. I concede that the Mann-Whitney (or Wilcoxson) test is more appropriate for this data compared to the t-test (both of which I presented above), but both tests above show the same result: very little evidence that the diets caused different amounts of stomach inflammation.

In the response at gmojudycarman.org, they state:

“Categorical data are data that fit into categories, such as male / female or pregnant / not pregnant.  [Kniss] has tried to turn this sort of data into data that is continuous, like you get with body weight or height.  This is really bad statistical methodology.  It is like taking pregnant / not pregnant data and trying to twist that data into groups that could be described as: pregnant, half pregnant and fully pregnant.  And you are right, it doesn’t make sense to even try to do something like that.”

Well, that’s an interesting statement… because that is exactly what the Carmen et al. authors did, right? They “twisted” inflammed/not inflammed into Nil, Mild, Moderate, and Severe inflammation. Personally, I don’t have a problem with using these categories (although the authors now seem to think it is “bad statistical methodology”??). My problem is with analyzing them separately.

There are different types of categorical data. The data described in the quote above is of a binomial nature (on/off, pregnant/not pregnant, present/absent, alive/dead). The data presented by the Carmen paper is more than that; it is Nil, Mild, Moderate, Severe. There are four different categories, that have a distinct order. Each category has meaning, and is linked to the others (Moderate is greater than Mild, but less than Severe). But this bit of criticism brings up an interesting question: what if we look at the data as a binary categorization (inflammed/not inflammed)? Let’s do that!

### Inflammed or not inflammed
## enter the non-GM diet data:
nonGM.fed<-c(rep(0,4),rep(1,69))
## enter the GM diet data:
GM.fed<-c(rep(0,8),rep(1,64))

N.obs<-c(length(nonGM.fed),length(GM.fed))
num.inflam<-c(sum(nonGM.fed),sum(GM.fed))
pct.inflam<-round(num.inflam/N.obs*100,0)
data.frame(N.obs,num.inflam,pct.inflam,row.names=c("nonGM","GM"))

 

Number of pigs Number with stomach inflammation Percentage of animals with stomach inflammation
nonGM 73 69 95
GM 72 64 89

Looking at the data this way, the GM-fed pigs had LESS inflammation! A whopping 95% of the animals fed non-GM feed had stomach inflammation, compared to 89% of the animals fed GM diets. That’s a lot of stomach inflammation. Is this difference statistically significant?

wilcox.test(nonGM.fed,GM.fed)

	Wilcoxon rank sum test with continuity correction

data:  nonGM.fed and GM.fed 
W = 2776, p-value = 0.2216
alternative hypothesis: true location shift is not equal to 0 

The p-value is 0.22, so not much evidence that there is a difference. And I don’t care what type of fancy statistical test you use, you simply can’t make the case that the GM-fed pigs were worse off if they had LESS stomach inflammation compared to the non-GM fed pigs.
 

Comments

  1. I’m confused because in the methods they mention doing the Mann-Whitney, but then the results you got were completely different. Your methods are transparent where theirs are not, do you have an explanation for the discrepancy? I note that in table three, on the p notation there is in “a” which signifies from the footnote that they used uncorrected Chi squared rather than the Mann-Whitney. Perhaps that’s the test that got them the significant result whereas the others failed?

    1. Hi Mark, I honestly haven’t had time to figure out exactly how they analyzed the data to get the results as presented. The discrepancy, though, almost certainly arises due to the authors separating the categories and running 4 separate analyses. Which really seems like a strange choice to me. It is far more likely to find a difference where none exists by analyzing the data 4 different times. Maybe if I have time over the weekend, I will try to recreate their analysis to figure out how they arrived at the p-values they presented. Until then, though, I’ve got some weed killin to do. ;-) Thanks for stopping by to comment. -AK

      1. No problem, I too am roundly confused about how they performed their tests, and also how they chose them at any given moment. Look at table 5 for instance. They seem to flip between Ttest and MW for any given value.

        This seems to be a recurring theme we also saw in Seralini’s paper. Usually the statistical choices don’t even require a second thought, because it should be obvious based on the type of data, but the anti-GMO authors seem to always complicate their analyses, and in Seralini’s case, I hadn’t even ever heard of some of the tests they used.

        1. I’ve been able to get p-values that are pretty close to what Carman et al. present in their table using the 2×2 Chisq and the Mann-Whitney test. So it seems that, even though they criticized me for not knowing anything about statistics, they used a test that gives nearly identical results. They just applied it to their data in an incorrect way.

          Carman p-value Mann-Whitney p-value Chi-square p-value
          Nil inflammation 0.218 0.2216 0.2185
          Mild inflammation 0.190 0.1924 0.1901
          Moderate inflammation 0.058 0.0594 0.0582
          Severe inflammation 0.004 0.0046 0.0044

          Code to get Mann-Whitney results:

          > ## Nil vs Others
          > GM< -c(rep(0,8),rep(1,64))
          > nonGM< -c(rep(0,4),rep(1,69))
          > wilcox.test(GM,nonGM)
          	Wilcoxon rank sum test with continuity correction
          W = 2480, p-value = 0.2216
          
          > ## Mild vs Others
          > GM< -c(rep(0,49),rep(1,23))
          > nonGM< -c(rep(0,42),rep(1,31))
          > wilcox.test(GM,nonGM)
          	Wilcoxon rank sum test with continuity correction
          W = 2351.5, p-value = 0.1924
          
          > ## Moderate vs Others
          > GM< -c(rep(0,54),rep(1,18))
          > nonGM< -c(rep(0,44),rep(1,29))
          > wilcox.test(GM,nonGM)
          	Wilcoxon rank sum test with continuity correction
          W = 2241, p-value = 0.05939
          
          > ## Severe vs Others
          > GM< -c(rep(0,49),rep(1,23))
          > nonGM< -c(rep(0,64),rep(1,9))
          > wilcox.test(GM,nonGM)
          	Wilcoxon rank sum test with continuity correction
          W = 3143.5, p-value = 0.00458
          

          Code to get Chi-square results:

          > ## Nil Chisq test
          > nil< -matrix(c(8,64,4,69),ncol=2)
          > rownames(nil)< -c("Nil","Inflammed")
          > colnames(nil)< -c("GM","nonGM")
          > nil
                    GM nonGM
          Nil        8     4
          Inflammed 64    69
          > chisq.test(nil,correct=F)
          	Pearson's Chi-squared test
          X-squared = 1.5145, df = 1, p-value = 0.2185
          
          > ## Mild Chisq test
          > mild< -matrix(c(49,23,42,31),ncol=2)
          > colnames(mild)< -c("Mild","Other Categories")
          > rownames(mild)< -c("GM","nonGM")
          > mild
                Mild Other Categories
          GM      49               42
          nonGM   23               31
          > chisq.test(mild,correct=F)
          	Pearson's Chi-squared test
          X-squared = 1.7168, df = 1, p-value = 0.1901
          
          > ## Moderate Chisq test
          > mod< -matrix(c(54,18,44,29),ncol=2)
          > colnames(mod)< -c("Moderate","Other Categories")
          > rownames(mod)< -c("GM","nonGM")
          > mod
                Moderate Other Categories
          GM          54               44
          nonGM       18               29
          > chisq.test(mod,correct=F)
          	Pearson's Chi-squared test
          X-squared = 3.5882, df = 1, p-value = 0.05819
          
          > ## Severe Chisq test
          > severe< -matrix(c(49,23,64,9),ncol=2)
          > colnames(severe)< -c("Severe","Other Categories")
          > rownames(severe)< -c("GM","nonGM")
          > severe
                Severe Other Categories
          GM        49               64
          nonGM     23                9
          > chisq.test(severe,correct=F)
          	Pearson's Chi-squared test
          X-squared = 8.1096, df = 1, p-value = 0.004403
          
          1. Thanks, Andrew, for doing this analysis. Right away, when I saw the Chi-squared tests being used on individual categories of a continuous series of related categories, I thought something was odd. They ran their statistics off of whether the pigs were in each category or not – which as you point out, completely ignores the data in the other categories. One of the student groups I taught in my Spring semested class made the same kind of mistake by turning quantitative data into discrete categories, and then ran individual statistics and trends off of that, rather than just doing it off of the quantitative data.

            Trying to shoe-horn individual categories that aren’t binary data into a statistical test designed for binary data is the wrong approach.

            What you demonstrate is that when you apply the appropriate statistics for the type of data they collected, you do not get a statistically significant difference between the GE-fed and non-GE fed pigs.

            The other alternative, if they are wedded to the Chi-squared test, is to turn their 4 categories into binary data, by grouping the nil and mild groups into one, and the moderate and severe groups into the other one. As others have pointed out, this also gives a non-significant difference.

            I think the ones who are getting the statistics lesson are the ones who wrote that paper, and the ones who set up the “gmojudycarman.org” fansite to attack level-headed criticism. Which turns out to be the same people, I guess.

          2. Sorry if my question sounds stupid, but what is the point in doing the analysis this way?
            I do understand why one would want to divide the data in a binary way – for example Inflamation – No inflammat
            , but not why one would want to divide the data in mild inflammation in one group and the other three(one of which is “severe”) in the other.
            Thanks in advance.

    2. I would like to make an observation that i havent seen mentioned yet about this study – it involves degrees of freedom. I realize that most folks critiquing this study are plant, not animal, folks. I’m an animal scientist. I’m not a statistician, but took a fair number of stats courses in grad school. One of the first things we learned was the importance of determining your experimental unit. The question usually asked is – what was the treatment applied to? For animal feeding trials such as this, the answer is the pen, not the pig. There is no way to determine what each pig ate, only what was given to the pen. There were 2 pens per treatment in this study, so by my calculation, that gives you 1 degree of freedom, at best. To put it in “plant terms” . . . Using pigs instead of pens would be like analyzing a study where you were looking at the effect of fertilizer on grass in a large field trial by counting the individual blades of grass and using that as your n rather than the number of plots to which the treatment was applied. Please correct me if I am misremembering what I was taught those many years ago.

      1. Great point, Diane. For the “growing and finishing phase” the pigs were housed 42 animals per pen; this means they only had 2 pens per treatment. Now that you mention it, I recall our animal science researchers being really excited when we got a GrowSafe system installed at our research center, because it allowed them to use single animals as experimental units. And that was a very big deal, because otherwise treatments can only be applied to whole pens. Which for the Carman study, would be N=2 for each treatment. I think you’ve probably identified a fatal flaw in the study design. I’d be interested in hearing from some more animal researchers, but it seems like this issue would probably preclude this research from being published in a reputable journal more familiar animal feeding trials. Which may explain the odd choice of journal for this “groundbreaking” research.

        1. Andrew, Diane is correct. The experimental unit should have been the pen. From out perspective, when we run a field trial, we might make several counts per plot, but the plot is the experimental unit, so we average the counts to get the value for the plot and run the statistical tests from there. We don’t run statistics using all the individual counts in the plot.

          In this case, the animals in each pen were treated as a whole rather than separately. So they shouldn’t become individual items in the statistical tests.

        2. I’m not particularly concerned that this is a problem here. To explain why, perhaps it’s easiest to describe one of my experiments where this sort of thing IS a problem, and why.

          My group works broadly on neurocognitive and neurodegenerative disorders (think aging and Alzheimer’s disease), using mice as a model organism. A recent experiment involved treating mice with a drug (or vehicle), via the food, and looking at a variety of memory-related behaviors. Mice are rather like pigs, in that they have relatively large litters of offspring. What one finds, then, is that results will vary not only due to experimental treatment, but also between litters within an experimental treatment. The reasons for this are relatively straight-forward. The things we measure are affected by a variety of developmental influences, such as quality and amount of maternal care, as well as various epigenetic influences. Some of that can be controlled for, others can’t, so we track litter of origin since it can confound results. These sorts of confounders exist regardless of whether the “unit of treatment” is an entire litter or cage (i.e., we can treat mice individually and single-house them, but mice from the same litter will still co-vary).

          The lesson, then, is not that our effective N should be the “experimental unit”, but that the effective N should accurately reflect the structure of the data and uncontrolled confounders. For stomach inflammation (which others have suggested wasn’t even measured correctly!), I guess it’s possible that you could get a batch-effect for pen, but I would naively not expect anything major like we commonly see with cognitive phenotypes. One can actually check for these sorts of things by clustering data (this is SOP for me) to see if there might be a confounder that’s effectively decreasing your N.

          Honestly, I would only concern myself with this sort of thing if the rest of their study weren’t already crap.

          1. Devon,

            I can’t comment on your studies, but I would argue that mice are not rather like pigs.

            Just a quick note on pigs versus rodents. The group penned feeding situation is very different for the 2 species. Group housed rodents can be ad lib fed, but I don’t know if this would be the case if mice were housed 42 to a cage. However, pigs housed in groups (even as small as 2 per pen, but especially as big as 42) are effectively being limit fed (on average, probably about 80% for a large group, some pigs in the group get more, some less), with a definite pecking order. All pigs would not be receiving the same amount of feed and they would also be under different levels of stress. The pen effect can be quite large, especially with such small numbers of pens and when the animals were not housed in a controlled environment. The pigs in this study experienced natural temperatures and the position of the pen in the barn can affect what the animals experienced with regards to wind, sun, rain, and snow, etc. The feeding trial lasted over 5 months and Sioux Center, Iowa is not known for its mild climate. :-) When doing feeding trials with pigs, researchers sometimes avoid putting experimental pigs in the end pens to reduce variation, especially if they are against an outside wall.

            I agree that there is so much wrong with the study as to make this almost moot (e.g., the pigs were apparently not even healthy and yes incorrect methods were used for determining inflammation). That said, if all else were to have been done properly, having one degree of freedom would on its own have been a fatal flaw in the study. Poor statistical design should always be flagged, so that others don’t make the same mistakes. Even if this had been a “worthy study”, the 168 pigs were subjected to an experiment that had no statistical power.

            FYI – not relevant for this, but genetically speaking mice (or rats) are MUCH more similar to each other than pigs are, even within a litter. Sometimes pigs within the same litter may even have different sires.

  2. The paper is characterized by many weaknesses in materials and methods, such as no clear describtion/analysis of feeds and mixed feeds, not acceptable animal losses (13 and 14%!!!), no feeding study adeequate to the rules of animal nutritionists (only a field study with large animal groups (48 per pen), some statistical weaknesses etc.
    But nevertheless, such studies seem to be necessary under clear defined conditions.

  3. Andrew, given the post by Judy today, I replied there, but am not confident of the response being posted. Hence, I am posting my comment here as well. Hope you do not mind.

    Bill Price

    — Start Post —

    Judy,
    I’m afraid you have a lack of understanding of statistical procedures in this case. The procedure Dr. Kniss uses is a crude measure but, in fact, it is legit. It is known as a non-parametric rank test, which, if you really did have experience in statistics, would be elementary knowledge for you. You would also understand that such tests, being non-parametric, are designed to mitigate the normaility assumptions of the statistical tests. I also note ALL of your tests make the same normaility assumptions, although the paper does not state if this was tested for the responses analyzed nor what corrective measures were taken, if any. More to the point, the severity data, as presented, is not only categorical, but multinomial ordinal data. Separating categories out for separate independent tests as you have done in the paper is not a valid analysis option. The categories are functionally and stochastically related. Separate analysis of categories will lead to incorrect conclusions. A proper multinomial analysis would account for these properties. Such analyses would be a simple two-way contingency chi-square, or a linear categorical model, or a logistic regression model, or possibly a generalized mixed model. I have tried the first three of these options, all of which make use of the correct inherent probability distribution (multinomial). All tests were highly non-significant.

    Far more troublesome in this paper are the extremely high rates of sick animals and mortality. More than half of all animals displayed pneumonia symptoms. 13-15% mortaility rates were also observed. Contrary to the claims made in the paper, these are not normal rates to be found in swine production. It is quite clear that the animals in this study (both GM and non GM) were in very poor health. This severely confounds any and all data collected and makes subsequent interpretations of potential treatment effects meaningless. The fact that you ignore this, even with your claims of statistical knowledge is troubling. This does not even begin to address the ethical concerns regarding the welfare of the animals during the study. It would be standard practice in animal science to terminate a study with such drastic health issues in order to properly care for the animals. That you did not is, quite frankly, disgusting.

    Bill Price
    Statistician
    University of Idaho
    — End Post –

      1. Wow. Carmen et al. report:

        “Mortalities were 13% and 14% for the non-GM-fed and GM-fed groups respectively, which are within expected rates for US commercial piggeries.”

        The sources you’ve posted indicate “expected” mortality would actually be between 2.9 and 8.0%. That’s a pretty alarming difference. Was there no Institutional Review Board involved with this project?

    1. I noticed I had overseen a statement in their paper and was incorrect in a portion of my previous comment to them. I subsequently added another comment to correct this. Neither comment has been approved as yet. The comment I added is below:

      — start comment —
      Amendment to my previous comment: I see now a section of the paper in which you do state that you tested for normality and used modified procedures if these tests were rejected. I apologize for that oversight and retract that portion of my comment.
      — end comment –

    2. I’m curious about the simple two-way contingency table for ordinal data. My old college statistics text said you could do that, but conceded that this wasn’t ideal. It seems to me that it doesn’t take into account the ordinality of the categories. Namely, if you swap the data from two rows in the table, the chi square value doesn’t change. In that sense, it seems to me that the chi squared test leaves much to be desired (even if you’re not combining cells).

      Then, I’m not a statistician, so I’m interested in your take on this.

      1. Adam,
        I think you have the gist of it. I think the contingency table approach is best viewed as a test for homogeneous distributions, e.g. is the distribution of counts across the severity categories equivalent between GM and non-GM. As you point out, this does not account for the ordinal nature of the categories. You can pound in a nail with a screwdriver, but it is not always the best tool for that job :)

        Be careful in talking about “the chi-square test”, however. Many procedures relevant to this data produce chi-square tests, but they differ in how they are obtained and what they test. It would be better to say the contingency table chi-square test leaves something to be desired in this case.

        Note also that other procedures, such as the linear categorical model or logistic regression model can account for the ordinal nature of the categories. They do this using a special transformation of the category probabilities called a cumulative logit. That is too involved to describe here, but suffice it to say, it assumes an ordinal structure and its interpretation does as well.

      2. Hi Adam, you are correct that you could run the contingency table approach (Chi-square) with all four categories, but as you point out it doesn’t include the ordinal structure of the data. Therefore it really isn’t any better than separating the categories and analyzing them separately. It would be similar to analyzing the data with {brown | green | yellow | red} as categories, ignoring the fact that there is a definite order to the categories of Nil, Mild, Moderate, and Severe.

        The Wilcoxson test keeps the order of the data intact for analysis, as I describe in a response to Nat below. Here is some code demonstrating that if you flip the order of the coding with the Wilcoxson test, the test statistic and p-value change dramatically. This would not occur if using a contingency table approach that ignores the ordinal structure.

        ## Coding: Nil = 0, Mild = 1, Moderate = 2, Severe = 3
        nonGM.c1< -c(rep(0,4),rep(1,31),rep(2,29),rep(3,9))
        GM.c1<-c(rep(0,8),rep(1,23),rep(2,18),rep(3,23))
        wilcox.test(nonGM.c1,GM.c1)
        	Wilcoxon rank sum test with continuity correction
        
        data:  nonGM.c1 and GM.c1 
        W = 2325, p-value = 0.2081
        

        Now let's change the order (switch mild and severe)

        ## Coding: Nil = 0, Severe = 1, Moderate = 2, Mild = 3
        nonGM.c2< -c(rep(0,4),rep(3,31),rep(2,29),rep(1,9))
        GM.c2<-c(rep(0,8),rep(3,23),rep(2,18),rep(1,23))
        wilcox.test(nonGM.c2,GM.c2)
        	Wilcoxon rank sum test with continuity correction
        
        data:  nonGM.c2 and GM.c2 
        W = 3227, p-value = 0.01275
        
  4. Just a quick note: it would have been possible to get values for the assessments in a variety of ways. They could have used molecular markers of inflammation and their levels of expression, for example. They could have used cell counts of tissue samples or other histopathology. They could have even done evaluation of the image colors with simple and free biological image processing software. That would have been a bit challenging with the folds in this case and may not have been the best choice, but at least would have been some values to work with, besides their method.

    1. I’m actually not opposed to the method they used for assessing inflammation. Sometimes categories like the ones they chose are the best balance of practicality and useful information. But since the categories are related, you can’t just simply analyze the categories separately.

      1. I understand your point on the statistical methods. But this is what I meant–that redness is not a valid score–they should have used other assessments:

        Dr. Robert Friendship is a professor in the Department of Population Medicine at the Ontario Veterinary College, University of Guelph, and a swine health management specialist. He has reviewed the research report and concluded that it was incorrect for the researchers to conclude that one group had more stomach inflammation than the other group because the researchers did not examine stomach inflammation.

        “The researchers did a visual scoring of the colour of the lining of the stomach of pigs at the abattoir and misinterpreted redness to indicate evidence of inflammation. It does not,” Friendship said. “There is no relationship between the colour of the stomach in the dead, bled-out pig at a slaughter plant and inflammation.”

        http://www.letstalkfarmanimals.ca/2013/06/13/canadian-experts-convinced-gmo-swine-feed-study-is-deeply-flawed/

        I suspected that color wasn’t valid, but I have not studied pig stomach tissue myself so I needed to see more from an expert on that.

        1. That is pretty damning… Perhaps this is why the study wasn’t published in a journal with a focus on vet science/animal science? Because their only “desired” finding was bunk?

  5. Comment on “A long-term toxicology study on pigs fed a
    combined genetically modified (GM) soy and GM maize diet” Carman et alii : scientific scam

    A) Lies at first about the conclusions written in a research paper Brasil et alii 2009 : “The Impact of Dietary Organic and
    Transgenic Soy on the Reproductive System of Female Adult Rat” THE ANATOMICAL RECORD 292:587–594 (2009).

    Carman has found an increase of uterine weights in her pigs (“However by weighing organs we found a significant 25% increase in uterine weights in the GM-fed pigs”) and to make us believe that it is something big, she lies about Brasil 2009.
    Carman writes : « The link between an increase in uterine weights and GM feeding is supported by other authors (Brasil et al., 2009) »…
    But Brasil et alii had written : « There was NO significant difference in the ovary or uterus absolute and relative weights (mg of tissue/g body weight) for the GMSG (=GM Soy fed group) and OSG (=organic soy fed group) compared with the CG (=control group). »

    The only differences seen in Brasil 2009 are an increase of glandular epithelium thickness in GM fed rat group and differences in corpora lutea.
    Carmn lied too about the percentage :
    She writes : “[Brasil et alii] who found that GM soy-fed rats
    had a statistically significant 59% increase in the density of the uterine endometrial glandular epithelium compared to rats fed an equivalent organic soy diet”.

    But Brasil tells : “The volume density of endometrial glandular epithelium was greater in the GMSG group (29.5 + -7.17, P < 0.001) when compared with the CG (18.5 +- 7.4) and OSG (20.3 +- 10.6) groups."
    So it's not 59% but 45 % if compared with OSG group.

    Conclusions from Brasil are : "both organic and
    transgenic soy reduced the body weight and estradiol serum levels. Both soy treatments also improved the lipid profile by reducing cholesterol and triglycerides serum levels. Probably the reduction in estradiol serum levels reflects the capacity of isoflavones to bind the estrogen receptor and blocking the actions of endogenous estrogens. The alterations presented here were more marked in the transgenic group, which showed the lowest body weight and cholesterol and triglycerides levels."

    B) Show must go on…

    1)Infectious diseases were not tested !!!
    a) pneumonia : Carman writes some of her pigs got pneumonia but which pathogen ??? There are several ones that can induce pig pneumonia (viruses and Bacteria). We think about Mycoplasma hyopneumoniae that affects lungs and induce an increase of pro-inflammatory cytokines. But you can also test PRRS virus…

    Moreover when Carman writes in discussion « We suggest that the
    following may be better measures: the red blood cell count and haematocrit to measure anaemia and iron deficiency from possible blood loss, C-reactive protein and white blood cell count to measure inflammation »,, it would be totally useless here to test CRP and other systemic inflammation biomarkers because their rates will be without any doubt modified by pneumonia !

    b) stomach pathogens : to eliminate false positive results, they should have tested also the presence of stomach Helicobacter sp. and Candida albicans, common in pigs.. They didn't.
    c) uterine pathogens : fluid presence may be due to an infectious disease too.

    2)Where are histological sections ???
    They show us macroscopic stomachs without any histological data !
    How could you tell if there's any inflammation if you don't detect polynuclear neutrophil invasion and local cytokines (by performing immunohistochemistry : IL 8, TNF alpha,…) ?

    They should have performed :
    1) Optic microscopy : stomach sections with HE and WS silver staining (to detect Helicobacter) to detect inflammation local biomarkers
    uterine sections
    2)TEM stomach sections to show epithelial cell junctions, epithelial cell morphology, and link between basal membrane and epithelial cells,

    3)classify all the inflammation markers found in stomach sections (PN number, macrophage number, cytokine rates, epithelial cell morphology …) and to each parameter, give a severity score, after that, add scores for each pig… and you get an total « stomach inflammation score » for each pig.

    Conclusion : They studied a lot of parameters that were useless but totally « forgot » the ones that were absolutely necessary !
    As histological and infectious data are missing, we can't confirm anything about inflammation, and yes, it's a scientific scam !

    1. Some very interesting observations. Many beyond my ability to judge (since I’m a plant guy, and don’t deal with stomachs). Thanks for commenting. -AK

  6. Andrew, I agree with Bill and yourself about the ordinal data on stomach inflammation. Such data should never be tested independently within categories. It seems to me, the only reason for doing such a test (other than ignorance) is to increase the likelihood of Type I errors.

    I think you analysis of inflamed/not inflamed is a much better approach to the data set. It demonstrates no difference in inflammation. With no difference, there is no reason to further mine the data.

    1. I’m not at all convinced that you can compare across categories like this. When doing it this way, you are implicitly assuming that ‘severe’ inflammation (=3) is EXACTLY 50% worse than moderate (=2), which is EXACTLY twice as bad is mild (1). I certainly don’t have the veterinary expertise to state that that is appropriate.
      How would these conclusions change if those valuations of -yes ordered, but NOT quantitative- categories were changed to [0 1 3 10]? … or any other distribution. Can anyone here make the medical argument for why [0 1 2 3] is more appropriate?

      1. Hi Nat. If using a t-test, your concern would be valid. But as others have pointed out, a t-test for ordinal data isn’t a good choice. In a comment below, Steve Kass scolded me for calculating means of the ordinal data, and he is very much on point. Calculating the means in my post was primarily to show that when you look at the central tendency of nil to severe inflammation, there wasn’t much difference. But Steve is correct this probably should have been omitted.

        But the nice thing about the Wilcoxson test is that it is a sign-rank test, and therefore the numbers we use to code the order is not an issue. All that matters is the order, not the numbers used to create the order. For example, in the code below, I’ll run the same analysis using my original coding (0,1,2,3) and again using something similar to what you’ve proposed (0,1,5,10). You’ll notice the W statistic and p-value are identical. This is because when analyzing ordinal data, only the order matters, not the actual value. Which is why Steve is on target when he reprimands me for calculating the mean for these data.

        ## Coding: Nil = 0, Mild = 1, Moderate = 2, Severe = 3
        nonGM.c1< -c(rep(0,4),rep(1,31),rep(2,29),rep(3,9))
        GM.c1<-c(rep(0,8),rep(1,23),rep(2,18),rep(3,23))
        wilcox.test(nonGM.c1,GM.c1)
        
        	Wilcoxon rank sum test with continuity correction
        
        data:  nonGM.c1 and GM.c1 
        W = 2325, p-value = 0.2081
        alternative hypothesis: true location shift is not equal to 0 
        
        
        ## Coding: Nil = 0, Mild = 1, Moderate = 5, Severe = 10
        nonGM.c2<-c(rep(0,4),rep(1,31),rep(5,29),rep(10,9))
        GM.c2<-c(rep(0,8),rep(1,23),rep(5,18),rep(10,23))
        wilcox.test(nonGM.c2,GM.c2)
        
        	Wilcoxon rank sum test with continuity correction
        
        data:  nonGM.c2 and GM.c2 
        W = 2325, p-value = 0.2081
        alternative hypothesis: true location shift is not equal to 0 
        
  7. Came across this after reading the paper as well. Nice job on the number crunching. One thing I noticed was that GM-fed pigs had “at least 50% less abnormalities of the Heart and Liver” as compared to the other group.

    Obviously, this is not statistically significant with such a small group of pigs, and it being a lone study, but it is a nice little fun fact to throw back when presented with illogical counter-arguments.

    1. Yes, it seems like when taken as a whole, this paper presents far more evidence in support of GM feed safety than against it.

  8. Hi Andrew,

    Unfortunately, you’ve misunderstood a fundamental issue about ordinal data: Ordinal data is ordered, but non-numeric. It’s coded as numbers to reflect the ordering, but the codes are solely for that purpose – to indicate the ordering.

    In no case are the means (mathematical averages) of numeric codes useful for statistical analysis. This is what you should never use a t-test on coded ordinal data, because a t-test compares means. (Means can be used if there are only two categories, like male/female or inflamed/not inflamed, and there are ordinal-looking scales that have been shown to be effectively scale data, like Likert scales. Neither of these exceptions is the case here.)

    The fact that you computed the mean values of the coded values of inflammation levels shows that you don’t understand this fundamental aspect of ordinal data. The means of ordinal codings are meaningless. Don’t compute them. Don’t test them. Don’t discuss them.

    It’s good that you and others are responding to the recently-published paper, but it would be good to see better statistics in the responses.

    1. Hi Steve, Thanks for commenting. Your comment is on point, even if a little harsh. Luckily, as a scientist who submits papers to the peer review process regularly, I’m quite used to being harshly told how wrong I am. :-) I’ve responded to this issue in my response to Nat above. I will also update the post to draw attention to your comment, because I do think this is an important point, and I shouldn’t have calculated means in the first place.

      I think the argument could be made that the means may be OK for this data set, assuming that the categories are more-or-less equally spaced. But since the authors didn’t really provide any quantitative information to describe their categories, you are correct that computing means and a t-test are not appropriate here.

      1. Thanks, and glad to know you’re used to brusque criticism. Of course, you’re right that the mean is a valid measure of central tendency if the coding reflects the spacing. My sharp answer probably comes out of my own frustration at the end of each semester I teach statistics, as I’m seemingly unable to teach some of these concepts (like “The mean is not a valid measure of central tendency for ordinal data.”) in a way that sticks.

    2. Ok, I’ve tried to avoid this, but to no avail :) There are many ways to approach data with statistics, each with their own set of assumptions. Sometimes the assumptions are tolerable, and other times they may not be. Andrew’s t-test analysis of “scores” does make some big assumptions, namely that the proportional change in scores corresponds to proportional changes in severity. As Steve points out and Andrew acknowledges, that assumption here is a bit much.

      I’ve had some requests regarding another analysis of this data, so I’ll put it here. Data of this type is often handled using logistic regression, and the ordinal aspect using a cumulative logit transformation. Below is such an analysis (in SAS. Sorry, I am not up to speed in R). I have also incorporated gender into the treatment structure as a 2 x 2 factorial. Is this the correct analysis? Maybe, if you accept the assumptions :)

      data pig;
      	input severity trt$ gender $ count;
      	cards;
      	0	NoGMO	M	1
      	1	NoGMO	M	16
      	2	NoGMO	M	17
      	3	NoGMO	M	2
      	0	GMO	M	4
      	1	GMO	M	12
      	2	GMO	M	12
      	3	GMO	M	8
      	0	NoGMO	F	3
      	1	NoGMO	F	15
      	2	NoGMO	F	12
      	3	NoGMO	F	7
      	0	GMO	F	4
      	1	GMO	F	11
      	2	GMO	F	6
      	3	GMO	F	15
      ;
      
      proc logistic;
      	class trt gender;
       	weight count;
      	model severity =gender trt gender*trt;
      	oddsratio trt ;
      run;
      

      — Partial Results —

                     Type 3 Analysis of Effects
      
                                             Wald
                   Effect          DF     Chi-Square    Pr > ChiSq
      
                   gender           1        0.9547        0.3285
                   trt              1        1.7931        0.1805
                   trt*gender       1        0.4476        0.5035
      
             Odds Ratio Estimates and Wald Confidence Intervals
      
      Label                                Estimate         95% CL
      trt GMO vs NoGMO at gender=F         0.543       0.234     1.259
      trt GMO vs NoGMO at gender=M         0.815       0.350     1.896
      

      — End Partial Results –

      1. Funny, I was scrolling through the comments to see if anyone had just used an ordinal logistic regression, since that’s really the correct way to do things. The in R is below (I use the “ordinal” package, since it gives p-values, unlike polr in MASS):

        library(ordinal)
        d <- data.frame(
            Severity = as.factor(c(rep(c(0,1,2,3),4))),
            Treatment = c(rep(c(rep("NoGMO",4),rep("GMO",4)),2)),
            Gender = c(rep("M", 8), rep("F", 8)),
            Count = c(1,16,17,2,4,12,12,8,3,15,12,7,4,11,6,15)
        )
        summary(clm(Severity ~ Gender*Treatment, weights=Count, data=d, link="logit"))
        

        The results are about the same as those from SAS. I'm unsure that a Gender:Treatment interaction makes biological sense, so one can simply use a "+" instead of the "*", but the results don't really change. Likewise, you can simply ignore gender and the results are still similarly non-significant (unsurprisingly).

        1. Thanks for adding R code Devon. Small differences in numeric results could arise from the estimation methods and specific numeric algorithms used. I assume “logit” in this case defaults to the cumulative logit for multinomial data (this is what Proc Logistic does). Interesting note: a generalized logit, which is not really a good choice for ordinal data, gives a significant difference for the TRT effect in this data.

          IMO, it is necessary to include gender in the model or analyze them separately. Gender and its interaction with treatment is a potential source of variability and should be accounted for in the model. There is no reason to expect that genders would react similarly to the treatments. Some people will suggest that, because gender is non-significant, we can ignore it when modeling. I do not subscribe to that line of thought, however. Non-significance does not imply no effect, AKA, never accept the null hypothesis. It was a factor in the design and should be accounted for in the model.

          Glad to see so many people familiar with R out there. although it is not a package I know well, it is a powerful tool.

          1. Hi Bill,

            Yes, logit is as you described (I think it’s the default, but I didn’t want to chance that the default was “probit”).

            I don’t have any big disagreements regarding accounting for a gender:treatment interaction. I only mentioned that above since, as I’m not familiar with pig gastric inflammation (I’m a neuroscientist by training…), I don’t know if there’s some compelling prior information for including or excluding it.

        2. Thanks Devon! I had played a little with polr (MASS) but did not know of the ordinal package. Although I agree with Bill that we should probably account for gender in the analysis, I’ve ignored it in the example below just so it is similar to the previous analyses I’ve been conducting.

          ### Ordinal Logistic Regression
          inflam< -c(rep("Nil",4), rep("Mild",31), rep("Moderate",29), rep("Severe",9),
                    rep("Nil",8), rep("Mild",23), rep("Moderate",18), rep("Severe",23))
          inflam<-factor(inflam,levels=c("Nil","Mild","Moderate","Severe"),ordered=T)
          trt<-c(rep("nonGM",73),rep("GM",72))
          pigs<-data.frame(trt,inflam)
          table(pigs$trt,pigs$inflam)
          
          library(ordinal)
          olreg<-clm(inflam~trt, data=pigs, link="logit")
          summary(olreg)
          

          ##-- Output: --##

          > table(pigs$trt,pigs$inflam)
                 
                  Nil Mild Moderate Severe
            GM      8   23       18     23
            nonGM   4   31       29      9
          
          > summary(olreg)
          formula: inflam ~ trt
          data:    pigs
          
           link  threshold nobs logLik  AIC    niter max.grad cond.H 
           logit flexible  145  -183.73 375.46 5(0)  5.29e-08 1.5e+01
          
          Coefficients:
                   Estimate Std. Error z value Pr(>|z|)
          trtnonGM  -0.3893     0.3060  -1.272    0.203
          

          So, another analysis (probably the "best" analysis for this data) shows no significant difference between treatment groups.

  9. (1) A central assumption of the MWW test is that the data are continuous. Even though tie breaking approches can be used, I believe they are intended for cases where the data are discrete by nature of reporting to insufficient significant figures. For categorical data that are full of ties, you certainly will not obtain significant p values, if the MWW test even makes sense at all. And yes, really bad move trying to apply a t-test.
    (2) There is nothing wrong with making inflammation categorical if the authors used a test appropriate for categorical data, which they did. Your error was in using a test whose central assumption is continuity i.e. no ties. If there is reason to suspect severe and mild inflammation have different causes, what the authors did was perfectly valid. Perhaps multiple hypothesis correction may have been in order, but that is about it as far as I can see.

    1. According to the MWW’s wikipedia page ( http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#Assumptions_and_formal_statement_of_hypotheses ), while continuity was assumed in the authors’ original publication of the test, there are more general assumptions under which it is valid. They give as the four assumptions:

      1. Independence
      2. Ordinality
      3. Symmetry under the null
      4. P(inflammation of GMO pig > inflammation of non-GMO pig) + 0.5 P(inflammation of GMO pig = inflammation of non-GMO pig) > 0.5

      Assumption # 4 seems to be quite a bit more general than continuity (in which you’d expect P(GMO = non-GMO) = 0).

      Is there something I’m missing here?

  10. When ~ 90% of the pigs show inflammation regardless of the GM category, it would appear that there is a more fundamental issue. Should the question be GM vs non-GM but corn vs some other feed?

Leave a Reply

%d bloggers like this: