Making Sense of Research on Complementary Cancer Therapies.

By Nancy Hepp, CancerChoices Lead Researcher and Program Manager,
with review by Laura Pole, RN, MSN, OCNS, and Andy Jackson, ND.

Perhaps you’ve seen headlines such as these:

  1. Vitamin D Prevents Cancer
  2. No Benefit from Taking Vitamin D Supplements
  1. Exercise Recommended for People with Cancer
  2. Ease Back on Exercise during Cancer Treatment
  1. Metformin Improves Cancer Survival
  2. Metformin Has No Effect on Cancer

Is it any wonder that many people are confused? Or that some people have even stopped listening to scientists because they make such contradictory statements? Medical research is in some ways a separate place, almost like a separate country with a different language and culture. Let me be your tour guide for a bit, helping you understand what researchers are actually saying.

Building evidence over time

It may help off the bat to think of medical research studies as a bit like election results. As soon as the polls close on election night, we start seeing returns. I may get excited because my candidate is ahead by 2 points—but if only 2% of precincts have reported, it’s far too early to claim victory. As more and more results come in, we get a more definitive picture of the final result, and we have increasingly more confidence in the results we’re seeing.

Medical research follows a similar path of mounting evidence with each new study. When a study comes out with one conclusion, it’s almost always far too early to declare that this is the final word. Several studies drawing similar conclusions give us more confidence in the results. One important difference between elections and research is that research never reaches 100% of the count. We can always have another study on a topic which will add to the overall picture, either strengthening the current understanding or raising questions about it.

Clinical or preclinical research?

A very important consideration is whether a therapy or treatment was tested in human beings (clinical research) or in isolated tissues or animals (preclinical research). Each has its place, as currently researchers would be very hesitant to try a new drug or procedure on people if the safety and effectiveness hasn’t first been shown in cell and animal studies. But preclinical studies are only a first step toward finding how a drug or other therapy will act in living human beings. They’re a little like a pre-election poll. Polls give an indication of how an election will go, but election results can be different from polls for a lot of reasons. Preclinical research can also find different results from clinical evidence for a number of reasons. Animals and isolated cells in a petri dish may respond very differently to an herb or drug than a living person. This may be a reason your doctor would not be as enthusiastic as you had hoped about a preclinical study showing that an herb killed cancer cells. Results of preclinical studies are a first indication of medical effects, but only that.

Benefit or not?

It’s important to realize that in virtually all clinical studies, some people will show a benefit from a treatment and some will not, no matter whether the treatment is a chemotherapy drug, a supplement, a change in lifestyle, or anything else. Almost no treatment will help 100% of the people in the study. This means that researchers have to conclude that a treatment is beneficial if enough people get better. How “enough” is determined, and how “better” is defined, are just 2 reasons that research studies can be so complex to interpret, and why even when studies show similar results, the researchers may come to different conclusions.

Researchers can measure study outcomes in many different ways. Even something as apparently definitive as survival can be measured variously—1-year survival, 5-year survival, progression-free survival, recurrence-free survival, and more. When comparing 2 studies, we have to be sure that the same measures are used.

This becomes even more complex when measuring conditions such as pain or fatigue or sleep quality, where no hard-and-fast measures exist. We have to rely on each person’s report of the severity or duration of their symptoms. Some tools are available to help quantify pain or fatigue, and researchers may also look at how much pain medication was used or how much a person moved while in bed (an indirect measure of sleep), but then how do you measure whether something is “better”? Researchers have to determine how much of a change means “better” to them. Does 10% less pain medication use mean less pain? Or would it have to be 40% less use? Does moving from a “9” rating of fatigue to a “6” show enough improvement? This is an important issue for interpreting study results. Imagine a study defining an improvement in pain as a 50% reduction in the pain score (a very realistic scenario). A 40% reduction would count for nothing in the study, and the researchers would conclude that the therapy showed no effect with only a 40% reduction in pain, although this might have been meaningful to the people experiencing it. We must always look carefully at what a study is measuring when interpreting the results.

Statistical analysis gives us a standard to follow to determine how much is “enough” improvement across all the people in the study. Most people would agree that if a therapy is tested on 100 people, and 2 show an improvement, we don’t want the researchers claiming that this therapy is effective. But how many people would need to show improvement to make that claim? 30? 50? 90? The field of statistics has developed rules and guidelines for interpreting study results to distinguish events that occur randomly from events that are due to the effect of a treatment or intervention. These involve complex mathematical formulas to determine whether a result is more likely a result of random chance or due to the therapy being studied.

Studies use either “p values” or “confidence intervals” (or both, as they are related) to assess how likely the results of this particular study reflect what would happen if the entire world has been studied. P values (probability values) are shown as a number, and the smaller the number, the more confidence we have that the result was not a random fluke. Most researchers agree that a p value less than 0.05 gives us enough confidence to say that we saw a medical effect in this study. But some researchers are content with a p value as high as 0.1, which is a less rigorous standard.

It’s also possible to have a situation in which a statistically significant improvement is found, but the meaningful clinical effect is negligible. As always when we assign numbers to complex living beings and processes, we gain some understanding and lose some of the nuance and intricacy. For example, a 5% reduction in insomnia may have a p value of less than 0.05, but this could mean only 1 extra person is sleeping well after treatment (20 with insomnia among people who use a therapy compared to 21 with insomnia among people who didn’t). We would say that this therapy has minimal clinical benefit, even though it shows statistical benefit.

Both the statistical significance and the clinical significance need to be considered, and then also any risk of harm from using a therapy.

Compared to what?

When looking for whether a therapy has benefit, what the therapy is being compared to matters. A weak comparison is to assess a medical situation before treatment and compare it with the situation after treatment. For example, suppose we want to see if applying a medicinal cream to a wound helps healing. We apply the cream every day for 2 weeks, and then we see that the wound is indeed much more healed than it was at the start. But wouldn’t we expect a wound to show substantial healing after 2 weeks, even without the cream? If we’re looking only at “before” and “after” status, we won’t be able to tell if the cream had much effect at all. We’d really need at least one other person with a similar wound and health status. We’d use the cream on one person and not the other, and then we’d see if we could find much difference between the wounds after 2 weeks. This is even stronger if we have groups of people instead of just two.

This situation of treating some people in a study with a therapy and then comparing their outcomes to other, comparable people who weren’t treated is called having a control group, and the people who didn’t receive the treatment are called “controls.” Controls are needed to have much confidence at all that any improvement (or harm) is due to the therapy we’re investigating and not simply natural healing (or deterioration). If controls are given a placebo—an inactive therapy such as a look-alike pill that users can’t tell from “the real thing”—we have a very good comparison and can have even more confidence that our therapy had an effect.

In research studies, what a treatment is compared to is important, and any therapies that have only before-and-after comparisons show only weak evidence, at best.

The study size matters

Large, rigorously designed studies are somewhat like getting election results from a big city, while a small study is more like getting the results from one neighborhood. A case study would be similar to asking one person how they voted. In general, the more people in a medical study, the more confident we can be that we’ll get the overall trend of a therapy’s effects.

In small studies—less than 100 people—the change of outcome of just a few people can alter the conclusion, similar to how in a small election, just a few votes can swing the results.

Example: We’re trying out a new sleep therapy among 50 people. We’ve determined by statistical guidelines that we need to see an improvement in sleep among at least 35 people to conclude that the therapy is effective. So what do we do if we measure everything carefully, and we find that 34 people reported better sleep after using a therapy? We have to conclude that the therapy had no effect. If the identical study with another group of people found that 36 people reported better sleep, the researchers would say the therapy is effective. The opposite conclusions from these 2 studies rely on just a small difference in the outcomes.

When we at CancerChoices evaluate studies such as these, we look carefully at the p values. In this example, the p values may have been 0.07 (just a little too high) in the first study and less than 0.05 (right where we want it) in the second study. When we interpret these results for our readers, we would not describe them as conflicting findings. We would say that the first study showed a weak trend toward better sleep, and the second study showed a significant effect. Both studies found changes in the same direction, and with at least a weak level of confidence.

I want to be clear: we do not want to overstate the first study’s results—if it were the only study available, we would call this only weak evidence of an effect. But we think a study showing a weak trend toward a positive effect is not the same as a study without any evidence of benefit. Looking at these weak trends has helped us resolve apparently conflicting evidence across studies. We define a weak trend as coming close, but not quite achieving, statistical significance. This means having a p value less than 0.10 or so. Anything with a higher p value is not good enough to call even weak evidence. For those studies, we, along with most researchers, conclude there was no evidence of an effect.


Because small studies are more influenced by the outcomes of each participant than large studies, they are considered less reliable. But that doesn’t mean their results cannot be valuable. By combining the 2 studies in the sleep study example above, we can consider them as though they were 1 study with 100 people. If we need to see improvement among 70 people to see a result, we’ve achieved that. We’ve taken 2 studies and combined them to help us interpret the overall results. This is called a meta-analysis, and it is a very useful tool for finding the best likely interpretation of a collection of similar studies. A meta-analysis can combine the results of many large and small studies to give us a clearer picture of the findings, especially when individual studies came to different conclusions.

Of course a meta-analysis is only as good as the studies being combined, and sometimes studies have important differences that a meta-analysis may not be able to reconcile. Studies may have used different doses or different durations of a therapy, or at different cancer treatment phases, or among different populations, but these details may be lost in the meta-analysis. A meta-analysis is helpful, but its application can still be limited and open to interpretation.

How much evidence is enough?

Especially for therapies that are generally safe and relatively inexpensive, finding even preliminary evidence of an effect may be helpful for people who are suffering and may be able to find relief from a therapy. For example, our evaluation of research on yoga found preliminary evidence of a higher rate of return to work 6 months after starting chemotherapy among people with breast cancer who practice yoga compared to people who don’t practice yoga. “Preliminary evidence” isn’t very strong, but yoga is safe for many people and generally not very expensive, so the risks are low, while the benefit could be meaningful. If you would like to improve your recovery after chemotherapy and your doctor doesn’t advise against yoga, preliminary evidence may be enough for you to decide to start practicing yoga. For therapies that are more expensive or potentially harmful, a higher level of evidence is a good idea.

The study population matters

Recall that at the beginning of this post I presented apparently opposite findings from studies:

  1. Vitamin D Prevents Cancer
  2. No Benefit from Taking Vitamin D Supplements
  3. Exercise Recommended for People with Cancer
  4. Ease Back on Exercise during Cancer Treatment
  5. Metformin Improves Cancer Survival
  6. Metformin Has No Effect on Cancer

Each of these pairs of studies found conflicting results, but by looking carefully at who was under investigation in each study, we can see that both conclusions could be valid, but for different groups of people.

Our analysis of the benefits of vitamin D found that supplements can reduce your risk of cancer, but only if you are deficient in vitamin D. For people who have adequate vitamin D levels, taking supplements probably won’t bring any benefit for cancer risk (and getting your vitamin D levels up too high could even be a problem).

We found that exercise is recommended for people with cancer, but if you are experiencing substantial nausea or vomiting, some kinds of pain, or other short-term symptoms, you may need to stop vigorous exercise and switch to gentle movement for a time.

When we analyzed hundreds of studies about the impact of the diabetes drug metformin on cancer outcomes, we found that metformin may indeed improve survival and reduce cancer risk among people with diabetes, prediabetes, or high blood sugar, but very little evidence shows much cancer benefit among people with normal blood sugar levels.

Who is being studied is at least as important as which therapy is being used. Characteristics of the people in the study can make a difference in the outcome and can be a big clue as to whether a therapy might have a similar effect for you. Consider their age, sex, type and stage of cancer, whether they have diabetes or other medical conditions, their genetics, and other characteristics. Many studies find no effect of a therapy among the whole study population, but a particular subgroup may experience better outcomes when using a therapy. Perhaps only women showed a response, or only people under age 35, or only people with a certain genetic profile. It’s important to pay attention to the study population and any subgroups when evaluating a study. However, having several subgroup analyses can also increase the risk of “false positive” correlations—those that occur due to random chance and not due to any effect from the therapy. Subgroup analyses need to be interpreted with some caution.

Someone to interpret studies for you

After this brief visit to the world of medical research, you would be justified to feel overwhelmed at the thought of reading and competently interpreting all the studies on a complementary therapy you’re interested in. This is why many organizations read and interpret the results for you.

We at CancerChoices review the cancer-related use of complementary therapies, including herbs and supplements, mind-body therapies, off-label drugs, diets, and more. We consider all the aspects of the research described in this post, and then we summarize and interpret the results for our readers. We categorize the overall evidence on any medical benefit of a therapy into increasing levels. These categories consider not only the number and size of studies, but also the study design and quality:

  1. No evidence of an effect
  2. Insufficient evidence of any effect
  3. Weak evidence
  4. Preliminary evidence
  5. Modest evidence
  6. Good evidence
  7. Strong evidence

Once we’ve compiled all the evidence, we rate the therapy on 7 dimensions based on how strong the evidence is for each of 4 medical benefits, plus how integrative oncology experts recommend use, safety, and affordability and access.

All the levels of evidence and the ratings are described at Therapy Ratings. We footnote and link to all the studies that we have evaluated so that you can go to the source documents if you’d like.

To date, we have published 39 full therapy reviews and ratings, with another 20 partial reviews of complementary therapies. Many more reviews are planned.

We invite you to visit us at to find rigorous and transparent assessments of the research behind complementary therapies, lifestyle practices such as eating, exercising, managing stress, and managing your body weight, and much more.