PyProcessMacro, the Python library to estimate moderation, mediation, and conditional process analysis has been updated to version 0.9.6!

A few changes this time, mostly under the hood (bug fixes). The main change is that the library now correctly reports the Moderated Mediation Index for models with only one moderator on the indirect path.

You can update the current version of your library using pip: pip install pyprocessmacro.

In this blog post, I will demonstrate a more advanced use of the DistributionBuilder: keeping track of the participants' allocation process. In a few lines of code, we will get better insights in how the participants are constructing their distributions.

A typical use of DistributionBuilder

A distribution builder is generally used like this:

  1. Add it to a survey
  2. Let the participant provides her allocation
  3. Recover the allocation.
// 1. Create the DistributionBuilder
var distbuilder = new DistributionBuilder()
distbuilder.render("container")
distbuilder.labelize()

// 2. Participants interacting
// Add 3 balls there, remove 2 here, 1 more over here...

// 3. Recover the allocation
console.log(distbuilder.distribution)

But what if you want to know what happened in step 2? What if you want to keep track of all the intermediary steps between the empty distribution builder, and the final allocation provided by participants?

Monitoring the changes in the allocation

The DistributionBuilder library provides a very useful property: onChange. Any function that you supply to this onChange will be called every time the distribution changes.

function iHaveChanged() {
    // This function just says "I have changed!"
    alert("I have changed!")
}

var distbuilder = new DistributionBuilder({"onChange":iHaveChanged})
distbuilder.render("container")
distbuilder.labelize()

The distribution builder now outputs "I have changed!" every time the user makes a change to the allocation.

This is not very useful, but it shows how we can use the onChange property to trigger specific actions when the participants are changing their allocation of balls.

Recovering and storing the intermediary allocations

Now let's create a better function that will recover the current allocation from the distribution builder, and stores it in an array.

var hist = [] // Empty array that will store the history.

function updateHistory(distbuilder) {
    // This function takes a a DistributionBuilder object as argument
    var allocation = distbuilder.getDistribution().slice() // Make a copy of the current allocation
    hist.push(allocation) // Push this copy to the history array.
}

var distbuilder = new DistributionBuilder({"onChange":() => updateHistory(distbuilder)})
// See below for an explanation on this pattern
distbuilder.render("container")
distbuilder.labelize()

Now this is more useful! Every time a respondent interacts with the distribution builder, the resulting allocation will be stored in the history array.

A few notes on the code above in case you want to better understand what's going on:

  • Unlike in the previous example, we did not bind updateHistory to onChange. Instead, we bind a wrapper function that calls updateHistory with the object distbuilder already supplied as argument. This is what () => updateHistory(distbuilder) means.
  • The function updateHistory makes a copy (with slice()) of the allocation before pushing it to the array. If you do not make a copy, the content of history will change every time the allocation changes, and this array will contain X copies of the current allocation instead of the history. If you have the newest version of the library (v1.1), you can omit the slice().

In the example below, you can observe the previous allocations that have been stored.

See the Pen DistBuilderHistory by Quentin Andre (@QuentinAndre) on CodePen.

Store the allocation in Qualtrics

Qualtrics' embedded data fields only accept strings. To store the history in Qualtrics, you will therefore have to convert this array of allocations into one large string.

function historyToString(arr) {
  return arr.map(x => x.join(",")).join("*")
}

Qualtrics.SurveyEngine.setEmbdeddedData("HistoryOfAllocations", historyToString(hist))

The function above separates the values within each allocation by commas ',' and the different allocations within the history by stars '*'.

Now you are all set! You can observe how participants construct their distributions, and get better insights into their thinking process. In the next post, I will discuss results obtained from this paradigm.

I have posted a minor update of DistributionBuilder on GitHub. It implements some cosmetic improvement to the DistributionBuilder, and prevents a potential bug in more complicated use cases of the library. If you had downloaded the library, I invite you to update your current version. If you were using the link to the library, the code has been automatically updated, and you can still use the library as you used to!

Detailed changelog:

  • The width of the distribution builder is now automatically adjusted using CSS flexbox.
  • The argument resize of DistributionBuilder.render() will be deprecated in future versions. For compatibility reasons, using the resize argument does not raise an error, but it no longer affects the behavior of the distribution builder.
  • Changed the HTML structure: the <div class="cell"></div> now includes an inner <div class="ball"></div>. The appearance of the "balls" in the distribution builder can now be changed more easily.
  • The method getDistribution() now returns a copy of the current allocation. This is to avoid accidental side-effects.

You have used DistributionBuilder in one of your experiment, and you just downloaded the data from Qualtrics. Now what? This small guide will cover the basics of analyzing responses from DistributionBuilder with Python.

You can also view this notebook on NBViewer, which will allow you to download, run and edit the code yourself.

In [1]:
import numpy as np
import pandas as pd
from io import StringIO
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.formula.api as smf
import warnings

data = """part_id,condition,allocation
A,Low,"3,1,0,2,1,0,0,0,1,2"
B,High,"0,0,2,0,2,0,1,2,0,3"
C,High,"3,0,0,2,1,0,0,0,3,1"
D,Low,"2,1,0,2,0,1,2,0,1,1"
E,Low,"3,2,0,0,0,1,0,3,0,1"
F,High,"0,0,2,0,0,1,2,0,2,3"
G,Low,"1,3,0,3,0,2,0,1,0,0"
H,High,"0,1,3,0,0,1,2,2,1,0"
I,High,"2,2,0,0,0,2,1,0,0,3"
J,Low,"1,3,1,0,2,2,0,0,0,1"
K,Low,"1,3,1,0,0,1,2,1,1,0"
L,Low,"3,1,2,1,0,1,0,1,0,1"
M,Low,"0,2,2,0,1,3,0,2,0,0"
N,Low,"1,3,0,2,0,1,1,1,1,0"
O,High,"0,0,1,3,1,0,0,3,0,2"
P,High,"1,1,2,0,0,0,2,0,2,2"
Q,High,"1,1,0,0,0,0,3,2,2,1"
R,High,"2,0,0,0,0,2,3,0,3,0"
S,High,"0,1,3,2,0,0,0,0,1,3"
T,Low,"1,3,1,0,1,0,3,0,0,1"
"""

You have just completed a pilot on a Qualtrics. Participants have been randomly assigned to one of two conditions ("Low" or "High"), and have learned a distribution of numbers ranging from 11 to 20.

After learning the distribution, they have been asked to predict the next 10 numbers drawn from this distribution. To do so, they had to allocate 10 balls on a distribution builder with 10 buckets, labeled [11, 12, 13, 14, 15, 16, 17, 18, 19, 20].

Here is the data for the first twenty participants in your study (in practice, always collect more than this. Please).

In [2]:
df = pd.read_csv(StringIO(data), sep=",")
df
Out[2]:
part_id condition allocation
0 A Low 3,1,0,2,1,0,0,0,1,2
1 B High 0,0,2,0,2,0,1,2,0,3
2 C High 3,0,0,2,1,0,0,0,3,1
3 D Low 2,1,0,2,0,1,2,0,1,1
4 E Low 3,2,0,0,0,1,0,3,0,1
5 F High 0,0,2,0,0,1,2,0,2,3
6 G Low 1,3,0,3,0,2,0,1,0,0
7 H High 0,1,3,0,0,1,2,2,1,0
8 I High 2,2,0,0,0,2,1,0,0,3
9 J Low 1,3,1,0,2,2,0,0,0,1
10 K Low 1,3,1,0,0,1,2,1,1,0
11 L Low 3,1,2,1,0,1,0,1,0,1
12 M Low 0,2,2,0,1,3,0,2,0,0
13 N Low 1,3,0,2,0,1,1,1,1,0
14 O High 0,0,1,3,1,0,0,3,0,2
15 P High 1,1,2,0,0,0,2,0,2,2
16 Q High 1,1,0,0,0,0,3,2,2,1
17 R High 2,0,0,0,0,2,3,0,3,0
18 S High 0,1,3,2,0,0,0,0,1,3
19 T Low 1,3,1,0,1,0,3,0,0,1

1. Data Wrangling: Transforming the data for statistical analysis and visualization.

If you look at the data, you can see that you have the allocation provided by each participant.

However, those allocations are not interpretable right now:

  1. The allocations are stored as a string (and not as a list of numbers).
  2. This numbers in this allocation are not the values of the distribution: those are the number of balls in each bucket. You can see that all the values are below 10, while the buckets ranged from 11 to 20.

Let's review how you should transform this data, and what you can do with it.

The first step is to write a function and convert those allocations to distributions.

a. Converting the allocations into distributions

In [3]:
def convert_allocation_to_distribution(allocation, buckets):
    """
    Takes an allocation of balls to buckets, and a list of buckets.
    Return the corresponding distribution of values.
    
    Example: 
        buckets = [1, 2, 3, 4, 5]
        x = "0, 3, 1, 2, 1"
        dist = convert_allocation_to_distribution(x, buckets)
        print(dist) -> (2, 2, 2, 3, 4, 4, 5)
    """
    arr = allocation.split(",")
    
    if len(arr) != len(buckets):
        raise ValueError("The number of buckets should match the length of the allocations.")
    values = np.repeat(buckets, arr)
    return values

Now we apply this function to the column "allocation", specifying that the buckets ranged from 11 to 20.

In [4]:
df["distribution"] = df["allocation"].apply(convert_allocation_to_distribution, 
                                            buckets=[11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
df.head()
Out[4]:
part_id condition allocation distribution
0 A Low 3,1,0,2,1,0,0,0,1,2 [11, 11, 11, 12, 14, 14, 15, 19, 20, 20]
1 B High 0,0,2,0,2,0,1,2,0,3 [13, 13, 15, 15, 17, 18, 18, 20, 20, 20]
2 C High 3,0,0,2,1,0,0,0,3,1 [11, 11, 11, 14, 14, 15, 19, 19, 19, 20]
3 D Low 2,1,0,2,0,1,2,0,1,1 [11, 11, 12, 14, 14, 16, 17, 17, 19, 20]
4 E Low 3,2,0,0,0,1,0,3,0,1 [11, 11, 11, 12, 12, 16, 18, 18, 18, 20]

Good! The raw allocation strings have now been converted to the actual distribution provided by participants. However, the data is in a nice shape: the data for each participant is stored as a list of number in a single column. Ideally, we'd like a format that we can use to do analysis and graphs.

b. Pivoting the distributions in long form

We are now going to reshape the data in long form, such that one record corresponds to one value entered by one participant.

In [5]:
part_values = df.set_index('part_id')['distribution'].apply(pd.Series).stack().reset_index()
part_values.columns = ["part_id", "value_index", "value"]

Let's first check that we have the expected number of records. We have 20 participants and 10 values by participants:

In [6]:
part_values.shape[0]
Out[6]:
200

That's a total of 200 observations. Good. Now if we inspect the head of the dataset...

In [7]:
part_values.head()
Out[7]:
part_id value_index value
0 A 0 11
1 A 1 11
2 A 2 11
3 A 3 12
4 A 4 14

...we see that we have lost some information along the way: we have the list of all values, but no longer have any participant-level information beyond the identifier part_id.

To correct this, we finally need to merge those values with the original dataset on the column part_id.

c. Merging back the rest of the data

Before merging, we can also drop the columns "distribution" and "allocation", which are no longer needed.

In [8]:
df_long = part_values.merge(df.drop(["distribution", "allocation"], axis=1), on="part_id")
df_long.head()
Out[8]:
part_id value_index value condition
0 A 0 11 Low
1 A 1 11 Low
2 A 2 11 Low
3 A 3 12 Low
4 A 4 14 Low

We now have everything in long form:

  • The participant-level information
  • All the values of the distributions provided by the participants

We are now ready for some visualization and statistical analysis.

2. Visualizing the data

The first thing you can do is to inspect the distributions provided by the participants: do they look nice? Let's inspect the distributions provided by those 20 respondents.

In [9]:
bins = np.arange(10.5, 20.6, 1)
xticks = np.arange(11, 20.1, 1)
g = sns.FacetGrid(data=df_long, col="part_id", col_wrap=4, hue="condition", sharex=False, size=1.5, aspect=1.7)
g.map(plt.hist, "value", bins=bins, histtype="bar", rwidth=0.8)
g.set_ylabels("Number of Balls")
g.set_xlabels("Bucket")
g.set(xticks=xticks)
g.set_titles("Participant {col_name}")
g.set(xlim=(10, 21))
g.add_legend(title="Condition");

Since those allocations were randomly generated, they do not look very nice. Oh well. But at least we have an interesting representation of participants' responses!

Note that we could also look at the distribution of responses by condition.

In [10]:
g = sns.FacetGrid(data=df_long, col="condition", hue="condition", size=4)
g.map(plt.hist, "value", bins=bins, lw=1, rwidth=0.8)
g.set_ylabels("Total Number of Balls")
g.set_xlabels("Bucket")
g.set(xticks=xticks)
g.set_titles("Condition {col_name}")
g.set(xlim=(10, 21));

The visualization above is informative within each condition, but you cannot compare the two conditions (unless you have the exact same number of participant in each condition).

If if is not the case, we can visualize the mean number of balls in each bucket and each condition by applying a simple transformation:

In [11]:
med_values = (df_long.groupby(["part_id", "value", "condition"]).count()
             .reset_index().groupby(["condition", "value"])
             .mean().reset_index())
med_values.columns = ["condition", "value", "median"]
med_values.head()
Out[11]:
condition value median
0 High 11 1.800000
1 High 12 1.200000
2 High 13 2.166667
3 High 14 2.333333
4 High 15 1.333333
In [12]:
g = sns.factorplot(x="value", y="median", hue="condition", hue_order=["Low", "High"],
                   data=med_values, kind="bar", ec="white", size=4, aspect=2)
g.set_ylabels("Median Number of Balls")
g.set_xlabels("Bucket")
g.set_titles("Condition {col_name}");

3. Analyzing summary statistics

The analysis of the data always depends on the research question.

So far, we have considered that we have collected 10 values from each participant, and have treated the data as such.

However, we could have collected the data with the specific goal of testing people's perception of a statistics (e.g. mean/standard deviation/skewness) of a distribution.

In that case, the relevant unit of analysis is the statistics computed at the distribution's level. We need to aggregate the long form data on part_id, and compute the statistics of interest, to obtain one record point per participant.

In [13]:
agg_values = df_long.groupby("part_id")["value"].aggregate(["mean","std", "skew"]).reset_index()
df_agg = agg_values.merge(df.drop(["allocation", "distribution"], axis=1), on="part_id")
df_agg.head()
Out[13]:
part_id mean std skew condition
0 A 14.7 3.713339 0.555959 Low
1 B 16.9 2.766867 -0.264377 High
2 C 15.3 3.683296 0.035688 High
3 D 15.1 3.212822 0.088451 Low
4 E 14.7 3.622461 0.197400 Low

We can now test the impact of our condition on people's perceived mean (or standard deviation, or skewness) of the distribution:

In [14]:
summ = smf.ols("mean~condition", data=df_agg).fit().summary()
summ
Out[14]:
OLS Regression Results
Dep. Variable: mean R-squared: 0.715
Model: OLS Adj. R-squared: 0.700
Method: Least Squares F-statistic: 45.23
Date: Fri, 13 Apr 2018 Prob (F-statistic): 2.64e-06
Time: 17:12:36 Log-Likelihood: -15.678
No. Observations: 20 AIC: 35.36
Df Residuals: 18 BIC: 37.35
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 16.2000 0.177 91.714 0.000 15.829 16.571
condition[T.Low] -1.6800 0.250 -6.725 0.000 -2.205 -1.155
Omnibus: 0.287 Durbin-Watson: 2.336
Prob(Omnibus): 0.866 Jarque-Bera (JB): 0.395
Skew: 0.235 Prob(JB): 0.821
Kurtosis: 2.498 Cond. No. 2.62

Here, we see that participants in the "Low" condition reported a significantly lower mean that participants in the "High" condition (which is exactly how I created the data, so not so surprising...).

3. Wrapping up

I have demonstrated how to convert the raw data outputted by DistributionBuilder to a standard long form dataset. A few words of caution to finish:

  • Always check that the list of buckets you use in the analysis matches the list of buckets that was shown to participants.
  • Always check your number of records. If something is off, you might have missing data, or you might not have forced participants to submit a full distribution.
  • Always ask yourself: what is my unit of analysis? Is it the values provided by participants, or is it a summary statistics of their total reported distribution.

Let's say you are interested in how well participants can learn a distribution of a specific type (I'm not judging, we all have weird hobbies). How should you select the values from the distribution?

In this blog post, I present a simple method to ensure that the values you use as stimuli will always match the properties of the distribution you are interested in.

You can also view this notebook on NBViewer, which will allow you to download, run and edit the code yourself.

In [44]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import numpy as np

Imagine that you want to test whether consumers can learn a given distribution and reproduce it on a distribution builder.

Specifically, you want to test if participants can learn a normal distribution with parameters $\mu = 25$ and $\sigma = 5$.

For the sake of illustration, let's assume the following:

  • Each participant will learn 50 numbers
  • Those 50 numbers to be integer (to facilitate learning).
  • The numbers will be reported on a distribution builder with bins [0, 2, ..., 48, 50].

How to create distribution of 50 values that, when reported into the distribution builder, will still look like "normal-like"?

A bad solution: the "random draw" approach

It might seem obvious: let's just draw those numbers from the distribution!

In [45]:
np.random.seed(25330)
MU = 25
SIGMA = 5
N = 50
BINS = np.arange(1, 51.5, 2)
r = np.arange(0, 50, 0.001)
dist = stats.norm(MU, SIGMA)
numbers = np.random.normal(MU, SIGMA, size=N) # Random draw of 50 numbers
round_numbers = np.round(numbers, 0) # Rounded to nearest integet

When we visualize it however....

In [46]:
plt.hist(round_numbers, bins=BINS, ec="white", align="mid");
plt.axvline(np.mean(round_numbers),color="darkred")
plt.annotate(r"$\mu = {:.2f}, \sigma = {:.2f}$".format(round_numbers.mean(), round_numbers.std()),(5, 7))
plt.ylabel("Number of values")
plt.xlabel("Bins")
sns.despine();

The mean is lower than what we'd like, the standard deviation is too high, and the distribution would not look normal at all when reported in a distribution builder.

This was expected: random numbers are, by definition, random. Can we do better?

A slightly better approach: the iterative approach

Yes! We could repeated this sampling process several times, until we are sufficiently close to the parameters of the distribution that we want to obtain.

In [47]:
ERR_MU = 0.01
ERR_SIGMA = 0.1
SKEW_SIGMA = 0.01
numbers = np.random.normal(MU, SIGMA, size=N)
round_numbers = np.round(numbers, 0)
m = numbers.mean()
sd = numbers.std()
skew = stats.skew(numbers)
i = 1
while (np.abs(m-MU) > ERR_MU) or (np.abs(sd-SIGMA) > ERR_SIGMA) or (np.abs(skew) > SKEW_SIGMA):
    i += 1
    numbers = np.random.normal(MU, SIGMA, size=N)
    round_numbers = np.round(numbers, 0)
    m = numbers.mean()
    sd = numbers.std()
    skew = stats.skew(numbers)
print("After {} iterations, we have a satisfying distribution".format(i))
After 6902 iterations, we have a satisfying distribution

That took a few seconds. Let's visualize it...

In [48]:
plt.hist(round_numbers, bins=BINS, ec="white", align="mid")
plt.axvline(np.mean(round_numbers), color="darkred")
plt.annotate(r"$\mu = {:.2f}, \sigma = {:.2f}$".format(round_numbers.mean(), round_numbers.std()),(3, 7))
plt.ylabel("Number of values")
plt.xlabel("Bins")
sns.despine();

The distribution now has the mean, variance and skew that we want... But it still not perfectly normal. In particular, the mode does not correspond to the mean... Do we really want to give reviewer B something to nitpick about?

The correct method: binning a continuous distribution

The trick is to follow these steps:

  1. Obtain the CDF of the distribution that we want to copy (here the CDF of $\mathcal{N}(25, 5)$
  2. Use this CDF to compute the probability of each random value falling in each bucket of the distribution builder. Formally, we compute for each bucket $P(l \leq X \leq h)$, where $l$ and $h$ are the lower and upper bounds of each bucket.
  3. Convert those probabilities in a number of observations, rounding them to the nearest integer.
  4. If this creates less observations than what we want, increase the probability of each observation by a very small amount.

Putting this together into a function:

In [49]:
def bin_dist(dist, buckets, n):
    """
    Generate a discrete distribution of numbers that match a target distribution.
    dist: 
        The Distribution object from which the CDF will be computed. 
        Can be any distribution, as long as it has support on the `buckets` provided
    buckets: 
        The buckets of the distribution builder that will be used
    n: 
        The number of observations that should be presented.
        
    Returns:
        An array of length n containing the values.
    """
    spacing = (buckets[1]-buckets[0])/2 # Space between each of the buckets
    lbounds = buckets-spacing # Lower bound of each bucket
    rbounds = buckets + spacing # Upper bound
    lcdf = dist.cdf(lbounds) # CDF up to lower bound
    rcdf = dist.cdf(rbounds) # CDF up to upper bound
    p = rcdf-lcdf # Probability of value being in the bucket
    nballs = np.round(p*n, 0) # Multiplying by expected number of values, and rounding to nearest integer
    mult = 1
    while (nballs.sum() < n): # In case we don't have enough observations...
        mult += 0.05
        nballs = np.round(p*n*mult)
    return np.repeat(buckets, nballs.astype(int))

Now if we apply this method:

In [50]:
dist = stats.norm(MU, SIGMA)
round_numbers = bin_dist(dist, np.arange(0, 51, 2), 50)
plt.hist(round_numbers, bins=BINS, align="mid", ec="white")
plt.axvline(np.mean(round_numbers), color="darkred")
plt.annotate(r"$\mu = {:.2f}, \sigma = {:.2f}$".format(round_numbers.mean(), round_numbers.std()),(3, 6))
plt.ylabel("Number of values")
plt.xlabel("Bins")
sns.despine();

This is exactly what we want ! A normal-like distribution of integers.

This method is also very flexible: it can be applied to any continuous distribution and any number of buckets.

Here are a few illustrations of the function for different distributions, varying the number of buckets.

In [51]:
# Normal, Chi and Beta distributions.
dists = [stats.norm(25, 9), stats.chi(1, 5, 20), stats.beta(0.5, 0.5, -1, 52)]

# 6, 11 and 26 buckets.
buckets = [np.arange(0, 51, 10),  np.arange(0, 51, 5), np.arange(0, 51, 2)]

# Corresponding bins and widths
bins = [np.arange(-5, 56, 10), np.arange(-2.5, 53.5, 5), np.arange(-1, 52, 2)]
widths = [10, 5, 2]


fig, axes = plt.subplots(3, 3, figsize=(8, 4), sharey=True, dpi=300)
for i, ax in enumerate(axes):
    for j, a in enumerate(ax):
        d = dists[i]
        balls = bin_dist(d, buckets[j], 50)
        a.hist(balls, bins=bins[j], width=widths[j], align="mid", rwidth=1, density=True, ec="white")
        a.plot(r, d.pdf(r), ls=":", color="red")

        sns.despine(left=True)
        a.set_yticks([])
        a.set_xticks(buckets[j])
        a.tick_params(axis="x", labelrotation=90, labelsize=5)
        a.tick_params(axis="y", labelrotation=0, labelsize=5)
        
for a, t in zip(axes, [r"$\mathcal{N}(25, 9)$", r"$\chi(1)$ (scaled)", r"$\beta(0.5, 0.5)$ (scaled)"]):
    a[0].set_ylabel("{}".format(t), size=8)
    
for a, t in zip(axes[0], ["6", "11", "26"]):
    a.set_title("{} Buckets".format(t), size=8)
    
plt.tight_layout()

A few rules to finish:

  • Don't use too few buckets. The fewer buckets you have, the less faithful the representation of the distribution will be.
  • Don't present too few observations. You also need a good number of them to approximate the distribution.
  • Make sure that your buckets cover the "full" distribution: your distribution should have support on all buckets, and the buckets should cover the majority of the support of the distribution.