Transformed Data

VegaFusion supports extracting the transformed data for an Altair Chart using the vegafusion.transformed_data() function. This is particularly useful when building a chart that includes a pipeline of transforms, as it’s now possible to see the intermediate results of each transform.

Example: Top K

Here is an example, based on the Top-K plot with Others example from the Altair documentation, of how transformed_data() can be helpful when building a complex chart.

First, create an Altair Chart wrapping the data source URL.

import altair as alt
import vegafusion as vf

source = "https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json"
chart = alt.Chart(source)

The transformed_data() function can be used on this empty chart to access a preview of the data that is available at the URL. Here the row_limit argument is used to limit the result to 3 rows and the DataFrame is transposed to make it easier to read.

vf.transformed_data(chart, row_limit=3).T

0

1

2

Title

The Land Girls

First Love, Last Rites

I Married a Strange Person

US_Gross

146083

10876

203134

Worldwide_Gross

146083

10876

203134

Production_Budget

8000000

300000

250000

Release_Date

Jun 12 1998

Aug 07 1998

Aug 28 1998

MPAA_Rating

R

R

Distributor

Gramercy

Strand

Lionsgate

IMDB_Rating

6.1

6.9

6.8

IMDB_Votes

1071.0

207.0

865.0

Major_Genre

Drama

Comedy

Rotten_Tomatoes_Rating

nan

nan

nan

Source

Creative_Type

Director

US_DVD_Sales

nan

nan

nan

Running_Time_min

nan

nan

nan

The first step of making this chart is to compute the average worldwide gross of all the movies for each director. This can be accomplished with the Altair Aggregate Transform.

chart = (
    alt.Chart(source)
    .transform_aggregate(
        aggregate_gross='mean(Worldwide_Gross)',
        groupby=["Director"],
    )
)
vf.transformed_data(chart, row_limit=5)

Director

aggregate_gross

0

3.59284e+07

1

Christopher Nolan

3.44251e+08

2

Roman Polanski

5.13407e+07

3

Richard Fleischer

2.27635e+07

4

Blake Edwards

5e+06

Next, the directors are ranked by average gross in descending order. This can be accomplished with the Altair Window Transform

chart = (
    alt.Chart(source)
    .transform_aggregate(
        aggregate_gross='mean(Worldwide_Gross)',
        groupby=["Director"],
    ).transform_window(
        rank='row_number()',
        sort=[alt.SortField("aggregate_gross", order="descending")],
    )
)
vf.transformed_data(chart, row_limit=5)

Director

aggregate_gross

rank

0

David Yates

9.37984e+08

1

1

James Cameron

8.29781e+08

2

2

Carlos Saldanha

7.69293e+08

3

3

Pete Docter

7.31305e+08

4

4

Andrew Stanton

7.00319e+08

5

Then, a new column is added that contains the director’s name for the top 9 ranked directors and “All Others” for the remaining directors. This can be accomplished using the Altair Calculate Transform.

chart = (
    alt.Chart(source)
    .transform_aggregate(
        aggregate_gross='mean(Worldwide_Gross)',
        groupby=["Director"],
    ).transform_window(
        rank='row_number()',
        sort=[alt.SortField("aggregate_gross", order="descending")],
    ).transform_calculate(
        ranked_director="datum.rank < 10 ? datum.Director : 'All Others'"
    )
)
vf.transformed_data(chart, row_limit=12)

Director

aggregate_gross

rank

ranked_director

0

David Yates

9.37984e+08

1

David Yates

1

James Cameron

8.29781e+08

2

James Cameron

2

Carlos Saldanha

7.69293e+08

3

Carlos Saldanha

3

Pete Docter

7.31305e+08

4

Pete Docter

4

Andrew Stanton

7.00319e+08

5

Andrew Stanton

5

David Slade

6.88155e+08

6

David Slade

6

George Lucas

6.73577e+08

7

George Lucas

7

Andrew Adamson

6.43134e+08

8

Andrew Adamson

8

Peter Jackson

5.95566e+08

9

Peter Jackson

9

Richard Marquand

5.727e+08

10

All Others

10

Eric Darnell

5.66099e+08

11

All Others

11

Roland Emmerich

4.5506e+08

12

All Others

Finally, this dataset is ready to be encoded as a bar mark:

import altair as alt
import vegafusion as vf

vf.enable()

source = "https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json"
chart = (
    alt.Chart(source)
    .transform_aggregate(
        aggregate_gross='mean(Worldwide_Gross)',
        groupby=["Director"],
    ).transform_window(
        rank='row_number()',
        sort=[alt.SortField("aggregate_gross", order="descending")],
    ).transform_calculate(
        ranked_director="datum.rank < 10 ? datum.Director : 'All Others'"
    ).mark_bar().encode(
        x=alt.X("aggregate_gross:Q", aggregate="mean", title=None),
        y=alt.Y(
            "ranked_director:N",
            sort=alt.Sort(op="mean", field="aggregate_gross", order="descending"),
            title=None,
        ),
    )
)
chart

Top-K directors

The exact value of each bar can be accessed by applying transformed_data() to the final chart (which includes the implicit transforms in the bar mark encoding).

vf.transformed_data(chart)

ranked_director

mean_aggregate_gross

0

David Yates

9.37984e+08

1

James Cameron

8.29781e+08

2

Carlos Saldanha

7.69293e+08

3

Pete Docter

7.31305e+08

4

Andrew Stanton

7.00319e+08

5

David Slade

6.88155e+08

6

George Lucas

6.73577e+08

7

Andrew Adamson

6.43134e+08

8

Peter Jackson

5.95566e+08

9

All Others

8.87602e+07

Datetime Timezone

Datetime columns will be returned in the local timezone returned by the vegafusion.get_local_tz() function. If not overridden using vegafusion.set_local_tz(), this will be the local timezone of the Python kernel.

For example:

import vegafusion as vf
import altair as alt
from vega_datasets import data

# Manually set timezone to Seattle's since this a seattle weather
# dataset
vf.set_local_tz("America/Los_Angeles")

source = data.seattle_weather()

chart = alt.Chart(source).mark_bar(
    cornerRadiusTopLeft=3,
    cornerRadiusTopRight=3
).encode(
    x='month(date):O',
    y='count():Q',
    color='weather:N'
)
chart

visualization

tx_df = vf.transformed_data(chart, row_limit=5)
tx_df

weather

month_date

__count

__count_start

__count_end

0

drizzle

2012-01-01 00:00:00-08:00

10

114

124

1

rain

2012-01-01 00:00:00-08:00

35

41

76

2

sun

2012-01-01 00:00:00-08:00

33

0

33

3

snow

2012-01-01 00:00:00-08:00

8

33

41

4

rain

2012-02-01 00:00:00-08:00

40

33

73

tx_df.dtypes
weather                                       object
month_date       datetime64[ns, America/Los_Angeles]
__count                                        int64
__count_start                                  int64
__count_end                                    int64
dtype: object

Supported Transforms

Here is the current set of supported Vega-Lite/Vega transforms:

Unsupported Transforms

VegaFusion’s coverage of Vega transforms is not complete, but it is growing with each release. If a chart makes use of a transform that is not yet supported, an error will be raised by the transformed_data() function.

Note: Charts with unsupported transforms will still render properly using the mime and widget renderers as these transforms will be pushed to the client for evaluation by the Vega JavaScript library.