Module model_selection (1.26.0)

Functions for test/train split and model tuning. This module is styled after scikit-learn's model_selection module: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection.

Classes

KFold

KFold(n_splits: int = 5, *, random_state: typing.Optional[int] = None)

K-Fold cross-validator.

Split data in train/test sets. Split dataset into k consecutive folds.

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

Parameters
NameDescription
n_splitsint

Number of folds. Must be at least 2. Default to 5.

random_stateOptional[int]

A seed to use for randomly choosing the rows of the split. If not set, a random split will be generated each time. Default to None.

Modules Functions

cross_validate

cross_validate(
    estimator,
    X: typing.Union[
        bigframes.dataframe.DataFrame,
        bigframes.series.Series,
        pandas.core.frame.DataFrame,
        pandas.core.series.Series,
    ],
    y: typing.Optional[
        typing.Union[
            bigframes.dataframe.DataFrame,
            bigframes.series.Series,
            pandas.core.frame.DataFrame,
            pandas.core.series.Series,
        ]
    ] = None,
    *,
    cv: typing.Optional[typing.Union[int, bigframes.ml.model_selection.KFold]] = None
) -> dict[str, list]

Evaluate metric(s) by cross-validation and also record fit/score times.

Parameters
NameDescription
Xbigframes.dataframe.DataFrame or bigframes.series.Series

The data to fit.

ybigframes.dataframe.DataFrame, bigframes.series.Series or None

The target variable to try to predict in the case of supe()rvised learning. Default to None.

cvint, bigframes.ml.model_selection.KFold or None

Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - int, to specify the number of folds in a KFold, - bigframes.ml.model_selection.KFold instance.

Returns
TypeDescription
Dict[str, List]A dict of arrays containing the score/time arrays for each scorer is returned. The keys for this dict are: test_score The score array for test scores on each cv split. fit_time The time for fitting the estimator on the train set for each cv split. score_time The time for scoring the estimator on the test set for each cv split.

train_test_split

train_test_split(
    *arrays: typing.Union[
        bigframes.dataframe.DataFrame,
        bigframes.series.Series,
        pandas.core.frame.DataFrame,
        pandas.core.series.Series,
    ],
    test_size: typing.Optional[float] = None,
    train_size: typing.Optional[float] = None,
    random_state: typing.Optional[int] = None,
    stratify: typing.Optional[bigframes.series.Series] = None
) -> typing.List[typing.Union[bigframes.dataframe.DataFrame, bigframes.series.Series]]

Splits dataframes or series into random train and test subsets.

Parameters
NameDescription
\*arraysbigframes.dataframe.DataFrame or bigframes.series.Series

A sequence of BigQuery DataFrames or Series that can be joined on their indexes.

test_sizedefault None

The proportion of the dataset to include in the test split. If None, this will default to the complement of train_size. If both are none, it will be set to 0.25.

train_sizedefault None

The proportion of the dataset to include in the train split. If None, this will default to the complement of test_size.

random_statedefault None

A seed to use for randomly choosing the rows of the split. If not set, a random split will be generated each time.

Returns
TypeDescription
List[Union[bigframes.dataframe.DataFrame, bigframes.series.Series]]A list of BigQuery DataFrames or Series.