pyspark.sql.functions.split#
- pyspark.sql.functions.split(str, pattern, limit=- 1)[source]#
Splits str around matches of the given pattern.
New in version 1.5.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- str
Column
or str a string expression to split
- pattern
Column
or str a string representing a regular expression. The regex string should be a Java regular expression.
- limit
Column
or str or int an integer which controls the number of times pattern is applied.
limit > 0
: The resulting array’s length will not be more than limit, and theresulting array’s last entry will contain all input beyond the last matched pattern.
limit <= 0
: pattern will be applied as many times as possible, and the resultingarray can be of any size.
Changed in version 3.0: split now takes an optional limit field. If not provided, default limit value is -1.
Changed in version 4.0.0: pattern now accepts column. Does not accept column name since string type remain accepted as a regular expression representation, for backwards compatibility. In addition to int, limit now accepts column and column name.
- str
- Returns
Column
array of separated strings.
Examples
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([('oneAtwoBthreeC',)], ['s',]) >>> df.select(sf.split(df.s, '[ABC]', 2).alias('s')).show() +-----------------+ | s| +-----------------+ |[one, twoBthreeC]| +-----------------+
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([('oneAtwoBthreeC',)], ['s',]) >>> df.select(sf.split(df.s, '[ABC]', -1).alias('s')).show() +-------------------+ | s| +-------------------+ |[one, two, three, ]| +-------------------+
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame( ... [('oneAtwoBthreeC', '[ABC]'), ('1A2B3C', '[1-9]+'), ('aa2bb3cc4', '[1-9]+')], ... ['s', 'pattern'] ... ) >>> df.select(sf.split(df.s, df.pattern).alias('s')).show() +-------------------+ | s| +-------------------+ |[one, two, three, ]| | [, A, B, C]| | [aa, bb, cc, ]| +-------------------+
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame( ... [('oneAtwoBthreeC', '[ABC]', 2), ('1A2B3C', '[1-9]+', -1)], ... ['s', 'pattern', 'expected_parts'] ... ) >>> df.select(sf.split(df.s, df.pattern, df.expected_parts).alias('s')).show() +-----------------+ | s| +-----------------+ |[one, twoBthreeC]| | [, A, B, C]| +-----------------+