pyspark.sql.functions.split#

pyspark.sql.functions.split(str, pattern, limit=- 1)[source]#

Splits str around matches of the given pattern.

New in version 1.5.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

strColumn or str

a string expression to split

patternColumn or str

a string representing a regular expression. The regex string should be a Java regular expression.

limitColumn or str or int

an integer which controls the number of times pattern is applied.

limit > 0: The resulting array’s length will not be more than limit, and the
resulting array’s last entry will contain all input beyond the last matched pattern.
limit <= 0: pattern will be applied as many times as possible, and the resulting
array can be of any size.

Changed in version 3.0: split now takes an optional limit field. If not provided, default limit value is -1.

Changed in version 4.0.0: pattern now accepts column. Does not accept column name since string type remain accepted as a regular expression representation, for backwards compatibility. In addition to int, limit now accepts column and column name.

Returns

Column: array of separated strings.

Examples

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([('oneAtwoBthreeC',)], ['s',])
>>> df.select(sf.split(df.s, '[ABC]', 2).alias('s')).show()
+-----------------+
|                s|
+-----------------+
|[one, twoBthreeC]|
+-----------------+

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([('oneAtwoBthreeC',)], ['s',])
>>> df.select(sf.split(df.s, '[ABC]', -1).alias('s')).show()
+-------------------+
|                  s|
+-------------------+
|[one, two, three, ]|
+-------------------+

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame(
...     [('oneAtwoBthreeC', '[ABC]'), ('1A2B3C', '[1-9]+'), ('aa2bb3cc4', '[1-9]+')],
...     ['s', 'pattern']
... )
>>> df.select(sf.split(df.s, df.pattern).alias('s')).show()
+-------------------+
|                  s|
+-------------------+
|[one, two, three, ]|
|        [, A, B, C]|
|     [aa, bb, cc, ]|
+-------------------+

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame(
...     [('oneAtwoBthreeC', '[ABC]', 2), ('1A2B3C', '[1-9]+', -1)],
...     ['s', 'pattern', 'expected_parts']
... )
>>> df.select(sf.split(df.s, df.pattern, df.expected_parts).alias('s')).show()
+-----------------+
|                s|
+-----------------+
|[one, twoBthreeC]|
|      [, A, B, C]|
+-----------------+