Python Regex Split String Using re.split()

Posted in /  

Python Regex Split String Using re.split()

Vinay Khatri
Last updated on June 29, 2022

    The Python regular expression module re supports a spilt() method that can split a string into a list of sub-strings, based on the regular expression pattern. The re.split() method is similar to the Python string’s spilt() method, but it is more flexible and powerful. This method use the regular expression pattern to split the string based on the regex pattern occurrence .

    This tutorial discusses the re.split() method in detail with the help of some examples. And by the end of this article, you will build a solid understanding of how to use the Python re.split() method in Python to split a string.

    Here is a short overview of the method that we are going to tackle.

    Operation Description
    re.split(pattern, string) It will split the string into a list of substrings by each occurrence of the pattern.
    re.split(pattern, string, maxsplit=3) It limits the split occurrence of the string by 3.
    re.split(a|b, string) Split the string by either a or b.
    re.split((a|b), string) Split the string either by a or b and also include the separator.

    How to use the Regex re.split() function in Python?

    Similar to the string’s split() method we use the re.split() method to split a string into a list, based on a separator. In re.split() , we define a regular expression pattern as a separator, the re.split() function split the targeted string by each occurrence of the separator.

    re.split() Function Syntax

    import re
    re.split(pattern, string, maxsplit=0, flags=0)

    There are only two mandatory arguments for split method pattern and string. The other two maxsplit and flags are optional.

    Arguments

    pattern: (regular expression): It is a regular expression, that will be used as a separator for the string split.

    string: (str): It represents the targeted string value that we want to split.

    maxsplit: (int): It is an optional argument value, that defines the number of split occurrences. By default its value is 0 means split all the occurrences by the pattern, we can also specify it as a positive integer value to limit the number of split occurrences.

    flags: (regex flags): It is also an optional argument value, that defines the raised flag on the split() method. By default, its value is 0 means no flags are raised. We can also set it to something like re.I when we want to perform the ignore-case searching, or re.A for ASCII-only matching.

    Return Value of re.split() method

    The re.split() method returns a list of split substrings using the occurrence of the pattern as a separator. If the specified regular expression is not found in the targeted string, the split() method returns a list containing the target string as a single element.

    Note: If we use the capturing group (parentheses) in the separator or pattern, then the separator group will also be included in the returned list. This simply means, for capturing parenthesis’s regular expression pattern, the separators are also included in the list.

    Examples

    Example 1: Use the re.split() method to split a string into words.

    To split a string into a list of words we have to separate the individual words by white space. This means we need to use the white space pattern as a separator for the re.split() method The \s is regular expression matches the whitespaces in the string. And the \s+ matches the multiple white spaces in the string, and we will use it as a pattern to split our string into a list of words.

    import re
    
    #pattern sequence for multiple white space
    pattern = r'\s+'
    
    #targeted string
    string = "Hello Welcome to TechGeekBuzz  Python RegEX   Tutorial"
    
    #split the string by white spaces
    word_list  =  re.split(pattern, string)
    
    print(word_list)

    Output

    ['Hello', 'Welcome', 'to', 'TechGeekBuzz', 'Python', 'RegEX', 'Tutorial']

    How to limit the number of split?

    The re.split() method accept a maxsplit argument, that can set a limit to the number of splits. By default, the value of maxsplit is 0 which means split the string by all the possible occurrence patterns. But we can also set it to a positive integer value and limit the split number.

    For instance, if we set the maxsplit argument to 3, then only three splits will be performed on the string. Let’s see the maxsplit argument in action with an example. Let’s say we have a string that contains details in id-name--DD-MM-YY format, and we need to split it into a list as id, name, and DD-MM-YY.

    Here we only want to split the string by the first two Hyphens.

    import re
    
    #pattern sequence for hyphen or non alphanumeric chracter
    pattern = r'\W+'
    
    #targeted string
    string = "10-Rahul-23-09-1999"
    
    #split the string by first 2 hyphens
    detail  =  re.split(pattern, string, maxsplit=2)
    
    print(detail)

    Output

    ['10', 'Rahul', '23-09-1999']

    In the above example, the \W+ represents the regular expression pattern for the non-Alphanumeric character. As - is a non-alphanumeric character \W+ matches it.

    How to split a string that has multiple delimiters characters?

    With Python string’s split() method we can split a string by a fixed delimiter, but with the help of re.split() method we can split a string that has multiple separators or delimiters characters.

    For instance, we have a string in this format ‘Name,Department,Salary,DD-MM-YY’ . And we want to extract the information from this string in a list as [Name, Department, Salary, DD, MM, YY] so how would we do that.

    The answer is simple we will use the re.split() method, and it will return us a list with the required items. In the given string there are two delimiters comma(,) and hyphen(-).

    So in the pattern, we have to write such a regular expression that can match either comma or hyphen.

    Solution 1

    import re
    
    #pattern sequence for hyphen or comma matching
    pattern = r',|-'
    
    #targeted string
    string = "Rahul,Sales,20000,23-09-1999"
    
    #split the string by commas or hyphen
    detail  =  re.split(pattern, string)
    
    print(detail)

    Output

    ['Rahul', 'Sales', '20000', '23', '09', '1999']

    In the above example the pattern r',|-' represents raw string expression for , or -. We could also perform the same tasks using r'\W+' pattern.

    import re
    
    #pattern sequence for non-alphanumeric characters
    pattern = r'\W+'
    
    #targeted string
    string = "Rahul,Sales,20000,23-09-1999"
    
    #split the string by non alphanumeric characters
    detail  =  re.split(pattern, string)
    
    print(detail)

    Output

    ['Rahul', 'Sales', '20000', '23', '09', '1999']

    How to split the string with the separator and include the separator as well?

    In the re.split() method if the regular expression pattern is defined in the capturing group or parentheses () , the split() method will also include the separator in the returning list.

    Example Let’s repeat the above example where we are splitting a string that is present in the Name,Department,Salary,DD-MM-YY format. But here we will enclose the \W+ in the capturing group such as (\W+).

    import re
    
    #pattern sequence for non-alphanumeric characters as capturing group
    pattern = r'(\W+)'
    
    #targeted string
    string = "Rahul,Sales,20000,23-09-1999"
    
    #split the string by non alphanumeric characters and include them in the list
    detail  =  re.split(pattern, string)
    
    print(detail)

    Output

    ['Rahul', ',', 'Sales', ',', '20000', ',', '23', '-', '09', '-', '1999']

    In this output, you can see that by specifying the pattern as capturing group (\W+) not only split the string by matched pattern but also include the pattern. This type of splitting the string comes very handy when we want both the split string as well as the separator.

    Flags argument in the re.split() method

    There is the last argument in split() method called flag. It is an optional argument value which default value is 0 means no flags are raised. The Python re module provides some flags, which are more like conditions that need to be satisfied when a certain operation is performed. Let’s say we have a string that only contains numbers and alphabets and we wish to spilt the string by alphabets.

    import re
    
    #pattern sequence lowercase alphabets
    pattern = r'[a-z]+'
    
    #targeted string
    string = "2a2bdh3HjdhHH8jd9pD3"
    
    #split the string by given pattern
    result  =  re.split(pattern, string)
    
    print(result)

    Output

    ['2', '2', '3H', 'HH8', '9', 'D3']

    In this example, the pattern [a-z]+ is only matching the lowercase characters and leaving the uppercase. But we want to split the string into numbers by ignoring the case of the letter.

    In such cases instead of defining a new pattern such as [a-z]|[A-Z] (which will also do the trick), we can raise the flags argument to re.I, which will ignore the case while matching the pattern for splitting.

    import re
    
    #pattern sequence lowercase alphabets
    pattern = r'[a-z]+'
    
    #targeted string
    string = "2a2bdh3HjdhHH8jd9pD3"
    
    #split the string by given pattern and ignore case
    result  =  re.split(pattern, string, flags=re.I)
    
    print(result)

    Output

    ['2', '2', '3', '8', '9', '3']

    Difference between String’s split() and regex split() methods

    String split() RegEx split()
    The string split() method split the string into a list of substring by a single fixed delimiter or separator. The regex split() method can split a string into a list of substrings by multiple delimiters or separators.
    In the split() method we can not include the separator in the resulting list. In the regex split() method we can include the separator in the resulting string by using capturing groups.
    Example
    string = "Hello-World, Welcome to TechgeekBuzz.com"
    
    #split by white space
    result = string.split()
    
    print(result)
    Output
    ['Hello-World,', 'Welcome', 'to', 'TechgeekBuzz.com']
    import re
    
    string = "Hello-World, Welcome to TechgeekBuzz.com"
    
    #split by hyphen comma space and dot
    result = re.split("[-|,|.|\s]+", string)
    
    print(result)
    Output
    ['Hello', 'World', 'Welcome', 'to', 'TechgeekBuzz', 'com']

    Some common examples of re.split() function

    Example 1: Split the string by five delimiters

    To spilt a string by five delimiters, we can group all the five delimiters in an open group.

    import re
    
    string = "Hello-World;Welcome,to TechgeekBuzz.com"
    
    #split by hyphen comma space dot semicolon
    result = re.split("[-,\s.;]+", string)
    
    print(result)

    Output

    ['Hello', 'World', 'Welcome', 'to', 'TechgeekBuzz', 'com']

    Example 2: Split the string by specific words

    Let’s split a string into a list by the “ and” or “ or” words.

    import re
    
    string = "12or1315or16and17and18"
    
    #split by and or
    result = re.split("and|or", string)
    
    print(result)
    • Output
    ['12', '1315', '16', '17', '18']

    Conclusion

    This Python tutorial discussed the Python re.split() method or function. The re.split(pattern, string, maxsplit=0, flags=0) split the given string into a list by separating the string using the given pattern.

    In the method, pattern and string are the mandatory argument values whereas maxsplit and flags are optional. The re.split() method is more powerful and flexible than the normal string split() method which can split a string by a fixed separator.

    People are also reading:

    Leave a Comment on this Post

    0 Comments