Removing quotes from CSV created by PowerShell

In a previous post, I demonstrate how to reshape a comma delimited file (CSV) using PowerShell. I noted that the results put all of the column values in double quotes:

"ID","Name","State"
"5","Stephanie","Arizona"
"4","Melanie","Oregon"
"2","Katie","Texas"

I used this code to produce this result:

cd c:\MyFolder
Import-Csv -Path SampleData.csv|Where {$_.Country -eq "USA"} `
  | SELECT ID, Name, State `
  |Sort-Object State `
  |Export-Csv SampleDataSorted.csv `
  -NoTypeInformation; 

In this post, I remove the double quotes.

A Microsoft blog suggests one way to remove these double quotes. They suggest using the ConvertTo-CSV cmdlet rather than Export-CSV and doing a string replace of the double quotes with empty string (code from the article).

dir c:\fso -Filter *.csv | ? {$_.basename -like 'users?'} `
 | Import-Csv `
 |  sort lname,fname `
 | convertto-csv -NoTypeInformation `
 | % { $_ -replace '"', ""} `
 | out-file c:\fso\usersconsolidated.csv -fo -en ascii

I convert my command to:

cd c:\MyFolder
Import-Csv -Path SampleData.csv `
  | Where {$_.Country -eq "USA"} `
  | SELECT ID, Name, State `
  | Sort-Object State `
  | ConvertTo-CSV -NoTypeInformation `
  | % {$_ -replace '"',""} `
  | Out-File SampleDataSorted.csv -fo -en ascii
  ; 
Import-Csv -Path SampleData.csv Read source csv
Where {$_.Country -eq "USA"} Filter rows
SELECT ID, Name, State Set output columns
ConvertTo-CSV -NoTypeInformation Create CSV output
% {$_ -replace '"',""} For each row, replace double quotes with empty string
Out-File SampleDataSorted.csv -fo -en ascii Output to file

This produces the results I expect.

ID,Name,State
5,Stephanie,Arizona
4,Melanie,Oregon
2,Katie,Texas

But here is the problem. ETL scenarios can be messy. CSV files frequently need the double quotes. When a column contains a comma, you need to put the quotes around the column value. For example a file like this is problematic.

ID,Name,State,Country
1,Russ,Sandys,Bermuda
3,Gail,Sandys,Bermuda
4,Melanie,Oregon,USA
5,Stephanie,Arizona,USA
6,"Katie, ""Scott"", Arianna",Texas,USA

I want to keep the quotes around the second column that last row.

Regular expressions to the rescue. The replace command can do regular expression replacement without any additional work. That isn’t to say that building the regular expression isn’t painful.

cd c:\MyFolder
Import-Csv -Path SampleData.csv `
  | Where {$_.Country -eq "USA"} `
  | SELECT ID, Name, State `
  | Sort-Object State `
  | ConvertTo-CSV -NoTypeInformation `
  | % {$_ -replace  `
  '\G(?^|,)(("(?[^,"]*?)"(?=,|$))|(?".*?(?

Let's break down the expression. The first expression (which I will explain in a moment) is the expression to be matched. The second ('${start}${output}'} is what that expression is replaced with. I use named captures (start, output) to identify what I want the expression to be replaced with. One way to view named captures is like variables. If the pattern is matched then assign the text to that cvariable. Note that the single quotes around this expression are important http://www.johndcook.com/blog/powershell_perl_regex/#capture. So every match that is found will be replaced with the capture called "start" followed by the capture called "output."

The first expression is interpreted as follows.

\G
Must occur after the previous match. That means that the expression won't match if there isn't an uninterrupted sequences of matches, one right after the other.

(?<start>^|,)
This defines the "start" capture, which is either the beginning of the string (^) or comma

((...)|(...))
Try to match first expression and if it fails, try the second. The pipe | between the two expressions in parentheses means to try the first pattern and as soon as it fails, try the second pattern.

("(?<output>[^,"]*?)"(?=,|$))
  ("(?<output>[^,"]*?)"(?=,|$))
Starts and ends with double quote
  ("(?<output>[^,"]*?)"(?=,|$))
After the second double quote, a comma must be present or it must be the end of the string. The construction ?= is not part of the match. That is important since I need to use that comma in the next match.
  ("(?<output>[^,"]*?)"(?=,|$))
This is what will be put in the "output" capture. The way I understand this is that there is a variable called "output." If this pattern matches, then the "output" variable will be assigned the text that is found (represented by the pattern in the rest of the parentheses). The quotes will not be included in my replace statement above if this pattern is matched because they are outside of the "output" parentheses.
  ("(?<output>[^,"]*?)"(?=,|$))
This is a sequence of zero or more characters that are not comma or double quote. The * means to get zero or more. The ? after the * prevents the match from going to the end of the string. This is what is assigned to the "output" capture (variable) if the pattern is matched.

(?<output>".*?(?<!")("")*?"(?=,|$))
This is the pattern that is matched if the first fails. Note that the whole pattern, including the surrounding double quotes, is in the "output" capture and thus will be output by the replace statement.
  (?<output>".*?(?<!")("")*?"(?=,|$))
Match every character. It will stop if it finds a double quote followed by the comma or end of string.
  (?<output>".*?(?<!")("")*?"(?=,|$))
This handles the case where there are two double quotes before a comma. I want to leave them alone. If there are an even number of double quotes before the comma then continue looking for the end of the column.
  (?<output>".*?(?<!")("")*?"(?=,|$))
Double quotes cannot precede the next pattern. What I am trying to avoid is interpreting exactly two (or any even number of) double quotes preceding a comma being interpreted as the end of the match. If there are two quotes, then treat the comma as part of the current column. If there is one quote, then treat the comma as the break between two columns.
  (?<output>".*?(?<!")("")*?"(?=,|$))
This represents a sequence of two double quotes. Basically, if there are an even number of double quotes before a comma then treat that comma as part of the current column and continue looking for the next sequence of double quotes followed by comma to end the column.

Let me know if I can improve my regular expression. And definitely post any questions.

Leave a Reply

%d