Warm tip: This article is reproduced from serverfault.com, please click

Call Python script per PowerShell & passing PSObject and return the parsed data

发布于 2020-11-28 11:21:57

some background: currently I am querying 4Mio rows (with 50 columns) from a MS SQL server with dbatools into a PSObject (in Batch 10.000 rows each query), processing the data with PowerShell (a lot of RegEx stuff) and writing back into a MariaDb with SimplySql. In average i get approx. 150 rows/sec. Had to use a lot of tricks (Net's Stringbuilder etc.) for this performance, its not that bad imho

As new requirements I want to detect the language of some text cells and I have to remove personal data (name & address). I found some good python libs (spacy and pycld2) for that purpose. I made tests with pycld2 - pretty good detection.

Simplified code for clarification (hint:I am a python noob):

#get data from MS SQL
$data = Invoke-DbaQuery -SqlInstance $Connection -Query $Query -As PSObject -QueryTimeout 1800
for ($i=0;$i -lt $data.length;$i++){
  #do a lot of other stuff here
  #...
  #finally make lang detection
  if ($LangDetect.IsPresent){
    $strLang = $tCaseDescription -replace "([^\p{L}\p{N}_\.\s]|`t|`n|`r)+",""
    $arg = "import pycld2 as cld2; isReliable, textBytesFound, details = cld2.detect('" + $strLang + "', isPlainText = True, bestEffort = True);print(details[0][1])"
    $tCaseLang = & $Env:Programfiles\Python39\python.exe -c $arg
  } else {
    $tCaseLang = ''
  }
}
#write to MariaDB
Invoke-SqlUpdate -ConnectionName $ConnectionName -Query $Query

This python call each time works, but it destroys the performance (12rows/sec) due the loop-call and importing pycld2 lib each time. So, this is a lame solution :) In addition, as mentioned above - I want to use spacy - where some more columns has to parsed for getting rid of the personal data.

I am not sure, if I have the mood to convert the whole PS Parser to python :|

I believe, a better solution might be to pass the whole PSObject from PowerShell to python (before the PS loop starts) and return it as well as PSObject - after it has been processed in python - but I don't know, how I can realize this with python / python function.

What would be your approach/suggestions, any other ideas? Thanks :)

Questioner
TefoD
Viewed
0
mklement0 2020-11-30 05:30:03

The following simplified example shows you how you can pass multiple [pscustomobject] ([psobject]) instances from PowerShell to a Python script (passed as a string via -c in this case):

  • by using JSON as the serialization format, via ConvertTo-Json...

  • ... and passing that JSON via the pipeline, which Python can read via stdin (standard input).

Important:

  • Character encoding:

    • PowerShell uses the encoding specified in the $OutputEncoding preference variable when sending data to external programs (such as Python), which commendably defaults to BOM-less UTF-8 in PowerShell [Core] v6+, but regrettably to ASCII(!) in Windows PowerShell.

    • Just like PowerShell limits you to sending text to an external program, it also invariably interprets what it receives as text, namely based on the encoding stored in [Console]::OutputEncoding; regrettably, both PowerShell editions as of this writing default to the system's OEM code page.

    • To both send and receive (BOM-less) UTF-8 in both PowerShell editions, (temporarily) set $OutputEncoding and [Console]::OutputEncoding as follows:
      $OutputEncoding = [Console]::OutputEncoding = [System.Text.Utf8Encoding]::new($false)

  • If you want your Python script to also output objects, again consider using JSON, which on the PowerShell you can parse into objects with ConvertFrom-Json.

# Sample input objects.
$data = [pscustomobject] @{ one = 1; two = 2 }, [pscustomobject] @{ one = 10; two = 20 }

# Convert to JSON and pipe to Python.
ConvertTo-Json $data | python -c @'

import sys, json

# Parse the JSON passed via stdin into a list of dictionaries.
dicts = json.load(sys.stdin)

# Sample processing: print the 'one' entry of each dict.
for dict in dicts:
  print(dict['one'])

'@

If the data to pass is a collection of single-line strings, you don't need JSON:

$data = 'foo', 'bar', 'baz'

$data | python -c @'

import sys

# Sample processing: print each stdin input line enclosed in [...]
for line in sys.stdin:
  print('[' + line.rstrip('\r\n') + ']')

'@