Warm tip: This article is reproduced from serverfault.com, please click

apache spark-当我尝试根据条件修改列时出现Pyspark错误

(apache spark - Pyspark Error when I try to modify a column based in a condition with when otherwise)

发布于 2020-11-29 12:24:47

我正在使用Pyspark 3.0.1

我想在满足条件时修改列,否则我想保持与我相同的值。

df.printSchema()

root
 |-- ID: decimal(4,0) (nullable = true)
 |-- Provider: string (nullable = true)
 |-- Principal: float (nullable = false)
 |-- PRINCIPALBALANCE: float (nullable = true)
 |-- STATUS: integer (nullable = true)
 |-- Installment Rate: float (nullable = true)
 |-- Yearly Percentage: float (nullable = true)
 |-- Processing Fee Percentage: double (nullable = true)
 |-- Disb Date: string (nullable = true)
 |-- ZOHOID: integer (nullable = true)
 |-- UPFRONTPROCESSINGFEEBALANCE: float (nullable = true)
 |-- WITHHOLDINGTAXBALANCE: float (nullable = true)
 |-- UPFRONTPROCESSINGFEEPERCENTAGE: float (nullable = true)
 |-- UPFRONTPROCESSINGFEEWHTPERCENTAGE: float (nullable = true)
 |-- PROCESSINGFEEWHTPERCENTAGE: float (nullable = true)
 |-- PROCESSINGFEEVATPERCENTAGE: float (nullable = true)
 |-- BUSINESSSHORTCODE: string (nullable = true)
 |-- EXCTRACTIONDATE: timestamp (nullable = true)
 |-- fake Fee: double (nullable = false)
 |-- fake WHT: string (nullable = true)
 |-- fake Fee_WHT: string (nullable = true)
 |-- Agency Fee CP: string (nullable = true)
 |-- Agency VAT CP: string (nullable = true)
 |-- Agency WHT CP: string (nullable = true)
 |-- Agency Fee_VAT_WHT CP: string (nullable = true)

df.head(1)

[Row(ID=Decimal('16'), Provider='fake', Principal=2000.01, PRINCIPALBALANCE=0.2, STATUS=4, Installment Rate=0.33333333, Yearly Percentage=600.0, Processing Fee Percentage=0.20, Disb Date=None, ZOHOID=3000, UPFRONTPROCESSINGFEEBALANCE=None, WITHHOLDINGTAXBALANCE=None, UPFRONTPROCESSINGFEEPERCENTAGE=None, UPFRONTPROCESSINGFEEWHTPERCENTAGE=None, PROCESSINGFEEWHTPERCENTAGE=None, PROCESSINGFEEVATPERCENTAGE=16.0, BUSINESSSHORTCODE='20005', EXCTRACTIONDATE=datetime.datetime(2020, 11, 25, 5, 7, 58, 6000), fake Fee=1770.7, fake WHT='312.48', fake Fee_WHT='2,083.18', Agency Fee CP='566.62', Agency VAT CP='566.62', Agency WHT CP='186.39', Agency Fee_VAT_WHT CP='5,394.41')]

我已经读到可以使用when和else来执行此操作,但是使用以下代码运行它时出现此错误:

from pyspark.sql.functions import when
df.withColumn('Gross Loan Amount',when(((df['Disb Date'] <='2018-03-19') &(df['ID']!=457))
                                       ,(df['Principal']+df['Agency Fee CP']+df['Agency VAT CP']).otherwise(df['Gross Loan Amount'])))

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-18-abd559e65640> in <module>
      1 from pyspark.sql.functions import when
      2 df.withColumn('Gross Loan Amount',when(((df['Disb Date'] <='2018-03-19') &(df['ID']!=457))
----> 3                                        ,(df['Principal']+df['Agency Fee CP']+df['Agency VAT CP']).otherwise(df['Gross Loan Amount'])))

/usr/local/spark/python/pyspark/sql/dataframe.py in __getitem__(self, item)
   1378         """
   1379         if isinstance(item, basestring):
-> 1380             jc = self._jdf.apply(item)
   1381             return Column(jc)
   1382         elif isinstance(item, Column):

/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    132                 # Hide where the exception came from that shows a non-Pythonic
    133                 # JVM exception message.
--> 134                 raise_from(converted)
    135             else:
    136                 raise

/usr/local/spark/python/pyspark/sql/utils.py in raise_from(e)

AnalysisException: Cannot resolve column name "Gross Loan Amount" among (ID, Provider, Principal, PRINCIPALBALANCE, STATUS, Installment Rate, Yearly Percentage, Processing Fee Percentage, Disb Date, ZOHOID, UPFRONTPROCESSINGFEEBALANCE, WITHHOLDINGTAXBALANCE, UPFRONTPROCESSINGFEEPERCENTAGE, UPFRONTPROCESSINGFEEWHTPERCENTAGE, PROCESSINGFEEWHTPERCENTAGE, PROCESSINGFEEVATPERCENTAGE, BUSINESSSHORTCODE, EXCTRACTIONDATE, fake Fee, fake WHT, fake Fee_WHT, Agency Fee CP, Agency VAT CP, Agency WHT CP, Agency Fee_VAT_WHT CP);

我创建了一个假数据框来测试它是否真的有效:

df_test = spark.createDataFrame(
    [
        (1, 'foo','a'), # create your data here, be consistent in the types.
        (2, 'bar','b'),
        (2, 'fee','c'),
    ],
    ['id', 'txt','letter'] # add your columns label here
)

df_test.show()

+---+---+------+
| id|txt|letter|
+---+---+------+
|  1|foo|     a|
|  2|bar|     b|
|  2|fee|     c|
+---+---+------+

df_test.withColumn('txt',when(df_test['id']==1,'change').otherwise(df_test['txt'])).show()

+---+------+------+
| id|   txt|letter|
+---+------+------+
|  1|change|     a|
|  2|   bar|     b|
|  2|   fee|     c|
+---+------+------+

我做错了还是没有考虑?

Questioner
J.C Guzman
Viewed
0
mck 2020-11-29 20:33:16

Gross Loan Amount不是数据框中的列,因此otherwise在第一个示例中无法解析语句。

但是在第二个示例中,txt是一个现有的列,因此该otherwise语句可以被解析。

你确定要修改Gross Loan Amount不存在列吗?